pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 06:11:27 +08:00

Author	SHA1	Message	Date
Sam Ginzburg	f4adf6dd1d	lint	2024-11-04 12:30:35 -08:00
Sam Ginzburg	41a1e2557f	[inductor] Error on unsupported autotuner configs	2024-11-04 12:25:03 -08:00
PyTorch UpdateBot	2ce2e4df4e	Update slow tests (#139051 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139051 Approved by: https://github.com/pytorchbot	2024-11-04 11:49:06 +00:00
Bob Ren	12d225d91c	add opaque unary sin and cos to SYMPY_INTERP (#139569 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nn.py TestNNDeviceTypeCPU.test_affine_3d_rotateRandom_cpu` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139569 Approved by: https://github.com/ezyang	2024-11-04 07:37:11 +00:00
Sun, Jiayi	3337439dc0	[inductor] modify the heuristic for disabling vectorization (#136422 ) Summary Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-04 07:33:32 +00:00
James Wu	f4ee5a243d	Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402 ) Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times. In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish. In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel. The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen. Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402 Approved by: https://github.com/oulgen	2024-11-04 06:37:18 +00:00
cyy	3179eb15ae	[1/N] Remove usage of C array (#139567 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139567 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-04 04:52:46 +00:00
Yuxin Wu	cadc50e7e9	LOG(INFO) -> VLOG(2) in ProcessGroupNCCL (#130696 ) In the same spirit as https://github.com/pytorch/pytorch/pull/105695 Initialization and error handling logs are mostly kept. Routine logs are changed to VLOG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130696 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@fb.com>	2024-11-04 04:43:42 +00:00
Jason Ansel	ed30fa74ab	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-04 04:28:40 +00:00
Jason Ansel	b6fb135c2c	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-04 04:28:40 +00:00
Jason Ansel	3d633f12ba	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-04 04:28:33 +00:00
Jason Ansel	66d5e2405d	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-04 04:28:25 +00:00
Jason Ansel	d189f92eb1	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-04 04:28:18 +00:00
Animesh Jain	e6ff07f00e	[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 ) This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476 We have dict tag optimization where if the dict tag does not change, we skip guards on all the items of the dict that are "immutable". We considered tensors as immutable in such scenarios. This is critical for guard eval performance, because generally users dont change their parameters. If I try to remove this optimization, we see slowdowns, e.g, 3.03x to 2.95x on conv_mixer TIMM benchamrk. So, I am adding a flag which keeps the current state but allows the users to remove this optimization. Not ideal, but given how serious guard eval perf has to be, we are in the gray are of unsoundness vs performance tradeoff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560 Approved by: https://github.com/jansel	2024-11-04 00:54:20 +00:00
cyy	7f387fa612	[10/N] Fix extra warnings brought by clang-tidy-17 (#139385 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385 Approved by: https://github.com/Skylion007	2024-11-04 00:47:19 +00:00
briancoutinho	3242049daa	[profiler] Annotate triton kernels with kernel hash (#139531 ) As above, annotates triton kernel hash in the profile attributes. Added a new unit test in profiler to triton/dynamo events. Testplan: Running new unit test in CI Internal: buck2 run @mode/dev-nosan caffe2/test:profiler -- -r test_pt2_triton_attributes Running on an example, this is how the kernel hash file looks ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 1670242, "tid": 1670242, "ts": 2413669097354.058, "dur": 95.812, "args": { "External id": 3,"kernel_hash": "cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu", "grid": "grid(100,)", "Record function id": 0, "stream": 0, "Concrete Inputs": ["", "", "", "100"], "kernel_file": "/tmp/torchinductor_bcoutinho/qa/cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu.py", "kernel_backend": "triton", "Input type": ["float", "float", "float", "Scalar"], "Input Strides": [[10, 1], [10, 1], [10, 1], []], "Input Dims": [[10, 10], [10, 10], [10, 10], []], "Ev Idx": 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139531 Approved by: https://github.com/davidberard98	2024-11-03 23:19:35 +00:00
Yifu Wang	924e726c3a	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-03 21:37:31 +00:00
Bob Ren	5d07651c72	only use hint_size in _smart_symbol_sort for size type symbols (#139571 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_exponential_kstest_cpu_bfloat16` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139571 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482, #139484, #139486	2024-11-03 21:15:08 +00:00
cyy	57a49018b1	[5/N] Fix Wextra-semi warning (#139465 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139465 Approved by: https://github.com/ezyang	2024-11-03 20:40:50 +00:00
cyy	03e83111f5	Remove unnecessary check of CUDA 10.2 (#139566 ) Since PyTorch now requires higher CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139566 Approved by: https://github.com/ezyang	2024-11-03 20:04:37 +00:00
leslie-fang-intel	d84a344410	[Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537 Approved by: https://github.com/jansel	2024-11-03 10:10:29 +00:00
Edward Z. Yang	585dbfa583	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-03 06:29:57 +00:00
PyTorch MergeBot	3a2ab9584f	Revert "[executorch hash update] update the pinned executorch hash (#139536 )" This reverts commit 468d592fbc12dfc67d89f954781ccbf540241470. Reverted https://github.com/pytorch/pytorch/pull/139536 on behalf of https://github.com/huydhn due to This is breaking trunk, need to fix before relanding ([comment](https://github.com/pytorch/pytorch/pull/139536#issuecomment-2453313984))	2024-11-03 06:25:41 +00:00
Bob Ren	a1370259ba	always specialize float on export path (#139486 ) This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout. Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482, #139484	2024-11-03 04:47:12 +00:00
Bob Ren	25f243ff5d	Update tensorify pass to specialize symfloats we didn't tensorify away (#139564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139564 Approved by: https://github.com/huydhn	2024-11-03 04:27:43 +00:00
Nikita Shulga	b3ad45733b	[Lint] Clang-format all metal kernels (#139530 ) Except Quantized.metal, where linting breaks all the ASCII art Pull Request resolved: https://github.com/pytorch/pytorch/pull/139530 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #139522	2024-11-03 04:14:20 +00:00
PyTorch UpdateBot	468d592fbc	[executorch hash update] update the pinned executorch hash (#139536 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139536 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-03 03:14:06 +00:00
PyTorch MergeBot	067d2a089d	Revert "Expose Storage _use_count API in Python (#139426 )" This reverts commit e31136d07bbfb10735df101df953c73d22dde24b. Reverted https://github.com/pytorch/pytorch/pull/139426 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing some inductor job in trunk ([comment](https://github.com/pytorch/pytorch/pull/139426#issuecomment-2453269063))	2024-11-03 02:40:45 +00:00
Bob Ren	b8b60e0bc5	add is_integer to support example_value function whitelist (#139484 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_is_integer_dynamic_shapes when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139484 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482	2024-11-03 02:01:38 +00:00
Ke Wen	f121eab018	[c10d] Remove dead Dynamo marker (#139545 ) Per discussion with @anijain2305, `dynamo_unsupported_distributed_c10d_ops` is not referenced anywhere. Removing this dead code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139545 Approved by: https://github.com/Skylion007	2024-11-03 00:40:26 +00:00
Syed Tousif Ahmed	0f06dff4d7	Restores release_lock_on_cudamalloc behavior in CUDACachingAllocator (#139430 ) In https://github.com/pytorch/pytorch/pull/134685, I transformed the following code: ```CPP if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { // At scope exit, acquire the lock again. This provides safety against // any potential exceptions in the cudaMallocMaybeCapturing function. auto sg = c10::make_scope_exit([&]() { lock.lock(); }); lock.unlock(); p.err = cudaMallocMaybeCapturing(&ptr, size); } else { p.err = cudaMallocMaybeCapturing(&ptr, size); } if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { TORCH_CHECK( lock.owns_lock(), "Failed to acquire lock after cudaMalloc"); } ``` into: ```CPP if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { // At scope exit, acquire the lock again. This provides safety against // any potential exceptions in the cudaMallocMaybeCapturing function. auto sg = c10::make_scope_exit([&]() { lock.lock(); }); lock.unlock(); } auto active_pool = MemPoolContext::getActiveMemPool(); if (active_pool && active_pool->allocator() && p.pool->owner_PrivatePool) { ptr = active_pool->allocator()->raw_alloc(size); p.err = ptr ? cudaSuccess : cudaErrorMemoryAllocation; } else { p.err = cudaMallocMaybeCapturing(&ptr, size); } if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { TORCH_CHECK( lock.owns_lock(), "Failed to acquire lock after cudaMalloc"); } ``` This is wrong because, I didn't realize what `c10::make_scope_exit([&]() { lock.lock(); });` does. And so my changes doesn't let `release_lock_on_cudamalloc` unlock..execute alloc..lock, and instead it just unlock..locks. This PR rectifies that change, and in addition adds an ASSERT ensuring the active pool and p.pool are the same (mirroring the behavior from released_cached_blocks). Thanks @zvon82 for reporting this! Pull Request resolved: https://github.com/pytorch/pytorch/pull/139430 Approved by: https://github.com/ezyang	2024-11-03 00:04:30 +00:00
Yukio Siraichi	a3cb8ee38b	AOTAutograd: Make general `SymInt` hashable when merging view inputs. (#139553 ) Fix: #139111 This PR wraps `SymInt` input arguments with `SymIntEqByExpr`, making them hashable when merging view inputs (`merge_view_inputs` function). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139553 Approved by: https://github.com/ezyang	2024-11-02 23:57:11 +00:00
Yuanhao Ji	b46e1fc141	[Dynamo] Fix graph break when `tensor.split()` is called within a device context manager (#139270 ) Fixes: #139183 Note: this case can also be reproduced on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/139270 Approved by: https://github.com/ezyang Co-authored-by: Vincent Moens <vincentmoens@gmail.com>	2024-11-02 23:55:51 +00:00
Jane Xu	e31136d07b	Expose Storage _use_count API in Python (#139426 ) Would be nice to replace the torch._C._storage_Use_Count call in https://github.com/pytorch/torchtune/pull/1936, at least without needing to know about _cdata in OSS code. Initially keeping it private as Tensor._use_count is also private. In favor over https://github.com/pytorch/pytorch/pull/139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139426 Approved by: https://github.com/soulitzer	2024-11-02 23:36:31 +00:00
axel	f6e5d09682	Raise error for int64 and bool dtypes in nanmean, even for empty tensors (#138745 ) This PR ensures that the `nanmean()` function raises a `RuntimeError` when using `int64` or `bool` dtypes, even for empty tensors. Previously, non-empty tensors correctly raised errors for unsupported dtypes, while empty tensors did not. This change brings consistent error handling for both cases. addressing the need raised in an issue by @hyperkai (Issue [#131043](https://github.com/pytorch/pytorch/issues/131043)). ### Changes - Added checks in `nanmean_out()` to raise errors for `int64` and `bool` dtypes regardless of tensor size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138745 Approved by: https://github.com/ezyang	2024-11-02 22:52:40 +00:00
Bob Ren	232af152b5	Fix graph breaks related to specialized float inputs (#139482 ) Fixes issue with timm models where example_value = 0.09999 proxy.node.target = <built-in function sub> would fall through to ``` unimplemented( "torch.* op returned non-Tensor " + f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}", case_name="unsupported_operator", ) ``` and graph break Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482 Approved by: https://github.com/ezyang ghstack dependencies: #139451	2024-11-02 21:58:46 +00:00
PyTorch MergeBot	854be65fa0	Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 )" This reverts commit 55038aa66162372acc1041751d5cc5c8ed9bc304. Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2453085932))	2024-11-02 18:29:10 +00:00
Ke Wen	e9eb7b1b13	[CI] Skip test_cuda_tracker_equivalence for ROCm (#139543 ) Test fails on ROCm, skipping it for this platform. Resolves https://github.com/pytorch/pytorch/issues/139515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139543 Approved by: https://github.com/huydhn	2024-11-02 15:39:07 +00:00
PyTorch MergeBot	92d7f29e59	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit f6be44c74e012fb4329e6e716ebb78e9f5092a3b. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))	2024-11-02 13:11:04 +00:00
PyTorch MergeBot	709752e0bb	Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 )" This reverts commit 293fbb42d207058d49f0ae40ca408214ee88b76b. Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))	2024-11-02 13:04:00 +00:00
Edward Z. Yang	f6be44c74e	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-02 11:50:11 +00:00
Ke Wen	55038aa661	[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 ) Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172 There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013 Approved by: https://github.com/shuqiangzhang	2024-11-02 07:47:55 +00:00
PyTorch MergeBot	2a3fe06ce0	Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598 )" This reverts commit 39ec5a20ea3d7bc8c2147f8363f8a06f4bb1e953. Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails an executorch test https://github.com/pytorch/executorch/blob/main/exir/backend/test/test_graph_partition.py#L114-L175 ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2452903705))	2024-11-02 07:19:22 +00:00
PyTorch MergeBot	f3238106fd	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit 030f70b40bca62993bd65d03c58ded45601abe35. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))	2024-11-02 06:53:48 +00:00
PyTorch MergeBot	0863d6a08e	Revert "[inductor] Remove SIMDKernel.last_usage (#139364 )" This reverts commit 286d3ce266ce01ca905afb1cc9ea5d81abf79ff7. Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:11 +00:00
PyTorch MergeBot	9331640e26	Revert "[inductor] Remove Node.last_usage mutation (#139365 )" This reverts commit 1e934b473cabe6bc003f66d9811082e97c958a31. Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	dc4b459737	Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370 )" This reverts commit b57b4b7f9b168389def15ea06a4dcf9e5f6f4f04. Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	66a401c9e1	Revert "[inductor] Simplify remove_kernel_local_buffers (#139452 )" This reverts commit 73c0762a34ef152450287dbc365cb8db930031b7. Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit c53beab3775671b5b7ec6106737c0d8939b8455a. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Bob Ren	fdd298dcb7	add hex method on SymFloat (#139451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139451 Approved by: https://github.com/ezyang	2024-11-02 05:33:19 +00:00
PyTorch MergeBot	8d1eaa3da6	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit a6630bcf8736e4d66375688dfd8b45c401de3fef. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))	2024-11-02 03:38:15 +00:00
drisspg	540f3ef9b1	Fix flex_decode to build offsets off of strides (#139516 ) Fixes PR: https://github.com/pytorch/pytorch/issues/139462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139516 Approved by: https://github.com/Chillee	2024-11-02 03:17:46 +00:00
Bin Bao	293fbb42d2	[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154 Approved by: https://github.com/angelayi ghstack dependencies: #139153	2024-11-02 03:10:05 +00:00
Bin Bao	a46a79fe92	[AOTI] Ignore .o files in package_aoti (#139153 ) Summary: There is no point to package .o files since a .so file is included in that package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153 Approved by: https://github.com/angelayi	2024-11-02 03:10:05 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Justin Chu	387b120549	[ONNX] Remove type promotion rule for pow (#139527 ) ONNX supports different input types in Pow, so type promotion is not needed. The resulting graph is the following: ```py ONNXProgram( model= < ir_version=9, opset_imports={'': 18, 'pkg.onnxscript.torch_lib.common': 1}, producer_name='pytorch', producer_version='2.6.0a0+git59a1af5', domain=None, model_version=None, > graph( name=main_graph, inputs=( %"x"<FLOAT16,[3]> ), outputs=( %"pow_1"<FLOAT16,[3]> ), ) { 0 \| # node_Constant_0 %"val_0"<?,?> ⬅️ ::Constant() {value=Tensor<FLOAT,[]>(array(2., dtype=float32), name=None)} 1 \| # node_Pow_1 %"pow_1"<FLOAT16,[3]> ⬅️ ::Pow(%"x", %"val_0") return %"pow_1"<FLOAT16,[3]> } ... , exported_program= ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f16[3]"): # File: /workspace/pytorch/test/onnx/exporter/test_small_models_e2e.py:53 in forward, code: return x**2.0 pow_1: "f16[3]" = torch.ops.aten.pow.Tensor_Scalar(x, 2.0); x = None return (pow_1,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='pow_1'), target=None)]) Range constraints: {} ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139527 Approved by: https://github.com/titaiwangms	2024-11-02 02:19:50 +00:00
Matthew Sterrett	7e65060410	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel	2024-11-02 02:14:01 +00:00
Chen, Zejun	edd3f5a94d	[profiler] fix a building warning by adding USE_KINETO namespace for setTraceID (#139461 ) Fix: https://github.com/pytorch/pytorch/issues/139460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139461 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/sraikund16	2024-11-02 01:02:29 +00:00
Angela Yi	092fe2f422	Handle nan case when checking mutations (#139483 ) Test Plan: PT2 readiness models Differential Revision: D65340986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139483 Approved by: https://github.com/zou3519	2024-11-02 00:49:05 +00:00
William Wen	b71e813bce	[dynamo, 3.13] fix bytecode nop tests (#139323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323 Approved by: https://github.com/jansel	2024-11-02 00:39:36 +00:00
Bin Bao	8c17830dea	[AOTI] Unify how weights are stored as data section (#139471 ) Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well. Test Plan: CI Differential Revision: D65302273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471 Approved by: https://github.com/chenyang78	2024-11-02 00:23:24 +00:00
PyTorch UpdateBot	aa54b2467f	[executorch hash update] update the pinned executorch hash (#139133 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139133 Approved by: https://github.com/pytorchbot	2024-11-02 00:14:47 +00:00
eellison	ee2f8a50d3	Class rename (#139490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490 Approved by: https://github.com/exclamaforte, https://github.com/zou3519 ghstack dependencies: #139295	2024-11-02 00:10:17 +00:00
PyTorch MergeBot	c95adb9c5b	Revert "use more elements per thread for narrow dtypes (#139449 )" This reverts commit f5b9e725d14a9a2906b7f1701d97cb4e95891a92. Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a bunch of tests are failing after it lands, it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2452723863))	2024-11-01 23:42:16 +00:00
PyTorch MergeBot	b617d4813c	Revert "fix dynamo tracking numpy 2 ops (#138686 )" This reverts commit 124eac255e3af04379721af09631a45a05c7fb05. Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))	2024-11-01 23:34:06 +00:00
Nikita Shulga	77b72d686e	[BE][MPS] Make metal shaders compile cleanly (#139522 ) I.e. without warnings, by deleting dead code and fixing one signed-unsigned comparison warning Also, pass `-Werror` to metal compiler if WERROR options is set Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522 Approved by: https://github.com/Skylion007	2024-11-01 23:22:47 +00:00
eellison	2382b3b6d8	[Easy] Add joint graph passes, fallback_random to bisector (#139295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295 Approved by: https://github.com/zou3519, https://github.com/exclamaforte	2024-11-01 23:21:53 +00:00
Gabriel Ferns	1e73842029	Refactor FxGraphDrawer to use HTML-like labels (#137726 ) Fixes https://github.com/pytorch/pytorch/issues/137499 Testing: Added a new unit test to make sure that the regression case succeeds. I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read? ![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932) Vs. ![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726 Approved by: https://github.com/eellison, https://github.com/malfet	2024-11-01 23:19:50 +00:00
David Berard	60542eeb33	[inductor] set sanitize_overflow=False for triton kernels (#139502 ) In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502 Approved by: https://github.com/bertmaher	2024-11-01 23:10:21 +00:00
Huy Do	da395384a2	Delete Windows GPU jobs in periodic (#139336 ) As an outcome of https://fburl.com/gdoc/voce5o06, we could stop running Windows GPU tests on periodic pending the green light from MS. No one is monitoring these jobs atm. We already have Windows CUDA and CPU build jobs in trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139336 Approved by: https://github.com/ZainRizvi, https://github.com/wdvr, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-01 22:26:22 +00:00
Shuqiang Zhang	4c64a7f33f	[pgnccl] add a restart test for PGs in blocking mode (#139496 ) Summary: Restarting (aborting and re-initialize a PG) is a basic need if we want to achieve in-process restart of PGs without tearing down the whole process. Add this tests to verify that this is supported by current NCCL. Note that this restart test passes steadily only for blocking mode for now. In nonblockin mode. There is problem in either nccl init or abort that needs further investigation Test Plan: new UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2024-11-01 22:13:37 +00:00
Huy Do	0b13bdd877	Delete parallelnative jobs in periodic (#139328 ) As an outcome of https://fburl.com/gdoc/voce5o06, we can now clean up parallelnative build and test jobs in periodic. There is not much value in running them anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/139328 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-11-01 22:05:13 +00:00
Huy Do	8eb75cbad6	Delete iOS jobs from periodic (#139345 ) As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up iOS lite interpreter jobs on PyTorch CI. There is not much value in running them anymore. It's stated in https://github.com/pytorch/ios-demo-app/blob/master/README.md that ExecuTorch is the replacement now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139345 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-11-01 22:04:27 +00:00
Huy Do	8ad76efb8d	Delete Vulkan jobs from periodic (#139354 ) As an outcome of https://fburl.com/gdoc/voce5o06, we can clean up this job now as the backend has been marked as deprecated https://pytorch.org/tutorials/prototype/vulkan_workflow.html to be replace by ExecuTorch Vulkan delegate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139354 Approved by: https://github.com/wdvr, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-01 22:03:12 +00:00
Mikayla Gawarecki	a979318ef7	Add section to serialization note re weights_only (#139433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221	2024-11-01 21:51:50 +00:00
Nikita Shulga	a1f854f270	[MPS] Compile kernels into Metallib (#138636 ) PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see https://github.com/pytorch/pytorch/pull/125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: \| PyTorch Version \| Number of shaders in MPS \| Ops added \| \| ------------- \| ------------- \| ---- \| \| 1.12 \| 0 \| \| \| 1.13 \| 2 \| bitwise_ops and index.out \| \| 2.0 \| 4 \| cross repeat and view) \| \| 2.1 \| 9 \| unary_ops, histogram, renorm, binary_ops \| \| 2.2 \| 11 \| gamma and bucketization \| \| 2.3 \| 12 \| naive_matmul (to workaround crash) \| \| 2.4 \| 13 \| quantized_mm \| \| 2.5 \| 14 \| fused_adam \| Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636 Approved by: https://github.com/manuelcandales	2024-11-01 21:47:20 +00:00
Edward Z. Yang	a6630bcf87	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-01 21:43:25 +00:00
Xuan Zhang	9c2ffce71a	add condition for freeable input buffer (#139480 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480 Approved by: https://github.com/yf225 ghstack dependencies: #139396	2024-11-01 21:15:40 +00:00
Huy Do	18f3b3c991	Clean up Android jobs in CI (#139350 ) As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up Android lite interpreter jobs on PyTorch CI. There is not much value in running them anymore. It's stated in https://github.com/pytorch/android-demo-app/blob/master/README.md that ExecuTorch is the replacement now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139350 Approved by: https://github.com/ZainRizvi	2024-11-01 21:10:19 +00:00
Sam Larsen	c412a42ae2	[pt2 logging] move remote cache get/put logging up one level (#139423 ) Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see. Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x Differential Revision: D65290089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423 Approved by: https://github.com/oulgen	2024-11-01 21:06:59 +00:00
Animesh Jain	0e57f2b589	[invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326 Approved by: https://github.com/zou3519 ghstack dependencies: #139216, #139130	2024-11-01 21:02:32 +00:00
Animesh Jain	6a268c3fbb	[invoke_subgraph] Generate fake_inputs correctly (#139130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130 Approved by: https://github.com/zou3519 ghstack dependencies: #139216	2024-11-01 21:02:32 +00:00
Animesh Jain	4c756cacfd	[invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216 Approved by: https://github.com/zou3519	2024-11-01 21:02:32 +00:00
Justin Chu	5d67efb809	[ONNX] New registration API (#135403 ) The ONNX custom ops registration API. ## Design 1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] \| Callable" parameter for specifying extra functions 2. Use a callable as the key to support all possible call_function targets in the fx graph 3. Allow a callable or a Sequence of callables as values. - When there is a single callable, it is the translation function for the op - When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes. - The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function. - Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions. ```py torch.onnx.export(dynamo=True, custom_translation_table = { torch.ops.aten.add: [overload1, overload2], torch.sym_not: sym_not_onnx, }) ``` Support for functions that handles complex inputs will be in separate PRs. fixes https://github.com/pytorch/pytorch/issues/138391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403 Approved by: https://github.com/titaiwangms	2024-11-01 20:58:54 +00:00
Natalia Gimelshein	f5b9e725d1	use more elements per thread for narrow dtypes (#139449 ) Fix perf issue for narrow type by accessing more elements per thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-11-01 20:41:13 +00:00
Jason Ansel	73c0762a34	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-01 20:36:39 +00:00
Bert Maher	dcdcb8b364	Avoid overflow in float32-to-int32 test (#139489 ) Summary: Triton has added some integer overflow detection when kernels are compiled with `debug=True`, and this test results in integer overflow (2.0 is 0x40000000, times 2 is 0x80000000 which overflows a signed int32). Assertion `int32 overflow detected for operation mul` failed Fixes #139479 Test Plan: ``` python inductor/test_torchinductor.py -k test_float32_to_int32_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78	2024-11-01 20:22:19 +00:00
Yifu Wang	0dbc284a72	[SymmetricMemory] expose signal_pads as tensors in Python (#138754 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR ```python # Obtain the signal pad of the specified peer rank as a tensor. # If both shape and dtype are unspecified, the returned tensor will be a # 1d uint32 tensor, which is most natural for signaling purposes. symm_mem.get_signal_pad(peer_rank) # If only shape is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape) symm_mem.get_signal_pad(peer_rank, shape) # If only dtype is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank).view(dtype) symm_mem.get_signal_pad(peer_rank, dtype=dtype) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-11-01 20:17:15 +00:00
Haifeng Jin	124eac255e	fix dynamo tracking numpy 2 ops (#138686 ) Fixes #136559 As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking. This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged. Before this PR, the following tests failed: ``` PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors ``` With this PR, the supported/unsupported ops in NumPy 1 are not changed. For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list. I used the following scripts to check the differences before and after the change for both NumPy 1 & 2. The output is empty for NumPy 1 since there is no change. The output is a list of `numpy.random` that considered supported for NumPy 2. ```py from torch._dynamo import trace_rules import numpy as np def new_numpy_function_ids(): unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"} def is_supported(k, v, mod): if not callable(v): return False if not getattr(v, "__module__", None): return True if v.__module__ == mod.__name__: return True if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs: return True return False rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: for k, v in mod.__dict__.items(): if is_supported(k, v, mod): rv[id(v)] = f"{mod.__name__}.{k}" return rv def old_numpy_function_ids(): rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: rv.update( { id(v): f"{mod.__name__}.{k}" for k, v in mod.__dict__.items() if callable(v) and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__ } ) return rv rv1 = set(old_numpy_function_ids().values()) rv2 = set(new_numpy_function_ids().values()) for v in (rv1 - rv2): print(v) print("****") for v in (rv2 - rv1): print(v) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686 Approved by: https://github.com/lezcano, https://github.com/williamwen42	2024-11-01 19:51:40 +00:00
Mikayla Gawarecki	ea0e09b3f3	Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221 ) Fixes https://github.com/pytorch/pytorch/issues/129698 https://github.com/pytorch/pytorch/pull/139106 without pickletools Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221 Approved by: https://github.com/malfet ghstack dependencies: #138936	2024-11-01 19:31:39 +00:00
rzou	f3b485eb2a	[reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been more than two weeks since this landed. You can flip the config locally as a workaround. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064 Approved by: https://github.com/albanD, https://github.com/eellison	2024-11-01 19:21:16 +00:00
Colin L. Rice	abc5d59dcb	config: create Config objects with JK support (#138766 ) This teaches install_config_module (and the underlying code) to understands Config objects. Additionally we've added a JK option to this which resolves the JK. This config gets stored within the _ConfigEntry class and is evaluated when __getattr__ is called. If justknobs is set, it'll call justknobs_check to see the result. Due to preceeding work, basically everything works correctly here and we had to update a couple of tests, and modify the getattr behaviour. Note that we are updating the justknob_check function to support a default option, to make default work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766 Approved by: https://github.com/ezyang	2024-11-01 19:20:37 +00:00
eqy	6fc63b4ef1	[ROCM][CUDA][NCCL] Disable `test_lowering_one_shot_all_reduce` on ROCM (#139414 ) I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139414 Approved by: https://github.com/huydhn, https://github.com/yifuwang	2024-11-01 18:39:47 +00:00
Jason Davies	391ee62180	Ensure scalar tensor device matches attn_mask for convert_boolean_attn_mask_cudnn. (#139450 ) This is causing a small performance hit when using SDPA with the cuDNN backend due to unnecessary host-to-device memcpy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139450 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-11-01 18:38:02 +00:00
Sam Larsen	d8b606ecb5	[fx graph cache] Support freezing with FX graph caching (#136505 ) Summary: The main changes to support freezing are: 1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata. 2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places. 3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded. 4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along. Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505 Approved by: https://github.com/eellison	2024-11-01 18:29:29 +00:00
vladimirrotariu	7d644f025f	make equation behind torch.isclose element-wise (#138459 ) The current formula behind torch.isclose, according to the docs, is ![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd) However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation: ![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459 Approved by: https://github.com/soulitzer	2024-11-01 18:18:33 +00:00
Nikita Shulga	1857be1b48	Fix S390 builds (#139491 ) Caused by https://github.com/pytorch/pytorch/pull/137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139491 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-11-01 18:16:29 +00:00
Nikita Shulga	51adab0829	[MPS] Fix reduction ops outputs for empty tensors (#139446 ) By adding a switch for all reduction types, that either sets it to given value or raises runtime error. Before this change, reduction ops returned uninitialized values in many case Fixes https://github.com/pytorch/pytorch/issues/139400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446 Approved by: https://github.com/Skylion007	2024-11-01 17:32:12 +00:00
Bin Bao	7d081cabfb	[AOTI] Forward fix #139458 (#139485 ) Summary: A new test added in https://github.com/pytorch/pytorch/pull/139458 only fails in certain CI instance. Skip for now as the failing test has a low priority. @diff-train-skip-merge (to silent fb bot so that I can land this myself) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139485 Approved by: https://github.com/huydhn, https://github.com/hl475	2024-11-01 17:14:40 +00:00
Scott Wolchok	3e0f4d18eb	[PyTorch] Support non-zero beta in fp16_gemv_trans (#138275 ) No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138275 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918, #138005	2024-11-01 16:49:05 +00:00
Scott Wolchok	195b1b9a9b	[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures (#138005 ) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138005 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918	2024-11-01 16:49:05 +00:00
Scott Wolchok	fad5d89321	[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM (#137918 ) This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/137918 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083	2024-11-01 16:48:56 +00:00
Scott Wolchok	d79c5143d8	[PyTorch] Add efficient isnan for NEON half (#139083 ) Same as the efficient one for float when f16 hardware support is available. Testing: Added exhaustive isnan test coverage Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139083 Approved by: https://github.com/malfet ghstack dependencies: #139082	2024-11-01 16:40:51 +00:00
Scott Wolchok	9ecd7d1587	[PyTorch] Add efficient isnan for NEON float (#139082 ) Just test x != x rather than applying element-by-element scalar isnan. Testing: vec_test_all_types checks IsNan Differential Revision: [D65001633](https://our.internmc.facebook.com/intern/diff/D65001633/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139082 Approved by: https://github.com/malfet	2024-11-01 16:40:51 +00:00
sanchitintel	3cbf0c0bbf	[Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688 ) # Summary The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights. This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension. The figure below is from the document linked above - <img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94"> ## Performance data Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded. \|M \| N \| K \| Latency with ATen _weight_int8pack_mm \| Latency with codegened templated GEMM (current main branch) \| Latency with codegened templated GEMM (this PR) \| \|-----\|-----\|-----\|------\|----------\|----\| \|4096\|4096\|4096\| 45.844 ms \| 9.322 ms\| 5.2181 ms \| \|4096\|11008\|4096\| 127.618 ms \|24.6258 ms \| 13.6046 ms\| \|4096\|4096\|11008\| 121.953 ms \| 25.4692 ms \| 10.2669 ms \| \|4096\|32000\|4096\| 478.450 ms\| 75.3942 ms \| 48.21 ms \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688 Approved by: https://github.com/jgong5	2024-11-01 16:32:22 +00:00
Jason Ansel	b57b4b7f9b	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-01 16:28:15 +00:00
Jason Ansel	1e934b473c	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-01 16:28:15 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Shuqiang Zhang	df0c1eceb9	[pgnccl][simple] clean up unused members of PGNCCL (#139436 ) Summary: Found those unused members when prototying something else. Better remove unused members Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436 Approved by: https://github.com/Skylion007	2024-11-01 16:25:04 +00:00
Bin Bao	33dce10ece	[AOTI][reland] Update zero size computation in clone_preserve_strides (#139458 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139224. clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Differential Revision: [D65317451](https://our.internmc.facebook.com/intern/diff/D65317451) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139458 Approved by: https://github.com/hl475	2024-11-01 13:51:02 +00:00
Huy Do	560a0704c5	Use a different test name for testConversionToStringView (#139448 ) Summary: The change comes from D65214804 (https://github.com/pytorch/pytorch/pull/139239) `buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` doesn't like having 2 `testConversionToString` in the same suite `StringViewTest`, so just need to use a different name there. Test Plan: `buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` passes Differential Revision: D65314266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139448 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-11-01 13:25:16 +00:00
Yifu Wang	e6e140c3d7	[Inductor] fix a compilation time regression caused by user-visible output handling (#139420 ) This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732. The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference. ``` Before the offending PR compilation_latency mean=176.022 seconds compilation_latency mean=176.564 seconds On the offending PR compilation_latency mean=180.096 seconds compilation_latency mean=179.101 seconds On the fix compilation_latency mean=173.153 seconds compilation_latency mean=174.182 seconds ``` (I think the fix being faster than the baseline is due to noise) The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18). The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420 Approved by: https://github.com/ezyang	2024-11-01 08:27:40 +00:00
Yang Wang	307ee7926e	[Workflow][1/3] Remove benchmack tests from rerun disbled tests (#139337 ) Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](`3cbd83d3d5`)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [THIS PR] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: https://github.com/pytorch/pytorch/pull/139337 Approved by: https://github.com/huydhn	2024-11-01 08:23:51 +00:00
Yang Wang	f7407b3de0	[Workflow][2/3] Remove benchmack tests from rerun disbled test (#139407 ) Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [THIS PR] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: https://github.com/pytorch/pytorch/pull/139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-01 08:09:31 +00:00
Shunting Zhang	5e4c8b671c	[inductor] loaf-fix (#139376 ) Fix https://github.com/pytorch/pytorch/issues/128063 . Now for this snippet ``` def f(x): y = torch.sum(torch.sum(x, dim=-1)) z = x / 10.0 z_t = z.t().contiguous().t() return y, z, z_t ``` Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused). The PR needs fix 2 subtile bugs regarding LOAF . 1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison. 2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376 Approved by: https://github.com/jansel, https://github.com/eellison	2024-11-01 07:54:32 +00:00
lingzhi98	39ec5a20ea	[Partitioner] Enumerate partitions by iterating partition ids (#136598 ) Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598 Approved by: https://github.com/tarun292 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 07:42:36 +00:00
andras_matyassy	61df90e3f6	Add TORCHDYNAMO_EXTENDED_ADVICE (#137159 ) (#137196 ) Fixes #137159 Happy to contribute to this project for the first time. If I missed any contribution guidelines, please let me know, I'm happy to adjust. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137196 Approved by: https://github.com/ezyang	2024-11-01 06:43:26 +00:00
angelayi	86db2cd194	[export] Initial draft export (#139383 ) Differential Revision: [D65288590](https://our.internmc.facebook.com/intern/diff/D65288590) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139383 Approved by: https://github.com/zou3519	2024-11-01 06:25:44 +00:00
FFFrog	300ca6368f	Remove depracated alias macro(2/3) (#137559 ) Detailed Descriptions: - Remove AT_ASSERTM Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559 Approved by: https://github.com/ezyang	2024-11-01 06:17:57 +00:00
William Wen	0c47657b05	[dynamo] ignore False/None callback in fail_on_recompile/force_backend stances (#139215 ) Fix https://github.com/pytorch/pytorch/issues/139202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139215 Approved by: https://github.com/jansel	2024-11-01 06:15:28 +00:00
cyy	4a2da52137	[1/N] Replace c10::sv with std::sv (#139453 ) Picks some safe replacements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139453 Approved by: https://github.com/Skylion007	2024-11-01 05:39:37 +00:00
cyy	6ef6b3f586	Remove const fromDLPack overload (#139156 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139156 Approved by: https://github.com/ezyang	2024-11-01 04:12:46 +00:00
Will Constable	84416618a6	[Pipelining] Update schedules to use I, B actions. (#138886 ) Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD) consistently. Previously, schedules would issue a 'B' operation and leave it ambiguous whether that operation should be BACKWARD_INPUT or FULL_BACKWARD, depending on a separate flag (use_full_backward) passed to the schedule class, which would determine which behavior was taken at runtime. Now, use_full_backward is removed and the schedule class is required to produce unambiguous IR. The logic for 'use_full_backward' is removed from the runtime. _validate_pipeline_order is replaced with _simulate_comms_compute. Both offer similar functionality, to validate the corrrectness of a schedule IR. 'validate' operates on compute-only IR, while simulate operates on compute + comm IR. To convert from using validate to simulate, you have to first insert comm actions via '_add_send_recv'. 'simulate' was inefficiently written before this PR and needed to be optimized to run quickly for extra large schedules with >32 ranks and microbatches per rank used in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886 Approved by: https://github.com/H-Huang	2024-11-01 03:54:06 +00:00
Bob Ren	094d288f40	Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 ) As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things: 1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation) 2) It updates the tensorify pass to do the backup specialization This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows: 1) Integrate turning off specialize float only in the automatic dynamic pass. 2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised. 3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868 Approved by: https://github.com/ezyang	2024-11-01 03:18:02 +00:00
James Wu	c8a648d4df	Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309 ) This diff considerably changes the column format of PT2 Compile Events: - Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention. - Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba. Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309 Approved by: https://github.com/oulgen	2024-11-01 02:40:25 +00:00
Yu, Guangye	46bca8a4b6	Export XPU oneDNN header to the public (#139177 ) # Motivation Export oneDNN header to the public, for example, the third-party extension now could use `GpuStreamManager` to manage `dnnl::stream` to submit oneDNN kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139177 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/malfet	2024-11-01 02:36:16 +00:00
Yang Wang	04382efe5e	[Bash][3/3] Remove benchmack tests from rerun disbled test (#139422 ) Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: https://github.com/pytorch/pytorch/pull/139422 Approved by: https://github.com/huydhn	2024-11-01 01:49:58 +00:00
Gabriel Ferns	030f70b40b	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-01 01:24:40 +00:00
cyyever	8ace3e8023	Add sv starts/ends_with (#139261 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 01:17:42 +00:00
Mikayla Gawarecki	2a309c0997	Fix weights_only for BUILD instructions for user allowlisted objects with __slots__ (#138936 ) Previously `BUILD` instruction missed handling for `__slots__`. This only applies for things allowlisted via `add_safe_globals`/`safe_globals` that use slots. ### Background When does pickle serialize a `BUILD` instruction? When `state` is not `None` and `state_setter` is `None` [[link](`c5b99f5c2c/Lib/pickle.py (L765)`)]. In this case, the docs tell us that either `__setstate__` or a `__dict__` update will be performed [[link](https://github.com/python/cpython/blob/3.13/Lib/pickletools.py#L1984)] `__reduce__`/`__reduce_ex__` are expected to return tuples of length 2 to 6 where `state` is the 3rd argument. When user doesn't patch `__reduce__` but patches `__setstate__`/`__getstate__`, state will be what is yielded by `__getstate__` Note the return type for [`__getstate__` ](https://docs.python.org/3/library/pickle.html#object.__getstate__) - For a class that has no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is None. - For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is `self.__dict__`. - For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is a tuple consisting of two dictionaries: `self.__dict__`, and a dictionary mapping slot names to slot values. Only slots that have a value are included in the latter. - For a class that has [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__) and no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__), the default state is a tuple whose first item is None and whose second item is a dictionary mapping slot names to slot values described in the previous bullet. see handling in pickle code `c5b99f5c2c/Lib/pickle.py (L1846-L1867)` Before this PR, we didn't account for the fact that when `__setstate__` is not defined, `state` might be a tuple so this would fail ```python from dataclasses import dataclass # Define the dataclass @dataclass class MyDataClass: __slots__ = ["x", "y"] x: int y: str # Create an instance of the dataclass my_data = MyDataClass(x=2, y=3) # Save the dataclass to a file torch.save(my_data, "my_data.pt") with torch.serialization.safe_globals([MyDataClass]): loaded_my_data = torch.load("my_data.pt", weights_only=True) # AttributeError: 'MyDataClass' object has no attribute '__dict__' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138936 Approved by: https://github.com/malfet	2024-11-01 00:59:29 +00:00
Jason Ansel	c2ffd41a86	[inductor] Enable AMD cooperative reduction tests (#139230 ) Fixes #139099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139230 Approved by: https://github.com/eellison	2024-11-01 00:55:13 +00:00
Jason Ansel	f9ef880c0b	[inductor] Refactor kernel args into SIMDKernelFeatures (#139327 ) This is a refactor PR to move stuff around. I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 00:30:14 +00:00
PyTorch MergeBot	b6b9596607	Revert "[dynamo] Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit 44257c063e2f7bd9b35e6e4973f89d7f1cb65442. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break some internal tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2451050605))	2024-11-01 00:13:20 +00:00
IvanKobzarev	d33849908d	[aotd] Fuse tangents subclasses runtime traversals (#139068 ) Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068 Approved by: https://github.com/bdhirsh	2024-11-01 00:03:02 +00:00
Xuan Zhang	86602a66d7	[orm] fix live_memory computation in lpmf algorithm (#139396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139396 Approved by: https://github.com/yf225	2024-10-31 23:45:30 +00:00
PyTorch MergeBot	3d3551506d	Revert "[dynamo, 3.13] fix bytecode nop tests (#139323 )" This reverts commit c2d754441f8e941c208579661a04b5ed1e5e71bc. Reverted https://github.com/pytorch/pytorch/pull/139323 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in instruction count metric ([comment](https://github.com/pytorch/pytorch/pull/139323#issuecomment-2451017609))	2024-10-31 23:34:00 +00:00
Chirag Pandya	6727f343b5	[c10d][fr][easy] Move check_no_missing_dump_files (#139417 ) Summary: Move check_no_missing_dump_files to after the "just print" location. This allows us to print dump_files when there are actual missing files. Test Plan: ``` torchfrtrace -j ~/pyper-training-online-924394600 --selected-ranks 1 2 Inferred common prefix nccl_trace_rank_ loaded 95 files in 0.040270328521728516s built groups, memberships Rank 1 Rank 2 ------------------------------------------------------------------ ------------------------------------------------------------------ broadcast(input_sizes=[[2]], state=completed) broadcast(input_sizes=[[2]], state=completed) ``` Without this change, the command was erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-10-31 22:55:01 +00:00
Will Constable	8e8040a5c2	[Pipelining] Optimize ready_to_schedule logic (#138924 ) Used in both simulator and add_send_recv pass, the ready_to_schedule logic works by looking at all the previously scheduled ops on a rank to see if any of them 'unblocks' the current op to be scheduled. For example, to schedule a FORWARD op, a previous RECV_F op is needed, unless this is stage 0 or there is a previous stage on the same rank that ran FORWARD already. The old implementation iteratively compared the candidate op to the previous ops. The new implementation uses set lookups to reduce complexity. It also maintains the set of previous ops as ops are scheduled rather than constructing a set on demand. I did not save benchmark results, but this results in a 10-100x speedup which is most noticeable for unit tests with artificially huge schedule IR, the largest of which took longer than 20m before (I never let it finish) but now takes less than 14s. Most schedules take less than 10ms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924 Approved by: https://github.com/H-Huang ghstack dependencies: #138928, #131762	2024-10-31 22:49:45 +00:00
Will Constable	c82e0d117a	[Pipelining] Support separate dI / dW and V-schedules (#131762 ) ### Separate dI / dW: PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD or separate dI / dW operations. Separating the B and W may add execution overhead or may be suboptimal in cases where BW are 'fused', but it is worthwhile when separating B, W lets the schedule be more efficient by filling in bubbles. In some cases, the schedule will still issue B followed by W at certain points, so in these cases just merge them back into BW ops and execute them as full backwards rather than executing a B followed by a W. ### V-schedules: V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F In the runtime, we pass activations between adjacent stages without using SEND/RECV ops since the stages are on the same rank/process. We add new APIs to PipelineStage abstraction for passing the activations both during forward and backward. Currently the implementation directly modifies the 'recv buffers' the stage is managing, so the forward/backwrad execution logic does not need to know the difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762 Approved by: https://github.com/H-Huang ghstack dependencies: #138928	2024-10-31 22:49:45 +00:00
Zhengxu Chen	45da80b970	reland D65167805 "[export] Update min_val and max_val to Optional[int] in serialization." (#139394 ) Summary: had a land racing with another diff D65166035 to fix the schema. According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/rec/ir/tests:ir_export_deserialize_test Differential Revision: D65273170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139394 Approved by: https://github.com/yiming0416	2024-10-31 22:28:32 +00:00
Nikita Shulga	01136fb9e0	Update `MPS_ERROR_RUNTIME_TOO_LOW` message (#139427 ) https://github.com/pytorch/pytorch/pull/133141 updated min os requirement to 13.0, but missed the message Fixes https://github.com/pytorch/pytorch/issues/139425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139427 Approved by: https://github.com/seemethere, https://github.com/kit1980	2024-10-31 22:04:08 +00:00
Donald Tolley	c1e7d85ce6	Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049 ) #### Summary This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others. #### Changes - `weighted_huber_loss`: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter. - `wmse_loss` (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks. - `wmae_loss` (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers. #### Code Details - Input Validation: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors. - Reduction Options: Supports `none`, `mean`, and `sum` reductions to suit various computational needs. - Backward Compatibility: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument. #### Usage Example ```python import torch input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32) target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32) weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32) loss = weighted_huber_loss(input, target, weights, delta=1.0) print(loss) ``` --- Feedback on these implementations is welcome; please let me know if further modifications are required. Resolves #132465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2024-10-31 21:59:43 +00:00
Simon Fan	82e74ad40e	[aot autograd] refactor CompiledFunction.backward: control flow (3/N) (#139347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139347 Approved by: https://github.com/zou3519 ghstack dependencies: #139331, #139343	2024-10-31 21:53:03 +00:00
Simon Fan	8134456a27	[aot autograd] refactor CompiledFunction.backward: epilogue (2/N) (#139343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139343 Approved by: https://github.com/zou3519 ghstack dependencies: #139331	2024-10-31 21:53:03 +00:00
Simon Fan	04ce9ec087	[aot autograd] refactor CompiledFunction.backward: prologue (1/N) (#139331 ) So for functional autograd + CA, most nodes are inlined in aot autograd. But user-defined callables aren't safe to make_fx unless dynamo traces through them. The AOT backward must be inlined by dynamo time. We plan to directly insert calls to the backward in the graph: - call prologue - call bwd graph - call epilogue Restructuring our AOT bwd implementation will make this implementation easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139331 Approved by: https://github.com/zou3519	2024-10-31 21:53:03 +00:00
angelayi	8c22e09e39	[aoti] Add masked_select to cshim (#139071 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071 Approved by: https://github.com/desertfire	2024-10-31 21:52:53 +00:00
PyTorch MergeBot	b9acbde4fd	Revert "Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 )" This reverts commit a49457279919b324d8ca1db85636d16d6dfd4e0f. Reverted https://github.com/pytorch/pytorch/pull/138868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the new tests are failing on fbcode ([comment](https://github.com/pytorch/pytorch/pull/138868#issuecomment-2450863895))	2024-10-31 21:46:06 +00:00
Laith Sakka	6a1c451479	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 21:16:55 +00:00
PyTorch MergeBot	c4d9428b17	Revert "[AOTI] Update zero size computation in clone_preserve_strides (#139224 )" This reverts commit 206a8dde68faef052dfeedabb4180179ab24015e. Reverted https://github.com/pytorch/pytorch/pull/139224 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139224#issuecomment-2450811914))	2024-10-31 21:05:07 +00:00
Joel Schlosser	ddb291a881	Fix and test several NJT reductions (#139317 ) I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit. It applies to the following ops: * `sum` / `mean` / `prod` * `all` / `any` * `amin` / `amax` * `min` / `max` * `argmin` / `argmax` The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, args, kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim. Extensive test coverage includes: reductions across ragged dim * reductions across non-batch, non-ragged dims * reductions across both batch and ragged dims * multiple dim reductions (for ops that support this) * full reduction -> scalar Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317 Approved by: https://github.com/cpuhrsch	2024-10-31 20:55:38 +00:00
PyTorch MergeBot	abb0dd4b00	Revert "[inductor] patterns to remove pointless view/permute pairs (#139136 )" This reverts commit 2b86cd74a60ca2483173ba3012506aeac85ab2d7. Reverted https://github.com/pytorch/pytorch/pull/139136 on behalf of https://github.com/ZainRizvi due to Sorry but this PR seems to have broken on trunk. The failure: distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op [GH job link](https://github.com/pytorch/pytorch/actions/runs/11615060962/job/32346609889) [HUD commit link](`2b86cd74a6`) ([comment](https://github.com/pytorch/pytorch/pull/139136#issuecomment-2450796414))	2024-10-31 20:54:17 +00:00
Justin Chu	76b5ee1119	[ONNX] Set flags correctly in tests (#139413 ) Previously the flag was set via envvar, since the envvar was read at initialization, it may not have been correctly set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139413 Approved by: https://github.com/titaiwangms	2024-10-31 20:46:23 +00:00
Jerry Zhang	938803df94	Add bfloat16 support for per tensor/channel cpu/cuda fake quantize ops (#139306 ) Summary: Fixes https://fb.workplace.com/groups/2240361332735959/permalink/8190736677698365 Test Plan: buck2 test 'fbcode//mode/dev' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_tensor_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cuda (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' Differential Revision: D65221710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139306 Approved by: https://github.com/navsud	2024-10-31 20:41:15 +00:00
drisspg	53c9c19e76	[Autotune Inductor] Some clean up and dataclassing (#139157 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139157 Approved by: https://github.com/eellison	2024-10-31 20:04:55 +00:00
William Wen	c2d754441f	[dynamo, 3.13] fix bytecode nop tests (#139323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323 Approved by: https://github.com/jansel	2024-10-31 20:03:43 +00:00
Guilherme Leobas	1518cf426b	Remove `@skipIfTorchDynamo` from test_extremal_numerics_l1_loss_cpu test (#139318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139318 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2024-10-31 19:57:28 +00:00
PyTorch MergeBot	886579af99	Revert "Use static_assert to detect get_type_index used in device code (#139173 )" This reverts commit d391ed3f4ec6b1a78f7b34e27cba74b37d885475. Reverted https://github.com/pytorch/pytorch/pull/139173 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139173#issuecomment-2450695123))	2024-10-31 19:50:19 +00:00
Shivam Raikundalia	ac7acfb894	[Profiler] Create Auto-Trace Frontend for Trace ID (#139310 ) Summary: This PR adds Auto-Trace implementation for Trace ID. By default, the python side will generate a uuid in the same format as the one set in the backend by kineto. Upon running an auto-trace, the python generated trace id will overwrite the one set in kineto using the Config variable. Since we don't expect users to generate on-demand traces after an auto-trace we can simply keep overwriting the backend trace id whenever autotrace is ran. If we one day want to eventually do something like this, we simply have to add a call in kineto on the backend to generate a new ID upon start of profiling. We also implement a custom callback in the frontend such that users can generate their own trace ids if they wish to. This works similarly as the default, only difference being that they have to manually set this callback after a profiler is generated. We use a specific call to set this rather then putting it in the frontend initializer in case users want to change the trace_id for different repeats. Test Plan: Tested both default and custom callbacks using the verbose prints added. Trace ids on the frontend and the prints on the backend for the manifold upload matched. Differential Revision: D65178308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139310 Approved by: https://github.com/shengfukevin	2024-10-31 19:02:57 +00:00
Xuehai Pan	7faf0ad913	[dyanmo] fix `deque.maxlen` support when extending elements from left (#139279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139279 Approved by: https://github.com/jansel	2024-10-31 18:38:11 +00:00
bskrlj	8e27833e30	Ensure SWA boundary conditions w.r.t. definition (#133773 ) According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html) ``` Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999. ``` An inspection of `swa_utils.py` indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415). Fixes https://github.com/pytorch/pytorch/issues/133772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773 Approved by: https://github.com/janeyx99	2024-10-31 18:24:08 +00:00
Will Constable	547d921462	[Pipelining] Remove unused special case from simulator (#138928 ) The special case was added during experimentation with batched send/recv ops. The ops needed to be jointly scheduled or the simulator would think that each op was unschedulable since each contained a recv that depended on the other's send. The workaround I added was to let the scheduler 'peek' one op ahead for unblocking, which let batched ops be scheduled but also changed the behavior or non-batched ops. It let RECV ops be simulated one step earlier than the unblocking SEND ops, which shortened the simulated duration of schedules. Removing this workaround simplifies the simulator but more importantly lends to optimizing the runtime of the simulator by making it much easier to avoid copying or extending lists of previous ops on each iteration. It also restores the output of the simulator for non-batched ops to a more natural output where RECV must happen at the same time or later than matching SEND, rather than possibly a step earlier. For example, for this test: `python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0` Before: ``` Step 0: 0F0 1RECV_F0 Step 1: 0SEND_F0 Step 2: 0F1 1RECV_F1 Step 3: 0SEND_F1 1F0 Step 4: 0RECV_B0 1B0 Step 5: 0B0 1SEND_B0 Step 6: 1F1 Step 7: 0RECV_B1 1B1 Step 8: 0B1 1SEND_B1 ``` After: ``` Rank 0 Rank 1 Step 00: 0F0 Step 01: 0SEND_F0 1RECV_F0 Step 02: 0F1 Step 03: 0SEND_F1 1RECV_F1 Step 04: 1F0 Step 05: 1B0 Step 06: 0RECV_B0 1SEND_B0 Step 07: 0B0 1F1 Step 08: 1B1 Step 09: 0RECV_B1 1SEND_B1 Step 10: 0B1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928 Approved by: https://github.com/H-Huang	2024-10-31 17:48:35 +00:00
Nikita Shulga	9d096e4d9f	Don't use deprecated type properties in UpsampleKernel (#139399 ) By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399 Approved by: https://github.com/Skylion007 ghstack dependencies: #139353, #139358	2024-10-31 17:32:19 +00:00
Bin Bao	206a8dde68	[AOTI] Update zero size computation in clone_preserve_strides (#139224 ) Summary: clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Fix that. Differential Revision: [D65250405](https://our.internmc.facebook.com/intern/diff/D65250405) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139224 Approved by: https://github.com/angelayi	2024-10-31 17:07:18 +00:00
eellison	f93ebb2cf4	[Easy] Refactor post grad application of passes (#139293 ) Refactors GraphTransformObserver to hook into the bisect manager pass application. And reworks post grad passes to use it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139293 Approved by: https://github.com/exclamaforte ghstack dependencies: #139292	2024-10-31 17:05:27 +00:00
Shuqiang Zhang	5075046db2	[c10d] separate comm init from getNCClComm (#139362 ) Summary: This PR is a non op. But it clearly separate the init logic from the getNCCLCOMM. getNCClComm is now a purely a 'read' only function Test Plan: existing CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139362 Approved by: https://github.com/wconstab	2024-10-31 16:58:20 +00:00
James Wu	864beebb41	[easy] Add start event metadata to collected metadata for PT2 Compile Events (#139289 ) We should be logging metadata from event starts to PT2 Compile Events too. Differential Revision: [D65070086](https://our.internmc.facebook.com/intern/diff/D65070086/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139289 Approved by: https://github.com/oulgen	2024-10-31 16:52:30 +00:00
Tomasz Bohutyn	dd6263e2fb	Implement HPUHooksInterface (#137338 ) Fixes #137262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137338 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2024-10-31 16:26:19 +00:00
PyTorch MergeBot	87f1990697	Revert "Don't uselessly recompute axiom dict every static eval call (#138967 )" This reverts commit 24b695ae2d5d85a3bda0e493fb4631d5e0add290. Reverted https://github.com/pytorch/pytorch/pull/138967 on behalf of https://github.com/ZainRizvi due to Sorry, looks like this PR introduced a failure that was incorrectly classified as flaky, and the log classifier didn't identify the right log line either ([comment](https://github.com/pytorch/pytorch/pull/138967#issuecomment-2450228525))	2024-10-31 15:54:18 +00:00
Shunting Zhang	2b86cd74a6	[inductor] patterns to remove pointless view/permute pairs (#139136 ) These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph. Consider this snippet: ``` class LinearAndCEL(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(C, V) self.ce = nn.CrossEntropyLoss() def forward(self, x, y): return self.ce(self.linear(x).view(B * T, V), y.view(-1)) ``` `x` passed to `forward` is a 3D tensor of shape [B, T, C]. The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients. The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-10-31 15:35:46 +00:00
Sam Larsen	d21a25c6b7	[fx graph cache] Refactor FxGraphCachePickler, step 2 (#138683 ) Summary: Move all the custom `_reduce_*` functions inside the FxGraphCachePickler class. This is mostly a cosmetic change since they're conceptually members of FxGraphCachePickler. But also in an upcoming diff, I'll add a member variable to the class to control how we handle constant tensors, so it will be convenient to be able to query that setting via `self`. I made the analogous changes to AOTAutogradCachePickler for consistency. Test Plan: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/138683 Approved by: https://github.com/eellison ghstack dependencies: #138681, #138682	2024-10-31 15:12:18 +00:00
Nikita Shulga	92a2a9ded2	[BE] And delete `DeprecatedTypProperties` cast (#139358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358 Approved by: https://github.com/ezyang ghstack dependencies: #139353	2024-10-31 14:39:22 +00:00
FFFrog	ea07718a5a	Remove redundant warning compress (#139367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139367 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-31 14:39:19 +00:00
augusto.yjh	c934ed6567	init kineto after torch module initialized (#131448 ) Fixes #131020 As discussed in the issue thread, we can use ` KINETO_DAEMON_INIT_DELAY_S` to delay the initialization of `kineto` in case `kineto` is initialized before `libtorch_cuda.so`. It's not clear to set a proper value of environmental variable `KINETO_DAEMON_INIT_DELAY_S`, here's a trick to make the initialization of `kineto` after the initialization of module `torch`. I'm not sure whether this is an acceptable trick, please take a look at this pr, thanks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131448 Approved by: https://github.com/sraikund16, https://github.com/briancoutinho	2024-10-31 13:24:24 +00:00
rzou	ccaa2a206a	[inductor] make requires_stride_order more unbacked-symint-aware (#137063 ) Previously, we tried to sort SymInt strides to determine the stride order. This PR makes the sorting more unbacked symint aware: given a Tensor with sizes (u0, u1, u2), it has strides (u1 * u2, u1, 1), which is sortable under the guard_size_oblivious assumptions. Test Plan: - test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/137063 Approved by: https://github.com/eellison	2024-10-31 13:11:02 +00:00
Wu, Chunyuan	3192bdeea4	[AOTI] Use `len(serialized_weights)` when calculating `consts_size` (#139054 ) Fixes the failure of INT8 DLRM using AOTI. The previous code calculates `consts_size` directly using `tensor` from `graph.constants`: ``` consts_size = sum( get_nbytes_of_tensor(tensor, all_cuda) for (name, tensor) in graph.constants.items() if name not in graph.folded_constants ) ``` Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`: ``` serialized_weights = b"".join( _to_bytes(graph.get_original_value_of_constant(name), all_cuda) for name in graph.constants.keys() if name not in graph.folded_constants ) ``` `tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in https://github.com/pytorch/pytorch/pull/135205. This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency. We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139054 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-31 09:54:16 +00:00
Laith Sakka	24b695ae2d	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 07:46:35 +00:00
Scott Wolchok	73fde0d940	[PyTorch] Unbreak C10_ALWAYS_INLINE_ATTRIBUTE on MSVC (#139363 ) At least one recent version refuses to accept it on a lambda, so disable. Differential Revision: [D65250256](https://our.internmc.facebook.com/intern/diff/D65250256/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D65250256/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/139363 Approved by: https://github.com/ngimel, https://github.com/malfet	2024-10-31 07:40:05 +00:00
Huy Do	f98bc9a49d	Revert D65167805 (#139371 ) Summary: This diff reverts D65167805 broke the release pipeline Test Plan: NA Differential Revision: D65245198 @diff-train-skip-merge (to silent facebook-github-bot until I have a stamp to land this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139371 Approved by: https://github.com/malfet	2024-10-31 07:25:28 +00:00
Nikita Shulga	86e6513c86	[BE] Remove deprecated `AT_DISPATCH_ALL_TYPES_AND_HALF` (#139353 ) It's been deprecated for 2 years now, time to delete Pull Request resolved: https://github.com/pytorch/pytorch/pull/139353 Approved by: https://github.com/ezyang	2024-10-31 07:06:19 +00:00
Jeff Daily	a7479fa282	TunableOp use dense size calculations as minimum sizes (#139137 ) Fixes #139116. Also fixes other unreported issues with torch.bmm due to incorrect size calculations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139137 Approved by: https://github.com/yoyoyocmu	2024-10-31 06:01:58 +00:00
Nhat Minh Luu	261d90c18f	Add docs page for `torch.inf` and `torch.nan` (#138430 ) Fixes #131040 ## Description Add docs for `torch.inf` and `torch.nan`, ## Checklist - [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER") - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138430 Approved by: https://github.com/ezyang	2024-10-31 05:46:46 +00:00
cyy	f95c71867e	[9/N] Fix extra warnings brought by clang-tidy-17 (#139286 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139286 Approved by: https://github.com/ezyang	2024-10-31 05:20:31 +00:00
FFFrog	42b5e191ae	Fix the example of fx/interpreter (#139368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139368 Approved by: https://github.com/ezyang	2024-10-31 05:12:43 +00:00
Yu, Guangye	d08dbd0436	Update torch-xpu-ops commit pin (#139041 ) # Motivation This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes: 1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows; 2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc... # Additional Context We have to supply XPU device check logic in `cdist` and `pdist` ops. This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-10-31 05:06:06 +00:00
Bob Ren	74b7fb9519	Add conjugate method on SymFloat (#139249 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_conjugate_num_type4_dynamic_shapes when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139249 Approved by: https://github.com/ezyang	2024-10-31 04:55:36 +00:00
kshitij12345	0cf4cc3d5f	[fx] split_module subgraph should always have an output node (#139275 ) Fixes https://github.com/pytorch/pytorch/issues/138207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139275 Approved by: https://github.com/ezyang	2024-10-31 04:53:19 +00:00
Sam Larsen	e3e3ab805b	[fx graph cache] Refactor FxGraphCachePickler (#138682 ) Summary: In an upcoming change, we need to modify FxGraphCachePickler to behave differently depending on whether the graph has frozen parameters (whether or not we have frozen parameters). To do that, it will be convenient to change FxGraphCachePickler into a regular object instead of a collection of classmethods. Test Plan: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/138682 Approved by: https://github.com/eellison ghstack dependencies: #138681	2024-10-31 03:31:51 +00:00
cyy	70ba471957	[3/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139248 ) Follows #139158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139248 Approved by: https://github.com/ezyang	2024-10-31 03:29:19 +00:00
cyy	1dd503c6fb	[4/N] Fix Wextra-semi warning (#139256 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139256 Approved by: https://github.com/ezyang	2024-10-31 03:01:14 +00:00
Piotr Bialecki	bd88d40e5f	[Submodule] update submodule onnx==1.17.0 (#139128 ) Follow-up PR of: https://github.com/pytorch/pytorch/pull/138719 CC @malfet @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/139128 Approved by: https://github.com/malfet	2024-10-31 02:50:00 +00:00
cyy	29297731bb	[5/N] Don't skip ASAN on some tests (#139265 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139265 Approved by: https://github.com/ezyang	2024-10-31 02:49:03 +00:00
Wu, Chunyuan	d7411c0cc1	[AOTI] add C shim for QConvPointWise (#138540 ) This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects: 1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent. 2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540 Approved by: https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #138691, #138806	2024-10-31 02:03:01 +00:00
Oguz Ulgen	69ea2e726c	Consolidate Triton cache into Inductor cache (#138239 ) Summary: This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate. Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892 I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR. In terms of instrumentation 1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value. 2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff 3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well Test Plan: Updated unit tests Adhoc running ``` TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner ``` gives https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json <img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93"> Differential Revision: D64504909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239 Approved by: https://github.com/ezyang	2024-10-31 01:37:16 +00:00
Edward Z. Yang	c7f1fccd7a	Globally enable Python dispatcher for all of Inductor compilation (#137621 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137621 Approved by: https://github.com/eellison	2024-10-31 01:35:23 +00:00
PyTorch MergeBot	289e03a429	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit 8840889c3f6565b7975150adebcbe062f19035ee. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2448824206))	2024-10-31 01:32:15 +00:00
Yidi Wu	38429938de	[cond] make cond not throw warnings on constant pred in eager mode (#138837 ) We don't raise warnings for torch.cond in eager mode the motivation is in https://github.com/pytorch/pytorch/issues/138782. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138837 Approved by: https://github.com/zou3519	2024-10-31 01:13:19 +00:00
Saurabh Mishra	b90503d9ae	[DCP] Unit Test to validate the stateful and non-stateful loads (#139251 ) Summary: Unit Test to validate the stateful and non-stateful loads. This test is a follow up to the fix in [#138575](https://github.com/pytorch/pytorch/pull/138575) which addresses an issue in stateful dict's in-place updates in distributed checkpoint loading. Also, added additional code comments regarding the stateful and non-stateful loads. Test Plan: ``` buck2 test //caffe2/test/distributed/checkpoint/e2e:test_e2e_save_and_load ``` https://www.internalfb.com/intern/testinfra/testrun/8162774562859797 Differential Revision: D65188659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139251 Approved by: https://github.com/LucasLLC, https://github.com/fegin	2024-10-31 01:12:51 +00:00
Nichols A. Romero	7ed0d69004	[ROCm] Increase hipBLASLt default workspace size (#139300 ) This PR increases hipBLASLt default workspace size to 76 MB which is the recommended default. This PR does not contain any bug fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139300 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-10-31 00:56:54 +00:00
PyTorch MergeBot	42d790bb65	Revert "Add conjugate method on SymFloat (#139249 )" This reverts commit bcf8a0124fbadb469f6766eb7555a75ea0fa9d43. Reverted https://github.com/pytorch/pytorch/pull/139249 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/139249#issuecomment-2448755839))	2024-10-31 00:45:48 +00:00
eellison	4db6b740bc	[Easy] GraphTransformObserver Refactoring (#139292 ) Uses `torch._inductor.config.trace.log_url_for_graph_xform` by default as the log url. It was only ever instantiated with this as the log_url argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139292 Approved by: https://github.com/shengfukevin, https://github.com/shunting314	2024-10-31 00:33:28 +00:00
Yu, Guangye	8fa0bc3358	Use cached dnnl::stream in GpuStreamManager (#139176 ) # Motivation The code changes in `GpuStreamManager` class intend to help manage `dnnl::stream` efficiently. # Addtional Context Use the following code to simply benchmark. ```python import torch import time device = torch.device("xpu") M, N, K = 64, 64, 64 # You can change these dimensions as needed torch.manual_seed(0) A = torch.randn(M, K, device=device) B = torch.randn(K, N, device=device) # Warm-up for _ in range(10): torch.matmul(A, B) s1 = torch.xpu.Stream() s2 = torch.xpu.Stream() # Measure the time for the GEMM operation start_time = time.time() with torch.xpu.stream(s1): for _ in range(50000): C = torch.matmul(A, B) with torch.xpu.stream(s2): for _ in range(50000): D = torch.matmul(A, B) torch.xpu.synchronize() end_time = time.time() # Calculate elapsed time elapsed_time = end_time - start_time # Print the results print(f"Time taken for GEMM operation: {elapsed_time:.6f} seconds") ``` Compared with the old implementation elapses 2.077069s, the new implementation consumes 2.023017s, which means ~2% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139176 Approved by: https://github.com/gujinghui, https://github.com/jgong5	2024-10-31 00:23:39 +00:00
Brian Hirsh	f81223938c	support nesting of suppress_guards, suppress guards when generated compiled autograd graph (#138968 ) Fixes https://github.com/pytorch/pytorch/issues/138920. See comments there for details. I still need to try to get a smaller repro to write an actual test. But suppressing the guards, I now no longer see the specilization in the CA graph in the linked example: ``` aot1_view_3: ... = torch.ops.aten.view.default(aot1_tangents_1, [aot1_sym_size_int, 48, 1]) aot1_view_4: ... = torch.ops.aten.view.default(aot1_view_3, [aot1_sym_size_int, 48]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138968 Approved by: https://github.com/yf225, https://github.com/xmfan	2024-10-31 00:13:39 +00:00
cyy	d391ed3f4e	Use static_assert to detect get_type_index used in device code (#139173 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139173 Approved by: https://github.com/r-barnes, https://github.com/ezyang	2024-10-31 00:06:53 +00:00
Catherine Lee	f747bd2947	Move slow test query to ClickHouse (#139322 ) Example run: https://github.com/pytorch/pytorch/actions/runs/11602255032/job/32306827867?pr=139322 (pr creation commented out), also tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/139322 Approved by: https://github.com/huydhn	2024-10-30 23:58:27 +00:00
cz2h	48854cbfc4	Add missing operator and corresponding unittest (#138309 ) Fixes https://github.com/pytorch/pytorch/issues/129690 Add operator.neg and oepartor.pos into _SYM_BOOL_OPS. Provide simple unit test under export/test_serialize.py that can reproduce the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138309 Approved by: https://github.com/ezyang, https://github.com/angelayi	2024-10-30 23:50:24 +00:00
Sherlock Huang	f32b9a5145	Fx graph always return tuple in fuse_as_graphmodule (#139236 ) Summary: As title. Test Plan: Let's see what OSS CI says Differential Revision: D65147426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139236 Approved by: https://github.com/ezyang	2024-10-30 23:31:06 +00:00
Bob Ren	a494572799	Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 ) As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things: 1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation) 2) It updates the tensorify pass to do the backup specialization This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows: 1) Integrate turning off specialize float only in the automatic dynamic pass. 2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised. 3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868 Approved by: https://github.com/ezyang	2024-10-30 23:28:25 +00:00
Bob Ren	bcf8a0124f	Add conjugate method on SymFloat (#139249 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_conjugate_num_type4_dynamic_shapes when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139249 Approved by: https://github.com/ezyang	2024-10-30 23:28:09 +00:00
Bob Ren	a426837f85	Don't set replacement if lhs is in the free symbols of the rhs (#139250 ) Fixes python test/dynamo/test_functions.py FunctionTests.test_is_integer when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139250 Approved by: https://github.com/ezyang	2024-10-30 23:21:30 +00:00
Catherine Lee	754b262bdb	Move close_nonexistent_disable_issues.py queries to ClickHouse (#139296 ) Example run: https://github.com/pytorch/pytorch/actions/runs/11601996563/job/32305991204?pr=139296 (commented out the part that actually closes issues but the queries run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139296 Approved by: https://github.com/huydhn	2024-10-30 23:09:39 +00:00
Edward Z. Yang	ae6cbd4256	Block more keys from config serialization (#139285 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139285 Approved by: https://github.com/jovianjaison, https://github.com/markkm, https://github.com/c00w	2024-10-30 23:05:59 +00:00
Will Constable	4a8d12227e	[Pipelining] add schedule simulator and chrometrace dump (#138134 ) Schedule simulator is useful for detecting hangs in schedules and validating that they won't hang. It also inserts bubbles (None actions) at any timestep where a rank can not enqueue its next action due to unmet dependencies, which can serve as a rough metric for schedule efficiency. The output can be visualized. The simulator expects a full comm + compute schedule as input. Chrometrace dump is a basic visualization utility. It currently just renders one 'process' per rank, and lets users visualize the schedule in a UI instead of as text. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134 Approved by: https://github.com/H-Huang	2024-10-30 23:00:58 +00:00
PyTorch MergeBot	ec5fbee6c0	Revert "Drop caffe2 string_utils (#139217 )" This reverts commit 1797a2035d92d25d3dcc46fd8facdd6569b30c53. Reverted https://github.com/pytorch/pytorch/pull/139217 on behalf of https://github.com/huydhn due to Chatting with @r-barnes, this is still used in lots of place internally ([comment](https://github.com/pytorch/pytorch/pull/139217#issuecomment-2448568071))	2024-10-30 22:23:32 +00:00
Yukio Siraichi	fef5e94657	`addmm`: error on output dtype mismatch. (#138520 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138520 Approved by: https://github.com/ezyang ghstack dependencies: #138515	2024-10-30 21:46:39 +00:00
Yukio Siraichi	6da3a043a8	Add test for consistency between meta and CPU devices. (#138515 ) Reference: https://github.com/pytorch/pytorch/issues/138399 This PR introduces an `OpInfo` test that checks whether running each `out=` operation using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically, it tests the case where the output tensors are not of the expected data type. According to the `out=` specification, some operations should error. I have added XFAIL to the set of operations that are currently failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515 Approved by: https://github.com/ezyang	2024-10-30 21:46:39 +00:00
Catherine Lee	24c9683355	[mergebot] Add ci-no-td label on revert (#139218 ) Just in case? Pull Request resolved: https://github.com/pytorch/pytorch/pull/139218 Approved by: https://github.com/wdvr	2024-10-30 21:36:09 +00:00
Gabriel Ferns	8840889c3f	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-10-30 21:35:50 +00:00
Richard Zou	ad0883a288	[real_tensor_prop] Infer Fake kernels during real tensor prop (#139213 ) This PR changes real_tensor_prop to also infer fake kernels when the operator doesn't have it. We infer the fake output to be of the same properties as the real output, with unbacked symints in the sizes and some stride order. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139213 Approved by: https://github.com/pianpwk ghstack dependencies: #139212	2024-10-30 21:29:33 +00:00
Zhengxu Chen	03ec25053a	[export] Update min_val and max_val to Optional[int] in serialization. (#139223 ) Summary: According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_serialize_infinite_sym_int Differential Revision: D65167805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139223 Approved by: https://github.com/yiming0416	2024-10-30 21:14:17 +00:00
Xu Han	6d5944c9f1	turn off USE_MIMALLOC_ON_MKL temporary. (#139204 ) Fixes #138994 We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused https://github.com/pytorch/pytorch/issues/138994 For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: https://github.com/pytorch/pytorch/pull/138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139204 Approved by: https://github.com/ezyang	2024-10-30 21:09:21 +00:00
Eddie Yan	05cb98f91d	[TF32][Inductor] Account for TF32 in `test_inductor_layout_optimization_input_mutations` (#138948 ) Tests using a conv2d kernel which can dispatch to a TF32-backed implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/138948 Approved by: https://github.com/ezyang	2024-10-30 20:34:16 +00:00
Huy Do	77e25d57b0	Create ciflow/inductor-periodic (#138763 ) This is related to https://github.com/pytorch/pytorch/issues/138476. This would save about 1/8 of the total cost, not a big number, but still a save I guess. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138763 Approved by: https://github.com/desertfire	2024-10-30 19:59:44 +00:00
Richard Zou	ef380f7b8e	[real tensor prop] Add some asserts for custom ops (#139212 ) When we see a custom op: - check that its mutation annotations are correct - check that its aliasing constraints matches our constraints for custom ops. Otherwise, there may be undefined behavior. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139212 Approved by: https://github.com/angelayi	2024-10-30 19:29:11 +00:00
Alex Baden	5c6d35482e	[Inductor] Support Triton AttrsDescriptor cls field (#139193 ) Fixes #139179 Adding corresponding changes to https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139193 Approved by: https://github.com/bertmaher	2024-10-30 18:16:38 +00:00
Pian Pawakapan	180d283156	[export] avoid debug name crash for dim hints (#139104 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139104 Approved by: https://github.com/ezyang	2024-10-30 18:12:44 +00:00
Yifu Wang	7765d1ef70	Preliminary registered-buffer collective support via Inductor (#138029 ) ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029 Approved by: https://github.com/Chillee ghstack dependencies: #138028	2024-10-30 18:11:09 +00:00
Yifu Wang	421473c234	get_symm_mem_workspace(): print helpful error during graph capture (#138028 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138028 Approved by: https://github.com/weifengpy	2024-10-30 18:11:09 +00:00
Howard Huang	f4ab8b48c5	Allow schedules to run with single stage (#138925 ) Ran into issues (https://github.com/pytorch/pytorch/pull/138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138925 Approved by: https://github.com/wconstab	2024-10-30 17:33:16 +00:00
Antoni Viros	ad637a4c5c	Add support for index_put_ in NT (#135722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722 Approved by: https://github.com/jbschlosser	2024-10-30 17:17:59 +00:00
angelayi	f14f245747	[export] Remove custom forward func in swap (#139126 ) Differential Revision: [D65100694](https://our.internmc.facebook.com/intern/diff/D65100694) Remove the custom forward function and instead move the pytree flatten/unflatten ops into the graph. This allows us to natively run via the interpreter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139126 Approved by: https://github.com/avikchaudhuri	2024-10-30 16:50:57 +00:00
Roy Hvaara	4b83302585	[MPS] Update error message for supported autocast type (#139192 ) Autocast in MPS currently only supports dtype of `torch.float16`. This PR updates the error message to reflect this. This PR was created using [Copilot Workspace](https://copilot-workspace.githubnext.com/pytorch/pytorch/issues/139190?shareId=5b510fda-380c-4e86-8e91-6b67a078f180) with no human input other than clicking buttons. Fixes #139190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139192 Approved by: https://github.com/malfet	2024-10-30 16:48:29 +00:00
iupaikov-amd	996c40e85e	Adjusted install_user script for Ubuntu 24.04 support (#138815 ) Fixes #138812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138815 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2024-10-30 16:31:09 +00:00
Arjun Vikram	29eb65fce8	Fix in-place state dict updates for distributed checkpoint loading (#138575 ) `dcp.load()` is documented as "operating in place", updating the state of existing state_dict elements instead of replacing them wherever possible. However, it appears that in the case of a stateful element, the code both updates its state in-place, then replaces it with a copy of itself in the state_dict. This looks like a simple oversight, so here's a PR that should fix it! [From the docs:](https://pytorch.org/docs/stable/distributed.checkpoint.html) > DCP is different than torch.save and torch.load in a few significant ways: ... > - It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead. This manifested as a strange bug in TorchTitan, causing a model loaded from a checkpoint to be saved incorrectly, resulting in a twice-resumed model being subtly broken. Let me know if this makes sense, and if there's anything else I should add! Thanks for all the work on PyTorch! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138575 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-10-30 16:10:24 +00:00
Bin Bao	04eb15da44	[AOTI] Unify the default value of allow_stack_allocation (#139147 ) Summary: Unify the default value of allow_stack_allocation for fbcode and OSS Differential Revision: D65064673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139147 Approved by: https://github.com/hl475	2024-10-30 16:01:23 +00:00
Yoshimasa Niwa	6e85266a47	[MPS] Fixes SiLU on non-contiguous tensors (#139006 ) Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139006 Approved by: https://github.com/malfet	2024-10-30 15:44:59 +00:00
PyTorch MergeBot	49bfbed2eb	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit 383eba522922f0b7c525b88ed4348c64b40b95cf. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2447382819))	2024-10-30 14:43:15 +00:00
cyyever	456c87c8a2	[8/N] Fix extra warnings brought by clang-tidy-17 (#139151 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-30 14:20:08 +00:00
Tom Ritchford	44257c063e	[dynamo] Fix constant propagation in builtins and UserClasses (#131354 ) * Fixes https://github.com/pytorch/pytorch/issues/118675 * Replaces https://github.com/pytorch/pytorch/pull/118994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-10-30 12:47:20 +00:00
PyTorch MergeBot	a951d99e16	Revert "Move reduce to template parameter in vectorized_reduction (#138672 )" This reverts commit 9b2c99d731695b76205d617ddc1e799ba11ae1a0. Reverted https://github.com/pytorch/pytorch/pull/138672 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138672#issuecomment-2446927015))	2024-10-30 12:12:13 +00:00
Xuehai Pan	9bbe4a67ad	[dynamo] support `maxlen` for `collections.deque` (#138194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138194 Approved by: https://github.com/jansel, https://github.com/malfet	2024-10-30 10:08:02 +00:00
Edward Z. Yang	a4b35767cb	Don't have random print in convert_frame (#139203 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139203 Approved by: https://github.com/Skylion007	2024-10-30 09:35:37 +00:00
YangQun1	a19bdfb36e	[compiled autograd] reorder backward hooks to match eager behavior (#138553 ) Fixes #138538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138553 Approved by: https://github.com/xmfan	2024-10-30 08:46:45 +00:00
wz337	b71ab3fc85	[DTensor][Bug Fix]Fix 2D DTensor mm with mesh_shape (1, n) or (n, 1) (#139134 ) Fixes #138742. In the issue, the matrix multiplication with DTensor failed when the size of one of mesh dimension is 1 when the mesh is > 1D. We are missing tests for covering this corner case where mesh_shape is (n, 1) or (1, n). The DTensor mm op is correct when the 1D mesh is of shape (self.world_size, ) or 2D mesh with none of the mesh_dimension has a size of 1. In this PR, we fixed the corner case by updating `gen_einsum_strategies` in `_einsum_strategy.py`. Specifically, we cannot skip generating `mesh_dim_strategies` when `mesh_dim <= 1`, as this is not valid for nD mesh with one of the mesh dimension sizes being 1. Without the fix, the OpStrategy generated for 2D mesh with mesh_shape of (1,n) or (n,1) is wrong, as the OpStrategy generated is 1D. ``` all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]] OpStrategy(all_strategies)::: [(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)[(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1) ``` After the fix, we can see the OpStrategy generated is correct with 2D strategy. ``` all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]][[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]] OpStrategy(all_strategies) = [(RR, RR) -> RR, (RS(1), RS(0)) -> RP, (RS(0), RR) -> RS(0), (RR, RS(1)) -> RS(1), (S(1)R, S(0)R) -> PR, (S(1)S(1), S(0)S(0)) -> PP, (S(1)S(0), S(0)R) -> PS(0), (S(1)R, S(0)S(1)) -> PS(1), (S(0)R, RR) -> S(0)R, (S(0)S(1), RS(0)) -> S(0)P, (S(0)S(0), RR) -> S(0)S(0), (S(0)R, RS(1)) -> S(0)S(1), (RR, S(1)R) -> S(1)R, (RS(1), S(1)S(0)) -> S(1)P, (RS(0), S(1)R) -> S(1)S(0), (RR, S(1)S(1)) -> S(1)S(1)] @ mesh: (4, 1) ``` ***** As a follow up, we should add more test coverage for DTensor op with 2D mesh and 2D mesh with one of the size of mesh dimension being 1. ***** Pull Request resolved: https://github.com/pytorch/pytorch/pull/139134 Approved by: https://github.com/fegin	2024-10-30 08:09:39 +00:00
Nikita Shulga	ceab24def4	[CI] Unify numpy version for python-3.9 and 3.10 configs (#139244 ) Per dependabot numpy-1.21 is subject of CVE-2021-34141 so perhaps it's ok not to test against it Pull Request resolved: https://github.com/pytorch/pytorch/pull/139244 Approved by: https://github.com/huydhn	2024-10-30 06:47:38 +00:00
Scott Wolchok	3495ef78a2	Unbreak fp16 dot issues caused by #137917 (#139262 ) See comment for explanation. In short, doing the fixup in float. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139262 Approved by: https://github.com/huydhn	2024-10-30 05:10:19 +00:00
cyy	4e5f9afc7f	Enable c10::sv and std::sv constexpr conversions (#139239 ) As a small step towards moving c10::sv to std::sv and this tiny change shouldn't break META builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139239 Approved by: https://github.com/malfet	2024-10-30 03:57:47 +00:00
leslie-fang-intel	cd8f7730f4	[PT2E][Quant] Remove Redundant Method in X86 Quantizer (#139161 ) Summary Remove the redundant method of X86 Inductor Quantizer as `get_supported_quantization_configs`, `get_supported_operator_for_quantization_config` and `get_supported_operators`. They are not the must have to implement a customized Quantizer and not mentioned in existing document for how to use X86 Inductor Quantizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139161 Approved by: https://github.com/jgong5	2024-10-30 03:31:17 +00:00
Xia, Weiwen	edcab61f93	Skip test for PT2E quantized ops in fbcode (#138792 ) Skip those tests as they are failing in fbcode. Submit this PR per request from @jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138792 Approved by: https://github.com/jerryzh168	2024-10-30 02:37:38 +00:00
eqy	b4e4f84a06	Fix regex in `test_static_inputs_address_mutation_log` for Python 3.12 (#139229 ) Otherwise Python 3.12's `re` seems to be unhappy with `re.error: global flags not at the start of the expression at position 113` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139229 Approved by: https://github.com/ezyang	2024-10-30 02:36:31 +00:00
cyy	b0f84aad5d	[3/N] Fix Wextra-semi warnings (#139165 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139165 Approved by: https://github.com/ezyang	2024-10-30 02:08:13 +00:00
PyTorch MergeBot	5861279f47	Revert "Add support for index_put_ in NT (#135722 )" This reverts commit b4836e5b5ce2891e9af21790d255720e2dbf8e91. Reverted https://github.com/pytorch/pytorch/pull/135722 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/135722#issuecomment-2445651914))	2024-10-30 01:53:55 +00:00
Richard Barnes	1797a2035d	Drop caffe2 string_utils (#139217 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139217 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2024-10-30 01:13:16 +00:00
cyy	da1c1a9884	[4/N] Don't skip ASAN on some tests (#139189 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139189 Approved by: https://github.com/ezyang	2024-10-30 00:59:32 +00:00
Nikita Shulga	ba40dc19d2	[CI] Run aarch64 build/tests on every trunk commit (#139228 ) As we have sccache now, should be reasonably fast Pull Request resolved: https://github.com/pytorch/pytorch/pull/139228 Approved by: https://github.com/kit1980	2024-10-30 00:49:06 +00:00
Nikita Shulga	f643499ddd	Fix `vec128_half_neon.h` compilation with GCC (#139235 ) `mask` is already defined as `uint16x8_t` no need to reinterpret it `bd369bb182/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h (L220)` Fixes ``` var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)': /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t' 227 \| vreinterpretq_u16_f16(mask), \| ^~~~ \| \| \| uint16x8_t In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note: initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)' 5841 \| vreinterpretq_u16_f16 (float16x8_t __a) \| ~~~~~~~~~~~~^~~ ``` introduced by https://github.com/pytorch/pytorch/pull/137911 Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with ``` /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 77 \| return vaddvq_f32(t0 + t1); \| ~~~^~~~ \| \| \| at::vec::SVE256::Vectorized<float> In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51, from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17, from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8, from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11, from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8, from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18, from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3, from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2, from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note: initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)' 10423 \| vaddvq_f32 (float32x4_t __a) \| ~~~~~~~~~~~~^~~ In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 119 \| return vaddvq_f32(x); \| ^ \| \| \| at::vec::SVE256::Vectorized<float> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139235 Approved by: https://github.com/huydhn	2024-10-30 00:48:57 +00:00
Angela Yi	d9e87fb339	[draft-export] Include guards for constraint violation errors (#138748 ) Summary: Added where logs are being added to constrain violations in draft export. Example output: ``` 1. Constraint violation error. The specified input dynamic_shapes spec was found to be incorrect during tracing. Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}. This occured at the following stacktrace: File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1736, in _wrapped_call_impl File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1747, in _call_impl File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/scripts/angelayi/draft_export/test_draft_export.py, lineno 138, in forward. Because of this, we have modified the dynamic shapes structure to be the following: ``` dynamic_shapes = {'a': {0: 3}} ``` ``` The result of this diff is also that `dynamic` logs are permanently turned on during draft export. Otherwise we cannot capture the `[guard added]` logs from symbolic_shapes.py. Test Plan: `buck2 run @//mode/dev-nosan scripts/angelayi/draft_export:test_draft_export -- -r "test_shape_failure" ` Differential Revision: D64862374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138748 Approved by: https://github.com/ezyang	2024-10-30 00:24:17 +00:00
Antoni Viros	b4836e5b5c	Add support for index_put_ in NT (#135722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722 Approved by: https://github.com/jbschlosser	2024-10-30 00:03:21 +00:00
Syed Tousif Ahmed	341a28f0ce	Refactors empty_cache to return only MemPool memory to the system (#133602 ) Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool. In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602 Approved by: https://github.com/ezyang	2024-10-29 23:58:44 +00:00
Nikita Shulga	bd369bb182	Workaround torch.deploy failures (#139195 ) Summary: Which are backed with an older version of `typing_extensoins` but this runtime could not care less about type-checking. So pretend that is has `TypeIs` by replacing it with `TypeGuard` Fixes test failures introduced by https://github.com/pytorch/pytorch/pull/133814 / D65030974 Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//multipy/runtime:test_deploy -- --exact 'multipy/runtime:test_deploy - TorchpyTest.TestNumpy'` Differential Revision: D65145409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139195 Approved by: https://github.com/Skylion007	2024-10-29 23:36:16 +00:00
titaiwangms	fcb36a69cd	[ONNX] Add a test file for _building.py (#139107 ) Fixes #138761 Add test file for _building.py to verify and guarantee the correct behavior on OpRecorder. Noted that the tests does not validate the model itself, but the expected behavior of the evaluator adding extra ops during input preprocessing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139107 Approved by: https://github.com/justinchuby	2024-10-29 23:25:31 +00:00
Colin L. Rice	a0e095dd9f	config: Modify install_config_module to use a layered approach (#138758 ) This modifies the config system, to use a single mapping of config -> ConfigEntry and to store the default and user values within them. We could have used multiple dicts (i.e. user_override and default), but as we add more fields (justknobs in this PR, perhaps testing and env variables later), it quickly becomes painful. There are a couple design decisions we could change. 1) All configs we save store the resolved value - not the default and user override seperately 2) All configs we load, apply the resolved value as a user override. This means that certain complexities of default behvaiour and deletion (as well as JK), will change if you save + load a config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138758 Approved by: https://github.com/ezyang	2024-10-29 23:19:36 +00:00
cyyever	46d0b635b9	[CMake] Remove pthread linking (#134436 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134436 Approved by: https://github.com/r-barnes	2024-10-29 23:14:40 +00:00
eqy	c9bd712305	[CUDA][AMP] Speed up fp16/bf16 casts on H100+ (#137053 ) Similar to #110251 we're seeing cases where vectorization can benefit casts to fp16/bf16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137053 Approved by: https://github.com/drisspg	2024-10-29 23:01:16 +00:00
Scott Wolchok	b29c170bee	[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too (#137917 ) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916	2024-10-29 22:38:01 +00:00
Scott Wolchok	fc2d0da773	[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half (#137916 ) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915	2024-10-29 22:38:01 +00:00
Scott Wolchok	5be1556d4a	[PyTorch] Clean up Registers/ElementsPerIteration constants (#137915 ) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914	2024-10-29 22:37:49 +00:00
Scott Wolchok	aafbea49b9	[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ (#137914 ) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913	2024-10-29 22:37:37 +00:00
Scott Wolchok	6502d6cf17	[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (#137913 ) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912	2024-10-29 22:37:30 +00:00
Scott Wolchok	9ede4b2746	[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized (#137912 ) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix https://github.com/pytorch/torchchat/issues/1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137912 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911	2024-10-29 22:37:24 +00:00
Scott Wolchok	41d7471413	[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available (#137911 ) We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137911 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #137661	2024-10-29 22:37:17 +00:00
Scott Wolchok	837538f040	[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert (#137661 ) NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137661 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-29 22:37:10 +00:00
Joel Schlosser	23d590e518	More flexible test parametrization with @reparametrize (#138369 ) Background: The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example: ```python class MyTestClass(TestCase): @parametrize("x", range(3)) @parametrize("y", [False, True]) def test_foo(self, x, y): # Invoked with: # x=0, y=False # x=1, y=False # x=2, y=False # x=0, y=True # x=1, y=True # x=2, y=True ... ``` Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info according to what is supported for each op. ```python class MyTestClass(TestCase): @ops(op_db) def test_foo(self, op, device, dtype): # Invoked each OpInfo in the db along with each device / dtype that corresponds # with this op according to the OpInfo entry. ... ``` Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately. This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples: ```python # adapter_fn that adds a new arg def include_is_even_arg(test_name, param_kwargs): x = param_kwargs["x"] is_even = x % 2 == 0 new_param_kwargs = dict(param_kwargs) new_param_kwargs["is_even"] = is_even is_even_suffix = "_even" if is_even else "_odd" new_test_name = f"{test_name}{is_even_suffix}" yield (new_test_name, new_param_kwargs) # adapter_fn that excludes certain values def exclude_odds(test_name, param_kwargs): x = param_kwargs["x"] is_even = x % 2 == 0 yield None if not is_even else (test_name, param_kwargs) class MyTestClass(TestCase): @reparametrize(parametrize("x", range(5)), include_is_even_arg) def test_foo(self, x, is_even): # Invoked with both the x value and the new is_even arg ... @reparametrize(parametrize("x", range(5)), exclude_odds) def test_bar(self, x): # Only invoked with even x values ... ``` For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369 Approved by: https://github.com/janeyx99	2024-10-29 22:14:38 +00:00
Jean Schmidt	ebaa774f96	Migrate inductor and torchbench workflows to start experimenting with a100 on aws (#139079 ) Excluding nightly workflows, as they are more critical and run less frequently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139079 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/huydhn	2024-10-29 22:11:25 +00:00
drisspg	80c7c7178e	Make sure all SDPA tests are ran with tensor cores enabled (#135592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135592 Approved by: https://github.com/eqy	2024-10-29 20:53:10 +00:00
Nikita Shulga	c81d4fd0a8	Upgrade sccache to v0.8.2 for CPU targets (#121323 ) This essentially reverts https://github.com/pytorch/pytorch/pull/95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see https://github.com/pytorch/pytorch/issues/139188 - Define `SCCACHE_REGION` for the jobs that needs it. - Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296 Fixes https://github.com/pytorch/pytorch/issues/121559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121323 Approved by: https://github.com/atalman	2024-10-29 19:54:36 +00:00
Jake Schmidt	2b577ae58f	Implement NJT embedding backward (#138627 ) Fixes #138352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138627 Approved by: https://github.com/jbschlosser	2024-10-29 18:44:58 +00:00
drisspg	a884462bca	Add workspace to TritonTemplates (#138050 ) Here's a markdown summary for the PR: # Add workspace buffer support for Triton templates ## Summary Adds support for templates to allocate and use temporary workspace buffers ## Key Changes - Add `WorkspaceArg` support in Triton template system - Automatic workspace allocation/deallocation around kernel execution - Zero-initialization support for workspace buffers - Seamless integration with existing tensor management ## Example Usage ```python def generate(self, ...): workspace_arg = WorkspaceArg( count=1024*1024, # 1MB workspace zero_fill=True # Zero-initialized ) return TritonTemplateCaller(..., workspace_arg=workspace_arg) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-10-29 18:17:54 +00:00
Xilun Wu	7964bcc3dc	[DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945 ) Summary This PR fixes a calculation miss in DeviceMesh's create_sub_mesh(). Error Description When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example: ``` flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2") alias_flattened_mesh = device_mesh["dim0-2"] # this mesh slice leads to error in current impl ``` It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`: ``` IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4) ``` Fix The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value. Test `pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945 Approved by: https://github.com/wz337	2024-10-29 17:56:56 +00:00
cyy	82a6d2db3f	[2/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139158 ) Follows #139007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139158 Approved by: https://github.com/Skylion007	2024-10-29 17:16:37 +00:00
Abatom	c98c88a211	[Bugfix] UnicodeDecodeError: 'utf-8' codec can't decode byte (#139062 ) Fixes #113564 When I used PyTorch's profiler to analyze the performance of vLLM, I encountered the following error. This error is similar to #113564. After analysis and troubleshooting, I changed the temporary file from text mode to binary mode, and it no longer reported an error and ran normally. ```bash ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 722, in stop ERROR 10-28 10:25:50 engine.py:160] self._transit_action(self.current_action, None) ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 751, in _transit_action ERROR 10-28 10:25:50 engine.py:160] action() ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 745, in _trace_ready ERROR 10-28 10:25:50 engine.py:160] self.on_trace_ready(self) ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 444, in handler_fn ERROR 10-28 10:25:50 engine.py:160] prof.export_chrome_trace(os.path.join(dir_name, file_name)) ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 220, in export_chrome_trace ERROR 10-28 10:25:50 engine.py:160] fout.writelines(fin) ERROR 10-28 10:25:50 engine.py:160] File "<frozen codecs>", line 322, in decode ERROR 10-28 10:25:50 engine.py:160] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 5896: invalid start byte ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139062 Approved by: https://github.com/ezyang	2024-10-29 17:16:26 +00:00
Boyuan Feng	68134a320e	[Flex Attention] Paged Attention (#137164 ) This PR adds paged attention for flex attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137164 Approved by: https://github.com/drisspg	2024-10-29 17:05:22 +00:00
cyy	3907f36808	Turn some variables and functions into static (#136847 ) Re-check some files and mark variables and functions into static and fix other warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136847 Approved by: https://github.com/ezyang	2024-10-29 17:01:56 +00:00
Henry Tsang	3f9f6048da	[aoti] Print output name for sympy.Expr as well (#138524 ) To avoid ``` NotImplementedError: unsupported type of output=s0s1 ``` It seems like this was caused by the use of `_scaled_dot_product_flash_attention`. Fallback kernek: ``` FallbackKernel( python_kernel_name='torch.ops.aten._scaled_dot_product_flash_attention.default', name=buf55, layout=MultiOutputLayout(device=device(type='cuda', index=0)), inputs=[ComputedBuffer(name='buf52', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0s1, 64], stride=[384s0s1, 64s0s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99da20>, ranges=[1, 6, s0s1, 64])), ComputedBuffer(name='buf53', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0s1, 64], stride=[384s0s1, 64s0s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99d480>, ranges=[1, 6, s0s1, 64])), ComputedBuffer(name='buf54', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0s1, 64], stride=[384s0s1, 64s0s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99c430>, ranges=[1, 6, s0s1, 64]))], constant_args=(0.125,), kwargs={'scale': 0.125}, output_view=None, python_kernel_name=torch.ops.aten._scaled_dot_product_flash_attention.default, cpp_kernel_name=at::_ops::_scaled_dot_product_flash_attention::call, ordered_kwargs_for_cpp_kernel=['scale'], op_overload=aten._scaled_dot_product_flash_attention.default, arg_properties=[{'name': 'query', 'type': Tensor, 'default_value': None}, {'name': 'key', 'type': Tensor, 'default_value': None}, {'name': 'value', 'type': Tensor, 'default_value': None}, {'name': 'dropout_p', 'type': float, 'default_value': 0.0}, {'name': 'is_causal', 'type': bool, 'default_value': False}, {'name': 'return_debug_mask', 'type': bool, 'default_value': False}], kwarg_properties=None, unbacked_bindings=None, mutation_outputs=[], origin_node=None, origins=OrderedSet([_scaled_dot_product_flash_attention]) ) ``` codegen with this pr ``` // Topologically Sorted Source Nodes: [scaled_dot_product_attention], Original ATen: [aten._scaled_dot_product_flash_attention] double var_147 = 0.125; AtenTensorHandle buf56_handle; AtenTensorHandle buf57_handle; auto buf55_4 = s0s1; auto buf55_5 = s0*s1; AtenTensorHandle buf58_handle; AtenTensorHandle buf59_handle; AtenTensorHandle buf60_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cuda__scaled_dot_product_flash_attention(convert_arrayref_tensor_to_tensor(buf52), convert_arrayref_tensor_to_tensor(buf53), convert_arrayref_tensor_to_tensor(buf54), 0.0, 0, 0, &var_147, &buf56_handle, &buf57_handle, nullptr, nullptr, &buf55_4, &buf55_5, &buf58_handle, &buf59_handle, &buf60_handle)); RAIIAtenTensorHandle buf56(buf56_handle); RAIIAtenTensorHandle buf57(buf57_handle); RAIIAtenTensorHandle buf58(buf58_handle); RAIIAtenTensorHandle buf59(buf59_handle); RAIIAtenTensorHandle buf60(buf60_handle); ``` Differential Revision: D64724460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138524 Approved by: https://github.com/chenyang78	2024-10-29 16:02:45 +00:00
Jason Ansel	a762dc0357	[inductor] Multi-kernel + cooperative reductions (#138893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138893 Approved by: https://github.com/shunting314 ghstack dependencies: #138533	2024-10-29 15:45:17 +00:00
Jason Ansel	77b0ae832d	[inductor] Allow cooperative + persistent reductions (#138533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138533 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-10-29 15:45:17 +00:00
Vishwa Raj Singh	9d7a0869f0	Make DDP Quantization hooks backend Agnostic (#138816 ) Current ddp hooks quantization code use .cuda() API to move tensors and parameter on backend devices. This limits only cuda backend to work with ddp quantization hooks. Change is to make code backend agnostic and move tensors/parameters based on tensor.device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138816 Approved by: https://github.com/kwen2501	2024-10-29 15:02:45 +00:00
Nikita Shulga	869d1ad0b4	[BE] Nested namespace in quantized folder (#139166 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139166 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-29 14:53:07 +00:00
Wu, Chunyuan	489c66fdb3	[AOTI] fix pointer_to_list (#138806 ) Fixes the `pointer_to_list` function to take `(ptr + i)` instead of `ptr`. This fixes the runtime error when running INT8 yolo-v7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138806 Approved by: https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #138691	2024-10-29 14:33:16 +00:00
Wu, Chunyuan	9af1816974	[AOTI] add C shim for _weight_int8pack_mm (#138691 ) Fixes the error of running WOQ-INT8 LLaMA: ``` E In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3, E from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque, AtenTensorOpaque)’: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope E 117 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle)); E \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-29 13:53:36 +00:00
amathewc	69d401d010	Update test_quantize_pt2e.py with HPU support (#137863 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. CHANGES - Add support for HPU devices within the test_move_exported_model_bn using TEST_HPU flag - Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances. - Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137863 Approved by: https://github.com/jerryzh168	2024-10-29 13:01:03 +00:00
Yuanhao Ji	b9618c9b88	[Dynamo] Add `itertools.compress()` support (#139061 ) Use polyfill to add `itertools.compress()` support in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139061 Approved by: https://github.com/jansel	2024-10-29 10:25:55 +00:00
cyy	e201460f8a	[2/N] Fix Wextra-semi warnings (#139142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139142 Approved by: https://github.com/ezyang	2024-10-29 08:14:37 +00:00
Sam Ginzburg	93d7f90c3a	[inductor] getting AOT inductor to treat None args correctly (#139114 ) Differential Revision: [D65102228](https://our.internmc.facebook.com/intern/diff/D65102228) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139114 Approved by: https://github.com/aakhundov	2024-10-29 08:11:53 +00:00
Nikita Shulga	8b08559c80	Move more workflows to 3.9 (#139145 ) Specifically mergebot and others should be using 3.9 now Pull Request resolved: https://github.com/pytorch/pytorch/pull/139145 Approved by: https://github.com/kit1980, https://github.com/Skylion007, https://github.com/huydhn	2024-10-29 05:39:46 +00:00
PyTorch MergeBot	38645e8a3e	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 8aedc649bdd0789b0ea9b9348d552fb1b0e437ff. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))	2024-10-29 04:54:37 +00:00
chuanqiw	ea93e09896	[CI] Align XPU CI build with CD to fix build issue (#139050 ) Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139050 Approved by: https://github.com/ezyang	2024-10-29 04:53:53 +00:00
Yuanhao Ji	e52ccb3ca6	[Device] Replace hardcoded devices with 'torch._C._get_accelerator()' (#139032 ) I noticed that some hard-code like `"cuda" if torch.cuda.is_available() else "cpu"` which can be replaced with `torch._C._get_accelerator()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139032 Approved by: https://github.com/ezyang	2024-10-29 04:51:47 +00:00
cyy	a0865b00fb	[1/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139007 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139007 Approved by: https://github.com/ezyang	2024-10-29 04:48:13 +00:00
cyy	0274d16c01	Fix clang-tidy warnings in jit code (#138974 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138974 Approved by: https://github.com/ezyang	2024-10-29 04:33:40 +00:00
Yiming Zhou	48b55ca1b1	[export] Fix non-strict retracing with kwargs (#138927 ) Summary: `torch.fx.Interpreter.run()` only takes args as input. Currently we pass kwargs as well which causes errors during retracing. Flatten the kwargs and concat them with args will solve the issue. Several previously failing tests under `_retraceability_non_strict` now passes. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability_non_strict ``` Differential Revision: D64980053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138927 Approved by: https://github.com/angelayi	2024-10-29 04:31:21 +00:00
Nikita Shulga	3342b533bb	Update setuptool to 72.1.0 (#139144 ) As older versions are affected by CVE-2024-6345 Also, update `typing_extensions` to 4.11 to support `TypeIs`, otherwise some of the workflows report following error (but succeed somehow), see [this](https://github.com/pytorch/pytorch/actions/runs/11566785190/job/32196549021): ``` 2024-10-29T03:55:01.3601410Z + /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_11566785190 --no-capture-output python3 -c 'import torch' 2024-10-29T03:55:01.3602260Z ~/runner/_work/_temp ~/runner/_work/pytorch/pytorch 2024-10-29T03:55:01.8043630Z Traceback (most recent call last): 2024-10-29T03:55:01.8044540Z File "<string>", line 1, in <module> 2024-10-29T03:55:01.8045670Z File "/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/torch/__init__.py", line 37, in <module> 2024-10-29T03:55:01.8046690Z from typing_extensions import ParamSpec as _ParamSpec, TypeIs as _TypeIs 2024-10-29T03:55:01.8048010Z ImportError: cannot import name 'TypeIs' from 'typing_extensions' (/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/typing_extensions.py) ``` Also delete macOS-X86 as we no longer build those Pull Request resolved: https://github.com/pytorch/pytorch/pull/139144 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/huydhn	2024-10-29 04:24:51 +00:00
Scott Wolchok	61d0686168	[PyTorch] Use intrusive_ptr(p, DontIncreaseRefcount) directly in TensorBase unsafe borrow ctor (#138934 ) We observed ASAN failures stemming from `5ea6777861/torch/csrc/autograd/python_variable.cpp (L403)` . Since it's possible that `tensor` is dead here, `borrowed()` needs to avoid dereferencing it. `intrusive_ptr::reclaim` dereferences the pointer in builds with debug checks enabled, so use the DontIncreaseRefcount ctor directly instead. Differential Revision: [D64990707](https://our.internmc.facebook.com/intern/diff/D64990707/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138934 Approved by: https://github.com/ezyang	2024-10-29 04:20:11 +00:00
PyTorch MergeBot	6aef58a249	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit c066f4a055020ae994dd10a1b1fafbe3774108cd. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the test is failing in trunk, maybe a landrace? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2443158194))	2024-10-29 04:08:11 +00:00
Will Feng	4ee514144b	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ---- Update: Did two items to prevent regression to existing use cases: 1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case 2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users). The risk of this new version of PR causing regression should be very low. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-29 03:31:19 +00:00
cyy	d8f99f39cb	Avoid unnecessary tensor constructions (#139039 ) Because Variable is an alias of Tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/139039 Approved by: https://github.com/Skylion007	2024-10-29 02:23:23 +00:00
Animesh Jain	e80fe7f13a	[dynamo][guards] Skip guards on empty nn module hooks (#138942 ) This brings some unsoundness in guards. Earlier we were skipping empty nn module hooks dict guard only on inbuilt nn modules, but as seen in https://github.com/pytorch/pytorch/issues/138386, there could be still be significant guard overhead. With this PR, we reduce the guard eval latency from 420 us to 280 us (1.5x reduction). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138942 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #139040, #138954	2024-10-29 02:11:47 +00:00
Animesh Jain	2aa5348356	[dynamo][guards] Skip no tensor aliasing guards on parameters (#138954 ) This is another unsound guard eval optimization. Its rare in practice to compile a function with two different parameters as inputs, and then later call the function with one parameter input as two different inputs (aliasing). This further reduces guard overhead from 280 us to 240 us for the model in https://github.com/pytorch/pytorch/issues/138386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138954 Approved by: https://github.com/jansel ghstack dependencies: #139040	2024-10-29 02:11:47 +00:00
Animesh Jain	dee7e715ba	[dynamo][refactor] Remaining cleanup from config-cleanup of enable_cpp_guard_manager (#139040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139040 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-29 02:11:39 +00:00
Jeff Daily	7c7b2d89ba	[ROCm] set hipblas workspace (#138791 ) Fixes #138532. This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791 Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-29 01:37:55 +00:00
eqy	07b0d633b8	[cuDNN][SDPA] Bail out of cuDNN SDPA for seqlen 1 inputs (#138531 ) Forwarded #138529 to the cuDNN team but for now but we want to avoid dispatching to unsupported cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/138531 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-29 01:03:36 +00:00
Syed Tousif Ahmed	1637a40796	Adds snapshot API for MemPools to get pool memory segments (#133601 ) Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool. In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601 Approved by: https://github.com/ezyang	2024-10-29 01:01:47 +00:00
eellison	c066f4a055	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-29 00:54:29 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
cyy	383d9e3de6	[4/N] Fix cppcoreguidelines-special-member-functions warnings (#139027 ) Follows #138796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139027 Approved by: https://github.com/ezyang	2024-10-29 00:18:18 +00:00
wz337	5b39734a0a	[DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097 ) We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error: ``` RuntimeError: No backend for the parent process group or its backend does not support splitting ``` This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097 Approved by: https://github.com/XilunWu	2024-10-29 00:04:06 +00:00
cyy	aa2b17c330	[3/N] Don't skip ASAN on some tests (#139058 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139058 Approved by: https://github.com/ezyang	2024-10-28 23:57:23 +00:00
cyy	5ab81099e3	[2/N] Fix object slice (#139036 ) Follows #138880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139036 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-10-28 23:56:36 +00:00
Svetlana Karslioglu	e00ead400c	Add a temporary Survey about the search (#139096 ) - Add a link to the new search survey - Add .css classes needed for the search banner Pull Request resolved: https://github.com/pytorch/pytorch/pull/139096 Approved by: https://github.com/seemethere, https://github.com/cjyabraham	2024-10-28 23:43:25 +00:00
Adnan Akhundov	ab09c4d913	Add host-side TMA support to AOTInductor (#138878 ) This adds host-side Triton TMA support to AOTInductor. Notes: - Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific). - C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions. - Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen. - This PR concludes the host-side Triton TMA support in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138878 Approved by: https://github.com/desertfire, https://github.com/chenyang78 ghstack dependencies: #138759, #138877	2024-10-28 23:39:53 +00:00
Simon Fan	fd9f4e6770	Back out "[compiled autograd] tls access helpers (#138061 )" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821 )" (#139086 ) Summary: Original commit changeset: 9bf80c1492d7 Original Phabricator Diff: D64796226 Original commit changeset: aa1d9ef8f6e6 Original Phabricator Diff: D64796212 Differential Revision: D65072644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086 Approved by: https://github.com/malfet	2024-10-28 23:37:05 +00:00
Nikita Shulga	18ad44e830	[BE] Test collect env against torch-2.* (#139122 ) And also update Python version to 3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139122 Approved by: https://github.com/kit1980	2024-10-28 23:17:38 +00:00
dependabot[bot]	ba749755f5	Bump rexml from 3.3.3 to 3.3.9 in /ios/TestApp (#139088 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.3.3 to 3.3.9. - [Release notes](https://github.com/ruby/rexml/releases) - [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md) - [Commits](https://github.com/ruby/rexml/compare/v3.3.3...v3.3.9) --- updated-dependencies: - dependency-name: rexml dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-28 15:47:10 -07:00
dependabot[bot]	23fb8baf37	Bump certifi from 2024.2.2 to 2024.7.4 in /tools/build/bazel (#130173 ) Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4. - [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-28 15:44:49 -07:00
Tugsbayasgalan Manlaibaatar	b7524b05d2	Make test_export training IR compatible (#138517 ) In this PR, I make test_export to be compatible with training IR. The idea is that when we flip the IR to non-functional training IR, all these tests should be green. The changes involve reading through the test case, and add necessary decomposition etc to make sure the tests pass. For example, if the tests expect to see mutated buffers returned, we need to get them via running run_decomp. Differential Revision: [D64732360](https://our.internmc.facebook.com/intern/diff/D64732360) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138517 Approved by: https://github.com/avikchaudhuri	2024-10-28 22:38:19 +00:00
William Wen	904816d1ed	[dynamo] handle 3.13.0 __dict__ watcher bug (#138284 ) https://github.com/python/cpython/pull/116115 introduced a bug (https://github.com/python/cpython/issues/125608) where changing the attributes of an object may not fire the dict watchers registered to the object's `__dict__`. It has been fixed by https://github.com/python/cpython/pull/125611 but will only be in 3.13.1+. This PR disables the dict watcher guard shortcut for `__dict__`s on 3.13.0 and warns the user to try using 3.13.1+ instead. We also added a simple test to check for this functionality in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138284 Approved by: https://github.com/jansel ghstack dependencies: #138030	2024-10-28 22:25:21 +00:00
William Wen	35be6aef69	[dynamo] add some cpython debugging methods (#138030 ) This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g. ```python x = x + 1 torch._dynamo.eval_frame.raise_sigtrap(); # can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C. x = torch.sin(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138030 Approved by: https://github.com/jansel	2024-10-28 22:25:21 +00:00
Joseph Macaranas	edf2a1be97	[ROCm][CK] Explicit cast values to half (#138751 ) Addresses ambiguous conversions and calls introduced by these two pull requests: [[ROCm] CK-based GEMM](https://github.com/pytorch/pytorch/pull/131004) [[AMD] Fix torch ck backend build with 6.2.1](https://github.com/pytorch/pytorch/pull/138434) Co-authored-by: cjatin <cjatin@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138751 Approved by: https://github.com/jeffdaily Co-authored-by: pruthvistony <pruthvigithub@gmail.com> Co-authored-by: cjatin <cjatin@users.noreply.github.com>	2024-10-28 22:00:26 +00:00
Ma Jian	ded83d2b16	support torch._utils._flatten_dense_tensors/_unflatten_dense_tensors … (#139023 ) Fixes #138897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139023 Approved by: https://github.com/ezyang	2024-10-28 21:59:07 +00:00
Guilherme Leobas	8785353f2f	Fix tensor subclass + dynamic shapes in torch.compile + aot autograd (#125941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125941 Approved by: https://github.com/bdhirsh ghstack dependencies: #133337	2024-10-28 21:58:59 +00:00
Guilherme Leobas	6baccb430b	Update TwoTensor impl. to accept `outer_size/outer_stride` (#133337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133337 Approved by: https://github.com/bdhirsh	2024-10-28 21:58:59 +00:00
cyy	f4f0f2995d	Fix Wextra-semi warnings (#139000 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139000 Approved by: https://github.com/ezyang	2024-10-28 21:48:51 +00:00
William Wen	52c80f663d	change name of dynamo CI chard to dynamo_wrapped (#138233 ) Implements https://github.com/pytorch/pytorch/issues/118127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138233 Approved by: https://github.com/clee2000	2024-10-28 21:42:33 +00:00
PyTorch MergeBot	02339e674d	Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 )" This reverts commit 74878ac271feecfa3ff3d32f78c7d889bcac97d6. Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](`74878ac271`) ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2442667605))	2024-10-28 21:30:28 +00:00
Mikayla Gawarecki	1a275fea4b	Remove numpy dependency for maia serialization (#137600 ) See rationale in #137444 description Pull Request resolved: https://github.com/pytorch/pytorch/pull/137600 Approved by: https://github.com/albanD	2024-10-28 20:57:35 +00:00
Jack Zhang	dd688099af	Update unbacked symints in torch.nonzero more precisely (#137663 ) ### Summary The fake impl for `nonzero` sets the symint's upper range to `sys.maxsize - 1` if there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape. See https://github.com/pytorch/pytorch/pull/134899 as a merged solution for a similar problem for a different op. ### Test plan Added unit test to verify upper bound reduction calculation (`python test/export/test_export.py TestExport.test_nonzero_dynamic`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137663 Approved by: https://github.com/ezyang	2024-10-28 20:57:23 +00:00
Bin Bao	8fa0479dd8	[inductor] Enable cpp wrapper for test_torchinductor (#138579 ) Summary: Expand cpp wrapper testing to test_torchinductor. Using skip_cpp_wrapper to skip failing tests for now, and fixes are coming later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138579 Approved by: https://github.com/chenyang78, https://github.com/benjaminglass1	2024-10-28 20:35:25 +00:00
PyTorch MergeBot	e5595f10c8	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 )" This reverts commit a688c57033b4536ef59356cdad241d65ca52a869. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/yf225 due to Seems to have bad interaction with latest commits on trunk, reverting to be safe ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2442527696))	2024-10-28 20:13:46 +00:00
Joel Schlosser	8ba9063002	FlexAttention support for NJT (#136792 ) This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S, D) key = ... # NJT of shape (B, H, S, D) value = ... # NJT of shape (B, H, S, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S), sum(S)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (booted to future PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792 Approved by: https://github.com/drisspg ghstack dependencies: #138841	2024-10-28 20:01:27 +00:00
Ryan Guo	4cd985a886	[dynamo] Remove some files from `dynamo_expected_failures` (#138935 ) Some tests in `test/dynamo` are marked as "expected failure when testing with `PYTORCH_TEST_WITH_DYNAMO=1`, i.e., we added files of those test names in the `dynamo_expected_failures` folder. However, a lot of those dynamo tests seem to be passing with `PYTORCH_TEST_WITH_DYNAMO=1`, so this patch removes them from `dynamo_expected_failures`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138935 Approved by: https://github.com/anijain2305	2024-10-28 19:41:26 +00:00
Avik Chaudhuri	9e06b5b5cb	fix unflatten with HOPs (#138978 ) Summary: Unflatten was broken for HOPs for a couple of reasons: (1) we didn't expect `get_attr` nodes in the exported program, but they can occur to hold graph arguments to HOPs; such attributes must be moved from the exported program to the corresponding unflattened submodule containing the HOP call. (2) we don't record metadata for graph arguments on serialization (there's nothing to hold it in our schema), and accordingly the `get_attr` nodes we create on deserialization don't have `nn_module_stack` metadata, which obviously wrecks unflatten. Test Plan: added a couple of tests Differential Revision: D65013647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138978 Approved by: https://github.com/zhxchen17	2024-10-28 19:30:56 +00:00
Mwiza Kunda	c2ded9ec0d	Fix dot reference checks (#138596 ) dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch i.e. ```python import torch x = torch.tensor([1,2,3], dtype=torch.float32) y = torch.tensor([4,5,6], dtype=torch.float16) x.dot(y) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half ``` However the below does not raise an exception ```python x.to("meta").dot(y.to("meta")) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596 Approved by: https://github.com/bdhirsh	2024-10-28 19:11:40 +00:00
Richard Barnes	068f7e7a78	torch::optional -> std::optional (#138987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138987 Approved by: https://github.com/Skylion007	2024-10-28 19:09:46 +00:00
PyTorch MergeBot	228963ad60	Revert "Add test for consistency between meta and CPU devices. (#138515 )" This reverts commit 006130d8eae834d17e3d3e21e61c506740cce6dc. Reverted https://github.com/pytorch/pytorch/pull/138515 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is failing in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/138515#issuecomment-2442357471))	2024-10-28 18:45:09 +00:00
Dan Zimmerman	f466df63a9	[torch] Address -Wreturn-type warning when compiling for AMD (#138951 ) Summary: Yep yep see title Test Plan: CI Differential Revision: D64971115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138951 Approved by: https://github.com/cyyever, https://github.com/adamomainz	2024-10-28 18:26:40 +00:00
Sergii Dymchenko	817e57f832	Remove Python 3.8 from README (#139089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139089 Approved by: https://github.com/clee2000, https://github.com/malfet	2024-10-28 18:12:11 +00:00
Laith Sakka	475ba1df8d	Expliclty avoid recording when should_record_events is false in record_shapeenv_event (#138965 ) Looking at the function record_shapeenv_event its hard to tell that it does not always run but we do disable it by setting top level is_recording to True self.should_record_events is false this makes it more explicit to avoid confusion and overloading is_recording. alternativley we can rename is_recording to do_no_record. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138965 Approved by: https://github.com/ezyang ghstack dependencies: #138804	2024-10-28 18:12:06 +00:00
Will Feng	a688c57033	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-28 18:11:23 +00:00
Nikita Shulga	5c49db98b4	[EZ] Update minversion to 3.9.0 (#139085 ) Fixes https://github.com/pytorch/pytorch/issues/138979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139085 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/seemethere, https://github.com/Skylion007	2024-10-28 18:04:29 +00:00
Ke Wen	74878ac271	[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 ) Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172 There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013 Approved by: https://github.com/shuqiangzhang	2024-10-28 18:03:25 +00:00
Bin Bao	fb2c750e9d	[AOTI][refactor] Move convert_arrayref_tensor_to_tensor logic (#139030 ) Summary: Move convert_arrayref_tensor_to_tensor codegen logic to cpp_wrapper_cpu_array_ref.py Test Plan: CI Differential Revision: D64904187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139030 Approved by: https://github.com/hl475	2024-10-28 18:00:41 +00:00
Masaki Kozuki	949fdd2997	remove redundant `a` (#139046 ) As per title, only one "a" is sufficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139046 Approved by: https://github.com/Skylion007	2024-10-28 17:47:24 +00:00
Catherine Lee	66a3c249ae	Linter for no workflows on fork (#138849 ) MInor, adds a linter that ensures that all jobs run on pull_request, schedule, push etc have a `if: github.repository_owner == 'pytorch'` or are dependent on a job that has that check There is also a setting in Github repos that can disable all workflows for that repo A lot of these are unnecessary because many jobs use reusable workflows that have that check. However, this is a one time change so I'm not that bothered Unfortunately I can't put this at the workflow level, which would make this better Lots of weird string parsing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138849 Approved by: https://github.com/malfet	2024-10-28 17:46:50 +00:00
Jack Zhang	01b055abe3	Make masked_scatter core aten (#137949 ) Summary: Making `masked_scatter` core aten since it is hard to decompose and we now have a portable kernel for it Test Plan: N/A Differential Revision: D64368725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137949 Approved by: https://github.com/larryliu0820	2024-10-28 17:31:53 +00:00
Edward Z. Yang	bca696ae81	Switch times to us in CompilationMetrics and improvements (#138975 ) Companion logger diff: https://www.internalfb.com/diff/D65012523 * Using float seconds for timestamps is bad because our internal system defaults to float32 precision and you don't even get second precision for timestamps in float32 * We decide to use microseconds instead of milliseconds because millisecond granularity you can end up with the same timestamp if compilation is happening very quickly; much better to force non-overlapping spans * Because there are so many new fields and I don't feel like reimplementing each on BwdCompilationMetrics, BwdCompilationMetrics is no more, it's just that everything in CompilationMetrics is now optional. * The actual frame compile times collection is not modified (still float) to reduce blast radius, so I just convert to microseconds before making the record. At float64 precision (Python's default), you get about microsecond precision on timestamps so shouldn't be a data problem (https://www.leebutterman.com/2021/02/01/store-your-unix-epoch-times-as-float64.html) * I rename some entries for clarity. In particular, whenever a timing contains all of the its lower phases (e.g., how Inductor also contains Triton compilation) we put "cumulative" in its name. If something doesn't happen at compile time but is delayed until we have actual real inputs, we put "runtime" in its name. Test plan: ``` buck2 run @mode/opt @mode/inplace //scripts/oulgen:runner ``` And then inspect https://fburl.com/scuba/dynamo_compile/sandbox/mslu7f5w and verify the us columns are populated and meaningful. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138975 Approved by: https://github.com/masnesral	2024-10-28 17:17:18 +00:00
cyy	9b2c99d731	Move reduce to template parameter in vectorized_reduction (#138672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138672 Approved by: https://github.com/soulitzer	2024-10-28 17:13:12 +00:00
Prajesh Praveen Anchalia	3685c630b8	[pytorch] Plumb compile context from dynamo.export to aot_compile (#138793 ) Summary: tlparse shows unknown for certain items when _export.aot_compile() passes the graph obtained from dynamo.export() to inductor.aot_compile(), we also do not have access to the dynamo trace in the GraphModule exported by dynamo. This change plumbs through the compile_context into aot_compile as a part of GraphModule.meta without a major change to APIs within dynamo. Addresses issue: https://github.com/pytorch/pytorch/issues/123759?fbclid=IwY2xjawGE0LBleHRuA2FlbQIxMQABHS-PRpxvsrsHCDPdStHpqr1jQvx1YOnrPsRAfYAb-oXkU8MxidkIUENY-Q_aem_MAT2oaOgD03C8ggBNm575Q#issuecomment-2430722505 Test Plan: ``` buck2 test mode/opt //caffe2/test/dynamo:test_dynamo Buck UI: https://www.internalfb.com/buck2/ad64c267-65be-47cf-a94f-e4b26e6e030b Test UI: https://www.internalfb.com/intern/testinfra/testrun/9288674286334710 Network: Up: 83KiB Down: 314KiB (reSessionID-1dad223b-c91d-4718-97a4-bb2c81e480f0) Jobs completed: 10750. Time elapsed: 19:18.5s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 5365. Fail 2. Fatal 0. Skip 4. Build failure 0 buck2 test mode/opt //caffe2/test/dynamo:test_dynamo_fb Buck UI: https://www.internalfb.com/buck2/179a60bb-34e1-43b3-97ad-91af8a93ab01 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275046340687 Network: Up: 201KiB Down: 1.8GiB (reSessionID-36f33983-6d78-4ec9-aa1b-34cee80dcb4f) Jobs completed: 17. Time elapsed: 42.9s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxZGXf6/index.html Repor fixed: https://github.com/pytorch/pytorch/issues/123759?fbclid=IwY2xjawGE0LBleHRuA2FlbQIxMQABHS-PRpxvsrsHCDPdStHpqr1jQvx1YOnrPsRAfYAb-oXkU8MxidkIUENY-Q_aem_MAT2oaOgD03C8ggBNm575Q#issuecomment-2430722505 Differential Revision: D64863946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138793 Approved by: https://github.com/ezyang	2024-10-28 17:07:44 +00:00
Edward Z. Yang	91ded0576d	Add sym_log2 (#137980 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1515595595745313/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137980 Approved by: https://github.com/bobrenjc93	2024-10-28 17:03:14 +00:00
Yukio Siraichi	006130d8ea	Add test for consistency between meta and CPU devices. (#138515 ) Reference: https://github.com/pytorch/pytorch/issues/138399 This PR introduces an `OpInfo` test that checks whether running each `out=` operation using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically, it tests the case where the output tensors are not of the expected data type. According to the `out=` specification, some operations should error. I have added XFAIL to the set of operations that are currently failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515 Approved by: https://github.com/ezyang	2024-10-28 16:58:48 +00:00
PyTorch MergeBot	4dd04db5d0	Revert "[Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643 )" This reverts commit 4d92d6e60436b1aeffbf4dfce51f16923505251b. Reverted https://github.com/pytorch/pytorch/pull/138643 on behalf of https://github.com/wdvr due to reverting due to a large number of internal failures, see below ([comment](https://github.com/pytorch/pytorch/pull/138643#issuecomment-2442036958))	2024-10-28 16:18:38 +00:00
eellison	d90717e4e2	Add option to save real tensors in TORCH_COMPILE_DEBUG repro (#138110 ) This pr adds a utility to try to try to construct the corresponding real tensor values of fake tensors by seeing if their meta storage is contained in the meta converter. Then, we are able to save real tensor values for fx_graph_runnable if `TORCH_COMPILE_DEBUG_SAVE_REAL=1` is set. Differential Revision: [D64502744](https://our.internmc.facebook.com/intern/diff/D64502744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138110 Approved by: https://github.com/ezyang	2024-10-28 16:18:22 +00:00
Nichols A. Romero	2922b9fee1	[ROCm] Fix ADDMM hipBLASLt regression (#138267 ) Fixes #138067 A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604 The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267 Approved by: https://github.com/eqy, https://github.com/jeffdaily	2024-10-28 16:07:11 +00:00
Sam Larsen	ad933578ed	[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 ) Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not. Test Plan: The new test fails with fast mode commented out, but succeeds when enabled: `python test/inductor/test_codecache.py -k test_stable_strings` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681 Approved by: https://github.com/oulgen	2024-10-28 15:23:56 +00:00
PyTorch MergeBot	3b0f39336c	Revert "Adds snapshot API for MemPools to get pool memory segments (#133601 )" This reverts commit 00504aa6b8b0ae68761b89f023184202e8c79bc8. Reverted https://github.com/pytorch/pytorch/pull/133601 on behalf of https://github.com/wdvr due to reverting for now as this breaks lots of internal tests. Details below ([comment](https://github.com/pytorch/pytorch/pull/133601#issuecomment-2441864871))	2024-10-28 15:12:20 +00:00
Xu Han	5916def695	Fix MKL status check wrong to MKLDNN. (#139049 ) Fix check MKL status wrong to MKLDNN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139049 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-10-28 14:28:56 +00:00
银河渡舟	4d8090cabb	Avoid file encoding issues when loading cpp extensions (#138565 ) I've found that when using `torch.utils.cpp_extension.load` on my Windows system, decoding errors occur when my .cpp/.cu files contain certain non-English characters. `test.py`: ```py from torch.utils.cpp_extension import load my_lib = load(name='my_cuda_kernel', sources=['my_cuda_kernel.cu'], extra_cuda_cflags=['-O2', '-std=c++17']) # ...... ``` `my_cuda_kernel.cu`: ```cpp #include <torch/types.h> #include <torch/extension.h> // 向量化 <------ some chinese characters // ...... ``` Errors will be reported as: ``` Traceback (most recent call last): File "E:\test\test.py", line 8, in <module> my_lib = load( ^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1314, in load return _jit_compile( ^^^^^^^^^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1680, in _jit_compile version = JIT_EXTENSION_VERSIONER.bump_version_if_changed( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 46, in bump_version_if_changed hash_value = hash_source_files(hash_value, source_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 17, in hash_source_files hash_value = update_hash(hash_value, file.read()) ^^^^^^^^^^^ UnicodeDecodeError: 'gbk' codec can't decode byte 0x96 in position 141: illegal multibyte sequence ``` The issue lies in the fact that the `open()` function in Python is platform-dependent, which can cause decoding errors when a file contains characters that are not supported by the default encoding. Pytorch uses file contents to generate hash string: `60c1433041/torch/utils/_cpp_extension_versioner.py (L16-L17)` In my windows the default encoding is `gbk` but all of my cpp files are in `utf-8`. There is a simple solution to this problem I think: just change the file reading mode to binary mode, which can avoid issues related to file encoding. It works perfectly on my computer. ```diff - with open(filename) as file: + with open(filename, 'rb') as file: hash_value = update_hash(hash_value, file.read()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138565 Approved by: https://github.com/malfet, https://github.com/janeyx99 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-28 14:06:34 +00:00
cyy	1ec76dd1dc	Enable clang-tidy on torch/csrc/distributed (#139043 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139043 Approved by: https://github.com/Skylion007	2024-10-28 13:56:54 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit fed37dbfbceefe306af648ff4fe1e0124c4d7844. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
PyTorch MergeBot	2487a834a4	Revert "Add sym_log2 (#137980 )" This reverts commit 5d450d7facd7480482132408acc4c23d80933bab. Reverted https://github.com/pytorch/pytorch/pull/137980 on behalf of https://github.com/jeanschmidt due to lint broke from this onwards on main ([comment](https://github.com/pytorch/pytorch/pull/137980#issuecomment-2441570186))	2024-10-28 13:21:08 +00:00
Edward Z. Yang	8274dadac5	Make OpaqueUnaryFn pickleable (#138395 ) Fixes https://github.com/pytorch/pytorch/issues/138070 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138395 Approved by: https://github.com/XuehaiPan, https://github.com/bobrenjc93	2024-10-28 13:10:04 +00:00
cyy	4d9b5a87e4	[3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796 Approved by: https://github.com/ezyang	2024-10-28 10:53:11 +00:00
Edward Z. Yang	2265c2d48c	Add pytorch.wait_counter.actual_codegen_and_compile WaitCounter (#138010 ) The current pytorch.wait_counter.codegen_and_compile scopes over cache hit/miss, so it doesn't accurately say if you're actually spending time doing Inductor compile or not. This counter /only/ is triggered when we're actually about to spend time in Inductor. It covers Inductor lowering, codegen as well as Triton compilation. It does NOT cover Triton compilation that occurs when you cache hit. Some more bikeshedding may be needed. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138010 Approved by: https://github.com/markkm	2024-10-28 08:06:24 +00:00
Michael Lazos	46132dc026	[Dynamo] Refactor wrap_fx_proxy (#138933 ) During the work to dedup graphs for hierarchical compilation I tried to tame the `wrap_fx_proxy_cls` mess by separating the wrapping into three distinct scenarios (vs a jumble of conditionals). These are: 1) wrapping a preexisting tensor (`_wrap_fx_preexisting_tensor` 2) wrapping and tracing a new op into the graph (`_wrap_fx_proxy`) 3) handling a value that is some other proxyable data structure See `wrap_fx_proxy_cls` for the conditional tree handling these three cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138933 Approved by: https://github.com/williamwen42	2024-10-28 08:05:33 +00:00
PyTorch MergeBot	9ca749d6cd	Revert " [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796 )" This reverts commit 7cb3cef05f4b1d1b448a82a01420e2a9ed1ccfe0. Reverted https://github.com/pytorch/pytorch/pull/138796 on behalf of https://github.com/wdvr due to reverting since this started failing a windows test ([comment](https://github.com/pytorch/pytorch/pull/138796#issuecomment-2440710865))	2024-10-28 07:06:00 +00:00
Tuan Trieu	633dcf1a2d	Constant folding for lifted graph (#135060 ) Summary: Current implementation for lifted graph takes a dict of [constant name: constant value]. And the constant value is used to run_node and excute the constant graph to get the folded values and then create new getattr nodes for folded values. We don't have constant values for lifted graph during model compilation on MTIA. I think it is more general to allow the constant folding pass to just take the constant names only to produce the constant graph and represent the folded nodes as placeholders to make it consistent with lifted graph. Additionally, this mimic the real situation on Sigmoid, where Sigmoid executes the constant graph, get the folded values and set the folded values to the main graph. This diff is to update the pass to work with a list of constant names. Test Plan: ``` buck run mode/opt caffe2/test:test_export -- -r split_const_gm ``` Differential Revision: D62144791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135060 Approved by: https://github.com/SherlockNoMad Co-authored-by: Tuan Trieu <tuant@meta.com>	2024-10-28 06:28:31 +00:00
Angela Yi	a99e8eeb97	Propagate real tensor tracing with torchbind + fixing side effects (#138797 ) Summary: * Fixed real tensor tracing w/ torchbind objs by passing the cloned tensor obj. For now I just catch the exception and have an error message if the `_clone` fails, but up for discussion on what to do here * Separate question, should we require people to set up FakeScriptObjects and stuff for draft mode? * Prevent side effects from happening when we do the first pass of custom ops profiling by cloning/copying everything. Not sure if deepcopying the model will succeed in all cases... But also I guess this path can be removed once custom ops profiling turns into one pass. Test Plan: `buck2 run @//mode/dev-nosan //scripts/angelayi/draft_export:test_draft_export` Reviewed By: ydwu4 Differential Revision: D64124825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138797 Approved by: https://github.com/ydwu4	2024-10-28 06:27:36 +00:00
Simon Fan	dd9ff9f139	[compiled autograd] add tests for bwd hooks relative firing order (#139004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139004 Approved by: https://github.com/yf225 ghstack dependencies: #139003	2024-10-28 05:55:56 +00:00
Simon Fan	fac74687a6	[compiled autograd] fix node origin graph comments (#139003 ) the comment update was done after prehooks were already collected, so prehooks would appear as part of the previous node Pull Request resolved: https://github.com/pytorch/pytorch/pull/139003 Approved by: https://github.com/yf225	2024-10-28 05:55:56 +00:00
cyy	f9ae3fac8c	[Distributed] [19/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138903 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138903 Approved by: https://github.com/ezyang	2024-10-28 05:29:25 +00:00
cyy	39aa3cb8d6	Re-enable skipped ubsan tests (#139008 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139008 Approved by: https://github.com/ezyang	2024-10-28 05:21:31 +00:00
Charles Coulombe	d2052ea84d	Update test_multiarray.py to support numpy 2.0+ (#138461 ) Import _core instead of core. Addresses partially #137182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138461 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-10-28 04:30:50 +00:00
Bob Ren	4c6ae39afd	Fix some nits in symbolic_shapes.py (#139018 ) While I was reading through this file for understanding, I fixed some nits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139018 Approved by: https://github.com/ezyang	2024-10-28 04:27:12 +00:00
PyTorch UpdateBot	1fad37a023	[audio hash update] update the pinned audio hash (#138402 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138402 Approved by: https://github.com/pytorchbot	2024-10-28 04:04:28 +00:00
PyTorch UpdateBot	6f5d538972	[executorch hash update] update the pinned executorch hash (#138661 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138661 Approved by: https://github.com/pytorchbot	2024-10-28 03:44:00 +00:00
Aaron Gokaslan	d72241d045	[Ez][BE]: Fix one more incorrect TypeIs (#139010 ) One other case where the side conditions could cause inaccurate typing info. Follow up to #138990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139010 Approved by: https://github.com/malfet	2024-10-28 03:36:45 +00:00
cyy	f7dc13806e	[2/N] Don't skip ASAN on some tests (#138663 ) Follows #138571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138663 Approved by: https://github.com/ezyang	2024-10-28 03:35:57 +00:00
Edward Z. Yang	5d450d7fac	Add sym_log2 (#137980 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1515595595745313/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137980 Approved by: https://github.com/bobrenjc93	2024-10-28 03:09:11 +00:00
Laith Sakka	c056dc4cb8	In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 ) Title + we avoid calling defer_assert when we statically know the guard results. timing for pnasnet5large ``` TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052 ``` matches with out the diff ``` TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804 Approved by: https://github.com/ezyang	2024-10-28 02:19:55 +00:00
cyyever	7cb3cef05f	[3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796 Approved by: https://github.com/ezyang	2024-10-28 01:38:02 +00:00
cyy	d2ec289787	Turn header static function into inline (#138671 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138671 Approved by: https://github.com/ezyang	2024-10-27 20:07:39 +00:00
Edward Z. Yang	192385e261	Add sym_sum to TorchInGraphFunctionVariable (#138848 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138848 Approved by: https://github.com/Skylion007	2024-10-27 20:04:35 +00:00
Xu Han	beb15c80fb	print USE_STATIC_MKL for further debug. (#138902 ) print `USE_STATIC_MKL` for further debug. <img width="257" alt="image" src="https://github.com/user-attachments/assets/cd45bada-c28a-441a-b271-35956cfe1f21"> if we use `MKL`, then show its link method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138902 Approved by: https://github.com/ezyang	2024-10-27 18:08:30 +00:00
Nikita Shulga	652a2ab93e	[BE] Skip `print(foo)` tests (#139009 ) Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output. Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009 Approved by: https://github.com/Skylion007	2024-10-27 18:04:03 +00:00
Ke Wen	ee11e2da1e	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-27 17:40:43 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Jason Ansel	3217ae2082	[inductor] Only apply score_fusion_memory_threshold to horizontal fusions (#138970 ) PR #136782 made `x.sum()+1` become two kernels, which hurts compile times as @ezyang noticed and breaks a lot of the tests in this stack. This reworks that heuristic to not apply as often. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138970 Approved by: https://github.com/shunting314	2024-10-27 16:31:38 +00:00
Wouter Devriendt	bae3426af7	reimport pr137735 due to merging check issues (#138959 ) This is a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959 Approved by: https://github.com/mikaylagawarecki	2024-10-27 16:31:34 +00:00
PyTorch MergeBot	144d75d934	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit 07e30eae2a8241e531890b6c9a33ab5a80c5ccaf. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2440070035))	2024-10-27 15:39:33 +00:00
PyTorch MergeBot	d969b34377	Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 )" This reverts commit f1a677cba5ef7514f2cf303753d3117528867a33. Reverted https://github.com/pytorch/pytorch/pull/138804 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to fail pr_time_benchmarks job in trunk ([comment](https://github.com/pytorch/pytorch/pull/138804#issuecomment-2440069407))	2024-10-27 15:36:46 +00:00
Aaron Gokaslan	5d074746e9	[BE]: Add better optional typing (#138426 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426 Approved by: https://github.com/XuehaiPan, https://github.com/malfet	2024-10-27 14:19:00 +00:00
Bin Bao	d9534a50a9	[AOTI][refactor] Separate header codegen (#138882 ) Summary: Move arrayref specific header codegen logic to cpp_wrapper_cpu_array_ref.py, and consolidate some header files codegen logic Test Plan: CI Differential Revision: D64899248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138882 Approved by: https://github.com/hl475	2024-10-27 14:14:27 +00:00
Yu, Guangye	40c098f731	Introduce a device-agnostic runtime API design (#132204 ) # Motivation According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design. I personally prefer the Simple Version APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does NOT break the previous design philosophies. I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle: 1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter. 2. Device-specific APIs should be placed under device-specific submodules. 3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter. Also, I list the pros and cons of Simple Version here: Pros: - `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience; - more concise, facilitate the developer to write a device-agnostic code. Cons: - no obvious drawbacks. # Additional Context I list the new APIs here: ```python torch.accelerator.is_available() -> bool: torch.accelerator.current_accelerator() -> torch.device: torch.accelerator.device_count() -> int: torch.accelerator.current_device_idx() -> int: torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None: torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream: torch.accelerator.set_stream(stream: torch.Stream) -> None: torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None: ``` According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204 Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD	2024-10-27 10:37:09 +00:00
Ke Wen	1152726feb	[PGNCCL] Use recursive mutex in NCCLComm (#138997 ) Fixes #138995: [PGNCCL][BUG] mutex acquired in recursive way may deadlock The fix: use `std::recursive_mutex` to replace `std::mutex`. Found and proposed by @dsjohns2. Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138997 Approved by: https://github.com/dsjohns2	2024-10-27 08:58:47 +00:00
Shunting Zhang	4681539f42	[inductor] force strides for efficient attn bwd (#138879 ) Try to fix https://github.com/pytorch/pytorch/issues/138772 . aten._scaled_dot_product_efficient_attention_backward requires the out and gradient_out to have stride order (3, 1, 2, 0). When Inductor layout optimization is enabled, Inductor may change tensor strides if they are not user visible. For efficient_attention_backward, Inductor tries to follow eager strides. But the eager strides Inductor gets for backward graph may be the one after optimization. There are a few possible fixes: 1. change the kernel to allow stride order other than (3, 1, 2, 0). This is probably hard 2. backout https://github.com/pytorch/pytorch/pull/112045/files and don't do layout optimization if the model contains efficient_attention. 3. Force (3, 1, 2, 0) strides order for the relevant tensors 4. Pass original eager layouts to Inductor for the backward graph. Let Inductor follow those layouts for tensors with extra layout requirement. The PR implements option 3. Option 4 looks more general to me, I think we can do this in long term. I tried to add a test but failed to repro: https://gist.github.com/shunting314/fe37a246aad269de9ea00199446688f6 Here is the original command to repro the issue: ``` TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 time python benchmark.py --model maxvit_nano_rw_256 --precision bfloat16 --torchcompile --bench train --no-retry -b 64 ``` benchmark.py is https://github.com/huggingface/pytorch-image-models/blob/main/benchmark.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/138879 Approved by: https://github.com/drisspg, https://github.com/eellison	2024-10-27 04:54:15 +00:00
Edward Z. Yang	c480a479b1	Make automatic_dynamic state live per CodeId, rather than on code object (#138740 ) This is semantics changing as if you are dealing with multiple code objects which have exactly the same filename/firstlineno/name, but are distinct objects, and need non-aliasing automatic dynamic state. Otherwise, this should be equivalent (modulo lifetime). I want to do this because when I do PGO I can't index on code object identity, need a stable identifier. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138740 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138693, #138717	2024-10-27 03:08:41 +00:00
Edward Z. Yang	14a45d7793	Refactor core algorithm for automatic dynamic shapes (#138717 ) While working on automatic dynamic PGO (https://github.com/pytorch/pytorch/pull/138052) one abstract property I was looking for out of profile information is that it formed a semilattice: I could join together two profiles and get a merged profile that is consistent with the profiles that I saw in both cases. While working on this data structure that supported joins, I realized that the base automatic dynamic algorithm could be implemented in this way, therefore this refactor. The basic recipe is that we now support a join operation on FrameStateSizeEntry. Intuitively, if you join two sizes that are equal, you get back that size (join(2, 2) == 2), but if you join two different sizes you get a special singleton auto_dynamic indicating that the size of the tensor is dynamic (join(2, 3) == auto_dynamic). So now, the automatic dynamic algorithm is: (1) compute the FrameStateSizeEntry that corresponds to the concrete values we've seen, and (2) join it into the ambient FrameStateSizeEntry. As a bonus, compiler collectives can buy into the same abstraction (we're simply distributing FrameStateSizeEntry from each node to every other node). For convenience, I also added the necessary `auto_unset` extra state which is the identity element (which makes our semilattice bounded from both top and bottom). Here, join(2, auto_unset) == 2. While doing this, there was a complication: the infer stride algorithm wasn't technically a semilattice. Here, I did what I suggested in the original code review https://github.com/pytorch/pytorch/pull/130232 which is stop using a heuristic, and instead replicate the stride inference algorithm in automatic dynamic. This means that when I join strides together, I don't join their concrete values, instead, if a stride can be inferred as the contiguous stride for a particular inner dimension, then you represent it as InferStride(dim). There's an example in code which I recommend looking at. Some other extra things that are happening in this PR: * I tried to deduplicate the size/stride automatic dynamic logic as much as possible. So hopefully less code to review here. * I had to reimplement all the logging. For the most part I tried to track the logging as closely to the original as possible, but I think we could be emitting less Chrome events here * The `marked_dynamic` handling is still preserved as is, but I kind of don't like it and we should figure out how to put it somewhere else Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138717 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138693	2024-10-27 03:08:41 +00:00
Mu-Chu Lee	28013aa527	[AOTInductor] Disable comprehensive_padding when use_runtime_constant_folding=True (#138872 ) Summary: Disable comprehensive_padding when use_runtime_constant_folding=True. We need to disable the comprehensive padding because it modifies the stride thus the stride information between the constant graph and main graph will differ. Test Plan: ``` buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/643940255/17/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR_EP" --aot-inductor-config="{'max_autotune': True, 'aot_inductor.use_runtime_constant_folding': True}" ``` Reviewed By: 22quinn, henryoier Differential Revision: D64927546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138872 Approved by: https://github.com/chenyang78	2024-10-27 01:12:27 +00:00
Mu-Chu Lee	fee17d530d	[AOTInductor] Add relu_nan_to_num option for pre-grad passes (#138545 ) Summary: Add a relu_nan_to_num in pre-grad pass. Test Plan: Included in commit Differential Revision: D64724780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138545 Approved by: https://github.com/chenyang78	2024-10-27 00:57:11 +00:00
Richard Barnes	42994234a6	std::value/std::type -> std::_v/std::_t (#138746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-26 20:59:24 +00:00
cyy	fb36daac9f	[7/N] Fix extra warnings brought by clang-tidy-17 (#138972 ) Fix extra warnings brought by clang-tidy-17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138972 Approved by: https://github.com/Skylion007	2024-10-26 19:09:47 +00:00
Yifu Wang	3a6f014381	[Inductor] improve the stride preservation logic of user-visible outputs (#136732 ) ## Context Previously, the stride preservation of user-visible nodes worked as follows: - After joint-graph tracing, we recorded the names of user-visible nodes and passed them to GraphLowering. - In GraphLowering, we determined whether we needed to preserve the striding for a certain node by checking if the node's name was in `user_visible_outputs`. - We obtained the original strides by checking `node.meta["val"].stride()`. However, there's a problem with this approach: the nodes in output_node.args[0] and their strides could change between the completion of joint-graph tracing and the consumption of `user_visible_outputs` (e.g., during post-grad passes), making it unreliable. ## This PR - After joint graph tracing: - Record the original strides for all nodes in `output_nodes.args[0]` as `output_node.meta["original_output_strides"]` (recording for all nodes in case we need the info for other purposes such as debugging). - Record the indices of user-visible outputs as `output_node.meta["user_visible_output_idxs"]`. - Remove the original plumbing of `user_visible_outputs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136732 Approved by: https://github.com/Chillee	2024-10-26 18:49:14 +00:00
Nikita Shulga	1d83a893c5	[BE][MPS] Use templates in Repeat shader (#138962 ) - Instead of generating shader from templated code on host, just define two specializations of one kernel template - Get rid of unused `threads_per_threadgroup` argument - Replace `if (typeid(scalar_t) == typeid(int32_t))` with `if constexpr (std::is_same_v<scalar_t, int32_t>)` in the host code Pull Request resolved: https://github.com/pytorch/pytorch/pull/138962 Approved by: https://github.com/janeyx99	2024-10-26 17:42:07 +00:00
Taras	e78c4ded48	Use the unicode variant of the Windows API (#47422 ) (#138605 ) Use the unicode variant of the Windows API in c10/util/Backtrace.cpp - #47422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138605 Approved by: https://github.com/peterjc123, https://github.com/malfet	2024-10-26 17:41:39 +00:00
cyy	1a73255102	Concat namespaces in jit code (#138976 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138976 Approved by: https://github.com/Skylion007	2024-10-26 17:41:27 +00:00
Aaron Gokaslan	4de93d1ead	[BE][Ez]: Fix bad TypeIs conversion (#138990 ) Fixes on TypeIs / TypeGuard conversion error. Follow up to #133814 Thanks for @ezyang for reminding me to double check the side conditions here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138990 Approved by: https://github.com/malfet	2024-10-26 17:37:40 +00:00
Laith Sakka	705f5b3489	Several enhancements for check_results.py (#137925 ) 1) always generate expected_results.csv up to accuracy of first three digits ex: 112313212312 --> 1120000000 .. etc 2) regenerate all record in expected_results.csv and not just failed ones , why? because if we change something by 1.3% and noise 1.5% we want to reflect that. 3) add "please update all results that changed significantly, and not only the failed ones" ``` (myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te st_check_result/result_test.csv out WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. new expected results file content if needed: a,instruction count,9011000000,0.01 b,memory,20010000000,0.1 c,something,107100000000,0.1 There was some failures you can use the new reference expected result stored at path:out and printed above ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925 Approved by: https://github.com/aorenste	2024-10-26 16:27:55 +00:00
Yuanhao Ji	1a2dc89f17	[Dynamo] Allow `torch.cond()` to handle emply arguments (#138190 ) Fixes #138150 ```python import torch @torch.compile(fullgraph=True) def foo(x, y, z): def f(): return y + 2 def g(): return z + 1 return torch.cond(x, f, g) print(foo(torch.zeros(1), torch.ones(1), torch.ones(1))) # tensor([2.]) print(foo(torch.ones(1), torch.ones(1), torch.ones(1))) # tensor([3.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138190 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-10-26 15:26:21 +00:00
Animesh Jain	c84f9b2069	[dynamo][guards] Log average time of constructed guard_manager (#138941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138941 Approved by: https://github.com/jansel ghstack dependencies: #138512, #138896	2024-10-26 15:14:46 +00:00
Animesh Jain	dba6887dc6	[dynamo][refactor][config-cleanp] Use guard_manager consistently instead of check_fn (#138896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138896 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #138512	2024-10-26 15:14:46 +00:00
Aaron Gokaslan	49ed365b22	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-10-26 15:07:13 +00:00
James Wu	eb6c7b93a7	Log AOTAutogradCache state to PT2 Compile Events (#138604 ) Same as previous diff for inductor, but for autograd instead Differential Revision: [D64765199](https://our.internmc.facebook.com/intern/diff/D64765199/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138604 Approved by: https://github.com/oulgen	2024-10-26 15:04:38 +00:00
Laith Sakka	f1a677cba5	In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 ) Title + we avoid calling defer_assert when we statically know the guard results. timing for pnasnet5large ``` TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052 ``` matches with out the diff ``` TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804 Approved by: https://github.com/ezyang	2024-10-26 15:03:53 +00:00
Joel Schlosser	14a17ad630	Elide calls to is_nested in Dynamo-traced graphs (#138841 ) Before this PR, calling `is_nested` in-graph would result in graph code like the following: ```python class GraphModule(torch.nn.Module): def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"): l_nt_ = L_nt_ # Note this useless line! getattr_1 = l_nt_.is_nested; getattr_1 = None add: "f64[3, s1, 5]" = l_nt_ + 2; l_nt_ = None return (add,) ``` This PR follows what is done for `is_sparse` / `is_quantized`: store it onto `TensorVariable` and have `getattr` calls to `is_nested` return the stored value as a constant. This removes the useless line above from the graph. Note that guarding is handled through tensor type check guards, so no need to guard on `is_nested` status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138841 Approved by: https://github.com/soulitzer	2024-10-26 15:03:32 +00:00
Adnan Akhundov	3234b251b3	Fix typos in CreateTMADescriptorVariable (#138877 ) This fixes some leftover typos in CreateTMADescriptorVariable.call_function (and close). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138877 Approved by: https://github.com/davidberard98, https://github.com/zou3519, https://github.com/Skylion007 ghstack dependencies: #138759	2024-10-26 15:03:07 +00:00
Xu Han	043864afdf	enable test_x86inductor_quantizer.py UTs on Windows. (#138937 ) This UTs are failed months ago, but due to the main branch move forward, some PRs fixed it. Let's turn on them. Local test passed: <img width="863" alt="image" src="https://github.com/user-attachments/assets/a2ec160c-cdf1-404d-bc24-2f60faa8d791"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138937 Approved by: https://github.com/jansel	2024-10-26 12:48:51 +00:00
Wu, Chunyuan	a3aca24ae5	[AOTI] add C shim for QLinearPointwise (#138439 ) This PR adds C shim for `QLinearPointwisePT2E` and `QLinearPointwiseBinaryPT2E`. The below changes are needed: - We moved the qlinear API out of the anonymous namespace since we need to call it in the shim layer. - We fixed the code which generated the `inputs` and `constant_args` so that we can directly leverage the `codegen` of the parent class. - `x_scale` and `x_zp` are ensured to be tensor during the lowering stage, thus we can remove the code which handles whether they're tensor or not. `fb0da32377/torch/_inductor/mkldnn_lowerings.py (L492-L496)` `fb0da32377/torch/_inductor/mkldnn_lowerings.py (L499-L503)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138439 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-26 08:04:15 +00:00
Simon Fan	99608ceed6	Scoped extension building for C++ backed custom ops tests (#136695 ) FIXES #125579 #131103 #133197 #133283 #134738 #135369 #135685 Tests that create C++ extensions can cause flakiness in CI due to library namespace conflict and test ordering. We can build them in temp dirs to ensure isolation. An alternative is to build these as part of the build process and have build time errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136695 Approved by: https://github.com/zou3519	2024-10-26 07:41:00 +00:00
Laith Sakka	10e2840ce3	Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548 ) update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression. I have one concern, with the way we do the regression detection. small or changes <threshold level will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. . Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548 Approved by: https://github.com/aorenste	2024-10-26 07:28:49 +00:00
Ke Wen	07e30eae2a	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-26 06:53:15 +00:00
Syed Tousif Ahmed	00504aa6b8	Adds snapshot API for MemPools to get pool memory segments (#133601 ) Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool. In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601 Approved by: https://github.com/ezyang	2024-10-26 03:34:59 +00:00
Kiuk Chung	940658405b	[test/test_cuda] Use temp file for test_improper_device_name (#138856 ) Use `tempfile.NamedTemporaryFile()` to have test_specify_improper_device_name save/load to a tmp file rather than the current-working-directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/138856 Approved by: https://github.com/Skylion007	2024-10-26 02:42:25 +00:00
Yidi Wu	0ac9a663ec	[hop] always trace subgraph with fake to support .item in eager mode (#138771 ) Fixes https://github.com/pytorch/pytorch/issues/138664 When we eagerly run torch.cond with autograd keys set, we'll create_fw_bw_graph using real tensors. This PR forces fakification when cannot detect the fake mode so as to trace the .item calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138771 Approved by: https://github.com/zou3519, https://github.com/malfet	2024-10-26 02:17:17 +00:00
Ryan Guo	f14247d5aa	[dynamo] Accurately identify mutated cells captured by multiple functions (#138632 ) This patch changes `mutated_closure_cell_contents: Set[str]` to `mutated_closure_cell_ids: Set[int]` so that Dynamo can more accurately identify closure cells across different instances of `UserFunctionVariable`. This prevents Dynamo from mistakenly treat a cell as immutable, despite it'll be mutated when referenced as closure cell from another function. More context in https://github.com/pytorch/pytorch/issues/138112#issuecomment-2420580779. Fixes #138112. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138632 Approved by: https://github.com/jansel ghstack dependencies: #138639	2024-10-26 02:17:07 +00:00
Ryan Guo	1e1f0ceb40	Allow Lazy Module to be modelled as `UnspecializedNNModuleVariable` (#138639 ) This patch - removes the `is_lazy_module` check from `is_dynamic_nn_module`, and adds a regression test. - removes a series of dynamo expected failures on lazy modules. The few ones I checked all were failing due to speculation log divergence, similar to #138489. Note that #100047 introduced the conditional removed in this patch, and it was trying to fix #100001. But I've confirmed locally that #100001 no longer repros after this patch. Fixes #138489. See more context in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138639 Approved by: https://github.com/jansel	2024-10-26 02:17:07 +00:00
Aaron Gokaslan	4af93fdb77	[BE]: Update cudnn_frontend submodule to 1.8.0 (#138709 ) Update cudnn frontend. Let's see what breaks @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/138709 Approved by: https://github.com/eqy	2024-10-26 01:55:33 +00:00
Yukio Siraichi	565a53d326	Use DLPack for creating tensors out of custom classes, when available. (#138697 ) Fixes #120614 Takes over #120615 In summary, this PR: - Adds a `__dlpack__` attribute check in the tensor creation path (i.e. [`internal_new_from_data` @ tensor_new.cpp](`cdfe1bffd1/torch/csrc/utils/tensor_new.cpp (L266)`)) - Creates the tensor by using the DLPack machinery, instead of an element-by-element copy - No changes since #120615 - Adds a test, making sure the DLPack machinery is used - Wraps a tensor in a fresh `TensorDLPackWrapper` class that implements only the DLPack methods - Creates a new tensor from an instance of `TensorDLPackWrapper` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138697 Approved by: https://github.com/ezyang Co-authored-by: Wenzel Jakob <wenzel.jakob@epfl.ch>	2024-10-26 01:27:05 +00:00
Xia, Weiwen	e299193423	Bug fix: Use oneDNN for `torch._int_mm` CPU only when avx512_vnni is supported (#136942 ) Fixes #136746 If AVX512_VNNI is not supported, overflow occurs inside oneDNN. Fall back to ref path in such case. UT is also updated to catch the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136942 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-26 01:17:11 +00:00
Scott Wolchok	a3de067975	[PyTorch] Use 128-bit vectors for ARM64 (#137426 ) The correct vector length for ARM64 is 128 bits (16 bytes). We were previously using double this, apparently just because that would be the same length as AVX2. Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655, #138716, #138744	2024-10-26 00:20:35 +00:00
Kiuk Chung	7ada814107	[c10/util] Add explicit include of <mutex> to c10/util/env.cpp (#138854 ) Add explicit include of `<mutex>` to `c10/util/env.cpp` since it has usages of `std::lock_guard` which is defined in the header `<mutex>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138854 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-26 00:16:05 +00:00
cyy	1605d4aeb8	Fix object slice (#138880 ) To avoid casting Tensor to Tensorbase Pull Request resolved: https://github.com/pytorch/pytorch/pull/138880 Approved by: https://github.com/Skylion007	2024-10-26 00:13:19 +00:00
Ke Wen	939fc4e335	[PGNCCL] Fix P2P data corruption in non-blocking mode (#138860 ) In non-blocking mode, it seems a single `ncclRecv` or `ncclSend` call can "early return" `ncclSuccess` before the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption. Side note: per NCCL, it is legal to call `ncclSend` or `ncclRecv` only if there is only one P2P op. This is true whether we are in blocking or non-blocking mode. In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead. Added a test `test_non_blocking_p2p`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138860 Approved by: https://github.com/shuqiangzhang	2024-10-25 23:58:43 +00:00
Ke Wen	54d13a9348	[c10d][CI] Improve world size setting in some tests (#138846 ) Following change in #137161 , bumping world size for some test suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846 Approved by: https://github.com/fduwjj	2024-10-25 23:02:17 +00:00
Ke Wen	a57e418c1f	[PGNCCL] Use ncclSend and ncclRecv (#138875 ) Stop routing to `torch::cuda::nccl`. Use native `ncclSend` and `ncclRecv` APIs instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138875 Approved by: https://github.com/shuqiangzhang	2024-10-25 22:17:10 +00:00
Max Podkorytov	4d92d6e604	[Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643 ) Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last. This way, the channels-last CK instances will be added to benchmark choices in max autotune # Testing ``` pytest test/inductor/test_ck_backend.py -k conv2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643 Approved by: https://github.com/chenyang78	2024-10-25 22:11:44 +00:00
PyTorch MergeBot	36b7135c6f	Revert "[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 )" This reverts commit 6cadf616aeb612f3c866b734268919ad1616ffaf. Reverted https://github.com/pytorch/pytorch/pull/138681 on behalf of https://github.com/jeanschmidt due to Introduced regressions on linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138681#issuecomment-2438945493))	2024-10-25 22:07:30 +00:00
Jiawen Liu	14b8028c81	[Pytorch][ATEN] Enable FP8 NCCL in Pytorch ATEN (#138776 ) Summary: Enable FP8 NCCL in Pytorch ATEN to unblock FP8 collective communication such as FP8 all-to-all Test Plan: CI & D64374424 Differential Revision: D64866426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138776 Approved by: https://github.com/eqy, https://github.com/jianyuh	2024-10-25 21:56:47 +00:00
Sam Larsen	86b45bde19	[pt2] Add logger logging for remote fx graph cache get + put (#138164 ) Summary: Capture the timing for the remote fx graph cache get and put operations and add them to the logger logging. Test Plan: 1) Landed D64483593 and waited for logger actualization. 2) Ran test script on devserver: `buck2 run mode/opt scripts/slarsen/torch_compile_model:run` 3) Queried dynamo_compile/sandbox: ``` (pytorch-3.10_4) devvm2296:~/local/pytorch-3.10_4 $ scuba -e="select time,co_filename,remote_fx_graph_cache_get_time_s,remote_fx_graph_cache_put_time_s from \`dynamo_compile/sandbox\` where remote_fx_graph_cache_put_time_s is not null" +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+ \| time \| co_filename \| remote_fx_graph_cache_get_time_s \| remote_fx_graph_cache_put_time_s \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+ \| 1729136266 \| null \| 0.05652284622192383 \| 0.9691152572631836 \| \| 1729136263 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/289bb46b326874c6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 0.8298435211181641 \| 0.18642282485961914 \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+ ``` Reviewed By: oulgen Differential Revision: D64484025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138164 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-10-25 21:30:18 +00:00
Menglu Yu	78377ec130	[PT2][Optimus] Normalize Clamp to use kwargs (#138723 ) Summary: The current clamp normalization does not include torch.clamp where its min and max are not normalized to kwargs, thus the batch fusion of clamp can hit min and max are both empty problem. Test Plan: ``` buck2 run mode/opt servicelab/ai_ml/auto_tune:local_model_pt2 -- --flow_id 654509735 --test_mode split ``` GPU type: NVIDIA PG509-210 =============Print full analysis for offsite_cvr_oba_optout_dedicated_model================ \| Metric \| Value \| \|:-------------------\|:-----------------\| \| GPU type \| A100 \| \| Batch size \| 10 \| \| Latency \| 227.13 ms \| \| Model size \| 2322763344 bytes \| \| Flops/example \| 1136.52 G \| \| TFLOPS \| 50.04 \| \| MFU \| 16.04% \| \| Activation/example \| 2722.49 MB \| I1023 112249.043 local_model_with_pt2.py:25] benchmark results [('batch_size', 10), ('latency_ms', 22712), ('model_size_bytes', 2322763344), ('flops_per_example', 113652), ('tflops_g', 5003), ('mfu', 1603), ('activation_per_example_mb', 272249) Differential Revision: D64848369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138723 Approved by: https://github.com/jackiexu1992	2024-10-25 21:05:39 +00:00
Shivam Raikundalia	a874ec85e8	[Functorch] Fix devices Parameter Type in benchmark_utilization Function (#138774 ) Summary: Issue described in https://github.com/pytorch/pytorch/issues/136697 Original user does not have CLA privileges so this is my commandeer Test Plan: OSS CI Differential Revision: D64872833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138774 Approved by: https://github.com/davidberard98	2024-10-25 19:25:18 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	3a0c361899	Remove presere ops (#138371 ) Summary: CI #buildall Test Plan: CI Reviewed By: StellarrZ Differential Revision: D64151426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138371 Approved by: https://github.com/bdhirsh	2024-10-25 19:13:55 +00:00
Ting Lu	b988388bac	Add CUDA 12.6 to Linux CD docker images (#138563 ) Reference https://github.com/pytorch/builder/pull/1003/files Related to #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138563 Approved by: https://github.com/malfet	2024-10-25 19:10:07 +00:00
Eddie Yan	846b4e614b	[TF32][cuDNN][Convolution] Add some missing TF32 decorators (#138768 ) Newer cuDNN versions seem to be able to dispatch to cuDNN kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/138768 Approved by: https://github.com/Skylion007	2024-10-25 19:03:42 +00:00
Yidi Wu	c6bb9b53f4	[scan] better error handling and remove redundant tests (#137967 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137967 Approved by: https://github.com/zou3519	2024-10-25 19:01:25 +00:00
Guilherme Leobas	7d283309d8	Avoid calling `realize()` on LazyVariableTracker on reconstruct (#138495 ) Fixes: https://github.com/pytorch/pytorch/issues/137686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138495 Approved by: https://github.com/zou3519	2024-10-25 19:01:15 +00:00
chilli	392221b390	Made DDPOptimizer work with HOPs (#138787 ) Fixes https://github.com/pytorch/pytorch/issues/137481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138787 Approved by: https://github.com/yf225 ghstack dependencies: #138733, #138794, #138881	2024-10-25 18:59:01 +00:00
chilli	07dbc42881	Stop force realizing to prevent recursion errors unless it's much bigger (#138881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138881 Approved by: https://github.com/shunting314 ghstack dependencies: #138733, #138794	2024-10-25 18:59:01 +00:00
Colin	de54246c42	Recomend pip install -r requirements in the unit testing guidelines. (#137797 ) Somehow make setup-env as recomended in CONTRIBUTING.MD is not installing all dependencies require to run tests This makes it slightly clearer when running tests. Specific repro on my side was ``` git checkout e7679663070e3149ae7cd6e28d376d86852ce9e4 make setup-env conda activate pytorch-deps python test/test_utils_internal.py ``` which is what my reading of the instructions implies should be correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137797 Approved by: https://github.com/albanD	2024-10-25 18:47:44 +00:00
Edward Z. Yang	03f9136870	Add wait counter on cuda::device_synchronize (#138883 ) The wait counter is typically only minute precision, but if there is a collective in the queue it will show up. We think this explains up to eight minutes of delay in some compile traces we're looking at, but the counter would definitively prove it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D64944970](https://our.internmc.facebook.com/intern/diff/D64944970) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138883 Approved by: https://github.com/eqy	2024-10-25 18:13:57 +00:00
Edward Z. Yang	dbbdfd9df5	Add pytorch.wait_counter.dynamo_compile (#138072 ) I was discussing with James March how the current fx_codegen_and_compile counter doesn't actually capture all compile time. This one is more accurate and corresponds closely to the existing events in dynamo_compile table. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138072 Approved by: https://github.com/markkm	2024-10-25 18:12:34 +00:00
Huy Do	77587f43d2	Add one more shard for CPU pull jobs (#138894 ) The first shard is close to 3.5 hours and timing out flakily in trunk now, for example https://github.com/pytorch/pytorch/actions/runs/11509141659/job/32039126506. So, I think we could just add one more shard in the same spirit as https://github.com/pytorch/pytorch/pull/137433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138894 Approved by: https://github.com/Skylion007	2024-10-25 18:09:50 +00:00
Xinran / Allan Rui	ba6526814a	Add dtype attribute to CSEVariable (#136778 ) Summary: - This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op. - There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen. Test Plan: CI Differential Revision: D61815079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-10-25 18:00:30 +00:00
Adam Mainz	d0640b945b	[inductor][nit] removing unnecessary else statements (#138789 ) Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read Test Plan: CI Differential Revision: D64882385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789 Approved by: https://github.com/aakhundov	2024-10-25 17:59:25 +00:00
Richard Barnes	69af467d4f	Eliminate c10::value_or_else (#138818 ) Test Plan: Sandcastle Differential Revision: D64857418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138818 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-10-25 17:59:01 +00:00
Gagan Jain	a6287b5c27	Fixing issue in move pass for copying Parameter (#138855 ) Summary: Fixing bug for Parameter copy during move pass of exported graph. Test Plan: UT runs on APS models. Differential Revision: D64876951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138855 Approved by: https://github.com/pianpwk Co-authored-by: Gagan Jain <gaganj@meta.com>	2024-10-25 17:57:27 +00:00
Brian Hirsh	375d71cc5a	plumb is_export flag to FunctionalTensorMode in analysis pass (#138836 ) Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`. Test Plan: CI Differential Revision: D64915263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836 Approved by: https://github.com/tugsbayasgalan	2024-10-25 17:56:14 +00:00
Richard Barnes	3d0aa6f049	Update readme with std::optional (#138914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138914 Approved by: https://github.com/malfet	2024-10-25 17:40:58 +00:00
PyTorch MergeBot	6f66398ab8	Revert "[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731 )" This reverts commit 245026af2d2f26c74993cb90e01bddbd627c6797. Reverted https://github.com/pytorch/pytorch/pull/138731 on behalf of https://github.com/jeanschmidt due to introduced regressions on linux-focal-cuda12.1-py3.10-gcc9-bazel-test ([comment](https://github.com/pytorch/pytorch/pull/138731#issuecomment-2438417669))	2024-10-25 17:37:32 +00:00
PyTorch MergeBot	447bb72822	Revert "[c10d][CI] Improve world size setting in some tests (#138846 )" This reverts commit 9c35e33d9b02e384f0d504f942a916e9e849b163. Reverted https://github.com/pytorch/pytorch/pull/138846 on behalf of https://github.com/jeanschmidt due to introduced breaks in linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138846#issuecomment-2438415315))	2024-10-25 17:35:27 +00:00
Xuan Zhang	2980aed65b	[inductor][memory] restructuring memory.py and turn on the flag (#137205 ) Addressing additional comments given in PR https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137205 Approved by: https://github.com/eellison	2024-10-25 17:19:34 +00:00
Animesh Jain	817b4988e4	[dynamo][config-cleanup] Remove enable_cpp_guard_manager=False codepath (#138512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138512 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-25 16:41:55 +00:00
eellison	fe18a221eb	Add debug backend that applies CrossRefFakeMode, use in compiler bisector (#138651 ) I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph. I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option. It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-10-25 15:58:36 +00:00
Sam Larsen	6cadf616ae	[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 ) Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not. Test Plan: The new test fails with fast mode commented out, but succeeds when enabled: `python test/inductor/test_codecache.py -k test_stable_strings` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681 Approved by: https://github.com/oulgen	2024-10-25 15:52:58 +00:00
Yuanhao Ji	78a0158540	[Dynamo] Improve `args` in `higher_order_ops` [1/N] (#138799 ) Replaced hard-coded argument indices with meaningful variable names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138799 Approved by: https://github.com/zou3519	2024-10-25 13:55:41 +00:00
Nikita Shulga	45b8155a07	[CI] Run periodic jobs only on pytorch/pytorch repo (#138874 ) Github by default tries to not run periodic jobs on forks, see https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/disabling-and-enabling-a-workflow But there is a special test repo called `pytorch/canary`, that will run those workflows for next 60 days, which is a waste of resources Pull Request resolved: https://github.com/pytorch/pytorch/pull/138874 Approved by: https://github.com/huydhn	2024-10-25 13:42:37 +00:00
IvanKobzarev	245026af2d	[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138731 Approved by: https://github.com/bdhirsh	2024-10-25 12:35:52 +00:00
Howard Huang	2c82f73647	[Pipelining] Clean up hooks in zero bubble (#138720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138720 Approved by: https://github.com/wconstab ghstack dependencies: #138119, #138504, #138735	2024-10-25 12:06:54 +00:00
Howard Huang	12755f45ff	[Pipelining] small comments and variable renames (#138735 ) Addressing the comments in previous PRs to update the variable names and add additional code comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/138735 Approved by: https://github.com/wconstab ghstack dependencies: #138119, #138504	2024-10-25 12:06:54 +00:00
Ke Wen	9c35e33d9b	[c10d][CI] Improve world size setting in some tests (#138846 ) Following change in #137161 , bumping world size for some test suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846 Approved by: https://github.com/fduwjj	2024-10-25 10:40:21 +00:00
Edward Z. Yang	a1175e3437	[BE] Strides are always non-negative, remove pointless test (#138784 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138784 Approved by: https://github.com/Chillee	2024-10-25 10:39:32 +00:00
Mwiza Kunda	22d2e2d9a0	Set RUNPATH so installed tests can find the required shared libraries (#136627 ) This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on. For example, currently: ```bash venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy ./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627 Approved by: https://github.com/malfet	2024-10-25 09:38:08 +00:00
Xuehai Pan	86d4b7d60b	[FX][export][dynamo] use `tuple` instead of `list` in normalized `args_spec` (#138212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138212 Approved by: https://github.com/jansel	2024-10-25 06:43:55 +00:00
cyyever	ce631939f0	[Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692 Approved by: https://github.com/ezyang	2024-10-25 05:32:38 +00:00
Nikita Shulga	b999daf7a9	Add sets to list of safe objects to de-serialize (#138866 ) Lists, dicts and tuples are already allowed, it's a bit weird not to exclude set from the list of basic containers. Test plan (in addition to unittest): ```python torch.save({1, 2, 3}, "foo.pt") torch.load("foo.pt", weights_only=True) ``` Fixes https://github.com/pytorch/pytorch/issues/138851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138866 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2024-10-25 05:23:08 +00:00
dependabot[bot]	907f001a68	Bump onnx from 1.16.1 to 1.17.0 in /.ci/docker (#138719 ) Bumps [onnx](https://github.com/onnx/onnx) from 1.16.1 to 1.17.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/onnx/onnx/releases">onnx's releases</a>.</em></p> <blockquote> <h2>v1.17.0</h2> <p>ONNX v1.17.0 is now available with exciting new features! We would like to thank everyone who contributed to this release! Please visit <a href="https://onnx.ai/">onnx.ai</a> to learn more about ONNX and associated projects.</p> <h1>Key Updates</h1> <h2>ai.onnx Opset 22</h2> <ul> <li>Update to support bfloat16: <ul> <li><a href="https://onnx.ai/onnx/operators/onnx__Acos.html#acos-22">Acos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Acosh.html#acosh-22">Acosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asin.html#asin-22">Asin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asinh.html#asinh-22">Asinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atan.html#atan-22">Atan</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atanh.html#atanh-22">Atanh</a>, <a href="https://onnx.ai/onnx/operators/onnx__AveragePool.html#averagepool-22">AveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Bernoulli.html#bernoulli-22">Bernoulli</a>, <a href="https://onnx.ai/onnx/operators/onnx__Conv.html#conv-22">Conv</a>, <a href="https://onnx.ai/onnx/operators/onnx__ConvTranspose.html#convtranspose-22">ConvTranspose</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cos.html#cos-22">Cos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cosh.html#cosh-22">Cosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__DeformConv.html#deformconv-22">DeformConv</a>, <a href="https://onnx.ai/onnx/operators/onnx__Det.html#det-22">Det</a>, <a href="https://onnx.ai/onnx/operators/onnx__Dropout.html#dropout-22">Dropout</a>, <a href="https://onnx.ai/onnx/operators/onnx__Elu.html#elu-22">Elu</a>, <a href="https://onnx.ai/onnx/operators/onnx__EyeLike.html#eyelike-22">EyeLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__GRU.html#gru-22">GRU</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalAveragePool.html#globalaveragepool-22">GlobalAveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalLpPool.html#globallppool-22">GlobalLpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalMaxPool.html#globalmaxpool-22">GlobalMaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GridSample.html#gridsample-22">GridSample</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSigmoid.html#hardsigmoid-22">HardSigmoid</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSwish.html#hardswish-22">HardSwish</a>, <a href="https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html#instancenormalization-22">InstanceNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LSTM.html#lstm-22">LSTM</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpNormalization.html#lpnormalization-22">LpNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpPool.html#lppool-22">LpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxPool.html#maxpool-22">MaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxRoiPool.html#maxroipool-22">MaxRoiPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxUnpool.html#maxunpool-22">MaxUnpool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Mish.html#mish-22">Mish</a>, <a href="https://onnx.ai/onnx/operators/onnx__Multinomial.html#multinomial-22">Multinomial</a>, <a href="https://onnx.ai/onnx/operators/onnx__NegativeLogLikelihoodLoss.html#negativeloglikelihoodloss-22">NegativeLogLikelihoodLoss</a>, <a href="https://onnx.ai/onnx/operators/onnx__RNN.html#rnn-22">RNN</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormal.html#randomnormal-22">RandomNormal</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormalLike.html#randomnormallike-22">RandomNormalLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniform.html#randomuniform-22">RandomUniform</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniformLike.html#randomuniformlike-22">RandomUniformLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RoiAlign.html#roialign-22">RoiAlign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Round.html#round-22">Round</a>, <a href="https://onnx.ai/onnx/operators/onnx__Selu.html#selu-22">Selu</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sin.html#sin-22">Sin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sinh.html#sinh-22">Sinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softplus.html#softplus-22">Softplus</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softsign.html#softsign-22">Softsign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Tan.html#tan-22">Tan</a>, <a href="https://onnx.ai/onnx/operators/onnx__ThresholdedRelu.html#thresholdedrelu-22">ThresholdedRelu</a></li> </ul> </li> </ul> <h2>Python Changes</h2> <ul> <li>Support for numpy >= 2.0</li> </ul> <h1>Bug fixes and infrastructure improvements</h1> <ul> <li>Fix Check URLs errors <a href="https://redirect.github.com/onnx/onnx/pull/5972">5972</a></li> <li>Use CMAKE_PREFIX_PATH in finding libprotobuf <a href="https://redirect.github.com/onnx/onnx/pull/5975">5975</a></li> <li>Bump main VERSION_NUMBER to 1.17.0 <a href="https://redirect.github.com/onnx/onnx/pull/5968">5968</a></li> <li>Fix source and pip tar.gz builds on s390x systems <a href="https://redirect.github.com/onnx/onnx/pull/5984">5984</a></li> <li>Fix unique_name <a href="https://redirect.github.com/onnx/onnx/pull/5992">5992</a></li> <li>Fix SegFault bug in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5990">5990</a></li> <li>Fix onnx.compose when connecting subgraphs <a href="https://redirect.github.com/onnx/onnx/pull/5991">5991</a></li> <li>Fix conversion from split 11 to split 18 <a href="https://redirect.github.com/onnx/onnx/pull/6020">6020</a></li> <li>Update error messages for NegativeLogLikelihoodLoss inference function <a href="https://redirect.github.com/onnx/onnx/pull/6021">6021</a></li> <li>Generalize input/output number check in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6005">6005</a></li> <li>Replace rank inference with shape inference for Einsum op <a href="https://redirect.github.com/onnx/onnx/pull/6010">6010</a></li> <li>build from source instruction with latest cmake change <a href="https://redirect.github.com/onnx/onnx/pull/6038">6038</a></li> <li>Handle OneHot's depth value during shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5963">5963</a></li> <li>Not to install cmake in pyproject.toml on Windows <a href="https://redirect.github.com/onnx/onnx/pull/6045">6045</a></li> <li>fix a skipped shape infer code <a href="https://redirect.github.com/onnx/onnx/pull/6049">6049</a></li> <li>Include the ".onnxtext" extension in supported serialization format <a href="https://redirect.github.com/onnx/onnx/pull/6051">6051</a></li> <li>Allow ReferenceEvaluator to return intermediate results <a href="https://redirect.github.com/onnx/onnx/pull/6066">6066</a></li> <li>Fix 1 typo in numpy_helper.py <a href="https://redirect.github.com/onnx/onnx/pull/6041">6041</a></li> <li>Remove benchmarking code <a href="https://redirect.github.com/onnx/onnx/pull/6076">6076</a></li> <li>Prevent crash on import after GCC 8 builds <a href="https://redirect.github.com/onnx/onnx/pull/6048">6048</a></li> <li>Check graph outputs are defined <a href="https://redirect.github.com/onnx/onnx/pull/6083">6083</a></li> <li>Enable additional ruff rules <a href="https://redirect.github.com/onnx/onnx/pull/6032">6032</a></li> <li>Add missing shape inference check for DequantizeLinear <a href="https://redirect.github.com/onnx/onnx/pull/6080">6080</a></li> <li>Add bfloat16 to all relevant ops <a href="https://redirect.github.com/onnx/onnx/pull/6099">6099</a></li> <li>fix(ci): install python dependencies with --only-binary :all: in manylinux <a href="https://redirect.github.com/onnx/onnx/pull/6120">6120</a></li> <li>fix: install google-re2 with --only-binary option <a href="https://redirect.github.com/onnx/onnx/pull/6129">6129</a></li> <li>Specify axis parameter for DequantizeLinear when input rank is 1 <a href="https://redirect.github.com/onnx/onnx/pull/6095">6095</a></li> <li>Pin onnxruntime to 1.17.3 for release CIs <a href="https://redirect.github.com/onnx/onnx/pull/6143">6143</a></li> <li>Fix INT4 TensorProto byte size is 5x larger than expected with negative values <a href="https://redirect.github.com/onnx/onnx/pull/6161">6161</a></li> <li>Mitigate tarball directory traversal risks <a href="https://redirect.github.com/onnx/onnx/pull/6164">6164</a></li> <li>Fix reference implementation for ScatterND with 4D tensors <a href="https://redirect.github.com/onnx/onnx/pull/6174">6174</a></li> <li>Addition of group > 1 in test and in backend for ConvTranspose <a href="https://redirect.github.com/onnx/onnx/pull/6175">6175</a></li> <li>Support for bfloat16 for binary, unary operators in reference implementation <a href="https://redirect.github.com/onnx/onnx/pull/6166">6166</a></li> <li>Refactor windows workflow to work on standard windows <a href="https://redirect.github.com/onnx/onnx/pull/6190">6190</a></li> <li>Fix a few crashes while running shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6195">6195</a></li> <li>Update onnx to work with numpy>=2.0 <a href="https://redirect.github.com/onnx/onnx/pull/6196">6196</a></li> <li>Use sets to improve performance of dfs search <a href="https://redirect.github.com/onnx/onnx/pull/6213">6213</a></li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b8baa84466`"><code>b8baa84</code></a> Set version 1.17.0 for official release (<a href="https://redirect.github.com/onnx/onnx/issues/6405">#6405</a>)</li> <li><a href="`6d77b80821`"><code>6d77b80</code></a> [Cherry-Pick] Fix main url checks (<a href="https://redirect.github.com/onnx/onnx/issues/6312">#6312</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6327">#6327</a>)</li> <li><a href="`174938d8b7`"><code>174938d</code></a> [Cherry-Pick] Fix protobuf pkg 5.28.0 failing on Windows (<a href="https://redirect.github.com/onnx/onnx/issues/6342">#6342</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6347">#6347</a>)</li> <li><a href="`f18d5931ad`"><code>f18d593</code></a> [Cherry-Pick] Remove unused variables (<a href="https://redirect.github.com/onnx/onnx/issues/6303">#6303</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6324">#6324</a>)</li> <li><a href="`c58890537f`"><code>c588905</code></a> Set version in rel-1.17.0 to 1.17.0rc1 (<a href="https://redirect.github.com/onnx/onnx/issues/6317">#6317</a>)</li> <li><a href="`4392c2c9ae`"><code>4392c2c</code></a> Prepare for rel-1.17.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6281">#6281</a>)</li> <li><a href="`cb54169e4f`"><code>cb54169</code></a> Update ort filter to 1.20.0 to skip tests known to fail with ort 1.19.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6306">#6306</a>)</li> <li><a href="`99e1fd352c`"><code>99e1fd3</code></a> Bump reviewdog/action-misspell from 1.21.0 to 1.23.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6268">#6268</a>)</li> <li><a href="`1920565505`"><code>1920565</code></a> Bump ossf/scorecard-action from 2.3.3 to 2.4.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6273">#6273</a>)</li> <li><a href="`2e8f2289b9`"><code>2e8f228</code></a> Bump mypy from 1.10.1 to 1.11.1 (<a href="https://redirect.github.com/onnx/onnx/issues/6275">#6275</a>)</li> <li>Additional commits viewable in <a href="https://github.com/onnx/onnx/compare/v1.16.1...v1.17.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=onnx&package-manager=pip&previous-version=1.16.1&new-version=1.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138719 Approved by: https://github.com/ezyang Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-25 03:53:25 +00:00
David Berard	94e341c6a3	[user triton] fix codegen for tl.constexpr globals (#138757 ) Fixes #138509 tl.constexpr globals would be codegen-ed as `constexpr()` instead of `tl.constexpr()` if they were un-annotated. This fixes the issue (and adds a test). The correct handling was already added but the corrected string was not being used in the un-annotated branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138757 Approved by: https://github.com/oulgen	2024-10-25 03:00:42 +00:00
Will Feng	36c6ad71ba	[tlparse] Add `dynamo_graph_break_reason` logging to trace_structured (#138778 ) A common challenge during torch.compile enablement is to answer user's question: "where is the graph break?". This PR will help make it easier to answer by surfacing graph breaks and their corresponding user stack trace / compiler stack trace in a direct link e.g. `0_0_0/dynamo_graph_break_reason_0.txt` from tlparse index.html. ![image](https://github.com/user-attachments/assets/79cd43f5-af14-4d08-9d5b-cb47d8203851) ![image](https://github.com/user-attachments/assets/23233ee2-0d56-4526-bf9a-d22c337c4d18) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138778 Approved by: https://github.com/ezyang	2024-10-25 02:00:04 +00:00
chilli	9425c0767d	Fix free symbol handling in FlexAttention (#138794 ) Fixes https://github.com/pytorch/pytorch/issues/136196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138794 Approved by: https://github.com/Skylion007 ghstack dependencies: #138733	2024-10-25 01:20:42 +00:00
Adnan Akhundov	f737e3fe2f	[inductor] Fix ReinterpretView call in TMADescriptor IR (#138759 ) As a result of #137768, `ReinterpretView` call in the `TMADescriptor` has become invalid. This leads to some TMA tests breaking in test_triton_kernels.py. In this PR, we fix this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138759 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-10-25 00:45:44 +00:00
Yifu Wang	ed9169df98	Removed the typing information for already deleted ProcessGroupCudaP2P (#138753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138753 Approved by: https://github.com/weifengpy	2024-10-25 00:32:07 +00:00
Shivam Raikundalia	2f4af0f4e6	[Profiler] Disable Dynamo-Sensitive Profiler Tests (#138762 ) Summary: During compilation, a profiler context gets ignored so we should temporarily turn off tests that are failing due to dynamo. Once profiler integration with dynamo is introduced we can reintroduce these tests Test Plan: Make sure CI is passing again Differential Revision: D64867447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138762 Approved by: https://github.com/davidberard98	2024-10-25 00:25:49 +00:00
Avik Chaudhuri	1d98a526dd	preserve signatures with multiple calls + buffer mutations (#138669 ) As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures). This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead. Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition. Differential Revision: D64806175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669 Approved by: https://github.com/tugsbayasgalan	2024-10-25 00:13:25 +00:00
Shuqiang Zhang	4c91481656	[c10d] allow sub group to be eagerly inited even if default one is not (#138665 ) Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves https://github.com/pytorch/pytorch/issues/137018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665 Approved by: https://github.com/kwen2501 ghstack dependencies: #138781	2024-10-24 23:51:28 +00:00
Avik Chaudhuri	277b32c930	fix unflatten training ir test suffix (#138840 ) Test Plan: none Differential Revision: D64917214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138840 Approved by: https://github.com/zhxchen17	2024-10-24 23:42:54 +00:00
Chirag Pandya	425ce2a7ee	[c10d] use a promise to delay watchdog shutdown (#138828 ) Summary: We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs. This change: 1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes. 2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout. Test Plan: Tested on my local job and flight recorder successfully executed for the job. https://fburl.com/mlhub/38fj5yne The watchdog thread gives heartbeat thread time to write out the logs. In the logs we see: ``` [trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish. ``` Differential Revision: D64857928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828 Approved by: https://github.com/d4l3k, https://github.com/fduwjj	2024-10-24 23:42:29 +00:00
Henry Tsang	751987eed1	[pt2] improve error logs for torch.cond and aoti package (#138647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138647 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2024-10-24 23:38:07 +00:00
Henry Tsang	3e4ba18eb5	[aoti] fix typo in codegen_dynamic_scalar (#138760 ) Summary: appears to be a typo Test Plan: ci Differential Revision: D64867271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138760 Approved by: https://github.com/ezyang	2024-10-24 23:16:30 +00:00
Pian Pawakapan	09848c892a	[aot_compile] propagate ShapeEnv during lowering (#138362 ) We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused. Differential Revision: D64613290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362 Approved by: https://github.com/angelayi	2024-10-24 22:22:14 +00:00
Angela Yi	51f6b946ae	[torchbind] Add generic __deepcopy__ method (#137613 ) Summary: Added a generic `__deepcopy__` method which will use the torchbind object's existing `__getattr__` and `__setattr__` to copy the torchbind object. This will later be used in [D64124825](https://www.internalfb.com/diff/D64124825) Differential Revision: D64124826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137613 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-10-24 22:14:55 +00:00
James Wu	282e6383c1	Add inductor cache metrics (#138603 ) Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event. Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well. Here's what triton events look like: {F1941513932} And this on a cache hit(since we still redo this work): {F1941514350} Inductor cache info: {F1941528530} Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603 Approved by: https://github.com/oulgen	2024-10-24 22:09:34 +00:00
Yiming Zhou	e78a3e260b	[export] Add serdes_non_strict to tests (#138662 ) Summary: We expand the tests to cover serdes_non_strict. Currently failing tests are skipped. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _serdes_non_strict ``` Differential Revision: D64709285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138662 Approved by: https://github.com/avikchaudhuri	2024-10-24 21:35:32 +00:00
Bob Ren	500b2bc781	Have as_tensor always return a float64 tensor in dynamo (#138598 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138598 Approved by: https://github.com/ezyang ghstack dependencies: #138595	2024-10-24 20:50:28 +00:00
ernest-lu	5b50b0a9bc	remove dead code (#138690 ) Fixes issue-138673: [issue](https://github.com/pytorch/pytorch/issues/138673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138690 Approved by: https://github.com/Aidyn-A, https://github.com/colesbury	2024-10-24 20:29:24 +00:00
Scott Wolchok	10a34dcd57	[PyTorch] Fix out-of-bounds array access in atomic_add_vec (#138744 ) There is no guarantee that `len` here is enough for a full vector. This was causing at least one test failure on https://github.com/pytorch/pytorch/pull/137426. Differential Revision: [D64857786](https://our.internmc.facebook.com/intern/diff/D64857786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138744 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655, #138716	2024-10-24 19:37:12 +00:00
Scott Wolchok	0af7632c10	[PyTorch] Fix ASAN failures for vec_test_all_types Cast test (#138716 ) The size of the destination array was too small. Differential Revision: [D64843491](https://our.internmc.facebook.com/intern/diff/D64843491/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138716 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655	2024-10-24 19:37:12 +00:00
Scott Wolchok	cbafe1e7f3	[PyTorch] Unbreak VectorizedN fmadd/fmsub/clamp (#138655 ) These are ternary ops, not binary ops. Differential Revision: [D64794253](https://our.internmc.facebook.com/intern/diff/D64794253/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138655 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542	2024-10-24 19:37:02 +00:00
Scott Wolchok	ead5738ff2	[PyTorch] Fix inductor bug with unrolled vectorized prod (#138542 ) This issue is one of two inductor bugs blocking land of #137426. Turned out to be simple Differential Revision: [D64734116](https://our.internmc.facebook.com/intern/diff/D64734116/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138542 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486 Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-10-24 19:36:51 +00:00
Scott Wolchok	6aa673377b	[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486 ) In this case, it looks like we expect the body to be a VecMask (unify_mask_base_type is called by where()), but we didn't make it a VecMask. Now we do. Differential Revision: [D64702918](https://our.internmc.facebook.com/intern/diff/D64702918/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138486 Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet	2024-10-24 19:36:41 +00:00
Shunting Zhang	239a21f37e	[Inductor] don't set XBLOCK larger than xnumel (#138730 ) When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config. The simple fix here is to just not let XBLOCK to be larger than xnumel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138730 Approved by: https://github.com/Chillee ghstack dependencies: #136782	2024-10-24 18:31:10 +00:00
PyTorch MergeBot	e7f1e306df	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 )" This reverts commit 362ca54f03f9bb72ba7633ed580fb788b1a8dea9. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/wdvr due to this change is breaking our prod training pipeline (verified with bisect) by increasing memory consumption 4x and causing OOM ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2435962833))	2024-10-24 17:46:09 +00:00
PyTorch MergeBot	8197e4c70d	Revert "[sparse] add search for optimal alg_id to torch.compile (#137427 )" This reverts commit 39bfba3f561e3125ce035de0bf90c8c7bcccd3ce. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))	2024-10-24 17:27:06 +00:00
IvanKobzarev	5ea6777861	[subclass] Unwrap_tensor_subclasses micro optimization (#138498 ) unwrap_tensor_subclasses -> get_plain_tensors Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel. 1/ Removing asserts from runtime path 2/ Removing list creation with using optional output list to append argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498 Approved by: https://github.com/bdhirsh	2024-10-24 16:54:54 +00:00
Shuqiang Zhang	fe458eef80	[c10d] fix a logic of using ncclCommSplit (#138781 ) Summary: Currently, whether split should be used depends on the size of subgroup. It's possible that default PG is not eagerly initialized yet, but split is still called. This PR fixes this issue by removing split's dependency on subgroup size Test Plan: Modified UT Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781 Approved by: https://github.com/kwen2501	2024-10-24 16:16:35 +00:00
Irem Yuksel	b021486405	Enable Windows Arm64 (#133088 ) This PR enables Pytorch for Windows on Arm64 - CPU only. Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible. We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries) as a BLAS option, which is introduced in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088 Approved by: https://github.com/malfet Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com> Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com> Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2024-10-24 16:10:44 +00:00
eqy	f7bb11dcc2	[cuDNN][cuDNN Frontend] Check in test for previously broken dBias check (#138725 ) see https://github.com/pytorch/pytorch/issues/137347, let's try to land before https://github.com/pytorch/pytorch/pull/138709 CC @malfet @drisspg @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138725 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-10-24 15:33:58 +00:00
Richard Barnes	8f62832189	c10::nullopt -> std::nullopt (#138701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138701 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-10-24 15:03:32 +00:00
Sam Larsen	7e62ac51a1	[pt2] [testing] Skip inductor_freezing - test_cpp_wrapper_cuda internally (#138366 ) Summary: It's been failing CI since probably forever; skip for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/138366 Approved by: https://github.com/eellison	2024-10-24 14:40:13 +00:00
Isuru Fernando	5c88a9f6c0	Assume that indices are non-negative in _unsafe_masked_index (#137315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137315 Approved by: https://github.com/eellison	2024-10-24 12:39:31 +00:00
Nick Westlake	0d9fb51028	Fix lru_cache where config is used (#134235 ) Ensure that any use of functools.lru_cache does not prevent config from being changed after the function has already run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134235 Approved by: https://github.com/masnesral	2024-10-24 10:43:34 +00:00
Richard Barnes	e7d4de0e59	Eliminate C10_TYPENAME_CONSTEXPR (#138702 ) Test Plan: Sandcastle Differential Revision: D64833560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138702 Approved by: https://github.com/malfet	2024-10-24 10:21:01 +00:00
Yu, Guangye	0efa590d43	[CI] Fix XPU CI failure (#138548 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/138577. # Solution 1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170 2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`. 3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`. # Additional Context > Why update torch-xpu-ops commit pin here? We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364). > What does the feature of torch-xpu-ops update? 1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc; 2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d` 3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc; 4. fix build failure related to `C10_UNUSED`; Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-24 07:56:26 +00:00
Richard Barnes	dbf0fa811a	Remove C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA and CONSTEXPR_EXCEPT_WIN_CUDA (#138479 ) BC linter suppressed due to removal of `tools/linter/adapters/constexpr_linter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138479 Approved by: https://github.com/eqy, https://github.com/malfet	2024-10-24 07:51:05 +00:00
Xu Han	96b30dcb25	[Windows][cpu] mkl use mimalloc as allocator on Windows (#138419 ) We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion From the blog conclusion, we found the `ResNet50` is typical case of it. Let's focus on the `ResNet50`, and collect the profiling log: ```cmd (nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1 aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300 aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300 aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300 aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300 aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300 aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300 aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300 aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100 aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100 aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900 aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900 aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900 aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000 aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100 aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100 aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300 aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300 aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100 aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100 aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100 aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300 aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100 aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100 aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100 aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200 aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100 aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100 aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100 aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100 aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100 aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100 aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200 aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100 aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200 --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 17.448s Execution time: 20.02380895614624 ``` We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`. Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory. We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory. So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes: 1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`. 2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`. 3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`. For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-24 05:29:47 +00:00
chilli	a94c501b84	Fixed max-autotune in FlexAttention to reset kernel options appropriately (#138733 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138733 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng	2024-10-24 05:18:09 +00:00
cyy	2bcfbf2505	[Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465 ) Follows #137404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465 Approved by: https://github.com/ezyang	2024-10-24 04:58:49 +00:00
cyy	53e356a1c0	[2/N] Enable cppcoreguidelines-special-member-functions (#138670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138670 Approved by: https://github.com/sraikund16	2024-10-24 04:35:18 +00:00
Animesh Jain	cfdf658a91	[dynamo][modules] Support overridden __call__ on nn modules (#138619 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138619 Approved by: https://github.com/williamwen42 ghstack dependencies: #138657	2024-10-24 03:49:26 +00:00
Animesh Jain	b1acd0978e	[dynamo] Support range_iterator as a function input (#138657 ) Fixes https://github.com/pytorch/pytorch/issues/138654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138657 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-24 03:49:26 +00:00
Doru Bercea	e5c3d7ab77	[ROCm] Improve performance of reductions on 1D and 2D tensors. (#137737 ) This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types: - 1D tensors - 2D tensors when reducing along dimension 0 - 2D tensors when reducing along dimension 1 Runtime reduction between 0 and 75% depending on tensor shape. The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137737 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell	2024-10-24 03:41:16 +00:00
fduwjj	d8f22a1141	[c10d] Reorder GIL checker and c++ stack trace print with comments (#138734 ) We found one case when the GIL deadlock happens and then FR timeout, I am wondering if we can do the GIL check before cpp stack trace print which can lead to hang Pull Request resolved: https://github.com/pytorch/pytorch/pull/138734 Approved by: https://github.com/c-p-i-o	2024-10-24 02:21:37 +00:00
Colin L. Rice	0b9320b7c5	fx_graph_cache: Remove custom amd JK (#137501 ) This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501 Approved by: https://github.com/oulgen	2024-10-24 01:30:39 +00:00
Howard Huang	32a3dbc645	[Pipelining] Free memory usage earlier in last stage (#138504 ) This fix is similar to that done in #138119, except this is an edge case for the last stage. For the last stage we perform backward on the `loss` which we detached in the previous PR. However, we also hold the `stage_outputs` alive because we return all the output chunks in `merge_output_chunks()` after the step is over. This will also still keep the autograd graph alive, so detaching these tensors frees the memory earlier. pre-fix: <img width="1780" alt="image" src="https://github.com/user-attachments/assets/bb78bde7-fd5c-4eba-bfc9-f0359e20bbab"> post-fix: <img width="1788" alt="image" src="https://github.com/user-attachments/assets/a26102d9-9db2-4fc8-946c-336b8430657c"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138504 Approved by: https://github.com/wconstab ghstack dependencies: #138119	2024-10-24 00:44:03 +00:00
Howard Huang	8945309c08	[Pipelining] fix extra memory usage in zero bubble (#138119 ) Full debugging details in here: https://docs.google.com/document/d/1Pe_E0KWAfsJ6MCvKZ5aR28rTXX-rYLg13XxwXd6AALw/edit?usp=sharing In zero bubble, we have two methods `stage_backward_input` and `stage_backward_weight`. During `stage_backward_input` we compute the gradients of the input with respect to the stage outputs and also retain the graph of the autograd graph (different than 1F1B where `retain_graph=False`). The output / loss was still being retained across the next schedule step() because we return the loss to the user and use the output to the next step. To allow autograd to free the variables in the graph we need to detach the output/loss after we don't need to use it autograd anymore. Pre-fix: <img width="1021" alt="image" src="https://github.com/user-attachments/assets/6c8bf469-32b1-4dac-85ff-b97991f9f0e3"> Post-fix: <img width="1039" alt="image" src="https://github.com/user-attachments/assets/a1875038-e80b-4dd4-84f2-38727d7792dc"> without AC (7B model on titan): 10% memory improvement with AC (7B model on titan) 50% memory improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/138119 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-10-24 00:44:03 +00:00
Nikita Shulga	889717aabd	[CI/CD] Disable split build (#138752 ) See https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138752 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-10-23 22:38:30 +00:00
Nikita Shulga	1b31248933	[EZ] Fix typo in test_mps.py (#138738 ) s/emedding_weight/embedding_weight/ Stolen from `074766d9b4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738 Approved by: https://github.com/atalman	2024-10-23 22:15:35 +00:00
drisspg	c92459488b	Fix test on windows (#138641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138641 Approved by: https://github.com/huydhn	2024-10-23 21:53:32 +00:00
Animesh Jain	dd4dd85210	[hierarchical-compilation][inductor] Support invoke_subgraph HOP (#138031 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138031 Approved by: https://github.com/eellison ghstack dependencies: #137538, #138036, #137965	2024-10-23 21:32:14 +00:00
Gabriel Ferns	7622ede3cd	Add dump_launch_params config in triton/inductor (#137143 ) Summary: Moves the checking of TORCHINDUCTOR_DUMP_LAUNCH_PARAMS into the config module to pull it out of the critical path. Test Plan: Existing unit tests cover this env variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137143 Approved by: https://github.com/eellison	2024-10-23 21:20:46 +00:00
Edward Z. Yang	9eadd7434e	Refactor: Move _nested_int_aware_sort top level (#138693 ) I need to use it from some other places later in the PR stack Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138693 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-23 21:15:05 +00:00
Pian Pawakapan	9b77d3109b	[export] fix test_unbacked_bindings_for_divisible_u_symint (#138607 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138607 Approved by: https://github.com/angelayi	2024-10-23 21:10:05 +00:00
Richard Barnes	dbd6ada8c3	Clean up a c10::optional and fix documentation (#138700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138700 Approved by: https://github.com/Skylion007	2024-10-23 20:42:28 +00:00
Tom Ritchford	8aedc649bd	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 19:13:44 +00:00
Catherine Lee	cd9c6e9408	Do not run CI on forks (#138714 ) Add `if: github.repository_owner == 'pytorch'` for some jobs that were missing it Fixes #138564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138714 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-10-23 18:23:05 +00:00
Laith Sakka	ed313a5ca2	Introduce torch.sym_add, variadic add (#138660 ) Tested internally here: https://www.internalfb.com/diff/D64057744 This is a reland after previous internal failures. main change is ``` if min is None and max is None: torch._check_is_size(size) return ``` Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660 Approved by: https://github.com/ezyang, https://github.com/bobrenjc93	2024-10-23 17:42:41 +00:00
Laith Sakka	72ea7ba89f	Generate slice.Tensor view operations instead of as_strided when split is used in the original program. (#137225 ) test_recompile assert that the changes do not add more recompilation by comparing with eager backend. The reason of this is because slice can be lowered in more efficient way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137225 Approved by: https://github.com/zou3519	2024-10-23 17:42:16 +00:00
Tom Ritchford	1bc73f3157	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 17:42:11 +00:00
Felix Su	c272526ea5	[SJD] [RFC] force setting last progress time (#138615 ) Summary: Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization The workaround is to update the healthcheck time when calling `get_last_progress_time` Test Plan: Logs show that the progress time value is being changed despite client not being set Behavior when watchdog is enabled with SJD config is left unchanged Differential Revision: D64733766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615 Approved by: https://github.com/gag1jain	2024-10-23 15:29:00 +00:00
PyTorch MergeBot	cdfe1bffd1	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit 8fbf866904661b16cba4c799af81121557ba9da8. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))	2024-10-23 14:49:49 +00:00
Jeremy Hadidjojo	2f007e5de5	Make trace log dir persist through multiple `set_logs()` calls (#137793 ) Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call. Reviewed By: ezyang Differential Revision: D64118047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793 Approved by: https://github.com/ezyang	2024-10-23 14:23:03 +00:00
Alex Baden	ecf2240243	[Inductor] New Triton Attrs Descriptor Fixups (#138390 ) Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390 Approved by: https://github.com/jansel, https://github.com/huydhn	2024-10-23 14:13:49 +00:00
Jean Schmidt	75c6787a16	[CI] Introduces experiment `awsa100` to `inductor-perf-compare.yml` workflow using `_runner-determinator.yml` (#138204 ) Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`. It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark. Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-10-23 13:47:26 +00:00
Richard Barnes	04103f6ae9	Eliminate c10 string_utils (#138499 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138499 Approved by: https://github.com/swolchok	2024-10-23 13:40:19 +00:00
Sun, Jiayi	c2d26418c3	[Quant][Inductor] expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051 ) ### Summary Expand quantization conv-binary(-unary) pattern fusion inside inductor to support the following two patterns: Pattern 1: ``` Conv(X) extra input \ / Add \| Optional(relu) \| Y ``` Pattern 2: ``` extra input Conv(X) \ / Add \| Optional(relu) \| Y ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138051 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-10-23 13:12:17 +00:00
chuanqiw	2f1842fa83	[CD] fix xpu support packages version (#138189 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138189 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/atalman	2024-10-23 12:25:43 +00:00
Ke Wen	8fbf866904	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384	2024-10-23 08:51:54 +00:00
Sheng Fu	2d7e586c13	Fixed dead lock in execution trace (#136892 ) Summary: This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL. One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda Differential Revision: D63556608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892 Approved by: https://github.com/aaronenyeshi	2024-10-23 07:53:56 +00:00
titaiwangms	cab5f54dee	[ONNX] Fix sequence handling in graph building (#138656 ) Previous to this PR, op.Concat is called without required attributes: axis, and val and arg seems wrongly coded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138656 Approved by: https://github.com/justinchuby	2024-10-23 07:47:58 +00:00
Ting Lu	5402677021	add CUDA 12.6 to conda docker image (#138417 ) Adds cuda 12.6 to common installation script. Adds cuda 12.6 to conda docker image build matrix. fixes #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138417 Approved by: https://github.com/cyyever, https://github.com/atalman	2024-10-23 07:30:51 +00:00
Bob Ren	5ceef8c470	Add support for SymFloats in split_module fx pass (#138599 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599 Approved by: https://github.com/ezyang	2024-10-23 06:56:13 +00:00
Bob Ren	96c86758e2	Support conditionals on sym node variables in the __bool__ and __len__ case (#138595 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595 Approved by: https://github.com/ezyang	2024-10-23 06:44:09 +00:00
titaiwangms	72dde6e84b	[ONNX] Avoid optimize `onnx_dynamo-fallback` (#138265 ) Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph. It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265 Approved by: https://github.com/xadupre, https://github.com/justinchuby	2024-10-23 04:13:32 +00:00
Prajesh Praveen Anchalia	bb65c9b883	[PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054 ) Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead. Differential Revision: D63661569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054 Approved by: https://github.com/bdhirsh	2024-10-23 03:15:37 +00:00
cyy	fbd14315f9	Update ruff to 0.7.0 (#138597 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138597 Approved by: https://github.com/ezyang	2024-10-23 03:00:30 +00:00
Sam Larsen	06b5330674	[easy] Log subproc pool creation (#138642 ) Summary: Request from internal to log subproc pool creation Test Plan: ``` $ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642 Approved by: https://github.com/eellison	2024-10-23 02:41:42 +00:00
cyy	86cca3fb05	[1/N] Don't skip ASAN on some tests (#138571 ) Clang15's ASAN is new enough so that it's possible to re-evaluate the disabled ASAN tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138571 Approved by: https://github.com/ezyang	2024-10-23 02:38:45 +00:00
Henry Tsang	d437df342b	[tests] fix broken tests caused by AotEagerAndRecordGraphs typo (#138492 ) Summary: Name change happened in https://github.com/pytorch/pytorch/pull/138231 AttributeError: module 'torch._dynamo.testing' has no attribute 'AOTEagerAndRecordGraphs'. Did you mean: 'AotEagerAndRecordGraphs'? Test Plan: ci Differential Revision: D64704686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138492 Approved by: https://github.com/aakhundov	2024-10-23 02:25:21 +00:00
Wouter Devriendt	fee2f331ce	Update torchbench.txt (#138569 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138569 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-10-23 01:42:25 +00:00
Ke Wen	f2ebf6d94a	[PGNCCL] Ensure comm is ready before all accesses (#138384 ) Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374	2024-10-23 01:36:58 +00:00
Mikayla Gawarecki	37149d032c	Fix .to(cpu) for Storage (#138011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138011 Approved by: https://github.com/albanD	2024-10-23 01:31:48 +00:00
Bin Bao	555bddbef7	[AOTI][refactor] Move use_minimal_arrayref_interface logic (#138250 ) Summary: Move use_minimal_arrayref_interface specific logic from CppWrapperCpu to CppWrapperCpuArrayRef. This is a copy-on-write style refactor, to simply the default AOTI generated code. Differential Revision: [D64598715](https://our.internmc.facebook.com/intern/diff/D64598715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138250 Approved by: https://github.com/chenyang78 ghstack dependencies: #138544, #138379	2024-10-23 01:00:34 +00:00
Bin Bao	2cee5a39ad	[AOTI] Fix check_model_with_multiple_inputs in test_aot_inductor (#138379 ) Summary: Add missing use_minimal_arrayref_interface setting to check_model_with_multiple_inputs. Differential Revision: [D64635211](https://our.internmc.facebook.com/intern/diff/D64635211) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138379 Approved by: https://github.com/hl475 ghstack dependencies: #138544	2024-10-23 00:54:29 +00:00
Richard Barnes	d428d81c7f	Remove some pre-cpp17 stuff (#138410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138410 Approved by: https://github.com/Skylion007	2024-10-23 00:38:03 +00:00
Tugsbayasgalan Manlaibaatar	f4b3813989	Wrap autograd and autocast ops in training IR (#138516 ) Differential Revision: [D64732361](https://our.internmc.facebook.com/intern/diff/D64732361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138516 Approved by: https://github.com/yushangdi ghstack dependencies: #138261	2024-10-23 00:37:54 +00:00
PyTorch MergeBot	9f7b987087	Revert "[Inductor] New Triton Attrs Descriptor Fixups (#138390 )" This reverts commit 215999452eb5517213b3a31f72eb9a7e843d12a0. Reverted https://github.com/pytorch/pytorch/pull/138390 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it still has another lint error ([comment](https://github.com/pytorch/pytorch/pull/138390#issuecomment-2430566004))	2024-10-23 00:37:28 +00:00
Tugsbayasgalan Manlaibaatar	69f18587d6	Move test_serialize to training IR (#138261 ) Differential Revision: [D64572253](https://our.internmc.facebook.com/intern/diff/D64572253) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138261 Approved by: https://github.com/yushangdi	2024-10-23 00:32:32 +00:00
Laith Sakka	662d07e93e	Remove parallel_and and parallel_or (#138135 ) Not used, suggested by @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135 Approved by: https://github.com/ezyang	2024-10-23 00:22:22 +00:00
cyy	38d3c27849	[1/N] Enable cppcoreguidelines-special-member-functions (#137405 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405 Approved by: https://github.com/ezyang	2024-10-23 00:16:53 +00:00
wz337	7e951c1675	[EZ][DTensor] Update DTensor readme to use the new import path (#138625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625 Approved by: https://github.com/XilunWu	2024-10-23 00:08:36 +00:00
William Wen	3441ea7d74	[dynamo] reset compiler stance after test (#138277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-23 00:07:33 +00:00
PyTorch UpdateBot	a825667670	[executorch hash update] update the pinned executorch hash (#135287 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135287 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-22 23:40:57 +00:00
eellison	5942b29850	Disabling amp context when invoking compiler (#138624 ) Fix for https://github.com/pytorch/pytorch/issues/133974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624 Approved by: https://github.com/bdhirsh, https://github.com/drisspg	2024-10-22 23:21:55 +00:00
Alex Baden	215999452e	[Inductor] New Triton Attrs Descriptor Fixups (#138390 ) Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390 Approved by: https://github.com/jansel	2024-10-22 23:16:05 +00:00
PyTorch MergeBot	10f16cc7da	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit 8aacbee8e0d6c03096f2ce94b70e2a8fab17ee81. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/wdvr due to this one has failing internal tests, not related to a landrace with #138398 - reverting this one ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2430460176))	2024-10-22 22:53:56 +00:00
Jesse Cai	39bfba3f56	[sparse] add search for optimal alg_id to torch.compile (#137427 ) Summary: This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal alg_id and cache it when running with `torch.compile` Seeing speedups on both bfloat16 and float8 dtypes: <img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b"> <img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6"> * `torch._cslt_sparse_mm_search` has been modified to return optimal split-k parameters as well as max alg_id. * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch	2024-10-22 22:39:42 +00:00
Nikita Shulga	b4cfb9c014	[EZ] Use `at::detail` nested namespace in Dispatch.h (#138633 ) Instead of `namespace at { namespace detail {` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138633 Approved by: https://github.com/Skylion007	2024-10-22 22:13:21 +00:00
Bin Bao	54fbd897d9	[AOTI][refactor] Clean up test_aot_inductor skip list (#138544 ) Summary: Remove skips for already fixed tests. Change remaining skip to xfail so that the failure list can be more proactively maintained. Differential Revision: [D64761257](https://our.internmc.facebook.com/intern/diff/D64761257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138544 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-22 21:32:49 +00:00
James Wu	a16476b671	Add support for adding extra metadata to chromium events, log to separate columns (#138477 ) This diff does a few things: ## Add metadata to events in progress Adds the ability to add extra metadata to Chromium Events via `add_event_data`. Metadata can only be added to chromium events that have started, but not ended (so, in progress events) - When you add the data, the metadata is appended to the metadata when you call log_event_end(). - The metadata appears in chromium events in tlparse. It also gets logged to scuba. ## New `dynamo` chromium event We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes: ``` __start__ -> dynamo (dynamo compile metrics) -> entire_frame_compile (compile.inner) -> backend_compile (i.e. aotdispatch) -> create_aot_dispatch_function -> inductor_compile -> ... ``` BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here. FAQ: Why can't we use `entire_frame_compile` as the event? This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo. ## Log metadata as separate columns (Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477 Approved by: https://github.com/aorenste	2024-10-22 21:17:44 +00:00
Matthew Francis-Landau	3b2b5486ea	Fixes issue with torch._dynamo.assume_constant_result with global functions (#132431 ) This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten. Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function. This causes that function to be overwritten in the global scope. This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431 Approved by: https://github.com/jansel	2024-10-22 21:14:26 +00:00
Yiming Zhou	e3af290165	[export] Add retraceability_non_strict to tests (#138380 ) Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability ``` Differential Revision: D64611532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380 Approved by: https://github.com/angelayi	2024-10-22 21:05:51 +00:00
Nikita Shulga	d1be61ce4e	Update copyrights to 2024 (#138638 ) Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well Fixes https://github.com/pytorch/pytorch/issues/138630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638 Approved by: https://github.com/atalman	2024-10-22 21:00:58 +00:00
dependabot[bot]	dbd0a39c79	Bump webrick from 1.7.0 to 1.8.2 in /ios/TestApp (#136593 ) Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2. - [Release notes](https://github.com/ruby/webrick/releases) - [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2) --- updated-dependencies: - dependency-name: webrick dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-22 13:32:50 -07:00
Joel Schlosser	f089d5ffef	Improve input validation for NJT pointwise ops (#138602 ) Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops. I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602 Approved by: https://github.com/soulitzer	2024-10-22 20:13:12 +00:00
cyy	1c77b13c06	[6/N] Fix extra warnings brought by clang-tidy-17 (#138572 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138572 Approved by: https://github.com/Skylion007	2024-10-22 19:46:38 +00:00
Ti-Tai Wang	a71723bf12	[ONNX] Add complex constant support (#138279 ) Transform complex python constant to float representation as well, like what we have with tensors. PS: I find it's not reasonable to add "complex->float" in IR side, so I put it here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138279 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-10-22 19:42:59 +00:00
Mark Kim-Mulgrew	c7a20939b4	Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589 Approved by: https://github.com/clee2000	2024-10-22 19:36:01 +00:00
atalman	078dca1ce8	Aarch64 binary builds - fix passing env_file to Docker (#138588 ) Aarch64 builds skipped the logic of sourcing binary env file. And as a result PYTORCH_EXTRA_INSTALL_REQUIREMENTS passed to Aarch64 builds have not included triton dependency constraint. This PR makes sure Aarch64 builds follow same path as our regular manywheel builds. To work around this issue we had to inject triton in aarrch64 builds for release 2.5, which is not ideal: https://github.com/pytorch/builder/pull/2011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138588 Approved by: https://github.com/jeanschmidt, https://github.com/malfet	2024-10-22 19:04:19 +00:00
eqy	c0e8458aab	[Flex Attention] Don't compute fill order to compute stride order just to get fill order back (#138376 ) Was a bit confusing to read when working on #138354 "computer-assisted proof" ``` import random def argsort(seq): # preserve original order for equal strides getter = seq.__getitem__ a_r = range(len(seq)) return list(reversed(sorted(a_r, key=getter, reverse=True))) # noqa: C413 def stride_order2fill_order(order): """ Convert stride order to fill order For channel last format, stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0] """ lookup = {pos: idx for idx, pos in enumerate(order)} fill_order = [lookup[i] for i in range(len(order))] return fill_order def get_stride_order(seq): """ Convert strides to stride order """ sorted_idx: List[int] = argsort(seq) out = [0 for _ in range(len(seq))] a = sorted_idx.copy() for i, elem in enumerate(sorted_idx): out[elem] = i fillorder = stride_order2fill_order(out) assert fillorder == sorted_idx return out for _ in range(1000): a = [0, 1, 2, 3] random.shuffle(a) get_stride_order(a) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376 Approved by: https://github.com/drisspg	2024-10-22 18:40:39 +00:00
Max Podkorytov	2dab4ccb65	[Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen (#137947 ) Plug into lowering and end to end test in a later PR Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-10-22 18:25:23 +00:00
Zain Rizvi	6e4c19289c	[EZ] [BE] Remove (now) unused scale config (#138511 ) Final step of moving scale config files to test-infra repo. Details in https://github.com/pytorch/test-infra/pull/5767 Scale configs are now read from test-infra. This PR is just cleaning up stale files Pull Request resolved: https://github.com/pytorch/pytorch/pull/138511 Approved by: https://github.com/clee2000	2024-10-22 18:08:42 +00:00
Stefan-Alin Pahontu	f7e36d8d6f	Fix for MSVC problem on Windows Arm64 (#136765 ) This PR proposes a workaround for an internal issue introduced in MSVC 14.37 for Windows Arm64 target. It is still an ongoing problem. The fix will be released with the future versions of Visual Studio 2022 but until then the changes to cpu/vec/vec_base.h should be sufficient. We also opened a new ticket on Visual Studio Developer Community, it can be found here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136765 Approved by: https://github.com/malfet Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2024-10-22 18:07:58 +00:00
PyTorch MergeBot	fc9093c3d2	Revert "Remove C10_DEPRECATED (#138406 )" This reverts commit 70ec86d7542d461ff6f01ba1a1c9a4f38637af8e. Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))	2024-10-22 18:00:41 +00:00
Catherine Lee	cc93c1e5e4	Upload artifacts during test run (#125799 ) Zip and upload artifacts while run_test is running Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799 Approved by: https://github.com/huydhn	2024-10-22 16:48:57 +00:00
Animesh Jain	2e48788a35	[hierarchical-compilation][invoke_subgraph] Use tracing context to cache artifacts of dispatch keys (#137965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137965 Approved by: https://github.com/zou3519 ghstack dependencies: #137538, #138036	2024-10-22 15:33:42 +00:00
Animesh Jain	e045e8f0df	[hierarchical-compilation][invoke_subgraph] Graph break on input mutation or aliasing (#138036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138036 Approved by: https://github.com/zou3519 ghstack dependencies: #137538	2024-10-22 15:33:42 +00:00
Animesh Jain	4dd4d38ca9	[hierarchical-compilation][hop] Introduce invoke_subgraph (#137538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538 Approved by: https://github.com/zou3519	2024-10-22 15:33:34 +00:00
Jeff Daily	046f02d2de	[ROCm] index_put performance improvement (#138259 ) On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel. None of the existing unit tests were exercising the large index case so a new unit test was added. It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259 Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy	2024-10-22 15:21:43 +00:00
Bin Bao	2827befe61	[AOTI][reland] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138541 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed. Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541 Approved by: https://github.com/frank-wei	2024-10-22 14:17:27 +00:00
Colin L. Rice	bb8bc7d6b3	config: simplify most of the config handling and fix some bugs (#138377 ) This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them. Cleanups - This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose. - I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows. - I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes). - I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired. - I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values. - I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher. For release notes, Not sure exactly how to communicate this, but something like "ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison	2024-10-22 13:40:26 +00:00
Edward Z. Yang	1b61313acd	Add type stub for SymInt.rsub (#138543 ) Fixes https://github.com/pytorch/pytorch/issues/138478 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138543 Approved by: https://github.com/malfet	2024-10-22 13:27:32 +00:00
Pearu Peterson	8c840fb921	Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626 ) As in the title. Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2024-10-22 09:52:25 +00:00
Simon Fan	5a13282c75	[compiled autograd] tls access helpers (#138061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061 Approved by: https://github.com/yf225 ghstack dependencies: #137953, #137821	2024-10-22 08:03:52 +00:00
Simon Fan	49fa437097	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-22 08:03:52 +00:00
Simon Fan	75259145ec	[compiled autograd] directly use python Logger class in cpp (#137953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953 Approved by: https://github.com/jansel, https://github.com/yf225	2024-10-22 08:03:52 +00:00
angelayi	60c1433041	[aoti] Cond symint input support (#138373 ) If the input is a symint, we don't want to add the aoti_torch_assign_tensors_out Pull Request resolved: https://github.com/pytorch/pytorch/pull/138373 Approved by: https://github.com/larryliu0820, https://github.com/desertfire	2024-10-22 07:53:22 +00:00
Pian Pawakapan	51045e6251	make DimHints compatible with Dims (#138490 ) Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints. The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid. Current behavior is that: ``` class Foo(torch.nn.Module): def forward(self, x, y): return x - y inputs = (torch.randn(4, 4), torch.randn(4, 4)) shapes = { "x": (Dim.AUTO, Dim("d1", min=3)), "y": (Dim("d0", max=8), Dim.DYNAMIC), } ep = export(Foo(), inputs, dynamic_shapes=shapes) ``` The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant. Differential Revision: D64636101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490 Approved by: https://github.com/avikchaudhuri	2024-10-22 07:43:48 +00:00
drisspg	9a9a0abc28	[SDPA-CUDNN] Make CuDNN Attention Opt in (#138522 ) # Summary Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. https://github.com/pytorch/pytorch/issues/138529 2. https://github.com/huggingface/diffusers/issues/9704 3. https://github.com/pytorch/pytorch/pull/138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet	2024-10-22 07:23:06 +00:00
Gabriel Ferns	2b4af6fa74	Mark torch.get_device as overridable at the python level (#132706 ) Summary: - add a value to `get_testing_overrides` function for `torch.get_device()` - remove `torch.get_device()` from the `get_ignored_functions` list Test Plan: Existing override testing infra, which should pick up the updates to these two variables. Closes the loop on: https://github.com/pytorch/pytorch/pull/132567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706 Approved by: https://github.com/ezyang	2024-10-22 07:20:42 +00:00
Pian Pawakapan	84e5f34fd1	bug in unbacked_bindings for au0 (#138136 ) Summary: we were storing au0 instead of u0 in unbacked_bindings / unbacked_var_to_val Test Plan: - Differential Revision: D64508936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138136 Approved by: https://github.com/ezyang	2024-10-22 07:04:30 +00:00
Sam Larsen	a80b87353c	[pt2] Log is_forward field to dynamo_compile scuba table (#138505 ) Differential Revision: [D64711721](https://our.internmc.facebook.com/intern/diff/D64711721) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138505 Approved by: https://github.com/oulgen	2024-10-22 05:50:49 +00:00
Chien-Chin Huang	0b4a071a1d	[CP] Implement AllGather based context parallelism (#132820 ) Summary: This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820 Approved by: https://github.com/XilunWu	2024-10-22 05:25:50 +00:00
Ke Wen	6b29d40e9b	[PGNCCL] Add default value for `nccl_nonblocking_timeout` (#138374 ) - Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488	2024-10-22 05:06:18 +00:00
Syed Tousif Ahmed	03c72976a5	Properly uses ref-counting for torch.cuda.use_mem_pool (#133600 ) This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`. The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-10-22 03:21:53 +00:00
Colin Peppler	89067402d4	[easy] in ROCmTemplate set kwargs when creating Buffer (#138521 ) Summary: https://github.com/pytorch/pytorch/pull/137768 makes Inductor IR kw only Test Plan: CI Differential Revision: D64723804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138521 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-22 03:13:16 +00:00
cyy	f881094366	Use Wmissing-prototypes on torch_cuda (#136080 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080 Approved by: https://github.com/ezyang	2024-10-22 02:04:19 +00:00
Tugsbayasgalan Manlaibaatar	9f7c26bef3	Fix training IR bug by changing passes order (#138292 ) Inserting runtime_assertions cause gm to have different names but the graph signature was populated earlier. To avoid this kind of errors in the future, I refactored these steps into a helper function. Differential Revision: [D64576251](https://our.internmc.facebook.com/intern/diff/D64576251) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138292 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #138266	2024-10-22 01:24:14 +00:00
Sergii Dymchenko	012ff2a0aa	Don't try to load cufile (#138501 ) Trying to loading it caused a big issue with 2.5.0 release - https://github.com/pytorch/pytorch/issues/138324 cufile is not actually used currently by default, see #133489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138501 Approved by: https://github.com/atalman, https://github.com/mikaylagawarecki, https://github.com/malfet	2024-10-22 01:13:27 +00:00
Tugsbayasgalan Manlaibaatar	5adc33d3b8	Training IR should preserve custom metadata (#138266 ) Differential Revision: [D64576252](https://our.internmc.facebook.com/intern/diff/D64576252) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138266 Approved by: https://github.com/yushangdi	2024-10-22 01:09:56 +00:00
Shunting Zhang	0a38c0ec89	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-22 00:50:00 +00:00
PyTorch MergeBot	3b186c5659	Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 )" This reverts commit 1417b2cd0562e0e4d4349024ef7c731b99214890. Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))	2024-10-22 00:46:48 +00:00
wz337	d7e0e1dbc4	[DeviceMesh] Use `split_group` to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129 ) Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129 Approved by: https://github.com/kwen2501	2024-10-22 00:00:05 +00:00
Matthew Francis-Landau	a7f49de485	Fixes issue with enums in a tuple for dynamo (#133123 ) Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`. This causes an issue if one of the values inside of the tuple will not be properly encoded. In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123 Approved by: https://github.com/jansel	2024-10-21 23:45:11 +00:00
Mikayla Gawarecki	e24871eb3c	Add environment variable to force no weights_only load (#138225 ) In preparation for `weights_only` flip, if users don't have access to the `torch.load` call Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225 Approved by: https://github.com/albanD	2024-10-21 23:26:15 +00:00
Will Feng	ec4ce094b2	[Traceable FSDP2][CI] Skip more tests on rocm (#138497 ) Some of the test checks doesn't work well with rocm. Fixes https://github.com/pytorch/pytorch/issues/138409. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138497 Approved by: https://github.com/fduwjj	2024-10-21 23:11:01 +00:00
Animesh Jain	77868697b7	[inductor][subgraph] Add size asserts (#138424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138424 Approved by: https://github.com/eellison ghstack dependencies: #137555	2024-10-21 22:43:49 +00:00
Parikshit Shah	853da168fc	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-21 21:45:13 +00:00
Bob Ren	20a2d39557	Log all failing test repros to scuba (#138394 ) This has the benefit that 1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba 2) We can do analysis (eg. set different two sets of tests across two PRs) 3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH. I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9 I then reverted the breaking change and published this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394 Approved by: https://github.com/ezyang	2024-10-21 21:35:47 +00:00
mwlon	ef52bbbf23	More appropriate socket errors and debug messages (#130347 ) Fixes #128998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130347 Approved by: https://github.com/fduwjj	2024-10-21 21:28:40 +00:00
Ke Wen	364340c7ee	[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR (#138488 ) Forward fix for build issue introduced by #137855: ``` In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2: fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR' 508 \| int split_color{NCCL_SPLIT_NOCOLOR - 1}; \| ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488 Approved by: https://github.com/fduwjj ghstack dependencies: #137855	2024-10-21 21:14:20 +00:00
Joel Schlosser	134f6cda7e	Support record_stream() for NJT (#137099 ) Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete. This is an ask from torchrec as the op is used there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099 Approved by: https://github.com/ngimel	2024-10-21 21:10:42 +00:00
Richard Barnes	70ec86d754	Remove C10_DEPRECATED (#138406 ) Looking in the code I see ``` // NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses // the "__declspec(deprecated)" implementation and not the C++14 // "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on // MSVC, but ran into issues with some older MSVC versions. ``` But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_. Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`. Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-21 20:57:27 +00:00
David Berard	bb2e090b7d	[user triton] typing triton_kernel_wrap.py (#138230 ) Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-10-21 20:34:49 +00:00
atalman	60081c29ec	Use cuda 12.4 pytorch_extra_install_requirements as default (#138458 ) Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4. This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458 Approved by: https://github.com/malfet	2024-10-21 20:16:37 +00:00
Sam Ginzburg	c1ead6fba3	Bugfix for passing None args to user defined Triton kernel (#138472 ) add test fewer failing tests more tests passing tests passing lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/138472 Approved by: https://github.com/aakhundov	2024-10-21 20:00:04 +00:00
Tom Ritchford	8ad191ae21	[dynamo] Replace __str__ with __repr__ in some places (#136316 ) ## The problem In a typical debugger, `repr()` is used to display variables and not `str()`. Several classes in Dynamo have a `__str__()` method that returns useful information and a `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore. `str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)". ## The solution In the Python object model, if there is no `__str__` method, `__repr__` is used instead (but not the other way around). So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier. The specific classes changed were all in `torch._dynamo.variables`: * `builtin.BuiltinVariable` * `constant.ConstantVariable` * `constant.EnumVariable` * `functions.UserMethodVariable` * `lazy.LazyVariableTracker` * `lazy.LazySymNodeFormatString` * `misc.GetAttrVariable` * `misc.NullVariable` * `user_defined.UserDefinedObjectVariable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316 Approved by: https://github.com/XuehaiPan, https://github.com/jansel	2024-10-21 19:50:38 +00:00
Huy Do	41f7d01ccf	Increase Docker push timeout limit from 15 to 30m (#138487 ) Some images now take more than 15 to finish pushing and keep timing out, for example, https://github.com/pytorch/pytorch/actions/runs/11442231435/job/31832143440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138487 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi	2024-10-21 19:44:52 +00:00
PyTorch MergeBot	32d4582e02	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit 16caa8c1b3a02e47b5f52d3c2d40d7931cc427dc. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to checking if this will solve inductor errors ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2427565425))	2024-10-21 19:40:58 +00:00
Xuehai Pan	ff2f751bfb	[tools] fix nightly pull tool when the conda environment not exists (#138448 ) Now, `conda env remove --name env` exits with errors if the given environment does not exist. This PR check the existance of the environment before trying to remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138448 Approved by: https://github.com/ezyang	2024-10-21 19:35:48 +00:00
PyTorch MergeBot	071f6f2de8	Revert "[ROCm] Fix ADDMM hipBLASLt regression (#138267 )" This reverts commit 14a3e12985e4550440a8a1755d3418e9b02b4950. Reverted https://github.com/pytorch/pytorch/pull/138267 on behalf of https://github.com/jeffdaily due to this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA ([comment](https://github.com/pytorch/pytorch/pull/138267#issuecomment-2427550465))	2024-10-21 19:33:13 +00:00
Xuehai Pan	abbd71d29d	[BE][Easy] enable PYFMT for `torch.fx` (#138443 ) Reproduce command: ```bash ghstack checkout https://github.com/pytorch/pytorch/pull/138443 git checkout HEAD~1 torch/ lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138443 Approved by: https://github.com/ezyang	2024-10-21 19:15:49 +00:00
Animesh Jain	8231180147	[dynamo][refactor] Refactor Wrap HOP to reuse it for invoke_subgraph (#137555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137555 Approved by: https://github.com/zou3519	2024-10-21 18:26:29 +00:00
Justin Chu	c6609ece84	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms, https://github.com/xadupre ghstack dependencies: #137789	2024-10-21 18:17:48 +00:00
Aaron Orenstein	07cc4bd3e2	typing compile_fx.py (#138033 ) Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033 Approved by: https://github.com/Skylion007	2024-10-21 18:14:59 +00:00
Will Feng	81738403a2	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-21 17:52:21 +00:00
Justin Chu	6e38c87ad0	[ONNX] Remove ExportTypes (#137789 ) Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward. Differential Revision: [D64412947](https://our.internmc.facebook.com/intern/diff/D64412947) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-10-21 17:50:28 +00:00
FFFrog	af0bc75460	Remove deprecated alias macro(1/3) (#137556 ) Detailed Descriptions: - Remove AT_ERROR Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556 Approved by: https://github.com/ezyang	2024-10-21 17:32:32 +00:00
Aaron Gokaslan	16caa8c1b3	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-10-21 17:20:06 +00:00
PyTorch MergeBot	9bb327bfc6	Revert "[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 )" This reverts commit a8b912f39d36bd2e6d204808d866439d0075f1a5. Reverted https://github.com/pytorch/pytorch/pull/137785 on behalf of https://github.com/ezyang due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/137785#issuecomment-2427295668))	2024-10-21 17:18:56 +00:00
Ryan Guo	02dd3b8e32	[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 ) This method was no longer needed after #113725; the checking logic is now in `SideEffects.check_allowed_side_effect`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #137905	2024-10-21 16:43:34 +00:00
Catherine Lee	1032ce6bd3	Only upload test/test-reports as artifacts (#138019 ) Fixes https://github.com/pytorch/pytorch/issues/137851 This is possibly too restrictive but I spot checked and I don't think any of the files outside of test/test-reports are important, but I can't guarantee that someone was putting something elsewhere and expecting for it to still be zipped Outputs can be see on HUD by clicking show artifacts Some examples: Logs <img width="293" alt="image" src="https://github.com/user-attachments/assets/9a2db9b1-0f62-4209-909b-4f56a908619d"> XMLs <img width="234" alt="image" src="https://github.com/user-attachments/assets/a639fe38-a112-4ea5-abba-ad1d5b25bb43"> JSONs <img width="180" alt="image" src="https://github.com/user-attachments/assets/be7a49ac-5258-4bc5-981d-3f134ebd343d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138019 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2024-10-21 16:43:30 +00:00
Ryan Guo	0a4197490c	Delay mul/pow expansion for `_SympyT` to enable more folding (#138235 ) Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g., ``` (a + b)^2 / (a + b) --> (a + b) ``` which won't happen if we expand eagerly during product construction: ``` (a^2 + 2ab + b^2) / (a + b) --> no change ``` Fixes #136044. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235 Approved by: https://github.com/ezyang	2024-10-21 16:38:47 +00:00
David Berard	701ddf962a	[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 ) replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example. This also adds metadata for to register_replacement patterns, including pad_mm. This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now. Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089 Approved by: https://github.com/aakhundov	2024-10-21 16:33:12 +00:00
Yuanhao Ji	279ddfc6ee	Add type check for `dilation` in `torch.quantized_max_pool3d()` (#137845 ) Fixes #136716 repro: ```python import torch input = torch.randn([1, 1, 1, 1, 1]) input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32) torch.quantized_max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash input = torch.randn([1, 1, 1, 1, 1]) input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32) result = torch.nn.functional.max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash ``` result: ``` RuntimeError: Expected dilation >= 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137845 Approved by: https://github.com/albanD	2024-10-21 16:15:57 +00:00
Parikshit Shah	a8b912f39d	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Differential Revision: D64246495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong	2024-10-21 15:30:07 +00:00
cyy	7ec21a6f0f	Enable clang-tidy on torch/csrc/api (#138437 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138437 Approved by: https://github.com/r-barnes	2024-10-21 14:22:38 +00:00
FFFrog	8aacbee8e0	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang ghstack dependencies: #138323	2024-10-21 13:51:54 +00:00
FFFrog	649f8117ad	Add deprecated warning for lazyInitXXX API (#138323 ) Detailed Descriptions: Involved APIs are as followed: - ``lazyInitCUDA`` - ``lazyInitHIP`` - ``lazyInitXPU`` - ``lazyInitMTIA`` - ``lazyInitPrivateUse1`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138323 Approved by: https://github.com/malfet	2024-10-21 13:51:54 +00:00
Bin Bao	1417b2cd05	[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303 Approved by: https://github.com/hl475	2024-10-21 13:47:50 +00:00
PyTorch UpdateBot	8f3efb8797	Update slow tests (#133203 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weeekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133203 Approved by: https://github.com/pytorchbot	2024-10-21 12:00:52 +00:00
cyy	14fc6b70ea	Remove torch/csrc/api/include/torch/linalg.h (#138435 ) Only one place in OSS uses it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138435 Approved by: https://github.com/r-barnes	2024-10-21 07:04:27 +00:00
Xiaodong Wang	5f940a44af	[AMD] Fix torch ck backend build with 6.2.1 (#138434 ) Summary: It's complaining about missing __hip_bfloat162 definition w/o this header. Differential Revision: D64673284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138434 Approved by: https://github.com/yaoyj11, https://github.com/houseroad	2024-10-21 06:38:38 +00:00
Will Feng	362ca54f03	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_work_registry` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_work_registry` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D64511994](https://our.internmc.facebook.com/intern/diff/D64511994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-21 06:02:57 +00:00
cyy	a170ff4167	Prepare to enable ASAN on CUDA (#138404 ) See which tests fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/138404 Approved by: https://github.com/ezyang	2024-10-21 03:55:29 +00:00
Richard Barnes	9ad2736627	Remove extraneous C++14 comment (#138408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138408 Approved by: https://github.com/Skylion007	2024-10-21 03:54:41 +00:00
PyTorch MergeBot	6987bfb40a	Revert "[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 )" This reverts commit 3c7d9d6c7fa565e811675be7dd84e5ef7c8ba7a0. Reverted https://github.com/pytorch/pytorch/pull/137906 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137906#issuecomment-2425505452))	2024-10-21 03:42:38 +00:00
wz337	fb0da32377	[DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117 ) As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension). For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117 Approved by: https://github.com/kwen2501	2024-10-21 03:04:24 +00:00
cyy	a05b64a38f	[5/N] Fix extra warnings brought by clang-tidy-17 (#138403 ) Follows #137983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138403 Approved by: https://github.com/ezyang	2024-10-21 02:59:54 +00:00
cyy	82eb09aafd	[Environment Variable][4/N] Use thread-safe getenv functions (#137843 ) Follows #137328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843 Approved by: https://github.com/ezyang	2024-10-21 02:58:59 +00:00
Shuqiang Zhang	2d3455e7d9	[c10d] try fix the unstableness of test_get_future_result (#138415 ) Summary: Seems depends on the platform, nccl error or timeout would be raised first on rank 0. Now we try to force the timeout by not exiting other ranks Test Plan: Tests pass locally Tags: Fixes https://github.com/pytorch/pytorch/issues/138397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138415 Approved by: https://github.com/kwen2501	2024-10-21 01:17:30 +00:00
cyy	e7b8a9a4c1	[5/N] Fix clang-tidy warnings in torch/csrc/api/ (#138389 ) Follows #138382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138389 Approved by: https://github.com/ezyang	2024-10-21 01:12:37 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Isuru Fernando	4f45a052ad	Fix try_solve for s1*s2 == 0 when both symbols are unknown (#137919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137919 Approved by: https://github.com/ezyang	2024-10-20 23:33:08 +00:00
Alnis Murtovi	09cf163ae3	Fix for mixed_mm tests failures on SM70 and lower (#138183 ) This PR fixes mixed_mm tests that are failing on SM70 and lower as discussed here https://github.com/pytorch/pytorch/pull/123762#issuecomment-2406601729. The failure occurs because some of the mixed_mm tests expect triton code to be generated, but on SM70 and lower, the generation of triton code is skipped (see https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L693). These tests will now be skipped when running on SM70 and lower. I do not have access to an SM70 GPU, so I was not able to test these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138183 Approved by: https://github.com/ezyang	2024-10-20 21:14:31 +00:00
PyTorch MergeBot	a1899b5a9e	Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843 )" This reverts commit 239ad73cb1c8a91f0a2de21d27af3d98f5a8dddc. Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/yf225 due to Sorry for reverting your PR but I believe this PR breaks the binary builds. Example: https://ossci-raw-job-status.s3.amazonaws.com/log/31790258895, with error message: `getenv is not a member of c10::utils`, might be easier to search for `not a member of` in the log ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2425192780))	2024-10-20 19:48:14 +00:00
Will Feng	a9f4f89cd5	[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-20 19:38:18 +00:00
cyy	239ad73cb1	[Environment Variable][4/N] Use thread-safe getenv functions (#137843 ) Follows #137328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843 Approved by: https://github.com/ezyang	2024-10-20 13:05:04 +00:00
drisspg	07fd61e106	[SDPA] Fix warning message (#138278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138278 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-10-20 08:00:56 +00:00
Huy Do	f568d48890	Enable git long paths checkout on Windows (#138411 ) Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376 According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround. ### Testing Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411 Approved by: https://github.com/wdvr	2024-10-20 07:18:44 +00:00
PyTorch MergeBot	f8303740f7	Revert "Enable git long paths checkout on Windows (#138411 )" This reverts commit 12283035f8c08cd3487bfaac25ccef7da90952ba. Reverted https://github.com/pytorch/pytorch/pull/138411 on behalf of https://github.com/huydhn due to Opps, I forgot Windows binary build, let me revert and reland this one ([comment](https://github.com/pytorch/pytorch/pull/138411#issuecomment-2424661640))	2024-10-20 06:50:48 +00:00
Huy Do	12283035f8	Enable git long paths checkout on Windows (#138411 ) Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376 According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround. ### Testing Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411 Approved by: https://github.com/wdvr	2024-10-20 06:32:34 +00:00
PyTorch MergeBot	d1027c2be6	Revert "Update sympy version constraint to 1.13.3 (#138338 )" This reverts commit d8279ad9d162b5ce71699f462d3664c3745b14f5. Reverted https://github.com/pytorch/pytorch/pull/138338 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think a bunch of inductor tests and test_dynamic_shapes are failing in trunk after this lands `d8279ad9d1` ([comment](https://github.com/pytorch/pytorch/pull/138338#issuecomment-2424487225))	2024-10-20 03:19:02 +00:00
Jeff Daily	3f3b692a00	[ROCm] CK-based GEMM (#131004 ) - composable_kernel as a third_party submodule - "ck" as a `torch.backends.cuda.preferred_linalg_library()` - reference CK gemm implementations for float, bfloat16, and half types Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004 Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-10-20 02:57:43 +00:00
Animesh Jain	0a2407b93c	[dynamo] Support omegaconf DictConfig (#138378 ) Fixes https://github.com/pytorch/pytorch/issues/138224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378 Approved by: https://github.com/jansel ghstack dependencies: #138359	2024-10-20 02:43:17 +00:00
Animesh Jain	f892543c1f	[dynamo] Support TypedDict (#138359 ) Seen in vLLM. Fixes https://github.com/pytorch/pytorch/issues/132629 Fixes https://github.com/pytorch/pytorch/issues/133613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138359 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-20 02:43:17 +00:00
cyy	1f349eed61	[4/N] Fix extra warnings brought by clang-tidy-17 (#137983 ) Follows #137552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137983 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-10-20 01:02:33 +00:00
Richard Barnes	b1b7c714ed	Add deprecated C10_UNUSED and C10_NODISCARD macros back (#138398 ) For backwards compatibility. Disallow internal use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138398 Approved by: https://github.com/malfet	2024-10-20 00:21:19 +00:00
Jeongseok (JS) Lee	d8279ad9d1	Update sympy version constraint to 1.13.3 (#138338 ) `simpy` was pinned to version 1.13.1 due to test failures with version 1.13.2 on Windows and mac, as reported in https://github.com/pytorch/pytorch/pull/133235. Now that a newer version, 1.13.3, has been released, this PR aims to verify if the test failure has been resolved and also allow building with newer versions for packaging purposes (e.g., https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277#discussion_r1806721862). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138338 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-20 00:20:02 +00:00
Nichols A. Romero	14a3e12985	[ROCm] Fix ADDMM hipBLASLt regression (#138267 ) Fixes #138067 A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604 The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267 Approved by: https://github.com/malfet	2024-10-20 00:19:10 +00:00
PyTorch MergeBot	47e80abc7a	Revert "[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 )" This reverts commit fb44658415e50b5be6a187ff3f14243c0fdf3daf. Reverted https://github.com/pytorch/pytorch/pull/138089 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_original_aten_preserved_pad_mm test runs OOM in trunk `fb44658415` ([comment](https://github.com/pytorch/pytorch/pull/138089#issuecomment-2424297269))	2024-10-19 23:55:01 +00:00
Will Feng	fcedf93d1e	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu	2024-10-19 19:10:31 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
David Berard	fb44658415	[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 ) replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example. This also adds metadata for to register_replacement patterns, including pad_mm. This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now. Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089 Approved by: https://github.com/aakhundov	2024-10-19 16:37:08 +00:00
Bob Ren	38ea487338	Re-raise in _run_sympy_handler to reduce log spew (#138356 ) Fixes: https://github.com/pytorch/pytorch/issues/138069 I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew. I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356 Approved by: https://github.com/ezyang	2024-10-19 16:02:45 +00:00
Nikita Shulga	c0879d0c21	Fix lint Regression casued by `fddabc6e0b` that was force merged	2024-10-19 08:33:41 -07:00
cyy	cdc9f14227	[4/N] Fix clang-tidy warnings in torch/csrc/api/ (#138382 ) Follows #138328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138382 Approved by: https://github.com/ezyang	2024-10-19 13:32:51 +00:00
Richard Barnes	fddabc6e0b	C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 ) Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-10-19 13:17:43 +00:00
cyy	2f6a70bfea	Enable more UBSAN checks (#138288 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138288 Approved by: https://github.com/ezyang	2024-10-19 13:00:26 +00:00
cyy	675e16e137	[3/N] Fix clang-tidy warnings in torch/csrc/api/ (#138328 ) Follows #136998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138328 Approved by: https://github.com/ezyang	2024-10-19 07:07:39 +00:00
PyTorch MergeBot	795255a7c8	Revert "[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 )" This reverts commit 0c913b35aaea9ca33510239e939957ec5fe66d78. Reverted https://github.com/pytorch/pytorch/pull/138187 on behalf of https://github.com/yf225 due to linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) test_compiled_autograd_ctx failed ([comment](https://github.com/pytorch/pytorch/pull/138187#issuecomment-2423609108))	2024-10-19 06:12:47 +00:00
Nikita Shulga	de16159e56	[MPS] Fix sliced cast (#138314 ) This fixes internal crash due to the invalid bufer size computation if sliced API is used Not sure what was the purpose of ```c++ IntArrayRef baseShape; if (src.is_view()) { baseShape = src._base().sizes(); } else { baseShape = getIMPSAllocator()->getBufferShape(src.storage().data()); } int flattenedShaped = 1; for (const auto i : c10::irange(baseShape.size())) { flattenedShaped *= baseShape[i]; } ``` As flattenShaped could be much easier computed as `[srcBuf lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do. When someone allocated buffer to hold say uint8 and that view-casted it to float16, attempt to compute `baseShape` returned sizes of original tensor in its data type, rather than size in new dtypes Fixes https://github.com/pytorch/pytorch/issues/137800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314 Approved by: https://github.com/albanD, https://github.com/DenisVieriu97	2024-10-19 05:17:09 +00:00
Will Feng	0c913b35aa	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu ghstack dependencies: #138245, #138174	2024-10-19 04:33:35 +00:00
Will Feng	8f118e53d7	[CI] Fix CompiledDDP failure when the gradient is not contiguous; Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138174 ) Summary: As title `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174 Approved by: https://github.com/yf225, https://github.com/kwen2501 ghstack dependencies: #138245 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-19 04:33:35 +00:00
Jeongseok Lee	3cfd244495	Add USE_SYSTEM_NVTX option (#138287 ) ## Summary We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages. This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior). ## Test Plan The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](`cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287 Approved by: https://github.com/albanD	2024-10-19 04:26:01 +00:00
Michael Lazos	a20a17fd6f	[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 ) Fixes https://github.com/pytorch/pytorch/issues/114369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669 Approved by: https://github.com/anijain2305	2024-10-19 04:12:45 +00:00
PyTorch UpdateBot	88eb15a3e3	[audio hash update] update the pinned audio hash (#138139 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138139 Approved by: https://github.com/pytorchbot	2024-10-19 04:02:21 +00:00
Wouter Devriendt	7d076b9e3a	updated EC2 fetching of metadata to use IMDSv2 (#138286 )	2024-10-18 20:58:47 -07:00
PyTorch MergeBot	ac7f52b301	Revert "[inductor] add a threshold for membw saving during fusion (#136782 )" This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8. Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))	2024-10-19 03:43:42 +00:00
Ke Wen	fecd370ea1	[c10d] Fix color value for comm split being negative (#137855 ) Fixes https://github.com/pytorch/pytorch/issues/137856. ### Issue 1 Today under `ProcessGroupNCCL::Options`, color is declared as: ``` int64_t split_color{0}; ``` When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`). But that's not all. ### Issue 2 `split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain: ``` [rank0]: TypeError: (): incompatible function arguments. The following argument types are supported: [rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None ``` This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855 Approved by: https://github.com/wconstab	2024-10-19 03:17:19 +00:00
Richard Barnes	542f7c8383	Eliminate C10_NODISCARD (#138336 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336 Approved by: https://github.com/Skylion007	2024-10-19 02:54:06 +00:00
fduwjj	a4b6ef178c	[c10d] Reorder cpp stack dump and FR dump and add log prefix to loggings (#138368 ) The rationale behind this PR is to: 1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR. 2. Add log Prefix to the logs which have not been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368 Approved by: https://github.com/c-p-i-o	2024-10-19 02:43:41 +00:00
Rachel Guo	ea412d5554	[AOTI] Fix a special case compile time data type codegen for sym int variables (#138106 ) Summary: This change unblocks the CFR AOTI lowering runtime error. TL;DR: In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto" can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA. Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. This diff manually cast it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code. Test Plan: Verified in FLB locally: ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16 --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"``` Differential Revision: D64490039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106 Approved by: https://github.com/ColinPeppler	2024-10-19 02:30:53 +00:00
Xu Han	d5035f0aab	fix codecache write_atomic path issue on Windows. (#138331 ) Fixes #138211 `Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing. This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened. After stepping trace the repo code: ```python import os import sys from pathlib import Path _IS_WINDOWS = sys.platform == "win32" def test_case(): cwd = os.getcwd() path1 = os.path.join(cwd, "haha1.txt") path2 = Path(os.path.join(cwd, "haha2.txt")) try: path2.rename(path1) except FileExistsError as e_file_exist: if _IS_WINDOWS: # on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename shutil.copy2(path2, path1) os.remove(path2) else: raise e_file_exist except BaseException as e: raise e print("run here.") if __name__ == "__main__": test_case() ``` We found the code `path2.rename(path1)` can breakdown into: 1. copy file2's content to file1. 2. delete file2. So, we can implemented equal code on Windows path: ```python shutil.copy2(src=tmp_path, dst=path) os.remove(tmp_path) ``` So, we can get current PR. TODO: need cherry-pick to release/2.5 branch, CC: @atalman . Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-19 01:27:12 +00:00
Aleksei Nikiforov	949b6f685d	Enable -Werror on s390x (#136527 ) Enable -Werror on s390x Example of original issue on s390x: https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704 Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow. `-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe. `-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527 Approved by: https://github.com/malfet	2024-10-19 01:18:42 +00:00
Nikita Shulga	4a3c9400fe	Update cpuinfo submodule (#138351 ) To suppress error on ARM systems where PR_SVE_GET_VL is missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351 Approved by: https://github.com/Skylion007	2024-10-19 01:12:29 +00:00
wz337	ff598f2f4d	[DTensorTestbase] Add an optional `eager_init` flag to `with_comms()` to support eager init nccl communicator for DeviceMesh test case (#138108 ) Add an optional `eager_init` flag to `with_comms`. When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization. Otherwise, `device_id` is still `None` and this goes through the normal lazy call. Default for `eager_init` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108 Approved by: https://github.com/kwen2501	2024-10-19 01:04:55 +00:00
nihui	b3ae1b1b73	[CMake] remove duplicated cmake options for Gloo and C10D (#138318 ) just a trival fix :P cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318 Approved by: https://github.com/janeyx99	2024-10-19 00:26:25 +00:00
Shunting Zhang	6647320de2	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-19 00:22:43 +00:00
PyTorch MergeBot	e8b1409dcf	Revert "[user triton] typing triton_kernel_wrap.py (#138230 )" This reverts commit 2f61b69603756c1fcaef71b231e598df31e20f42. Reverted https://github.com/pytorch/pytorch/pull/138230 on behalf of https://github.com/wdvr due to Reverting this, as it started failing tests on main ([comment](https://github.com/pytorch/pytorch/pull/138230#issuecomment-2423354596))	2024-10-18 23:12:29 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Jason Ansel	85a6a782e5	[inductor] Generalize WorkspaceArg for graph-level semaphores (#138170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170 Approved by: https://github.com/Chillee	2024-10-18 23:05:54 +00:00
Simon Fan	13bcb065f5	[compiled autograd] enable some reentrant tests (#137290 ) Some seem to fail due to queue_callback usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/137290 Approved by: https://github.com/yf225	2024-10-18 22:25:08 +00:00
PyTorch MergeBot	47e4045566	Revert "[pt2] Log is_forward field to dynamo_compile scuba table (#138097 )" This reverts commit 4e9273c84edafdcfff57521dde6675b967181ba8. Reverted https://github.com/pytorch/pytorch/pull/138097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a land race with https://github.com/pytorch/pytorch/pull/137803 ([comment](https://github.com/pytorch/pytorch/pull/138097#issuecomment-2423297516))	2024-10-18 22:00:40 +00:00
Aaron Shi	bd7cbddfe3	[CODEOWNERS] Remove aaronenyeshi from Profiler paths (#138346 ) As title, remove aaronenyeshi from Profiler paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138346 Approved by: https://github.com/sraikund16	2024-10-18 21:46:00 +00:00
Ke Wen	c88b77af9c	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 21:39:39 +00:00
Cen Zhao	7faa1284ab	[ptd][amd] call alltoallv instead of send/recv (#136368 ) Summary: as $title AMD provides a2av API, we should just use it instead of implementing PTD's own set of send/recv. we should not skip 0B send/recv within a2av, it may lead to dead lock: see details https://github.com/ROCm/rccl/pull/1349 Test Plan: before: mvai-job will timeout on all2all https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240913-1426-327e119d?job_attempt=1&version=0&env=PRODUCTION after: https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240919-1932-ebce94e6?job_attempt=0&tab=execution_details&env=PRODUCTION latest APS job: https://fburl.com/mlhub/vn6dj7zp Differential Revision: D63076315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136368 Approved by: https://github.com/xw285cornell	2024-10-18 21:31:57 +00:00
Shivam Raikundalia	5b58697cc7	[Profiler] Clang bugs in Collection [1/n] (#138296 ) Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read Test Plan: Format tests pass. Differential Revision: D64411171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296 Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007	2024-10-18 21:06:50 +00:00
James Wu	295de00908	[PT2 Compile Events] Revamp PT2 Compile/chromium event logging [1/?] (#138093 ) This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709 It implements the following changes: - Only log spans to scuba, so no start events are ever logged - Log events as the full event name, without "START" or "END" - Only log to scuba major phases from chromium events. These are: - entire_frame_compile (dynamo) - backend_compile (aotdispatch) - inductor_compile (inductor) - codegen (inductor codegen) Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well: - When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested - Log new events for pre and post grad passes. These do not log to scuba. By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add much more metadata and information to each individual event type. Diffs for that will come later. IMPLEMENTATION NOTES: - The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one. - There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later. Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093 Approved by: https://github.com/ezyang	2024-10-18 20:36:08 +00:00
Ryan Guo	3c7d9d6c7f	[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 ) This method was no longer needed after #113725; the checking logic is now in `SideEffects.check_allowed_side_effect`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #137905	2024-10-18 20:20:42 +00:00
Ryan Guo	162eba2dee	[dynamo] Remove `mutable_local.source` and index on `VariableTracker` rather than `MutableLocalBase` (#137905 ) This patch addresses parts of the side-effect refactor proposed in #133027; specifically, it does 3 things: 1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars` to index on `VariableTracker` rather than `MutableLocalBase`. 2. Remove the `source` field from `MutableSideEffects` and `AttributeMutation`, and use `VariableTracker.source` instead. 3. Plumb a `overridden_sources: Dict[Source, Source]` from `handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't update `VariableTracker.source` in place, while still preserving what `handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for certain `VariableTracker`). (1) and (2) are merged in 1 patch because of some dependency between a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over `sideSideEffects.store_attr_mutations.keys()`, and potentially update its source field to be completely different. b. `SideEffects.codegen_update_mutated`, which happens after the above and uses `cg(var.mutable_local.source)`. where if we apply (1) only, (b) breaks, and if we apply (2) only, (a) breaks. (3) is needed for correctness, see comments in the PR for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-10-18 20:20:42 +00:00
PyTorch MergeBot	7b39fb5712	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))	2024-10-18 20:09:40 +00:00
Zain Rizvi	cd1e9b0e60	[EZ] Remove canary scale config (#138361 ) Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767 Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy. Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361 Approved by: https://github.com/wdvr	2024-10-18 20:02:00 +00:00
Benjamin Glass	1ac42b5f3e	graph.py: Refine unspec variable finding (#137303 ) Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303 Approved by: https://github.com/eellison ghstack dependencies: #135701	2024-10-18 20:00:25 +00:00
Will Constable	d5bb70afe3	[Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271 ) There should always be 1 action. This may be an artifact from trying to extend the regex to handle the fused SEND_F_RECV_B style actions, which was abandoned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271 Approved by: https://github.com/H-Huang ghstack dependencies: #138142	2024-10-18 19:52:07 +00:00
Will Constable	f23e8a8923	[Pipelining] Fix/improve format_pipeline_order (#138142 ) Fix issue where format fn modified original data structure- avoid this. Change from printing "None" to empty string, for cleaner visualization of bubbles Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142 Approved by: https://github.com/H-Huang	2024-10-18 19:52:07 +00:00
Chong Gu	d512d0e227	Always use aten.constant_pad_nd for mm padding (#137820 ) Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc. Test Plan: ``` buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` ``` buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` Differential Revision: D64271583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820 Approved by: https://github.com/eellison	2024-10-18 19:35:03 +00:00
David Berard	2f61b69603	[user triton] typing triton_kernel_wrap.py (#138230 ) Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-10-18 19:29:31 +00:00
Tugsbayasgalan Manlaibaatar	1f32a1fb80	Replace torch.export default decomp table to be lazily populated (#137650 ) In this PR, we implement lazy dictionary for export decomp behaviour for following reasons: 1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible. I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions) Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650 Approved by: https://github.com/justinchuby, https://github.com/bdhirsh	2024-10-18 19:28:52 +00:00
Nikita Shulga	ea8ea2f33f	Improve build_with_deb_info (#138290 ) To skip over the command that do not have output file specified Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290 Approved by: https://github.com/Skylion007	2024-10-18 18:50:12 +00:00
Sam Larsen	4e9273c84e	[pt2] Log is_forward field to dynamo_compile scuba table (#138097 ) Summary: ^^ Test Plan: Ran a test script out of fbcode: D64350202. Then: ``` (pytorch-3.10_4) devvm2296:~/fbcode $ scuba -e="select time,co_filename,is_forward from \`dynamo_compile/sandbox\` where is_forward is not null" +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ \| time \| co_filename \| is_forward \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ \| 1729032583 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 1 \| \| 1729032583 \| null \| 0 \| \| 1729032650 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 1 \| \| 1729032650 \| null \| 0 \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ 4 row(s) in set (0 warnings, 131 errors, 0.80 sec) ``` Reviewed By: ezyang Differential Revision: D64438144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138097 Approved by: https://github.com/ezyang	2024-10-18 18:48:52 +00:00
Aaron Gokaslan	195d0a666b	[BE][Ez]: Use interned hardcoded string FURB156 (#138330 ) Uses string constants from string module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330 Approved by: https://github.com/albanD	2024-10-18 18:26:16 +00:00
Svetlana Karslioglu	9c2a80322a	Add Programmable Google Search (#137716 ) - Adding the code for the programmable Google search - Adding the CSS overrides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137716 Approved by: https://github.com/seemethere, https://github.com/albanD Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-10-18 18:18:16 +00:00
Huy Do	8d869c9ec7	Skip test_circular_dependencies on ROCm (#138312 ) The test is flaky on ROCm and has been disabled for quite a while https://github.com/pytorch/pytorch/issues/110040. The disabled issue was opened and then closed several times, so it's better to close that issue and skip the test here. (Not really fix the issue, I just want the test to be skipped on PR instead of being disabled, then close the issue) Fixes https://github.com/pytorch/pytorch/issues/110040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138312 Approved by: https://github.com/jithunnair-amd, https://github.com/clee2000	2024-10-18 18:17:48 +00:00
Jason Ansel	620039c38c	[inductor] Respect ir_dataclass(frozen=...) in Python 3.9 (#138247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138247 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2024-10-18 17:55:12 +00:00
PyTorch MergeBot	ada7a8c217	Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 )" This reverts commit 8cb91109061648497ca09d6f1f9b9e13a2f5557e. Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/yf225 due to because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2422961292))	2024-10-18 17:51:54 +00:00
Ryan Guo	59158f640c	[dynamo] Support equality comparison between Tensor and `None` (#138289 ) This patch updates the `wrap_fx_proxy_cls` function to allow boolean output when the operation is one of `supported_const_comparison_op_values`. Fixes #120907. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138289 Approved by: https://github.com/williamwen42	2024-10-18 17:49:26 +00:00
Aaron Orenstein	9ea271d40b	Expand doc for bundled autotune cache (#138298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138298 Approved by: https://github.com/ezyang, https://github.com/oulgen	2024-10-18 17:43:47 +00:00
intellinjun	4bba038b2f	Add diagonal_copy to torch/_decomp/__init__.py (#136730 ) Fixes https://github.com/pytorch/pytorch/issues/117349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136730 Approved by: https://github.com/masnesral	2024-10-18 17:39:17 +00:00
Catherine Lee	666572d819	Update viable strict workflow (#138262 ) Corresponds to https://github.com/pytorch/test-infra/pull/5775 Tested in https://github.com/pytorch/pytorch/actions/runs/11393196544/job/31700963325?pr=138262 by adding my branch to the environment and pointing the workflow at my test-infra branch and commenting out the parts that did the push + upload record to s3 Versioning would have been good for this... Pull Request resolved: https://github.com/pytorch/pytorch/pull/138262 Approved by: https://github.com/huydhn	2024-10-18 17:28:55 +00:00
atalman	912ea5601b	Move manywheel binary scripts to pytorch (#138103 ) PR to remove Manywheel Scripts: https://github.com/pytorch/builder/pull/2017 Test PR : https://github.com/pytorch/pytorch/pull/138325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138103 Approved by: https://github.com/malfet	2024-10-18 17:11:28 +00:00
Li, Xingyuan	358ff3b731	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_autoheuristic.py` reuse `test/inductor/test_b2b_gemm.py` reuse `test/inductor/test_custom_lowering.py` reuse `test/inductor/test_efficient_conv_bn_eval.py` reuse `test/inductor/test_group_batch_fusion.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-18 16:58:09 +00:00
Richard Barnes	8dd575faf6	[BE] Modernize `C10_UNUSED` (#138102 ) [`[[maybe_unused]]`](https://en.cppreference.com/w/cpp/language/attributes/maybe_unused) is part of C++17 standard Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138102 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/eqy	2024-10-18 16:33:01 +00:00
Wu, Chunyuan	de51ed8610	[AOTI] Add C shim for _mkl_linear (#137880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137880 Approved by: https://github.com/desertfire	2024-10-18 16:26:19 +00:00
PyTorch MergeBot	26ac5671dc	Revert "Fix CompiledDDP failure when the gradient is not contiguous (#138174 )" This reverts commit 0ecafda6024f50734118dd794ac71b86c6e6d569. Reverted https://github.com/pytorch/pytorch/pull/138174 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but I think it fails test_compute_comm_reordering in trunk for rocm and multigpu setup ([comment](https://github.com/pytorch/pytorch/pull/138174#issuecomment-2422818971))	2024-10-18 16:17:54 +00:00
Jean Schmidt	98856f7ea1	Increase max runners available for linux.12xlarge and windows.8xlarge.nvidia.gpu.nonephemeral (#138332 ) Related PR on test-infra: https://github.com/pytorch/test-infra/pull/5785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138332 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-10-18 16:17:36 +00:00
PyTorch MergeBot	af306a392c	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit 7a117f3b3eea4cfeef21da2e3a8a1e39c30fa07d. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to unfortunately the failures on the previous import are still present on the current one D64568703 ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2422789143))	2024-10-18 16:01:01 +00:00
ErezYosef	5a81475884	Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321 ) ### Description: This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](https://github.com/pytorch/pytorch/pull/134107#discussion_r1800833948). Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict. You can find the related docs here: [Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict). @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138321 Approved by: https://github.com/janeyx99	2024-10-18 15:41:43 +00:00
Aaron Orenstein	86aefa9405	typing subproc_pool.py (#138032 ) Added type annotations to subproc_pool.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138032 Approved by: https://github.com/Skylion007	2024-10-18 15:31:05 +00:00
Joona Havukainen	aa3ae50c07	Fixing MPS conv1d error message for output 2**16 (#134770 ) Fixes #134416 by removing the misleading message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134770 Approved by: https://github.com/malfet	2024-10-18 14:13:20 +00:00
albanD	c4ed03cea1	Add proper handling for view and factory function for csan (#138236 ) In particular, properly handle that some functions only read/write metadata on the Tensor and thus should not be detected as read/write by csan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138236 Approved by: https://github.com/ngimel	2024-10-18 14:04:18 +00:00
PyTorch MergeBot	0ff6f7a040	Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 )" This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af. Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))	2024-10-18 13:21:17 +00:00
Xuan Zhang	e027403dea	ILP for Auto SAC (Selective Activation Checkpointing) (#137908 ) This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to apply activation checkpointing (AC) and the amount of activation memory that should be discarded for each module. The MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs: * https://github.com/pytorch/pytorch/pull/124688 * https://github.com/pytorch/pytorch/pull/134243 * https://github.com/pytorch/pytorch/pull/135208 End-to-end example and its sample output: ``` import copy from typing import Tuple import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.ilp_utils import ( aggregate_stats, get_peak_memory_runtime_baseline, parse_module_info, ) from torch.distributed._tools.mem_tracker import _ModState, MemTracker from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.distributed._tools.sac_estimator import SACEstimator from torch.distributed._tools.sac_ilp import sac_milp from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) def _init_model_input_optimizer() -> Tuple[ torch.nn.Module, torch.optim.Optimizer, torch.Tensor ]: bsz = 8 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=8192, max_seq_len=1024, dim=768, dropout_p=0.1, ) with torch.device(torch.cuda.current_device()): model = Transformer(model_args) optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=torch.cuda.current_device(), ) return (model, optimizer, inp) def _run_and_get_mem_tracker( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> MemTracker: mem_tracker = MemTracker() mem_tracker.track_external(model, optimizer) with mem_tracker as mt: for iter_idx in range(2): # running twice to initialize optimizer output = model(inp) output.sum().backward() if iter_idx == 1: last_snapshot = mt.get_tracker_snapshot("current") optimizer.step() optimizer.zero_grad() if iter_idx == 0: mt.reset_mod_stats() assert last_snapshot is not None for mod_stats in mem_tracker.memory_tracking.values(): if _ModState.POST_BW not in mod_stats.snapshots.keys(): mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append( copy.deepcopy(last_snapshot) ) return mem_tracker def _run_and_get_runtime_estimator( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> RuntimeEstimator: def _run_one_step() -> None: output = model(inp) output.sum().backward() optimizer.step() optimizer.zero_grad() # Initializing optimizer states and warm-up _run_one_step() runtime_estimator = RuntimeEstimator() with runtime_estimator(estimate_mode_type="operator-level-cost-model"): _run_one_step() # We use only one iteration for estimation return runtime_estimator def _run_and_get_sac_estimator( model: torch.nn.Module, inp: torch.Tensor, ) -> SACEstimator: sac_estimator = SACEstimator() with sac_estimator(estimate_mode_type="operator-level-cost-model"): loss = model(inp).sum() loss.backward() return sac_estimator def main(): with FakeTensorMode(): model, optimizer, inp = _init_model_input_optimizer() mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp) runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp) sac_estimator = _run_and_get_sac_estimator(model, inp) mod_info = aggregate_stats( model, mem_tracker, runtime_estimator, sac_estimator, torch.device(torch.cuda.current_device()), ) g = parse_module_info(mod_info) peak_mem, compute_time = get_peak_memory_runtime_baseline(g) print("=== WITHOUT AC ===") print(f"peak_mem: {round(peak_mem / 230, 2)} GiB") print(f"compute_time: {round(compute_time, 2)} ms") ac_decisions, recomputation_time, peak_mem = sac_milp(g, memory_budget=1.75) print("=== WITH AC ===") print(f"ac_decisions: {ac_decisions}") print(f"peak_mem: {round(peak_mem / 230, 2)} GiB") print(f"recomputation_time: {recomputation_time} ms") if __name__ == "__main__": main() ``` ``` === WITHOUT AC === peak_mem: 2.41 GiB compute_time: 97.97 ms === WITH AC === ac_decisions: {'Transformer.layers.0': 0.5232, 'Transformer.layers.1': 0.5232, 'Transformer.layers.2': 0.6849, 'Transformer.layers.3': 0.5232} peak_mem: 1.75 GiB recomputation_time: 5.92 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137908 Approved by: https://github.com/weifengpy	2024-10-18 12:45:37 +00:00
zeshengzong	7b863230ea	[Docs] Optimize parameter description to declare allowed type (2/N) (#138152 ) Inspired by issue #137422 and #103847 Optimize method parameter types in docs to given users a more clear about what expected to pass to methods. Previous PR: - [x] https://github.com/pytorch/pytorch/pull/137956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138152 Approved by: https://github.com/albanD	2024-10-18 11:18:19 +00:00
Tom Ritchford	354bc3ac11	[dynamo] Remove an unused variable in repro.after_aot (#138094 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138094 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-10-18 09:37:10 +00:00
Tom Ritchford	e1c4548441	[dynamo] Simplify creation of VariableTrackers (#135714 ) ## `VariableTracker::build()` hides the Builders ### The problem In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](`083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)`). Variations on this code are repeated in many places. More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope. ### The solution In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import. This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports. CHANGES: Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714 Approved by: https://github.com/jansel	2024-10-18 09:36:46 +00:00
Ke Wen	1581a93e87	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 09:10:01 +00:00
Will Feng	1a8b4c65ac	Fix scatter and gather shape check error message (#138310 ) The error message seems incorrect based on the surrounding code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138310 Approved by: https://github.com/Microve, https://github.com/fegin	2024-10-18 07:49:07 +00:00
Tugsbayasgalan Manlaibaatar	517012058d	Move test_db to training IR (#138251 ) Differential Revision: [D64560792](https://our.internmc.facebook.com/intern/diff/D64560792) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138251 Approved by: https://github.com/yushangdi ghstack dependencies: #138249	2024-10-18 07:42:13 +00:00
Tugsbayasgalan Manlaibaatar	29264fcbef	Move test_verifier to training IR (#138249 ) Differential Revision: [D64560351](https://our.internmc.facebook.com/intern/diff/D64560351) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138249 Approved by: https://github.com/yushangdi	2024-10-18 07:36:29 +00:00
Avik Chaudhuri	5d01126616	preserve module signature with multiple calls (#137999 ) Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places. Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export. Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.) Differential Revision: D64406945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999 Approved by: https://github.com/tugsbayasgalan	2024-10-18 07:30:22 +00:00
Jing Xu	14e6624473	Update wmic command used in collect_env.py to its counterpart in powershell due to its deprecation (#138297 ) As title. `wmic` is deprecated in Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138297 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-18 07:03:17 +00:00
Adnan Akhundov	d116d007ee	Add host-side Triton TMA support to Inductor (#137950 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950 Approved by: https://github.com/eellison ghstack dependencies: #137677	2024-10-18 06:27:24 +00:00
zeshengzong	82443798aa	[Distributed] Refactor compress hook to remove duplicated code (#138182 ) Fix TODO in code ```python # TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress. ``` 1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method 2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke `_compress_hook` with difference `dtype` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182 Approved by: https://github.com/awgu	2024-10-18 06:01:15 +00:00
Huy Do	80a58b7207	Use fresh cache directory in test_cudacodecache (#138243 ) This test frequently times out flakily, for example, https://github.com/pytorch/pytorch/actions/runs/11377972115/job/31654107609#step:22:2376. I still couldn't reproduce this behavior locally running this multiple times and in parallel. ~~So, I suspect that the error only shows up when other tests are run in paralel.~~ ~~I attempt to run this serially in this PR, once land, I can monitor trunk to see if this helps.~~ Running serially still ends up with a timing out https://github.com/pytorch/pytorch/actions/runs/11391445912/job/31697603438, another try with fresh cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138243 Approved by: https://github.com/clee2000	2024-10-18 05:45:39 +00:00
ur4t	0b168ceb6d	Collect Nvidia libraries with collect_env.py (#138076 ) Collect Nvidia libraries to diagnose issues like #133548. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138076 Approved by: https://github.com/malfet	2024-10-18 05:05:00 +00:00
Will Feng	8cb9110906	[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-18 04:58:58 +00:00
Nikita Shulga	a9014d2287	[BE][MPS] Compile without warnings on MacOS15 (#138238 ) By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP` and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238 Approved by: https://github.com/atalman	2024-10-18 04:20:15 +00:00
Xingyuan Li	cc6c248919	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_inductor_freezing.py` reuse `test/inductor/test_layout_optim.py` reuse `test/inductor/test_loop_ordering.py` reuse `test/inductor/test_memory_planning.py` reuse `test/inductor/test_padding.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856 Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel	2024-10-18 03:58:00 +00:00
Nikita Lutsenko	c3cd9939fc	aten \| Deduplicate and silence set but unused variable warning. (#138270 ) Summary: Turns out we have two functions called slightly differently but they do exactly the same thing. Also silence the warning if the message is stripped out. Test Plan: Sandcastle, no behavior change. Differential Revision: D64566719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138270 Approved by: https://github.com/boguscoder, https://github.com/cyyever	2024-10-18 03:09:46 +00:00
William Wen	73a153b931	[dynamo] add compiler.set_stance raw function call test and doc example (#138276 ) Followup to https://github.com/pytorch/pytorch/pull/137504#issuecomment-2420107198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138276 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-18 02:54:22 +00:00
Animesh Jain	8b426d80dc	[hops][refactor] Refactor the aliasing/mutation detection functions (#138234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138234 Approved by: https://github.com/ydwu4 ghstack dependencies: #138231	2024-10-18 02:35:00 +00:00
Animesh Jain	e714ebf664	[dynamo][testing] Update AOTEagerandRecordGraphs backend (#138231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138231 Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/aakhundov	2024-10-18 02:35:00 +00:00
Matt Pitkin	8a5dd7f59b	Allow SequentialLR to include ChainedScheduler (#133450 ) This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450 Approved by: https://github.com/janeyx99	2024-10-18 02:29:38 +00:00
Yu, Guangye	8cda774a03	Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773 ) # Motivation Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-10-18 02:28:08 +00:00
Jerry Zhang	6d8c9be54b	[reland] Add int1 to int7 dtypes (#137928 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases for weight quantization Test Plan: python test/test_quantization.py -k test_uint4_int4_dtype Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D64344944](https://our.internmc.facebook.com/intern/diff/D64344944) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137928 Approved by: https://github.com/malfet	2024-10-18 02:02:08 +00:00
Mengwei Liu	7365a57dc0	[BC] Add check for core ATen opset schema BC (#137664 ) Summary: Based on core ATen opset BC policy: https://dev-discuss.pytorch.org/t/core-aten-opset-backward-forward-compatibility-policy/1772 Encorcing this policy in `check_forward_backward_compatibility.py`. Basically the script will error out if any BC breaking schema changes occurs to core ATen operators. Test Plan: Run `python test/forward_backward_compatibility/dump_all_function_schemas.py --filename nightly_schemas.txt` Manually added a argument to `nightly_schemas.txt`, `convolution` schema, see the following error: ``` [WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:329] Can NOT find backward compatible schemas after changes for schema aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor from the following candidates: [ aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups) -> Tensor aten::convolution.out(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, *, Tensor(a!) out) -> Tensor(a!) ]. Please contact PyTorch team to confirm if this BC breaking change is safe or not. ... [WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:342] The PR is introducing backward incompatible changes to core ATen operators. Please contact PyTorch team to confirm whether this change is wanted or not. Broken ops: [ aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor ] ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137664 Approved by: https://github.com/albanD	2024-10-18 01:58:33 +00:00
Shuqiang Zhang	21a9c06ca9	[c10d] differentiate timeout errors from nccl errors (#138240 ) Summary: Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths. It's important for c10d to differentiate different reasons of watchdog failures. E.g, timeout vs nccl errors, and possibly let users to handle the errors differently depends on the type of errors Test Plan: UT Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240 Approved by: https://github.com/Skylion007	2024-10-18 01:36:32 +00:00
Pian Pawakapan	95f869c3d7	[pytorch_operator_stats] log if using torchscript runtime (#137986 ) Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()` Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner Reviewed By: zou3519 Differential Revision: D64200781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986 Approved by: https://github.com/angelayi	2024-10-18 00:55:22 +00:00
FFFrog	ad28565ed7	Use C++17 Convention Methods in PyTorch (#137958 ) Detailed Descriptions: - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` - and so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/137958 Approved by: https://github.com/janeyx99	2024-10-18 00:52:51 +00:00
Nikita Lutsenko	b7cf8fb800	c10 \| Silence 'deprecated-dynamic-exception-spec' warning when importing cxxabi. (#138219 ) Summary: cxxabi header specifically from llvm violates this, ignore the warning when including it. Test Plan: No runtime behavior change, sandcastle only Differential Revision: D64540217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138219 Approved by: https://github.com/boguscoder	2024-10-18 00:42:45 +00:00
Will Feng	2f91d7c63f	[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 ) Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113 Approved by: https://github.com/xmfan	2024-10-18 00:13:00 +00:00
Sahan Paliskara	6d473e0dda	[autolint] move to use a label (#138263 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138263 Approved by: https://github.com/huydhn	2024-10-18 00:12:52 +00:00
Nikita Shulga	a3172809a1	[EZ] Fix typo in Normalization.mm (#138283 ) Introduced by `6b76a21ebd` One likely has to wait for 125 years to MacOS-150 release :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138283 Approved by: https://github.com/kit1980	2024-10-18 00:01:21 +00:00
Xiaodong Wang	b14c9b7250	[AMD] Hipify torchaudio_decoder (#138181 ) Summary: X-link: https://github.com/pytorch/audio/pull/3843 Continue to hipify more torchaudio targets. Test Plan: CI buck build mode/opt-amd-gpu pytorch/audio/src/... Differential Revision: D64298970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138181 Approved by: https://github.com/houseroad	2024-10-17 23:37:37 +00:00
Will Feng	0ecafda602	Fix CompiledDDP failure when the gradient is not contiguous (#138174 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174 Approved by: https://github.com/yf225, https://github.com/kwen2501 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-17 23:08:24 +00:00
Benjamin Glass	2fc6c32b4c	Ensure version file is regenerated at change (#138237 ) Fixes observed error where `version.py` would not be regenerated by CMake without deleting the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138237 Approved by: https://github.com/Skylion007	2024-10-17 22:46:05 +00:00
Xinya Zhang	770fcaf2ab	Fix the Rank of logsumexp Tensor and mGPU support. (#137717 ) The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes #131316 #137414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2024-10-17 21:58:14 +00:00
Tom Ritchford	9f81270d75	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-17 21:27:35 +00:00
albanD	69ba89da11	Fix cuda sanitizer and as_subclass calls (#138218 ) This fixes 4 main issues: - The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up. - Adds a "disable" method to be able to test after the mode is enabled. - Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it. - ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218 Approved by: https://github.com/ngimel	2024-10-17 21:18:32 +00:00
Edward Yang	b14269dcfb	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) (#138155 ) Summary: - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Original pull request: https://github.com/pytorch/pytorch/pull/136519 Test Plan: contbuild & OSS CI, see `4a8e49389c` Reviewed By: malfet Differential Revision: D64471142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155 Approved by: https://github.com/malfet, https://github.com/bobrenjc93	2024-10-17 20:58:56 +00:00
eellison	7a117f3b3e	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-17 19:24:54 +00:00
Jane Xu	54839781ed	Update lint failure msg to encourage lintrunner -a locally (#138232 ) This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-10-17 19:13:55 +00:00
Shivam Raikundalia	dfb5ac05cc	[Record Function] Add Kwargs only USER_SCOPE Macro (#138020 ) Summary: Add a macro such that users can easily add a USER annotation with kwargs only Test Plan: Will use D63801503 to test this E2E. Added unit test as well that makes sure that the kwargs get recorded correctly Differential Revision: D64420328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138020 Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi	2024-10-17 18:48:49 +00:00
Will Feng	0c76c68d7d	[tlparse][AOTAutograd] Rename to aot_inference_graph in tlparse output (#137803 ) Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803 Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang	2024-10-17 18:44:37 +00:00
zeshengzong	d531bd509e	[Docs] Fix description in `torch.save` docs to show default for pickle_protocol instead of variable name (#138153 ) Fixes #138013 Replace `DEFAULT_PROTOCOL` with actual default value `2` in `torch.save` method document Before ![image](https://github.com/user-attachments/assets/cdd77d14-c009-4848-8538-9256bf22c32a) After ![image](https://github.com/user-attachments/assets/f6b1063d-c955-478a-8d42-702b988426aa) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138153 Approved by: https://github.com/mikaylagawarecki	2024-10-17 18:13:05 +00:00
Richard Barnes	8abbd1c7c7	Modernize C10_NODISCARD to [[nodiscard]] (#138151 ) PyTorch is C++17 now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138151 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-17 18:07:39 +00:00
chilli	6752e7dc3e	Moved some of Inductor IR nodes to be frozen (#137859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859 Approved by: https://github.com/ezyang	2024-10-17 18:04:45 +00:00
Michael Lazos	0b2c12cb4d	Support more foreach ops for tensor beta support (#134170 ) Add more foreach ops so we don't have fallbacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170 Approved by: https://github.com/eellison	2024-10-17 17:51:31 +00:00
William Wen	92fdea8a39	remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133 ) Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons. If these tests are indeed important, we should create a new issue to track them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133 Approved by: https://github.com/ezyang	2024-10-17 17:42:46 +00:00
Scott Wolchok	6b76a21ebd	[PyTorch] Fix incorrect macOS 15.0 gating in MPS backend (#138022 ) The ifdef as written just checks if the macOS 15.0-capable SDK is being used. You also need a runtime gate to make sure macOS 15 is in use. Differential Revision: [D64429453](https://our.internmc.facebook.com/intern/diff/D64429453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138022 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137722, #138014	2024-10-17 17:35:34 +00:00
PyTorch MergeBot	d2a6c73235	Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 )" This reverts commit 20af56d4359c3f5fed2e8f94e111a8502f2ebeb3. Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2420109501))	2024-10-17 17:32:06 +00:00
Tugsbayasgalan Manlaibaatar	2a50d77823	Move test_experimental.py to training IR (#138140 ) Differential Revision: [D64510938](https://our.internmc.facebook.com/intern/diff/D64510938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138140 Approved by: https://github.com/avikchaudhuri	2024-10-17 17:30:10 +00:00
Joel Schlosser	ecc5e05854	Refactor NJT min / max seqlen handling for convenience (#138130 ) There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130 Approved by: https://github.com/soulitzer	2024-10-17 17:28:39 +00:00
PyTorch MergeBot	66478d0cf7	Revert "[compiled autograd] directly use python Logger class in cpp (#137953 )" This reverts commit af916613687d3bcc1d15362ba2fdf9312378c500. Reverted https://github.com/pytorch/pytorch/pull/137953 on behalf of https://github.com/clee2000 due to breaking builds internally D64479234, I think it makes the build size of a package too large? The logs link to a wiki with instructions of what to do ([comment](https://github.com/pytorch/pytorch/pull/137953#issuecomment-2420086928))	2024-10-17 17:19:36 +00:00
PyTorch MergeBot	3b0f3059f6	Revert "[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 )" This reverts commit ebe37b23f11e150cd3afa5464193ee036e15277f. Reverted https://github.com/pytorch/pytorch/pull/138113 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert https://github.com/pytorch/pytorch/pull/137953, please rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/138113#issuecomment-2420079703))	2024-10-17 17:16:44 +00:00
PyTorch MergeBot	375dcb960f	Revert "Avoid some dangling reference warnings (#132535 )" This reverts commit f3d7a02716d8725dcedff86094bd7e20f73155f1. Reverted https://github.com/pytorch/pytorch/pull/132535 on behalf of https://github.com/clee2000 due to broke some internal builds D64479234 ([comment](https://github.com/pytorch/pytorch/pull/132535#issuecomment-2419983509))	2024-10-17 16:23:36 +00:00
Shangdi Yu	348f208504	Autocast re-tracibility (#138082 ) Summary: Support autocast re-tracing by giving it the same treatment as set_grad. In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast ``` Differential Revision: D63856081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082 Approved by: https://github.com/ydwu4	2024-10-17 16:09:11 +00:00
Yidi Wu	3087b5e431	[cond] support lifted symint inputs in subgraph (#137519 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137519 Approved by: https://github.com/eellison	2024-10-17 16:09:06 +00:00
Zhuoran Zhao	2414c3f534	AOTI fixes for MI300 lowering (#137939 ) Summary: 1) Add sleef back to enable SIMD on AMD 2) adding kpack to triton compute_meta for AMD triton, since there will be user-defined triton kernels using this for k-dim packing Test Plan: ``` HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1 --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 \| tee local_benchmark_log.txt ``` Differential Revision: D64262924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939 Approved by: https://github.com/frank-wei	2024-10-17 16:09:04 +00:00
Sungmin Cho	502c6183e0	Prevent tuple instances from being weak-referenced. (#137838 ) Summary: Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__` ``` if hasattr(guarded_object.__class__, "__weakref__") and not isinstance( guarded_object, enum.Enum ): obj_ref = weakref.ref(guarded_object) ``` However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh : ``` class QuantizedWeights: # TODO: Ugly trick so torch allows us to replace parameters # with our custom weights. Do this properly. property def __class__(self) -> Type[nn.parameter.Parameter]: return nn.Parameter property def grad_fn(self) -> None: return None ``` For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does. I think the easiest change is to use instance-level checking rather than class-level ``` if hasattr(guarded_object, "__weakref__") ... ``` But I'm wondering if this will harm any of the existing behaviors. I'd appreciate reviews from the experts (I just added all recommended reviewers since I'm not sure who is the best person to consult...) Test Plan: CI? Reviewed By: YJYJLee Differential Revision: D64140537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-17 16:08:32 +00:00
Laith Sakka	7e16c9d5f2	include bw_compiler in strobelight profile (#138060 ) Summary: title + tlparse will have the phase name. Test Plan: {F1933087525} Differential Revision: D64450315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138060 Approved by: https://github.com/ezyang	2024-10-17 16:08:28 +00:00
Will Feng	20af56d435	[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan	2024-10-17 10:51:07 +00:00
CaoE	8cfe28e4e3	[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-10-17 09:06:57 +00:00
Tom Ritchford	47077bfcb5	Remove an unused variable in _subclasses.fake_tensor (#138086 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138086 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-17 09:05:25 +00:00
Laith Sakka	ba10259115	Increase default COMPILE_STROBELIGHT_MAX_STACK_LENGTH to 500 (#138006 ) Summary: pt2 call stacks are long, this reduces truncated stack <img width="1363" alt="Screenshot 2024-10-15 at 11 35 11 AM" src="https://github.com/user-attachments/assets/d09a8fb5-eafc-4440-ab58-464889dc6df8"> <img width="1373" alt="Screenshot 2024-10-15 at 11 35 26 AM" src="https://github.com/user-attachments/assets/c4c9c245-54d1-4e35-b16f-029ece335e03"> Differential Revision: D64414746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138006 Approved by: https://github.com/bobrenjc93	2024-10-17 07:31:32 +00:00
William Wen	5b7f4767ff	Fix https://github.com/pytorch/pytorch/issues/138062 (#138137 ) Fixes https://github.com/pytorch/pytorch/issues/138062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138137 Approved by: https://github.com/mlazos	2024-10-17 07:12:15 +00:00
Tugsbayasgalan Manlaibaatar	f3c3f3a3c3	Fix assigning tensor with requires_grad as constant in export (#137997 ) When we insert cojstants into unlifted graph, we need to detach them if they require grad BUT when we detach we need to preserve the original aliasing information. Differential Revision: [D64406859](https://our.internmc.facebook.com/intern/diff/D64406859/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137997 Approved by: https://github.com/avikchaudhuri	2024-10-17 06:41:10 +00:00
Edward Z. Yang	38d9924bfc	Disable lint suggestions on my PRs (#138054 ) The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC	2024-10-17 05:28:37 +00:00
cyy	af8bd323e8	Remove legacy Caffe2 pthreadpool from CMake (#134936 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936 Approved by: https://github.com/ezyang	2024-10-17 05:22:08 +00:00
Josh Fromm	9c084cccfd	[Pytorch][ATEN] Enable FP8 concatenate (#138046 ) Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models. Test Plan: ``` buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices ``` Differential Revision: D64443965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046 Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh	2024-10-17 04:58:54 +00:00
Jing Xu	ebd60f4074	update CMAKE_PREFIX_PATH setting command (#134934 ) Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues. `export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-17 04:19:18 +00:00
Edward Z. Yang	7db1f0b7b5	Minor assert error message improvement (#138053 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138053 Approved by: https://github.com/Skylion007	2024-10-17 03:54:15 +00:00
Will Feng	ebe37b23f1	[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 ) Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113 Approved by: https://github.com/xmfan ghstack dependencies: #138105	2024-10-17 03:45:10 +00:00
Bin Bao	fe43f72be7	[AOTI] Remove the non-ABI-compatible mode (part 2) (#138047 ) Summary: Continue to clean up non-ABI-compatible mode related code. Differential Revision: [D64444327](https://our.internmc.facebook.com/intern/diff/D64444327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138047 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016, #138009	2024-10-17 02:54:24 +00:00
Bin Bao	2e67d7cc35	[AOTI] Remove the non-ABI-compatible mode (part 1) (#138009 ) Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic. Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016	2024-10-17 02:48:26 +00:00
Nikita Shulga	7711f00553	[BE] Delete unused operator!= from the test (#138122 ) If method is unused, why not delete it altogether? Pull Request resolved: https://github.com/pytorch/pytorch/pull/138122 Approved by: https://github.com/swolchok	2024-10-17 02:24:48 +00:00
Joel Schlosser	906fe05895	Naive impls for NJT matmul (#138121 ) Our matmul support is abysmal - let's at least get this working and do it performantly later. Bonus: implements `bmm` as well. jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121 Approved by: https://github.com/cpuhrsch	2024-10-17 01:31:46 +00:00
zeshengzong	b4f7f4bf49	[Docs] Optimize parameter description to declare allowed type (1/N) (#137956 ) Inspired by issue #137422 and #103847 Optimize method parameter types in docs to given users a more clear about what expected to pass to methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137956 Approved by: https://github.com/albanD	2024-10-17 01:19:55 +00:00
Yifu Wang	c69f4518ec	[SymmetricMemory] fix a race condition in _pipelined_produce_and_all2all that can cause correctness issues for very small `chunk_producer`s (#138126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138126 Approved by: https://github.com/lessw2020	2024-10-17 01:05:41 +00:00
Benjamin Glass	69e125a7e9	AOTInductor: fixup test (follow-up to #137401 ) (#137692 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137692 Approved by: https://github.com/desertfire	2024-10-17 00:40:21 +00:00
Jane Xu	94537e70b5	Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100 Approved by: https://github.com/Skylion007	2024-10-17 00:34:56 +00:00
Will Feng	504904c9c6	[Traceable FSDP2] Add compiled_autograd_enabled helper function (#138105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138105 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-10-17 00:04:06 +00:00
Avik Chaudhuri	0e9708f907	tensor constant with wrapped method (#138091 ) Summary: Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs. Longer term maybe we should pull this fakification up in non-strict. Test Plan: added test Differential Revision: D64480272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091 Approved by: https://github.com/tugsbayasgalan	2024-10-17 00:00:04 +00:00
PyTorch MergeBot	4b3035f2fe	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))	2024-10-16 23:18:53 +00:00
PyTorch MergeBot	5254a0d383	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit cef6c3dcb07aafe25d62427e55442a46d7af3500. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to failing internal tests D64418200, some results not within tolerance? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2418122735))	2024-10-16 23:16:44 +00:00
Brian Hirsh	ea2726452a	add myself as codeowner in aot_autograd (#138075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138075 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #136670	2024-10-16 22:41:39 +00:00
Brian Hirsh	a682194a11	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-10-16 22:41:39 +00:00
Tom Ritchford	56379e2c17	Remove an unused variable in _subclasses.fake_impls (#138085 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138085 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-10-16 22:41:04 +00:00
Yidi Wu	0bfa1bf21d	[scan] support closure (#135602 ) This PR adds an additional_inputs argument to support closures similar to what we've done for while_loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135602 Approved by: https://github.com/zou3519 ghstack dependencies: #135600, #135601	2024-10-16 22:28:03 +00:00
Yidi Wu	819d6b139c	[scan] flatten subgraph output and make subgraph inputs to be a slice (#135601 ) This pr introduces two changes: 1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators. 2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like: ``` # x.shape = (1, 4, 5) instead of (4, 5) def combine_fn(carry, x): ... ``` In this PR, we fixed this and also simplify some of the slicing logic. 3. this diff also make sure we always stack ys on fist dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601 Approved by: https://github.com/zou3519 ghstack dependencies: #135600	2024-10-16 22:28:03 +00:00
Yidi Wu	0437a22d43	[scan] fix typo in signature and remove wrapper (#135600 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135600 Approved by: https://github.com/zou3519	2024-10-16 22:27:59 +00:00
Bin Bao	443472b1ca	[AOTI] Remove explicit abi_compatible setting in tests (#138016 ) Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138016 Approved by: https://github.com/malfet ghstack dependencies: #137982	2024-10-16 21:35:46 +00:00
Bin Bao	6bc57549f9	[AOTI] Remove non-ABI-compatible tests (#137982 ) Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True. Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982 Approved by: https://github.com/malfet	2024-10-16 21:35:46 +00:00
homorunner	a040c4a260	Use std::move on stringstream to prevent unnecessary copy. (#138065 ) - Takes advantage of C++20's improved handling of move semantics for std::basic_stringbuf. - Reduces unnecessary copying and improves memory efficiency, especially for long formatted strings. Benchmark(proof of concept): https://quick-bench.com/q/qohAu0ARH3vSDyKVsoKEfXOO6BI Pull Request resolved: https://github.com/pytorch/pytorch/pull/138065 Approved by: https://github.com/Skylion007	2024-10-16 21:35:10 +00:00
fduwjj	b72ff35f22	[c10d][ez] Add more inline comments to CUDAEventCache code (#138079 ) Address @kwen2501 's feedback in https://github.com/pytorch/pytorch/pull/138048, add more inline comments to the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138079 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048, #138059	2024-10-16 20:43:28 +00:00
Shangdi Yu	f2c96f5d87	Add AOTI test (#138043 ) Summary: add back the test that's removed in D63916320. It should work now as D64361273 added back the workspace change. Test Plan: CI Differential Revision: D64442054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138043 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-10-16 20:41:07 +00:00
Chirag Pandya	f95ddf0b31	[c10d] record world size in log (#138044 ) Summary: Record the world size in log and scuba table. This helps us quickly figure out if there are missing flight recorder files form ranks. Test Plan: Ran locally and noted that size was logged to scuba Differential Revision: D64442949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044 Approved by: https://github.com/Skylion007	2024-10-16 20:14:02 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Henry Tsang	a0a978ce23	[aoti config] add raise_error_on_ignored_optimization (#138035 ) Summary: Unfortunately this means adding another config. Test Plan: ci Differential Revision: D64437699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138035 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-10-16 18:38:47 +00:00
angelayi	f1c741dbe9	Fixes GuardOnDataDependentSymNode error in masked_fill (#137060 ) Fixes [P1621441513](https://www.internalfb.com/phabricator/paste/view/P1621441513) ([ref to internal post](https://fb.workplace.com/groups/6829516587176185/posts/1051474609896021/?comment_id=1055262166183932&reply_comment_id=1056583932718422)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137060 Approved by: https://github.com/ezyang	2024-10-16 18:16:33 +00:00
Catherine Lee	f173623bb2	[td] try catch exception, do not run td if not results (#138087 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138087 Approved by: https://github.com/wdvr	2024-10-16 18:04:25 +00:00
Li Yu (ads)	dabe2a3c3b	[Torch] Support meta device in random.fork_rng (#137715 ) Summary: ## Why random.fork_rng doesn't support meta device: ``` [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size [rank0]: losses.sum().backward() [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward [rank0]: return handle_torch_function( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function [rank0]: result = mode.__torch_function__(public_api, types, args, kwargs) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__ [rank0]: return func(args, kwargs) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward [rank0]: torch.autograd.backward( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward [rank0]: _engine_run_backward( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook [rank0]: frame.recompute_fn(args) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn [rank0]: with torch.random.fork_rng( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__ [rank0]: return next(self.gen) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng [rank0]: raise RuntimeError( [rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`. ``` This blocks us from running backward() on model with checkpoint enabled in meta mode. ## What This diff handles the case of meta device in random.fork_rng. Test Plan: Tested with toy model which has checkpoint on its module: P1641201046 Differential Revision: D64161410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715 Approved by: https://github.com/kit1980	2024-10-16 18:00:39 +00:00
Shangdi Yu	a47bb4a393	Fix autocast for non-strict export (#137495 ) Summary: add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes. add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`. - In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`. if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806). ``` - torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function - torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__ - torch.amp._enter_autocast()'s torch.overrides.handle_torch_function - PreDispatchTorchFunctionMode.__torch_function__ ``` - in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes. - to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_set_grad ``` Differential Revision: D64016023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495 Approved by: https://github.com/bdhirsh	2024-10-16 17:39:00 +00:00
Zheng, Zhaoqiong	7ba706c74e	update get start xpu (#137479 ) 1. respect the comment from the community, downgrade the "Beta" to "Prototype" for the first xpu release with wheel 2. add wheels installation of torchaudio & torchvision for nightly on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/137479 Approved by: https://github.com/atalman, https://github.com/malfet	2024-10-16 17:36:29 +00:00
fduwjj	7e704c2073	[c10d] Add unit test for CUDAEventCache to ensure caching is working (#138059 ) We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (https://github.com/pytorch/pytorch/pull/138040) and the test indeed failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138059 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048	2024-10-16 17:34:57 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
Tugsbayasgalan Manlaibaatar	0a6c40faba	Fix constant returning (#137993 ) When the constants are used twice in the exported graph (second one is returned as output), the lifting constant pass doesn't account for the second one being the output. THis PR fixes that. Differential Revision: [D64406108](https://our.internmc.facebook.com/intern/diff/D64406108/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137993 Approved by: https://github.com/avikchaudhuri	2024-10-16 16:42:09 +00:00
Scott Wolchok	189c95457d	[PyTorch] Don't hardcode 4 * Vec::size() in vectorized_reduction (#138014 ) This will break once we support 128-bit vectors, and there's no reason to do it. Differential Revision: [D64421982](https://our.internmc.facebook.com/intern/diff/D64421982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138014 Approved by: https://github.com/malfet, https://github.com/Skylion007 ghstack dependencies: #137722	2024-10-16 16:41:59 +00:00
Scott Wolchok	a12c859b00	[PyTorch] Check `defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256)` instead of `defined(CPU_CAPABILITY_NEON)` (#137722 ) The CPU_CAPABILITY system is for rebuilding kernels multiple times with different vector ISA targets. CPU_CAPABILITY_NEON was not being used for that, just as an extra flag for inductor. As a result, CPU_CAPABILITY_NEON-gated code was unnecessarily unavailable outside inductor. Fixes #137704 Differential Revision: [D64197046](https://our.internmc.facebook.com/intern/diff/D64197046/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137722 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-16 16:41:59 +00:00
PyTorch MergeBot	361f42bc42	Revert "[compiled autograd] Compiled autograd configs in TLS (#137821 )" This reverts commit 9aba0b91c8df4a15654f9ccc02abca31bdd81650. Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))	2024-10-16 16:38:29 +00:00
Tom Ritchford	af27f7888b	[dynamo] Remove an unused variable in AOTDispatchAutograd (#137989 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-16 16:37:19 +00:00
Nikita Shulga	753ba5d30a	Move basic dependencies install to requirements-ci (#138024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138024 Approved by: https://github.com/huydhn ghstack dependencies: #137991, #137992, #138023	2024-10-16 16:21:33 +00:00
William Wen	4c8718d8e7	[dynamo] add torch.compiler.set_stance (#137504 ) Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771. Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly. See added tests for usage examples. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504 Approved by: https://github.com/yf225, https://github.com/jansel	2024-10-16 16:18:25 +00:00
fduwjj	960c3bff98	[c10d] Refactor CUDAEventCache Create to use deque rather than stack (#138048 ) We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040	2024-10-16 14:44:39 +00:00
Tom Ritchford	932ae131fb	Remove an unused variable in _inductor/codegen/simd.py (#138000 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000 Approved by: https://github.com/Skylion007	2024-10-16 13:54:21 +00:00
Isuru Fernando	f3d7a02716	Avoid some dangling reference warnings (#132535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132535 Approved by: https://github.com/aaronenyeshi	2024-10-16 13:41:12 +00:00
Tom Ritchford	0c63de9755	[dynamo] Remove an unused variable in AutogradFunctionApplyVariable (#137985 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137985 Approved by: https://github.com/zou3519	2024-10-16 13:08:45 +00:00
Tom Ritchford	15722debfb	Remove two unused variables in _functorch/partitioners.py (#137998 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998 Approved by: https://github.com/Skylion007	2024-10-16 10:58:31 +00:00
Simon Fan	9aba0b91c8	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-16 09:28:32 +00:00
Simon Fan	af91661368	[compiled autograd] directly use python Logger class in cpp (#137953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953 Approved by: https://github.com/jansel, https://github.com/yf225	2024-10-16 09:28:32 +00:00
amathewc	7f88bf96f9	test_execution_trace.py: Use instantiate_device_type_tests to run GPU tests on HPU as well (#133975 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. CHANGES - Add support for HPU devices within the payload function. - Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances. - Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU. - Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975 Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi	2024-10-16 07:53:06 +00:00
cyyever	deaf0418b2	[2/N] Fix clang-tidy warnings in torch/csrc/api/ (#136998 ) Follows #134545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136998 Approved by: https://github.com/ezyang	2024-10-16 07:50:59 +00:00
Shuqiang Zhang	f4158558aa	[c10d] disable watchdog thread in blockingWait mode (#138001 ) Summary: Blocking wait mode is not widely used, probably useful in debugging. in blockingWait mode, we don't need to enable the watchdog thread to check the timeout or nccl error because the main thread would throw an exception if error happens and it is obvious to user which work fails and its user's responsibility to handle the exception. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #137799	2024-10-16 07:42:22 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Jason Ansel	7480e6938d	[inductor] Add LoopBody.op_counts (#137945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945 Approved by: https://github.com/eellison ghstack dependencies: #137946	2024-10-16 06:35:10 +00:00
Jason Ansel	0d7b2118ed	[inductor] Refactor triton dtype helpers (#137946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137946 Approved by: https://github.com/eellison	2024-10-16 06:35:10 +00:00
Huy Do	97f7fc1d31	Support retry when building Docker images (#138012 ) Similar to https://github.com/pytorch/test-infra/pull/5759, I'm seeing flaky network error from time to time when building Docker images, for example https://github.com/pytorch/pytorch/actions/runs/11352439248/job/31575206417. So, adding retrying to mitigate this class of flaky failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138012 Approved by: https://github.com/atalman	2024-10-16 06:10:41 +00:00
fduwjj	084657e012	[c10d] Fix data corruption bug after CUDAEventCache is enabled (#138040 ) Here is why we see using `CUDAEventCache` cause crash and data corruption. 1. The deleter is doing its job and append the job the stack. 2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens. Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss. Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040 Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang	2024-10-16 05:20:29 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
Nikita Shulga	60b4858977	[BE][Docker] Don't update scikit-learn (#138023 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138023 Approved by: https://github.com/huydhn ghstack dependencies: #137991, #137992	2024-10-16 05:01:40 +00:00
Nikita Shulga	7f6e85bb93	[BE] Move numpy installation logic to `requirements-ci.txt` (#137992 ) And slightly adjust versioning logic, as current one seems to exist to hide version conflicts: - 1.21.2 for Python-3.9 - 1.24.2 for Python-3.10 (to resolve conflict with numba-0.55.2) - 1.26.2 for Python-3.11 or 3.12 - 2.1.2 for Python-3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137992 Approved by: https://github.com/Skylion007, https://github.com/huydhn ghstack dependencies: #137991	2024-10-16 04:30:29 +00:00
Nikita Shulga	12f4d91e84	Enable Python-3.13 builds on MacOS (#138037 ) All logic changes happen in builder repo, namely: - `a01e87535b` - `bcd0972459` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138037 Approved by: https://github.com/huydhn ghstack dependencies: #138041	2024-10-16 04:24:12 +00:00
Yu, Guangye	66b39fd474	refactor KERNEL_MPS via resuing KERNEL (#137831 ) # Motivation Reuse `KERNEL` to simplify `KERNEL_MPS` for mps autocast code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137831 Approved by: https://github.com/malfet	2024-10-16 03:54:13 +00:00
Yu, Guangye	2c94c54f10	Export XPU libs to be public (#136974 ) # Motivation Export XPU-related libs to be public. Now they are included in `TORCH_LIBRARIES` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136974 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-10-16 03:41:01 +00:00
Yifu Wang	80f3ee41dc	[SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038 ) Fixes https://github.com/pytorch/pytorch/pull/137567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138038 Approved by: https://github.com/weifengpy	2024-10-16 03:34:26 +00:00
Howard Huang	75109682b6	[Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783 ) NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns. `ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783 Approved by: https://github.com/wconstab	2024-10-16 03:05:14 +00:00
Adnan Akhundov	809ff3b274	Add host-side Triton TMA support to Dynamo (#137677 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677 Approved by: https://github.com/zou3519	2024-10-16 02:18:48 +00:00
Nikita Shulga	dd2ae7d0c9	[BE] Use `x in [foo, bar]` (#138041 ) As shorthand for `x == foo or x == bar` And `x not in [foo, bar]` as shorthand for `x != foo and x != bar` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138041 Approved by: https://github.com/huydhn	2024-10-16 01:57:37 +00:00
Simon Fan	64ccebd2e0	update labeler for module: compiled autograd (#137954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137954 Approved by: https://github.com/yf225	2024-10-16 01:56:21 +00:00
Nichols A. Romero	aa28062169	[ROCm] TunableOp more unit test follow-up - Part 2 (#134517 ) More unit tests to cover TunableOp functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517 Approved by: https://github.com/jeffdaily	2024-10-16 01:49:47 +00:00
zeshengzong	7fa7333299	[Distributed][Test] Fix todo in distributed test files (#136836 ) Refactor distributed test code: - Fix TODO: (rohan-varma): remove model - Fix TODO: add comments for TestTraverse - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136836 Approved by: https://github.com/kwen2501	2024-10-16 01:15:12 +00:00
Shuqiang Zhang	a1b22e369b	[c10d] add an API to get the future result(success or failure) of a collective and customize error handling (#137799 ) Summary: This PR is trying to let users to know what exact collective call from the python thread is failing, and customize their own error handling function, instead of watchdog thread crashing everything. This is potentially very useful in fault tolerant training, in which we can have in-process restart. E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job. This is to allow users to check the status of each collective call, using the ivalue::future libs in PT core. This also allows users to attach its customized failure handling functions by: work.get_future_result().then(erro_handling_func) Note that the above call is also non-blocking for CPU thread Test Plan: Added a new test: test_get_future_result to verify the workResutl is correctly propagated to the users Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-10-16 00:20:09 +00:00
Nikita Lutsenko	8d9c9727c0	aten \| Fix set but unused variables warning in release builds. (#138008 ) Summary: Fixing a warning that happens only in release builds. Test Plan: Sandcastle + dependent diffs Reviewed By: boguscoder Differential Revision: D64415854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138008 Approved by: https://github.com/boguscoder, https://github.com/Skylion007	2024-10-16 00:05:39 +00:00
Edward Z. Yang	46ec4ad021	Add code pointer to internal Meta implementation (#137984 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137984 Approved by: https://github.com/albanD	2024-10-15 23:35:22 +00:00
PyTorch MergeBot	4557f6e339	Revert "[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 )" This reverts commit bf0b67059882933574f71a3b11b2f0127915ee5b. Reverted https://github.com/pytorch/pytorch/pull/137669 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing test_public_bindings in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/137669#issuecomment-2415331274))	2024-10-15 23:22:58 +00:00
Animesh Jain	19665f4619	[fake_tensor][cache] Supports ops with tuple of output tensors (#137935 ) This is needed for invoke_subgraph work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137935 Approved by: https://github.com/masnesral	2024-10-15 22:15:07 +00:00
Yifu Wang	5d5783a263	Improve the scheduling of _pipelined_multi_all_gather_and_consume (#137850 ) ``` Parallelization strategy: after each rank copies its shard into its local p2p buffer, every rank issues independent p2p copy -> shard_consumer sequences to two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Notation: - "mv" for the copy to local buffer - "cp" for p2p copies - "b" for barriers Constraints: - The GPU scheduler may or may not overlap "mv" with the first shard_consumer. - "cp" from different streams cannot overlap. Ideal scenario 0 - "mv" overlaps with the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Ideal scenario 1 - "mv" is scheduled before the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ] [b][ cp ][ shard_consumer ] We haven't yet figured out a way to ensure "mv" and "b" are either overlapped with or scheduled before the first shard_consumer. Thus, to prevent suboptimal scenarios, we are giving up the chance to overlap "mv" and "b" with the first shard_consumer for now. ``` This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel: Before this PR: <img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2"> With this PR: <img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738, #137805, #137836	2024-10-15 21:35:14 +00:00
Yifu Wang	2ae1a4caa1	Improve the scheduling of _pipelined_produce_and_all2all (#137836 ) ``` Parallelization strategy: every rank issues independent compute -> barrier -> p2p copy sequences on two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Ideally, stream activities would look like this ("b" for barriers, "cp" for p2p copies): [rank 0] stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] [rank 1] stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] Note that the barriers synchronize streams with the same ID across ranks. They don't synchronize streams on the same rank. Since the work on both streams is independent, there's no guarantee that the chunk_producer from stream 0 or stream 1 will be scheduled first. If there is a scheduling mismatch across ranks, the barrier forces all ranks to wait for the slowest. When scheduling mismatches occur among ranks, the stream activities might look like this (note that p2p copies from different streams cannot overlap with each other): [rank 0] stream 0: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ] stream 1: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ] [rank 1] stream 0: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ] stream 1: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ] To prevent this, we need to ensure that the chunk_producer on stream 1 gets scheduled first on every rank. Without access to the underlying kernels, CUDA offers no API to control the scheduling order of two independent, overlapping kernels. Our solution is to issue a small sleep kernel in stream 0. The sleep duration is insignificant, but having an extra task in stream 0 will almost guarantee that the chunk_producer on stream 1 gets scheduled first. Once the first chunk_producer is scheduled in the correct order, there's very little room for the scheduling order of subsequent kernels to be inconsistent across ranks. ``` Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks. Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank: <img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d"> With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR): <img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23"> With this PR, we get both consistent scheduling order across ranks and compute/compute overlap: <img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738, #137805	2024-10-15 21:35:14 +00:00
Yifu Wang	ef541c1a65	[fused_all_gather_scaled_matmul] support rowwise scaling (#137805 ) This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738	2024-10-15 21:35:14 +00:00
Yifu Wang	05edaeaded	[fused_scaled_matmul_reduce_scatter] support rowwise scaling (#137738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137738 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137643	2024-10-15 21:35:14 +00:00
Yifu Wang	91bc9dc2c9	[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643 ) Suggested by @lw for better safety/reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137643 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-10-15 21:35:14 +00:00
Jane Xu	eaec72d1e6	Link directly to new Custom Ops Landing Page (#137933 ) e.g., click on first link in https://docs-preview.pytorch.org/pytorch/pytorch/137933/library.html#testing-custom-ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/137933 Approved by: https://github.com/zou3519	2024-10-15 21:18:21 +00:00
Tristan Rice	aef4317ec8	[c10d] socket: retry connection timeout failures (#138003 ) This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout. The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc. Example failure: ``` socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out). socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known). ... repeats ipv4 connection failure ``` From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html ``` ETIMEDOUT Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server. ``` Test plan: CI for backwards compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro	2024-10-15 21:17:05 +00:00
Michael Lazos	bf0b670598	[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 ) Fixes https://github.com/pytorch/pytorch/issues/114369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669 Approved by: https://github.com/anijain2305	2024-10-15 20:52:58 +00:00
Matthew Levy	28a521e29a	[fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow (size 4) in c10::IValue::IValue() (#137924 ) Summary: Calling `pop()` on empty stack Test Plan: CI Differential Revision: D64332420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137924 Approved by: https://github.com/Skylion007	2024-10-15 20:42:47 +00:00
Xu Han	3ecec0c90c	skip lintrunner install on Windows. (#137981 ) `lintrunner` is not support Windows x64. Ref: https://pypi.org/project/lintrunner/#files When we install python dependency by `pip install -r requirements.txt` on Windows x64, it will failed on `lintrunner`. <img width="887" alt="image" src="https://github.com/user-attachments/assets/e3815177-e893-41ae-96af-8b39d12f74a7"> Solution: skip install `lintrunner` on Windows. Reference doc: https://peps.python.org/pep-0508/#environment-markers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137981 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2024-10-15 20:37:26 +00:00
Ke Wen	35fc24fbed	[PGNCCL] Fix bugs in non-blocking mode (#137741 ) ### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // https://github.com/NVIDIA/nccl/issues/1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741 Approved by: https://github.com/shuqiangzhang	2024-10-15 20:35:39 +00:00
Nikita Lutsenko	370d66d7dd	aten/buck \| Appropriately convert clang => msvc compiler_flags. (#137944 ) Summary: fPIC is not available in clang on Windows - filter it out. Also configure the flags appropriately for MSVC. Reviewed By: rameshviswanathan Differential Revision: D64365660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137944 Approved by: https://github.com/mwdavis84, https://github.com/ChristianK275, https://github.com/boguscoder	2024-10-15 20:21:01 +00:00
Alex Baden	487873f7ca	[Inductor]: Support updated Triton `AttrsDescriptor` (#137757 ) The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor. Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments. Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757 Approved by: https://github.com/jansel	2024-10-15 19:34:59 +00:00
Mikayla Gawarecki	534fa96f2d	Expose option to disable CRC-32 computation during `torch.save` (#137735 ) Option only works in open source, not internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735 Approved by: https://github.com/albanD	2024-10-15 19:30:02 +00:00
Andrew Gu	3cc8c8b944	[FSDP2] Add `set_unshard_in_backward(bool)` (#137922 ) For some expert use cases, the user knows some parameters are not required for backward, so we can skip the unshard in backward. One example is the embedding weight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137922 Approved by: https://github.com/weifengpy	2024-10-15 19:11:14 +00:00
Laith Sakka	60cf72e028	enable auto functionalize v2 by default (#136685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136685 Approved by: https://github.com/zou3519 ghstack dependencies: #137760	2024-10-15 19:04:42 +00:00
Laith Sakka	05b6200ccd	Do not compute base in export mode (#137760 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137760 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-10-15 19:04:42 +00:00
drisspg	f5e38f65c5	[FlexAttention] Support training bias for eager (#136910 ) (#137526 ) This PR is Part 2 of the implementation started in https://github.com/pytorch/pytorch/pull/136910, rolled in the updates from https://github.com/pytorch/pytorch/pull/137451. Original was reverted due to calls to #@torch.libary at `import torch` time, so added a call to register at first call to `ModIndex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137526 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-10-15 18:55:22 +00:00
PyTorch MergeBot	cd292908e5	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))	2024-10-15 18:55:09 +00:00
Siddhartha Menon	e1e6417d4c	Add SVE implementation of embedding_lookup_idx (#133995 ) Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-10-15 18:52:44 +00:00
Nikita Shulga	b09d6f3a7d	[EZ][BE] Delete 3.8 specific checks (#137991 ) As we no longer support 3.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137991 Approved by: https://github.com/Skylion007	2024-10-15 18:45:49 +00:00
Aaron Orenstein	524fe784ec	BundledAutotuneCache (take 2) (#137902 ) Summary: Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Attempt 2 of #134959 (D60677499). Various configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Test Plan: unit tests Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<< FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D64336043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902 Approved by: https://github.com/oulgen	2024-10-15 18:39:47 +00:00
albanD	bf77f52895	Fix memory leak on masked Tensor (#137890 ) Note that this reverts the change from https://github.com/pytorch/pytorch/pull/137815 as well which is not needed anymore! Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137890 Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007	2024-10-15 18:37:55 +00:00
Huy Do	0b7ef196cd	Use filelock to build extension_device backend one at a time (#137930 ) Fixes https://github.com/pytorch/pytorch/issues/136125 Fixes https://github.com/pytorch/pytorch/issues/137026 Fixes https://github.com/pytorch/pytorch/issues/137027 The compilation fails during `setUpClass`, so disabling the test doesn't do nothing. The theory I have for this flaky issue is that `test_open_device_registration` from both `TritonExtensionBackendTests` and `ExtensionBackendTests` are run in parallel and cleaned up while the other is still in fly, causing flaky failure. Here is an example failure https://github.com/pytorch/pytorch/actions/runs/11331105492/job/31512603585#step:22:1710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137930 Approved by: https://github.com/malfet	2024-10-15 17:46:28 +00:00
PyTorch MergeBot	60eb3fccfa	Revert "[ONNX] Remove ExportTypes (#137789 )" This reverts commit 3e0b83ad1f0a998ef8a72c5e82d9250ab800cce5. Reverted https://github.com/pytorch/pytorch/pull/137789 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
PyTorch MergeBot	2831af39c4	Revert "[ONNX] Remove deprecated export_to_pretty_string (#137790 )" This reverts commit d0628a7e3921639f62d6a6fec9f9f1871e087533. Reverted https://github.com/pytorch/pytorch/pull/137790 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
PyTorch MergeBot	dac0b4e62b	Revert "Add SVE implementation of embedding_lookup_idx (#133995 )" This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62. Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))	2024-10-15 17:23:50 +00:00
PyTorch MergeBot	d4d687ffb2	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:16 +00:00
PyTorch MergeBot	9af4e0d2aa	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit a6eb0205225fce7ba7a75d200566613b84aff4e9. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:15 +00:00
Pian Pawakapan	44653895cc	override bool(), is_nonzero for real tensor tracing (#136788 ) Fixes bool() and is_nonzero() calls for real tensor tracing, non-strict export Differential Revision: D63482693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136788 Approved by: https://github.com/ezyang	2024-10-15 17:13:44 +00:00
Haifeng Jin	bdbe0cfffa	Fix test_binary_ufuncs.py for NumPy 2 (#137937 ) Related to #107302 The following tests failed in test_binary_ufuncs.py when testing with NumPy 2. ``` FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8 ``` This PR fixes them. More details: * `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2. * `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937 Approved by: https://github.com/albanD	2024-10-15 17:04:24 +00:00
Nikita Shulga	e4d7676c1b	[CPU] Expand `torch.special.i1` to Half and BF16 (#137899 ) To match behavior of `torch.special.i0` Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849 Also, add explicit high-precision template specialization for `calc_i0` and `calc_i1` for `BFloat16` and `Half` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899 Approved by: https://github.com/Skylion007	2024-10-15 17:00:58 +00:00
Daniel Velkov	4abe38bc94	RMSprop docs: add missing input "epsilon" (#137854 ) Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854 Approved by: https://github.com/janeyx99	2024-10-15 16:40:42 +00:00
Haifeng Jin	822aa588bc	Fix torch_np/test_basic for NumPy 2 (#137814 ) Related to #107302 `TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2. The test is checking if all methods under `torch._numpy` exist in `numpy`. However, some of them are removed in NumPy 2. This PR fixes the issue by not checking the removed methods when running with NumPy 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814 Approved by: https://github.com/albanD	2024-10-15 16:40:28 +00:00
Isuru Fernando	120fbe9caa	Update inductor benchmark time to avoid flakiness (#137900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900 Approved by: https://github.com/laithsakka	2024-10-15 16:17:04 +00:00
Jack Taylor	966a1a971e	[ROCm] Add AMDSMI support for UUID input (#129741 ) Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2024-10-15 15:56:30 +00:00
Prachi Gupta	17ed403644	[ROCm] Enable test_triton* in test_sparse_csr suite (#137712 ) All test_triton* UTs are now passing on ROCm within test_sparse_csr suite. See logs here: https://ossci-raw-job-status.s3.amazonaws.com/log/31376189926 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137712 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-10-15 15:41:21 +00:00
Wang, Eikan	5689e33cfe	[Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794 ) Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`. https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99 Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794 Approved by: https://github.com/atalman ghstack dependencies: #137873	2024-10-15 15:31:37 +00:00
yintong-lu	3361908fc5	torch/ao/quantization/utils.py: Moving eps to targeted device to avoid device mismatch issue (#135204 ) MOTIVATION We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_. CHANGES Move eps to _device_ of other tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-10-15 14:58:55 +00:00
eellison	cef6c3dcb0	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-15 14:54:56 +00:00
Richard Barnes	b7f798caa4	Use C10_UNUSED instead of (void)X (#137239 ) Summary: Auto-generated with ``` buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\sfor\s$)(const.\n)\s\(void$[A-Za-z]+;\s//\sSuppress.\s\n(.)' --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".\.$cpp\\|h$"` ``` Differential Revision: D33432600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239 Approved by: https://github.com/Skylion007	2024-10-15 14:32:59 +00:00
Tom Ritchford	e7a4ad3b40	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-15 13:51:20 +00:00
Xiaodong Wang	5141ade8e3	[AMD] Do not skip 0-byte send/recv (#137952 ) Summary: With https://github.com/ROCm/rccl/pull/1376, we can remove this hack now and we have verified that we no longer run into hang Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-xdwang-900def406a?job_attempt=0&version=1&env=PRODUCTION Differential Revision: D64370817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137952 Approved by: https://github.com/eqy	2024-10-15 09:35:03 +00:00
Xiaodong Wang	b7be4b1e48	[AMD] Turn on fast path for index_put (#136136 ) Summary: This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino {F1870213925} Test Plan: CI Differential Revision: D62731130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136136 Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy	2024-10-15 08:39:17 +00:00
Wang, Eikan	f42d1b6fa1	Fix Intel GPU test failure due to unsupport bool for unfold (#137873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137873 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-10-15 07:58:51 +00:00
cyy	8c860aef0d	[Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942 ) Reland of #137328, which was reverted due to reverting a dependent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942 Approved by: https://github.com/eqy	2024-10-15 07:47:24 +00:00
Ke Wen	56cc22eb01	[CI][Distributed] Not to test distributed_test.py with UCC (#137932 ) Some UCC tests became unstable recently, with or without the M60 to T4 upgrade. See for example: #137855 (without upgrade), #137161 (with upgrade). So I am extracting the disablement from #137161 here. Failure signature: ``` RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0 ``` Earlier discussed here: https://github.com/pytorch/pytorch/pull/137161/files#r1797353294 Cc: @Aidyn-A @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137932 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy	2024-10-15 07:22:57 +00:00
Edward Z. Yang	5b442e8e92	Time torch_key computation in overall Dynamo stats (#137877 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137877 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-10-15 05:47:19 +00:00
Edward Z. Yang	5c3ba6faff	Add fbscribelogger to Dynamo benchmark runner (#137867 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867 Approved by: https://github.com/bobrenjc93	2024-10-15 04:36:41 +00:00
Brian Hirsh	ed94725b8c	log ViewAndMutationMeta to trace_structured (#133784 ) I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?) Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt ``` import torch @torch.compile def f(x): out1 = torch.view_as_complex(x) out2 = torch.view_as_complex(x) return out1, out2, x * 2 x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64) out = f(x_) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784 Approved by: https://github.com/ezyang	2024-10-15 02:49:02 +00:00
cyy	70206499f1	[3/N] Fix extra warnings brought by clang-tidy-17 (#137552 ) Follows #137459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137552 Approved by: https://github.com/ezyang	2024-10-15 02:33:44 +00:00
FFFrog	a6eb020522	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2024-10-15 01:53:28 +00:00
Bob Ren	b34db401f2	Add support for div in tensorify_python_scalars fx pass (#137623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137623 Approved by: https://github.com/ezyang	2024-10-15 01:49:46 +00:00
Michael Lazos	8316f9b2a0	Fix autograd function calls without context arg (#137809 ) Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error. The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809 Approved by: https://github.com/drisspg	2024-10-15 01:25:47 +00:00
Ryan Guo	a89cf2b59a	[dynamo] Don't codegen temporary cells for pre-existing cells (#137907 ) This patch removes tempvar codegen for the `NewCellVariable` that has `AttributeMutationExisting`, because these tempvar will never get used. Note that tempvar codegen for other objects also follow this pattern, i.e., it only fires on `AttributeMutationNew`. To visualize, in the following program, we'll see the modified bytecode contains redundant `make_cell` calls, and stores the result to a local `tmp_2` which is never used again. ```python import torch def test_write_cell(): count = torch.ones(1) def inc(): nonlocal count count = count + 1 torch.compile() def fn(): inc() fn() test_write_cell() ``` ``` $ TORCH_LOGS="bytecode" TORCH_LOGS_FORMAT="short" python test.py ...... 0 COPY_FREE_VARS 1 2 RESUME 0 4 LOAD_GLOBAL 9 (NULL + __compiled_fn_2) 14 LOAD_DEREF 3 (inc) 16 LOAD_ATTR 6 (__closure__) 36 LOAD_CONST 1 (0) 38 BINARY_SUBSCR 42 LOAD_ATTR 4 (cell_contents) 62 CALL 1 70 STORE_FAST 0 (graph_out_0) 72 LOAD_GLOBAL 0 (__import_torch_dot__dynamo_dot_utils) 82 LOAD_ATTR 3 (NULL\|self + make_cell) 102 CALL 0 110 STORE_FAST 2 (tmp_2) 112 LOAD_FAST 0 (graph_out_0) 114 LOAD_CONST 1 (0) 116 BINARY_SUBSCR 120 LOAD_DEREF 3 (inc) 122 LOAD_ATTR 6 (__closure__) 142 LOAD_CONST 1 (0) 144 BINARY_SUBSCR 148 STORE_ATTR 2 (cell_contents) 158 DELETE_FAST 0 (graph_out_0) 160 RETURN_CONST 0 (None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137907 Approved by: https://github.com/anijain2305	2024-10-15 00:49:45 +00:00
chilli	1cf78bbf62	Refactored debug_extra to be on ChoiceCaller (and called description) (#137857 ) Before: <img width="644" alt="image" src="https://github.com/user-attachments/assets/17b0fa8a-37c8-494b-8914-9d42c3db4bef"> After: <img width="1292" alt="image" src="https://github.com/user-attachments/assets/5ee59747-a34f-4dd6-b943-cb5a53d52080"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137857 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/masnesral ghstack dependencies: #137768	2024-10-15 00:48:14 +00:00
Edward Z. Yang	3630398509	Move symbolic_shapes create_env back to INFO (#137926 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137926 Approved by: https://github.com/Skylion007	2024-10-15 00:37:01 +00:00
cyyever	406db6a73d	Improve ASAN path detection (#137865 ) Follows #137335, for better adoption of latest clang to ASAN jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137865 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-14 23:54:46 +00:00
Shivam Raikundalia	aef3591998	[Profiler] Add Test for Clear on Fork (#137511 ) Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected. Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)' Differential Revision: D63992036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-10-14 23:20:33 +00:00
Nikita Shulga	0786b37260	[MPS] Add i0 op (#137849 ) More-or-less verbatim copy of `47c8aa8090/aten/src/ATen/native/Math.h (L101)` Plus a bit of a MPS boilerplate code Update test_mps.py to mark kaiser_window and i0 as passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849 Approved by: https://github.com/Skylion007	2024-10-14 22:50:01 +00:00
Nikita Shulga	18587f2427	[BE] Use `std::enable_if_t` in Math.h (#137920 ) PyTorch is C++17 project, so let's use some C++17 convenience methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/137920 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-14 22:20:09 +00:00
Catherine Lee	8ac06467d4	Forward fix test (#137910 ) Summary: Add back in a deleted file to fix test It was removed in https://github.com/pytorch/pytorch/pull/137404 Test Plan: `buck2 build --flagfile fbcode//mode/opt fbcode//caffe2/test/cpp/c10d:ProcessGroupGlooAsyncTest` succeeded Differential Revision: D64341074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137910 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/kit1980	2024-10-14 22:07:29 +00:00
Jerry Zhang	ad134fe038	Skip doc test internally (#137813 ) Summary: there are some path issues when we run the doc tests internally https://www.internalfb.com/intern/test/281475143872621 Test Plan: sandcastle Reviewed By: drisspg, msaroufim Differential Revision: D64255824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137813 Approved by: https://github.com/HDCharles	2024-10-14 21:29:15 +00:00
Eddie Yan	7911bf591d	[CUDA][Inductor] Fix some `bfloat16` tests for SM70 (#137675 ) Unsure about the `runtime_checks` changes as that's a pure pattern-match and guess Pull Request resolved: https://github.com/pytorch/pytorch/pull/137675 Approved by: https://github.com/eellison, https://github.com/jansel	2024-10-14 20:42:48 +00:00
atalman	6016b8a9be	Remove CI/CD python 3.8 requirements (#137893 ) Python 3.8 is deprecated from CI/CD. No reason have these pins Pull Request resolved: https://github.com/pytorch/pytorch/pull/137893 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD, https://github.com/kit1980	2024-10-14 20:28:48 +00:00
PyTorch MergeBot	3b7710316c	Revert "cublaslt autotuning support for TunableOp (#133896 )" This reverts commit 19bbbef79da8ed32f72d6e76517cb639d5db6c00. Reverted https://github.com/pytorch/pytorch/pull/133896 on behalf of https://github.com/clee2000 due to this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt? ([comment](https://github.com/pytorch/pytorch/pull/133896#issuecomment-2412180893))	2024-10-14 20:28:09 +00:00
PyTorch MergeBot	df0c2f5cae	Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328 )" This reverts commit 25ac5652d003c5526f496bd1e2cdfbe697c58ba4. Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))	2024-10-14 20:22:26 +00:00
Jagadish Krishnamoorthy	674d59359d	[ROCm] Enable dist sharded_tensor test suites (#137724 ) Following test suites are enabled on ROCm test_sharded_tensor test_sharded_tensor_reshard test_sharding_plan Pull Request resolved: https://github.com/pytorch/pytorch/pull/137724 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-10-14 20:20:57 +00:00
Alex Baden	39d21ed803	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 ) The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](`72c9833927`)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes. Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively). With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458 Approved by: https://github.com/jansel	2024-10-14 20:20:29 +00:00
rzou	11e4232b42	Revert "[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872 )" (#137891 ) This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4. We're reverting this because: 1) The original PR (#134872) fixed a bug but caused another one. The assessment is that the bug it caused is worse than the bug it fixed. 2) it was reverted on the release 2.5 branch, so we want to prevent divergence 3) The original author is out-of-office for a while so we don't want the divergence to wait until they're back Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891 Approved by: https://github.com/Skylion007	2024-10-14 20:12:58 +00:00
Will Constable	41c4aa9f7a	[pipelining] rename prev_/next_stage vars to clarify (#137739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137739 Approved by: https://github.com/H-Huang	2024-10-14 20:12:18 +00:00
drisspg	78299d75b7	[ScaledMM] More Large shape tuning (#137832 ) Fixes buggy in previous PR with check, and also after some more performance tuning at very large sizes found that when N > M it is valuable to transpose otherwise performance is better untransposed: If you look at the absolute Tflops I think we still have some room for improvement! ### Perf Here are some TFLOP deltas at larger sizes where green is the positive gain in TFLops at different values of K ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_delta_heatmap](https://github.com/user-attachments/assets/dcd009a5-1e4f-449c-b852-a92bb7db66e3) <details> <summary>### Different Values of K</summary> ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K24576_tflops_delta_heatmap](https://github.com/user-attachments/assets/8c043f6c-b8aa-48a9-bd5d-3ec6f39818cd) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K16384_tflops_delta_heatmap](https://github.com/user-attachments/assets/41a4b9f4-2749-4a84-b9c7-bddc2c2334c0) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K12288_tflops_delta_heatmap](https://github.com/user-attachments/assets/68d42421-cfa9-4a0a-a5a5-9f6db80bf609) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K8192_tflops_delta_heatmap](https://github.com/user-attachments/assets/c03906a0-5de7-463e-96a8-85f1774b3af6) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K6144_tflops_delta_heatmap](https://github.com/user-attachments/assets/d697b2d0-efc9-4ea8-9002-d517f3abaf50) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K4096_tflops_delta_heatmap](https://github.com/user-attachments/assets/06f8ef5c-277f-45ca-a44f-ed2e54d4133a) </details> <details> <summary>### Absolute Tflops</summary> ## Old ![large_shape_old_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/8872506b-0ff1-400e-8d11-71eff6d8d59a) ## New ![update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/9fc9ec24-ff1a-4b47-8934-72d181677d14) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137832 Approved by: https://github.com/vkuzo	2024-10-14 20:02:52 +00:00
Edward Z. Yang	d64492e4cb	Increase verbosity of inductor cache hit/miss to INFO level (#137876 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137876 Approved by: https://github.com/Skylion007	2024-10-14 19:59:31 +00:00
eqy	914c90dcea	[NCCL][CUDA] Set `PYTORCH_C10_DRIVER_API_SUPPORTED` in `ProcessGroupNCCL.cpp` compilation (#137828 ) Otherwise `expandable_segments()` is hardcoded to false in `CUDAAllocatorConfig.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137828 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-10-14 19:38:23 +00:00
Joel Schlosser	19918a1863	Fix autograd.Function + NJT when an output grad is None (#136875 ) For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws: ``` RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers ``` This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875 Approved by: https://github.com/soulitzer, https://github.com/huydhn	2024-10-14 19:31:50 +00:00
ErezYosef	197601eeea	Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107 ) A proposal addressing Issue #1489: Optimizer should track parameter names and not id. (also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552) ## Summary This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id. Optimizers can be initialized with `named_parameters()` as: ```python optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9) ``` This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as: ``` state_dict = { 'state': { 0: {'momentum_buffer': tensor(...), ...}, 1: {'momentum_buffer': tensor(...), ...}, }, 'param_groups': [ { 'lr': 0.01, 'weight_decay': 0, ... 'params': [0,1] 'param_names' ['layer.weight', 'layer.bias'] (optional) } ] } ``` Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored. ## Key Features #### Named Parameters in Optimizer Initialization: Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly. #### Parameter Names in `state_dict`: The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters. ## Backward Compatibility #### No Breaking Changes: This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer. #### Customization with Hooks: For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs. ## Documentation Updates Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively. ## Solution Example: A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order. The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict : ```python def adapt_state_dict_ids(optimizer, state_dict): # assuming a single param group. current_state_group = optimizer.state_dict()['param_groups'][0] loaded_state_group = state_dict['param_groups'][0] # same number of params, same names, only different ordering current_state_name_to_id_mapping = {} # mapping -- param_name: id for i, name in enumerate(current_state_group['param_names']): current_state_name_to_id_mapping[name] = current_state_group['params'][i] # changing the ids of the loaded state dict to match the order of the given state dict. for i, name in enumerate(current_state_group['param_names']): loaded_state_group['params'][i] = current_state_name_to_id_mapping[name] return state_dict ``` In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`. Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict. ### Note This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-10-14 19:24:44 +00:00
Tom Ritchford	4470339fbb	[dynamo] Fix an error in _dynamo.compiled_autograd.reset() (#137889 ) ---- * From https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137889 Approved by: https://github.com/Skylion007	2024-10-14 18:21:18 +00:00
Huy Do	929797dedb	Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835 ) The test is failing (flakily?) on periodic Windows CUDA jobs with the following error: ``` __________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________ Traceback (most recent call last): File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop os.remove(filename) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv' ``` For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097 The test tried to catch and ignore this, but this is Windows. So, the fix is to: 1. Ignore if these files couldn't be removed 2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835 Approved by: https://github.com/atalman	2024-10-14 17:28:33 +00:00
PyTorch MergeBot	f8a5b7170a	Revert "Fix autograd.Function + NJT when an output grad is None (#136875 )" This reverts commit 76ab1ab66560213701943ecde368aedcd5de08e5. Reverted https://github.com/pytorch/pytorch/pull/136875 on behalf of https://github.com/jbschlosser due to Caused memory leak ([comment](https://github.com/pytorch/pytorch/pull/136875#issuecomment-2411665776))	2024-10-14 16:00:44 +00:00
Bob Ren	47bb494e49	Add support for sub in tensorify_python_scalars fx pass (#137622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622 Approved by: https://github.com/ezyang ghstack dependencies: #137620	2024-10-14 15:37:29 +00:00
Bob Ren	f246507f28	Add support for add in tensorify_python_scalars fx pass (#137620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620 Approved by: https://github.com/ezyang	2024-10-14 15:10:27 +00:00
Sanket Purandare	a77145ae2f	Selective Activation Checkpointing (SAC) Estimator for estimating memory and recomputation time trade-offs. (#135208 ) This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs. It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243 under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage." It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa End-to-end example: ``` import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.sac_estimator import SACEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 8, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev ) sace = SACEstimator() with sace(estimate_mode_type='operator-level-cost-model'): loss = model(inp).sum() loss.backward() sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True) sace.display_modulewise_sac_stats(depth=4, print_tabular=True) ``` Example AC Stats for one of the transformer layers: ![Screenshot 2024-10-11 at 10 09 13 PM](https://github.com/user-attachments/assets/1cf85564-4319-4732-bba1-89d505cda6ab) Example AC Trade-off for one of the transformer layers: ![Screenshot 2024-10-11 at 10 09 58 PM](https://github.com/user-attachments/assets/5b2f343c-7e73-4c7d-bfea-3dcef2caa362) Example AC Trade-Off graph one of the transformer layers: ![Transformer layers 3](https://github.com/user-attachments/assets/490d4b37-a916-4298-a14c-f78ffecbbde2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208 Approved by: https://github.com/weifengpy	2024-10-14 13:56:40 +00:00
chilli	0e4d42634e	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-14 10:33:43 +00:00
Siddhartha Menon	770c134998	Add SVE implementation of embedding_lookup_idx (#133995 ) Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-10-14 10:17:27 +00:00
cyy	c48fe89011	Make c10::string_view an alias of std::string_view (#130417 ) In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-10-14 09:28:04 +00:00
PyTorch MergeBot	41977a0531	Revert "Port Inductor dataclasses to be kw_only (#137768 )" This reverts commit 65d665bae5b82a54b819c0c4527e7ccf88d19427. Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))	2024-10-13 22:25:19 +00:00
Isuru Fernando	08ce3aac62	Cache some ValueRanges (#137438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438 Approved by: https://github.com/ezyang	2024-10-13 19:23:34 +00:00
GarfieldHan	b361cd01f1	profiler: Fix undefined reference to `unwind_c` in `unwind_entry` while LTO is enabled (#137862 ) With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation. Fixes #121282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-13 19:04:58 +00:00
iupaikov-amd	c09b567a91	Fixed error string assertion in test_invalid_devices (#137772 ) ROCm distribution returns different error string for this operation so the test fails this assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772 Approved by: https://github.com/Skylion007	2024-10-13 18:10:07 +00:00
chilli	65d665bae5	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-13 14:55:45 +00:00
Bin Bao	cfc5d18aad	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-13 14:42:58 +00:00
Bin Bao	b181652f3d	[AOTI] Handle inplace output in ProxyExecutor (#137660 ) Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660 Approved by: https://github.com/chenyang78	2024-10-13 14:42:58 +00:00
cyy	a90b920284	Install llvm18 packages for ASAN workflows (#137335 ) Follows #128763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137335 Approved by: https://github.com/ezyang	2024-10-13 13:49:38 +00:00
FFFrog	4a8e49389c	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) ---- - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-13 12:38:02 +00:00
PyTorch MergeBot	563e9f99c3	Revert "Add device agnostic API for accelerator hooks (#137480 )" This reverts commit 858c91c3d8d9a71c66d0357e51a4bd805f95599f. Reverted https://github.com/pytorch/pytorch/pull/137480 on behalf of https://github.com/albanD due to break all builds on trunk ([comment](https://github.com/pytorch/pytorch/pull/137480#issuecomment-2408954802))	2024-10-13 12:12:37 +00:00
Yuxin Wu	08576b254b	Fix logging in socket.cpp (#137745 ) Formatter shall avoid throwing exceptions as much as possible. Fixes https://github.com/pytorch/pytorch/pull/128673#discussion_r1796226656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137745 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2024-10-13 10:38:10 +00:00
xangma	fe8d66d9a6	Faster Faster BatchSampler (#137423 ) Builds upon #76951. Benchmarking code is the same as in #76950. AMD Ryzen Threadripper PRO 3995WX: ``` batch_size drop_last origin new speedup ------------ ----------- -------- ------ --------- 4 True 0.94 0.5706 64.74% 4 False 0.9745 0.9468 2.93% 8 True 0.7423 0.3715 99.82% 8 False 0.7974 0.5666 40.73% 64 True 0.5394 0.2085 158.76% 64 False 0.6083 0.2697 125.51% 640 True 0.5448 0.1985 174.41% 640 False 0.7085 0.2308 206.91% 6400 True 0.5554 0.2028 173.88% 6400 False 0.7711 0.2109 265.60% 64000 True 0.556 0.2091 165.82% 64000 False 0.7803 0.2078 275.58% ``` When `drop_last == True`, it uses `zip` to speed things up. When `drop_last == False`, it uses `itertools` to speed things up. `itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now. Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423 Approved by: https://github.com/Skylion007	2024-10-13 09:36:03 +00:00
Michael Au-Yeung	b3af359cba	Log WorkNCCL exception string to C10dLogger (#137736 ) Summary: In WorkNCCL::handleException, log to c10d logger with `strings["work_nccl_exception"]`. Test Plan: Test run job to verify NCCL exception is logged. Differential Revision: D62603322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137736 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-10-13 07:33:05 +00:00
zeshengzong	858c91c3d8	Add device agnostic API for accelerator hooks (#137480 ) Make `AcceleratorHooksInterface` consistent for multiple accelerators - Add `showConfig` and `deviceSynchronize` method declaration in `AcceleratorHooksInterface` - Remove unreachable lines of code Pull Request resolved: https://github.com/pytorch/pytorch/pull/137480 Approved by: https://github.com/albanD, https://github.com/FFFrog	2024-10-13 07:19:32 +00:00
Xiaodong Wang	7642f6d047	[AMD] Unify cublaslt and hipblaslt path (#137604 ) Differential Revision: D63967918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137604 Approved by: https://github.com/eqy	2024-10-13 07:11:12 +00:00
Wang, Eikan	fa08e924ad	Skip test export with fake tensor inputs on cuda devices for Intel GPU (#137847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137847 Approved by: https://github.com/etaf, https://github.com/jansel	2024-10-13 07:07:48 +00:00
FFFrog	e3df636580	Fix -Wsign-compare warning spam in Indexing.cu (#137842 ) Detailed Descriptions: Fix for warning spam like ``` warning: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare] ``` ![image](https://github.com/user-attachments/assets/7be3cfff-c33b-4a6e-b52d-04085e6e1bec) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137842 Approved by: https://github.com/ezyang	2024-10-13 07:03:12 +00:00
Xuehai Pan	1d6932937e	[dynamo] fix `NamedTupleVariable` for PyStructSequence (`torch.return_types.`) support (#137776 ) PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors: ```python tuple = NamedTupleType(args) tuple = NamedTupleType._make(args) tuple = StructSequenceType(args) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776 Approved by: https://github.com/jansel	2024-10-13 06:46:41 +00:00
Animesh Jain	3050f2e5dd	[dynamo] Check nn modules parameters are not overwritten before taking tracing shortcut (#137824 ) Fixes https://github.com/pytorch/pytorch/issues/136257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137824 Approved by: https://github.com/jansel	2024-10-13 05:04:28 +00:00
abhishek-fujitsu	09e2a0d7bc	fix PyTorch build with Address Sanitizer enabled (#137446 ) Problem Building PyTorch with Address Sanitizer (ASAN) enabled was failing due to a static assertion in KernelFunction_impl.h. The compiler was unable to evaluate FuncPtr::func_ptr() as a constant expression when ASAN was enabled, causing a build error. ``` FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o /usr/bin/ccache /usr/bin/g++-11 -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D_GLIBCXX_SANITIZE_STD_ALLOCATOR -D_GLIBCXX_SANITIZE_VECTOR -Dtorch_cpu_EXPORTS -I/home/abhishekk/stantize/venv/pytorch/build/aten/src -I/home/abhishekk/stantize/venv/pytorch/aten/src -I/home/abhishekk/stantize/venv/pytorch/build -I/home/abhishekk/stantize/venv/pytorch -I/home/abhishekk/stantize/venv/pytorch/cmake/../third_party/benchmark/include -I/home/abhishekk/stantize/venv/pytorch/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/build/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/nlohmann -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api/include -I/home/abhishekk/stantize/venv/pytorch/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/../aten/src -I/home/abhishekk/stantize/venv/pytorch/torch/csrc -I/home/abhishekk/stantize/venv/pytorch/third_party/miniz-2.1.0 -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/include -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/src -I/home/abhishekk/stantize/venv/pytorch/third_party/cpp-httplib -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/FXdiv/include -I/home/abhishekk/stantize/venv/pytorch/c10/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/pthreadpool/include -I/home/abhishekk/stantize/venv/pytorch/third_party/cpuinfo/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/home/abhishekk/stantize/venv/pytorch/third_party/NNPACK/include -I/home/abhishekk/stantize/venv/pytorch/third_party/FP16/include -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/build/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/abhishekk/stantize/venv/pytorch/third_party/fmt/include -I/home/abhishekk/stantize/venv/pytorch/third_party/flatbuffers/include -isystem /home/abhishekk/stantize/venv/pytorch/build/third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/abhishekk/stantize/venv/pytorch/third_party/protobuf/src -isystem /home/abhishekk/stantize/venv/pytorch/third_party/XNNPACK/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/eigen -isystem /home/abhishekk/stantize/venv/pytorch/INTERFACE -isystem /home/abhishekk/stantize/venv/pytorch/third_party/nlohmann/include -isystem /home/abhishekk/stantize/venv/pytorch/build/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include/openmpi -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_PYTORCH_QNNPACK -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_SVE_CPU_DEFINITION -DHAVE_SVE256_CPU_DEFINITION -g -fno-omit-frame-pointer -Og -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -c /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp In file included from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction.h:260, from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:4, from /home/abhishekk/stantize/venv/pytorch/torch/library.h:63, from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:3: /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h: In instantiation of ‘static c10::KernelFunction c10::KernelFunction::makeFromUnboxedFunction(FuncPtr) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; bool AllowLegacyTypes = false]’: /home/abhishekk/stantize/venv/pytorch/torch/library.h:133:59: required from ‘torch::CppFunction::CppFunction(FuncPtr, std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t>) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t> = std::nullptr_t]’ /home/abhishekk/stantize/venv/pytorch/torch/library.h:691:17: required from ‘torch::Library& torch::Library::impl(Name, Func&&, torch::_RegisterOrVerify) & [with Name = const char; Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’ /home/abhishekk/stantize/venv/pytorch/torch/library.h:782:16: required from ‘torch::Library& torch::Library::impl(torch::detail::SelectiveStr<true>, Func&&) & [with Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’ /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:87:9: required from here /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:177:39: error: non-constant condition for static assertion 177 \| static_assert(FuncPtr::func_ptr() != nullptr, "Kernel function cannot be nullptr"); \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ ``` Testing* - Verified that PyTorch builds successfully with USE_ASAN=ON - Ran PyTorch test suite to ensure no regressions were introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137446 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-10-13 03:31:54 +00:00
PyTorch MergeBot	70bd58c35f	Revert "Add support for add in tensorify_python_scalars fx pass (#137620 )" This reverts commit 0430e72e755d2c1953917ffb78f00c516eb4bbd5. Reverted https://github.com/pytorch/pytorch/pull/137620 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk `0430e72e75` ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))	2024-10-13 02:05:37 +00:00
PyTorch MergeBot	279052ab86	Revert "Add support for sub in tensorify_python_scalars fx pass (#137622 )" This reverts commit b7924610a0c20f72657548acef7743801189444a. Reverted https://github.com/pytorch/pytorch/pull/137622 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk `0430e72e75` ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))	2024-10-13 02:05:37 +00:00
Jason Ansel	5fee1ee3f4	[inductor] Refactor generate_workspace_allocation (#137673 ) And some other small changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/137673 Approved by: https://github.com/Chillee ghstack dependencies: #137754	2024-10-13 01:25:14 +00:00
Jason Ansel	5146e6a96d	[inductor] Fix reduction_hint sum to single element (#137754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137754 Approved by: https://github.com/Chillee, https://github.com/malfet	2024-10-13 01:08:23 +00:00
Bob Ren	b7924610a0	Add support for sub in tensorify_python_scalars fx pass (#137622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622 Approved by: https://github.com/ezyang ghstack dependencies: #137620	2024-10-13 00:30:02 +00:00
Nichols A. Romero	bd63ec4f45	[ROCm] LoadHIP CMake cleanup (#137112 ) Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313 While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined. This PR does not fix any bugs or provide any new functionality. Strictly clean-up. The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-13 00:06:41 +00:00
zeshengzong	47c8aa8090	Refactor make device agnostic in accelerator hooks (#137558 ) Make `AcceleratorHooksInterface` consistent for multiple accelerators - Add `getDeviceFromPtr` method declaration in `AcceleratorHooksInterface` - Fix clangtidy warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/137558 Approved by: https://github.com/FFFrog, https://github.com/ezyang	2024-10-12 18:13:54 +00:00
Bob Ren	0430e72e75	Add support for add in tensorify_python_scalars fx pass (#137620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620 Approved by: https://github.com/ezyang ghstack dependencies: #136674, #137588	2024-10-12 17:18:27 +00:00
Wei Wang	e89fe0bd6e	Updating cuda binary build to get cusparselt from PYPI (#137653 ) Fixes #137374 Update 1: such PR require Meta uploading the PYPI package to download.pytorch.org. See: ERROR: Could not find a version that satisfies the requirement nvidia-cusparselt-cu12==0.6.2; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: none) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137653 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/atalman	2024-10-12 16:40:37 +00:00
Avik Chaudhuri	ed55d356de	[alt] fix unroll in successive unflatten (#137646 ) We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call. Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.) Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls. Test Plan: Like D64014936 Differential Revision: D64136277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646 Approved by: https://github.com/angelayi	2024-10-12 15:53:52 +00:00
yanbing-j	561f07fae7	Warn users of mkldnn device usage (#137553 ) In https://github.com/pytorch/pytorch/issues/136831, user will use mkldnn device to generate tensor, while mkldnn device is no longer used as device type, and only mkldnn layout is used. We plan to remove mkldnn device related code in the future release. This PR is to warn users not to use mkldnn device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137553 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-12 13:42:12 +00:00
Li, Xingyuan	0dbbcfa7ae	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_pattern_matcher.py` reuse `test/inductor/test_snode_runtime.py` reuse `test/inductor/test_unbacked_symints.py` fix `test/inductor/test_triton_kernels.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-12 13:21:20 +00:00
Yukio Siraichi	030ba03681	Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 ) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909 Approved by: https://github.com/jansel	2024-10-12 12:40:46 +00:00
Jovian Anthony Jaison	6001b16597	Add entire _dynamo.config as a json for logging (#137216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137216 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-12 11:48:59 +00:00
Angel Yang	a777dea3b3	Remove dtype check on meta device (#136774 ) Summary: # Latest Update This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190. --------------------------------- # Background T176105639 \| case \| embedding bag weight \| per_sample_weight \| fbgemm lookup \| forward in meta \| \| A \| fp32 \| fp32 \| good \| good \| \| B \| fp16 \| fp32 \| good\| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype \| \| C \| fp16 \| fp16 \| P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" \| good \| \| D \| fp32 \| fp16 \| N/A \| N/A \| Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration. The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`. # This diff Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan). # Reference diffs to resolve this issue Diff 1: D52591217 This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks. Diff 2: D53232739 Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models. Test Plan: # APS The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device: ``` buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false ``` Result: {F1461463993} Reviewed By: ezyang Differential Revision: D54175438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774 Approved by: https://github.com/ezyang	2024-10-12 05:45:21 +00:00
Huy Do	92cc319120	Fix masked tensor test_stack memory leak (#137815 ) This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further. So, let's just fix the test first/ ### Testing ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda ``` passes and doesn't warn about memory leak anymore. The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815 Approved by: https://github.com/kit1980	2024-10-12 04:30:57 +00:00
Jez Ng	c8609cf4b0	[inductor] Update Triton CPU pin (#137778 ) This incorporates the fix in https://github.com/triton-lang/triton/pull/4871. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137778 Approved by: https://github.com/Skylion007	2024-10-12 03:09:09 +00:00
Eddie Yan	d52b2cf92f	[CUDA][SDPA] Fix TF32 handling and bump threshold for multiheadattention test (#137752 ) For sm90, main issue was that `torch.testing.assert_close` bypasses the `tf32_on_and_off` tolerance switch decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/137752 Approved by: https://github.com/ezyang	2024-10-12 03:05:21 +00:00
Haifeng Jin	2db3f85894	Fixes NumPy 2 test failures in test_torch.py (#137740 ) Related to #107302 The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2. This PR fixes all the corresponding test failures in `test_torch.py`. 1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor. NumPy 1: Return value of `np.float64`. NumPy 2: Return value of `np.float32`. Solution: Enforce it with `.astype(np.float64)`. 2. The type of `np.gradient()` when returning multiple arrays. NumPy1: A list of arrays. NumPy2: A tuple of arrays. Solution: Cast the tuple to a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740 Approved by: https://github.com/ezyang	2024-10-12 02:40:17 +00:00
eqy	6be53d52c5	[CUDA][SDPA] Bump tolerances for `grad_query` in mem_eff test (#137750 ) (for sm80) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137750 Approved by: https://github.com/drisspg	2024-10-12 02:15:14 +00:00
Valentine233	67883e70c0	change GPT2ForSequenceClassification inference accuracy tolerance (#136749 ) Fixes https://github.com/pytorch/pytorch/issues/123503. https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-12 01:12:28 +00:00
Gufan Yin	fba2c0a23a	Fix comment in ProcessGroupGloo (#137746 ) Summary: Algorithm caching was removed in 2018 D13111781 Test Plan: CI Differential Revision: D64214527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137746 Approved by: https://github.com/Skylion007, https://github.com/wz337	2024-10-12 01:04:41 +00:00
Jean Schmidt	69bcf1035e	Updates reference to _runner-determinator.yml workflow, from current version to main version. (#137791 ) Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version. This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs. From: ``` jobs: get-label-type: uses: ./.github/workflows/_runner-determinator.yml ``` To: ``` jobs: get-label-type: uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro	2024-10-12 00:18:50 +00:00
Andrew Gu	e269a5cb09	[TCPStore] Throw value error if passing `world_size=0` to TCPStore (#137792 ) This fixes https://github.com/pytorch/pytorch/issues/137577. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137792 Approved by: https://github.com/fegin, https://github.com/H-Huang ghstack dependencies: #137713, #137721	2024-10-11 23:42:57 +00:00
cyyever	25ac5652d0	[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328 ) Follows #124485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328 Approved by: https://github.com/eqy	2024-10-11 23:23:57 +00:00
Shivam Raikundalia	8486d3df69	[Profiler] Hide ProfilerStep Alignment behind Experimental Config (#137668 ) Summary: Aligning ProfilerStep# annotation can be useful for visual purposes but it affects downstream tools like HTA to misreport how long each step took. For this reason, lets give users the option to turn on this alignment manually but also turn it off by default Test Plan: Alignment off: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_11_48.2543945.pt.trace.json.gz&bucket=gpu_traces Alignment on: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_08_27.2518391.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D64146115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137668 Approved by: https://github.com/aaronenyeshi	2024-10-11 22:57:05 +00:00
PyTorch MergeBot	0121d64aa9	Revert "[AOTI] Handle inplace output in ProxyExecutor (#137660 )" This reverts commit 573101aac3b1addc0a0b945ae09fe9be9034d3a9. Reverted https://github.com/pytorch/pytorch/pull/137660 on behalf of https://github.com/desertfire due to Fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/137660#issuecomment-2408213485))	2024-10-11 22:54:39 +00:00
PyTorch MergeBot	c58e5c4efa	Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534 )" This reverts commit b0da076f0cd5957c7fe55a58876f3b74babfc1b7. Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))	2024-10-11 22:50:58 +00:00
Will Constable	e3173d8725	[pipelining] Shape Inference (#136912 ) Performs shape inference at runtime using user-provided real tensors. - avoids the need for users to precompute shapes which is difficult and error prone - lets us remove args from the PipelineStage ctor (in a later PR) - deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference The current state as of this PR: - Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically - To override shape inference, they can continue to pass input/output args as previously Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage. If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step). Testing: - Removed input args from all PP test cases, thus exposing them all to shape-inference. - Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-10-11 22:49:00 +00:00
Shangdi Yu	432c3fe5af	Default to use training IR (#137804 ) Summary: Since capture_pre_autograd_graph is deprecated and will be deleted soon, we default this option to true. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D64254236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137804 Approved by: https://github.com/tugsbayasgalan	2024-10-11 22:34:28 +00:00
Jez Ng	c254901bdb	Have Triton custom extension test use privateuseone device (#137611 ) The original PR #122396 used the CPU device since at that point in time there was no actual Triton CPU backend. After #133408, this is no longer the case, so we now have multiple backends getting registered for the CPU. The test still works in OSS but fails internally due to different test runners initializing the backends in a different order. This PR doesn't actually end up fixing the test internally because cpp_extension -- needed to implement the privateuseone device -- isn't supported there, so we simply skip it instead. However, it still makes the OSS test independent of initialization order, which is good. Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611 Approved by: https://github.com/henrylhtsang	2024-10-11 21:27:29 +00:00
Bilal Khan	19bbbef79d	cublaslt autotuning support for TunableOp (#133896 ) Adds support for cublaslt autotuning to TunableOp. Todo: - [x] Add and test `ScaledGemmTunableOp` - [x] Benchmarking numbers Pull Request resolved: https://github.com/pytorch/pytorch/pull/133896 Approved by: https://github.com/eqy, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-10-11 21:16:36 +00:00
PyTorch MergeBot	1358969fa1	Revert "BundledAutotuneCache (#134959 )" This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a. Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))	2024-10-11 20:43:56 +00:00
Artemiy Bulavin	74e871355b	Add hooks to Scheduler nodes for generating device-specific debug strings (#135015 ) Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015 Approved by: https://github.com/jansel	2024-10-11 20:30:49 +00:00
eellison	8543000c27	Search through config changes in compiler bisector (#137346 ) Follow up to https://github.com/pytorch/pytorch/pull/131936. In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346 Approved by: https://github.com/zou3519	2024-10-11 20:24:54 +00:00
Ryan Landay	513563eb09	Fix stack named "queue" in Util::ComputePostOrder (#130526 ) This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack. This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`. Note: this also changes the backing store from an `std::vector` to an `std::deque`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526 Approved by: https://github.com/alanwaketan, https://github.com/malfet	2024-10-11 20:21:07 +00:00
Justin Chu	d0628a7e39	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms ghstack dependencies: #137789	2024-10-11 20:10:04 +00:00
Tugsbayasgalan Manlaibaatar	5fca2fd365	Try unify training and inference (#136888 ) Previously inference -> inference IR was going through a seperate flow from train -> inference decomposition. This diff unifies them so that we always retrace when decomposing. Joint IR decomp is still going through old flow (inference -> inference) but seems ok for now since it is still in experimental stage. Differential Revision: [D63062521](https://our.internmc.facebook.com/intern/diff/D63062521/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136888 Approved by: https://github.com/avikchaudhuri	2024-10-11 20:09:58 +00:00
Justin Chu	3e0b83ad1f	[ONNX] Remove ExportTypes (#137789 ) Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789 Approved by: https://github.com/titaiwangms	2024-10-11 19:29:52 +00:00
Sergii Dymchenko	460358a20f	Run lint-autoformat only on PRs to main (#137802 ) This is mostly to prevent showing up on ghstack PRs, with which code suggestions are not compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137802 Approved by: https://github.com/huydhn	2024-10-11 19:25:34 +00:00
Jean Schmidt	2cb983ab97	[CI] Adds support for selecting experiments for workflows on runner determinator (#137614 ) adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw: ``` experiments: lf: rollout_perc: 25 otherExp: rollout_perc: 25 default: false --- ``` and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated): ``` get-test-label-type: name: get-test-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... check_experiments: "awsa100" ``` The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS. Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one. ``` jobs: get-build-label-type: name: get-build-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... get-test-label-type: name: get-test-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... check_experiments: "awsa100" linux-focal-cuda12_1-py3_10-gcc9-inductor-build: name: cuda12.1-py3.10-gcc9-sm80 uses: ./.github/workflows/_linux-build.yml needs: - get-build-label-type - get-test-label-type with: runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}" ... test-matrix: \| { include: [ { config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" }, ... ]} ... ``` ``` experiments: lf: rollout_perc: 50 awsa100: rollout_perc: 50 default: false ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614 Approved by: https://github.com/malfet	2024-10-11 19:20:02 +00:00
Aaron Orenstein	709021143d	BundledAutotuneCache (#134959 ) Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Various related configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Testing: Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D60677499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959 Approved by: https://github.com/oulgen	2024-10-11 19:12:41 +00:00
chilli	b82000c1b3	Removed _compile workaround for create_block_mask (#137477 ) I also put in a change for supporting `create_block_mask` to properly handle non-multiples of BLOCK_SIZE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137477 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng	2024-10-11 19:04:23 +00:00
Jason Ansel	2dcd69da50	[inductor] Delete dead code and lints (#137753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137753 Approved by: https://github.com/Chillee	2024-10-11 18:55:08 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Animesh Jain	04adb74d08	[inductor][cond] Remove redundant prefix (#137718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137718 Approved by: https://github.com/eellison ghstack dependencies: #137200	2024-10-11 18:13:18 +00:00
Animesh Jain	cd02c85ba4	[inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200 ) Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code. This is the output code for test `test_cond_reintepret_view_inputs_outputs` Before this PR - https://www.internalfb.com/intern/paste/P1632948905/ With this PR - https://www.internalfb.com/intern/paste/P1632946348/ A relevant snippet from the above paste is ~~~ def false_graph_0(args): false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args args.clear() s0 = s0 with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add] triton_poi_fused_add_sub_1_xnumel = (-20) + (20s0) stream0 = get_raw_stream(0) triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0) del false_graph_0_arg0_1 del false_graph_0_arg1_1 return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), ) async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1 = args args.clear() s0 = arg0_1 assert_size_stride(arg1_1, (s0, 20), (20, 1)) assert_size_stride(arg2_1, (s0, 20), (20, 1)) assert_size_stride(arg3_1, (), ()) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = [None] 2 buf0 = [None] * 2 if arg3_1.item(): # subgraph: true_graph_0 true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0]) buf0[0] = true_graph_0_buf0 buf0[1] = true_graph_0_buf1 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0]) buf0[0] = false_graph_0_buf0 buf0[1] = false_graph_0_buf1 del arg1_1 del arg2_1 del arg3_1 buf1 = buf0[0] buf2 = buf0[1] del buf0 return (buf1, buf2, ) ~~~ The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper. Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov). This work will unblock hierarchical compilation (or cold start compile time work). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200 Approved by: https://github.com/desertfire, https://github.com/eellison	2024-10-11 17:57:10 +00:00
Nikita Shulga	68272ab596	Extend cuda_flip to unsigned types (#137781 ) Using AT_DISPATCH_V2 Test plan: `python3 -c "import torch;print(torch.randint(0, 100, (4, 4), dtype=torch.uint16, device='cuda').flip(0))"` Fixes https://github.com/pytorch/pytorch/issues/137770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137781 Approved by: https://github.com/Skylion007	2024-10-11 17:02:53 +00:00
Nichols A. Romero	4fa46d3bda	TunableOp: Performance Improvement (#135371 ) This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371 Approved by: https://github.com/jeffdaily	2024-10-11 16:52:40 +00:00
Jeff Daily	da578495ca	[ROCm] enable gfx110x for hipblaslt (#137317 ) Fixes #136347. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137317 Approved by: https://github.com/Skylion007, https://github.com/jithunnair-amd Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-11 16:51:31 +00:00
James Wu	41ccfc8752	Log chromium event for automatic dynamic reasons (#137491 ) Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally. Run following code: ``` import torch @torch.compile(backend="eager") def foo(t, x): return t.sin() + x torch._dynamo.config.automatic_dynamic_shapes = True torch._dynamo.config.assume_static_by_default = True # Change size x = torch.randn([1,2]) foo(x, 2) x = torch.randn([2,2]) foo(x, 2) torch._dynamo.reset() # Change dimensionality x = torch.randn([1,2]) foo(x, 2) x = torch.randn([1,2,3]) foo(x, 2) torch._dynamo.reset() # Change stride x = torch.randn([3,3]) foo(x, 2) x = torch.as_strided(x, [3,3], [2,2]) foo(x, 2) torch._dynamo.reset() # Change scalar x = torch.randn([1,2]) foo(x, 2) foo(x, 3) ``` Internal link to perfetto: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key The events look like this: <img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab"> <img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656"> <img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba"> <img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc"> Differential Revision: [D64184625](https://our.internmc.facebook.com/intern/diff/D64184625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2024-10-11 16:50:25 +00:00
Laith Sakka	a06d49a9f9	bump up add_loop_inductor_gpu expected instruction count. (#137672 ) diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu but not enough to fail in that diff, but now its kind of flaky test . it failed on recent merge: <img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611"> and here is the history <img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09"> <img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672 Approved by: https://github.com/masnesral	2024-10-11 16:46:38 +00:00
Aaron Gokaslan	d41558f8d7	[BE][Ez]: Better error message for CUDNN attention attn_bias (#137702 ) Follow up to #136885 . Fixes edge case on error condition (should be early exit so that expand doesn't every run into any trouble with weird cases (attn_bias 0, 1, > 5 dim). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137702 Approved by: https://github.com/eqy	2024-10-11 16:44:46 +00:00
Andrew Gu	5835b1af10	[FSDP2] Gated dynamo import for torch deploy (#137203 ) Differential Revision: [D63777335](https://our.internmc.facebook.com/intern/diff/D63777335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137203 Approved by: https://github.com/wz337	2024-10-11 16:38:19 +00:00
Andrew Gu	bdb42e7c94	[PGNCCL] Added some missing spaces in barrier msg (#137721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137721 Approved by: https://github.com/kwen2501 ghstack dependencies: #137713	2024-10-11 15:17:25 +00:00
Andrew Gu	39c5048549	[DeviceMesh] Fixed `from_group` when passing tensor `mesh` (#137713 ) This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713 Approved by: https://github.com/wz337	2024-10-11 14:53:51 +00:00
Jiong Gong	e30c55ee52	Update maintainers for inductor and x86 CPU (#136839 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136839 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2024-10-11 07:24:07 +00:00
drisspg	1c71de5b2c	[ScaleMM] Add a shape dependent max_swizzle size (#137681 ) # Summary I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling. There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36 It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes. This led to seeing if we can improve our triton codegen lowering for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517). In the meantime @Chillee suggested I make sure swizziling is set for the large matmul shapes This PR makes sure that we increase the max_swizzle_size for the large matmuls. ## Performance Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster On Nighlty W/ Triton at (2ef33c6c4c3) ![swizzle_tst_8_full_nightly_heatmaps](https://github.com/user-attachments/assets/e92af19b-4e79-4126-b9d0-da039da5363b) You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA. After this PR: ![swizzle_tst_8_full_heatmaps](https://github.com/user-attachments/assets/472068b3-45c2-43f8-84d3-b116da7898d5) For example w/ this change(power limited gpu) M=16384 K=16384 N=16384 TFlops Before :`985.49` TFlops After: `1304.69` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681 Approved by: https://github.com/eqy	2024-10-11 06:44:31 +00:00
Xia, Weiwen	4e309899c7	[Quant] Check stride > 0 for QConv and QConvTranspose (#136739 ) Fixes #136722 Fixes #136718 By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0. FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739 Approved by: https://github.com/jgong5	2024-10-11 05:50:37 +00:00
Ke Wen	fe148024fe	[c10d][experimental] Add _abort_process_group (#132291 ) Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: https://github.com/NVIDIA/nccl/issues/1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291 Approved by: https://github.com/eqy	2024-10-11 05:04:17 +00:00
Tugsbayasgalan Manlaibaatar	bc232e3c08	Fix custom op bug of clearing dir (#137655 ) Previously when we delete a custom op out of context manager, we weren't clearing the dir field of the op namespace. As a result, it was polluting other tests. Differential Revision: [D64141465](https://our.internmc.facebook.com/intern/diff/D64141465/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137655 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-10-11 04:32:40 +00:00
Alexander Zinoviev	ee713f80ed	Enable channels_last format for FSDP (#137382 ) Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers. Summary of changes: 1) Store strides information along with shapes 2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening 3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382 Approved by: https://github.com/awgu	2024-10-11 03:47:16 +00:00
Avik Chaudhuri	8ee361ed13	fix test_retrace_pre_autograd (#137733 ) Test Plan: fixed Differential Revision: D64200918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137733 Approved by: https://github.com/pianpwk, https://github.com/tugsbayasgalan	2024-10-11 03:46:22 +00:00
xinan.lin	8321eec009	[Inductor UT] Generalize device bias code in test_triton_kernels.py (#137585 ) [Inductor UT] Generalize device bias code in test_triton_kernels.py introduced by #137020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137585 Approved by: https://github.com/eellison, https://github.com/jansel	2024-10-11 02:00:01 +00:00
Avik Chaudhuri	8262f6d271	fix test_lazy_module_kwargs (#137705 ) Test Plan: fixed Differential Revision: D64185644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705 Approved by: https://github.com/tugsbayasgalan	2024-10-11 01:53:10 +00:00
Shangdi Yu	9d4cb0d3eb	Fix param and buffer mapping for state_dict when there are state_dict hooks (#137609 ) Resolve #137540 Summary: We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks. For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks. To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict(). Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically. nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy. One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale. But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"]. In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy. Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly. In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict. The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks. The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state. If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks ``` Differential Revision: D64080561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609 Approved by: https://github.com/angelayi, https://github.com/pianpwk	2024-10-11 01:33:50 +00:00
Richard Barnes	a919742149	c10::optional -> std::optional in PyTorch (#137333 ) Test Plan: Sandcastle Differential Revision: D63876535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137333 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-11 00:16:10 +00:00
PyTorch MergeBot	4fb1fd8a51	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit b6a64dce072240c0b06d2fb03ac81b3ed1b73d92. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
PyTorch MergeBot	b55ff476bd	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
Bin Bao	b0da076f0c	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-10 23:44:57 +00:00
Nikita Shulga	ad38bad766	[MPS] Add `tri[lu]_indices` (#137648 ) Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980 Copy-n-paste kernel implementation from `13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)` though use `float` instead of `double` for square root computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601, #137647	2024-10-10 23:41:06 +00:00
Bin Bao	573101aac3	[AOTI] Handle inplace output in ProxyExecutor (#137660 ) Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660 Approved by: https://github.com/chenyang78	2024-10-10 23:12:46 +00:00
Justin Chu	c37bb492da	[ONNX] Create an `optimize` method in ONNXProgram (#137667 ) Move optimization from the export call to the `optimize()` method in ONNXProgram. Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.) Example ```python onnx_program = torch.onnx.export(..., dynamo=True) onnx_program.optimize() onnx_program.save("model.onnx") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667 Approved by: https://github.com/titaiwangms ghstack dependencies: #137666	2024-10-10 22:44:19 +00:00
Justin Chu	e75984cd31	[ONNX] Use torch_2_6 apis from onnxscript (#137666 ) Create an `optimize=False` option in `torch.onnx.export` for model optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/137666 Approved by: https://github.com/titaiwangms	2024-10-10 22:23:15 +00:00
William Wen	93bbc8abcc	[dynamo, 3.13] use 3.13 multiline traceback in get_instruction_source_311 (#137617 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137617 Approved by: https://github.com/jansel	2024-10-10 20:19:27 +00:00
William Wen	4551a1ee79	[dynamo, 3.13] merge 3.13 FORMAT_* and <=3.12 FORMAT_VALUE (#137656 ) This was causing some 3.13 failures locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137656 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #137652	2024-10-10 19:53:42 +00:00
William Wen	6b2c3508f8	[dynamo, 3.13] fix typo in remove_fused_load_store (#137652 ) Whoops! Pull Request resolved: https://github.com/pytorch/pytorch/pull/137652 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-10-10 19:53:42 +00:00
Scott Wolchok	9c12198137	[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel, Try #2 (#137377 ) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was https://github.com/pytorch/pytorch/pull/136331 . Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137377 Approved by: https://github.com/malfet	2024-10-10 19:44:22 +00:00
Richeek Das	080f02ac7a	[dynamo] do not raise an unimplemented error with boolean masking setitem (#134902 ) Cudagraph breaks on boolean masking setitem, however the code runs fine. There is no need to raise an unimplemented error here, since it already warns that its an incompatible op. Fixes #134241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134902 Approved by: https://github.com/jansel, https://github.com/henrylhtsang	2024-10-10 19:11:40 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit be0b75256a7e516217b059ef273901b95c022fe7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
PyTorch MergeBot	33e5921e6b	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit 72ad1b8c6c7c364c1974b82a914876dcdf73af44. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:16 +00:00
eellison	881a18f25f	Set Cuda context in inductor and dont initialize wrong cuda device in fake_tensor (#137603 ) Previously we would construct tensors with "cuda" device which defaults to device:0 if not cuda context is set. Fix for https://github.com/pytorch/pytorch/issues/124854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137603 Approved by: https://github.com/jansel	2024-10-10 18:25:22 +00:00
Ryan Guo	dd7c2899bd	[dynamo] Properly prune dead cell local variables (#136891 ) This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again. See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output. Fixes #127350, #124653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519	2024-10-10 18:21:24 +00:00
Haifeng Jin	bcfdb72547	Fix dtype test for NumPy 2 (#137532 ) Related to #107302 The following test fails with NumPy 2. ``` _________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________ Traceback (most recent call last): File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface wrapped_x = np.array([1, -2, 3, -4], dtype=dtype) OverflowError: Python integer -2 out of bounds for uint8 To execute this test, run the following from the base repo dir: python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated. The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)` See the following for details. ``` >>> np.array([1, -2, 3, -4], dtype=np.uint8) <stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -2 to uint8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype) will give the desired result (the cast overflows). <stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -4 to uint8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype) will give the desired result (the cast overflows). array([ 1, 254, 3, 252], dtype=uint8) ``` This PR fixes the test failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532 Approved by: https://github.com/soulitzer	2024-10-10 18:12:25 +00:00
Menglu Yu	5e73f2d7c0	[PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415 ) Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion ``` Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335 Network: Up: 14KiB Down: 287B (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1) Jobs completed: 8. Time elapsed: 48.8s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 \| tee ~/local_run_shuai_interformer_cmf.txt ``` Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1}) https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces # e2e baseline: f650336422 proposal: f650336607 ### QPS and NE results {F1914975940} {F1914975938} {F1914975939} {F1914975945} > 0.7% QPS gain with NE neutral ### trace analysis Before {F1914990600} After {F1914990015} We reduced green part in the trace introduced by small nan_to_num kernels Differential Revision: D63962711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415 Approved by: https://github.com/Yuzhen11	2024-10-10 18:11:08 +00:00
cyy	94e12f97dc	[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 ) Follows #137072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404 Approved by: https://github.com/Skylion007	2024-10-10 18:05:34 +00:00
hjhee	20815c7cb9	Intel GPU: mode: add XPU to supported devices list (#137575 ) Kernel for `mode` Op is being ported to https://github.com/intel/torch-xpu-ops/pull/770, this requires adding XPU to supported device type. Additional context: https://github.com/intel/torch-xpu-ops/issues/327 @fengyuan14 @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/137575 Approved by: https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Feng Yuan <feng1.yuan@intel.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-10 17:44:40 +00:00
Ke Wen	cdd8fa98c7	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161	2024-10-10 17:16:34 +00:00
Colin Peppler	9690cacd61	[aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537 ) ### Context Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride. In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128). ``` # buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1)) buf812 = generate_example_value( (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0 ) ``` The fix is to apply the fallback on each symbol. ### Test ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda ========= Invalid __global__ write of size 2 bytes ``` Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537 Approved by: https://github.com/jingsh	2024-10-10 17:11:24 +00:00
Ke Wen	b6a64dce07	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere	2024-10-10 17:11:21 +00:00
Oguz Ulgen	034af88c2d	Add a microbechmark for cache read path (#137607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607 Approved by: https://github.com/jamesjwu	2024-10-10 16:36:18 +00:00
Nikita Shulga	dae60075e0	[BE][MPS] Use `Tensor`->`TensorBase` in OperationUtils.h (#137647 ) As for the most part those helper method need access to only base class methods. Also replace spurious `at::` namespace prefixes, i.e. `at::Tensor`->`Tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137647 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601	2024-10-10 16:11:17 +00:00
Max Podkorytov	bcf15d1bb4	[AOTI] Add error check for parsing error string from error code (#137626 ) Currently, there are compilation warnings as below, which are resolved after the fix ``` /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’: /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result] 482 \| hipDrvGetErrorString(code, &msg); \ \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’ 519 \| CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str())); \| ^~~~~~~~~~~~~~~~~ In file included from /opt/rocm/include/hip/hip_runtime.h:70, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13, from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4: /opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char)’, declared here 2369 \| hipError_t hipDrvGetErrorString(hipError_t hipError, const char errorString); \| ^~~~~~~~~~~~~~~~~~~~ In file included from /opt/rocm/include/hip/hip_runtime.h:70, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13, from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4: /opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here 399 \| } hipError_t; \| ^~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-10-10 15:14:39 +00:00
Aditya Tewari	575f260229	Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672 ) Motivation Enable SVE vectorization with `torch.compile` Extends PR: #119571 * This PR enables vectorization for codegen part using SVE-256 (vec length) * The changes can be extended to other SVE vec lengths I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile` Test results are for 8 cores on ARM Neoverse_V1 <img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2"> It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-10 13:20:40 +00:00
Thanh Ha	479bd1f300	Hardlock frequent periodic jobs to Meta runners (#137616 ) The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-10-10 12:32:16 +00:00
PyTorch MergeBot	f69bf005f7	Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097 )" This reverts commit 4304c68a4c4d742a3ec5266b81f64a85922509c9. Reverted https://github.com/pytorch/pytorch/pull/137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](https://github.com/pytorch/pytorch/pull/137097#issuecomment-2404573266))	2024-10-10 09:29:05 +00:00
Xiaodong Wang	eea1f79a1d	[AMD] use rccl.h instead of rccl/rccl.h (#135472 ) Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header. Test Plan: buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10 w/o this diff, it'll show 2.18 nccl version Differential Revision: D62371434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472 Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa	2024-10-10 08:55:57 +00:00
Robert Hardwick	eaab5cf0f9	Fix torch.compile correctness bug on aarch64+sve due to gcc bug (#137606 ) Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Fixes #137597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-10 08:44:53 +00:00
Avik Chaudhuri	365722f606	fix test_constant_output (#137547 ) Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants. Test Plan: fixed test Differential Revision: D64081786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547 Approved by: https://github.com/tugsbayasgalan	2024-10-10 07:48:15 +00:00
Jason Ansel	4e8997744c	[inductor] Enable coordinate descent tuning with max-autotune (#136867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136867 Approved by: https://github.com/Chillee	2024-10-10 07:29:52 +00:00
Kurt Mohler	383eba5229	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Fixes #75240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy	2024-10-10 06:59:08 +00:00
leslie-fang-intel	71010bf097	[Inductor][CPP] generalize the wgt tensor delete (#135101 ) Summary Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-10 06:01:09 +00:00
Yifu Wang	ea83c78174	[SymmetricMemory] set the storage_offset of tensors returned by get_buffer() to 0 (#137569 ) It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size. Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569 Approved by: https://github.com/weifengpy ghstack dependencies: #137567	2024-10-10 05:05:58 +00:00
Nikita Lutsenko	96bab021c0	ATen \| Fix header namespace pollution. (#137619 ) Summary: Fixing a warning, so we can enable it globally. Test Plan: Sandcastle-only, no runtime changes. Differential Revision: D64122115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137619 Approved by: https://github.com/Skylion007	2024-10-10 05:04:54 +00:00
Laith Sakka	1aa130e80c	Avoid generating as_strided for alaising views in auto_functionalize_v2 (#137149 ) during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base we create a view that is regenerated by calling aten.alias instead of as_strided for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149 Approved by: https://github.com/zou3519	2024-10-10 05:00:41 +00:00
Valentine233	b5284a01a4	[CPU] remove keyword static for exp_u20 (#137571 ) Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571 Approved by: https://github.com/jgong5	2024-10-10 04:52:04 +00:00
Lu Fang	d170c410f2	Clean up op BC check list (#137634 ) Summary: Remove some stale items Test Plan: CI Differential Revision: D64133246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634 Approved by: https://github.com/hl475	2024-10-10 04:29:21 +00:00
sanshang	249152475d	fix sequence number for group (#134578 ) Summary: Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay. `ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`. `b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)` However, `WorkNCCL` only has one sequence number member `seq_`. `b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)` We need to match collective and p2p with wait separately. `29b5a462dc` Depend on: https://github.com/pytorch/pytorch/pull/135132 Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o	2024-10-10 04:24:06 +00:00
Finlay Sanders	5aa9f2b660	Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654 ) Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype(). Fixes #137186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654 Approved by: https://github.com/mikaylagawarecki	2024-10-10 03:10:01 +00:00
Nichols A. Romero	b9c9f7f0fa	Document ROCm environment variables and improve CMake messaging to user (#137308 ) Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about. The PR improves the documentation and CMake's behavior for ROCM builds. - Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`. - CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-10 03:08:08 +00:00
Laith Sakka	f394fb554b	Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541 ) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541 Approved by: https://github.com/aorenste	2024-10-10 02:47:38 +00:00
Jane Xu	f9ed39c989	Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637 Approved by: https://github.com/albanD	2024-10-10 01:23:30 +00:00
Michael Lazos	d50d5df2fb	Add warning for non static grads in optimizer variable (#137554 ) Fixes https://github.com/pytorch/pytorch/issues/112548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554 Approved by: https://github.com/williamwen42	2024-10-10 01:23:21 +00:00
Miles	f301f6544b	fix bug for fill_empty_deterministic_ not support complex half (#137488 ) Fixes #133157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488 Approved by: https://github.com/ezyang	2024-10-10 01:21:32 +00:00
Laith Sakka	361046718d	Generate new expected results file when there is failures in diff time benchmarks (#137551 ) The test also add singpost log for the benchmarks that pass. to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv results ``` WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results. PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00% MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. You can use the new reference expected result stored at path: out.csv. a,instruction count,90,0.01 b,memory,200,0.1 c,something,100,0.1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551 Approved by: https://github.com/aorenste	2024-10-10 01:09:15 +00:00
Edward Z. Yang	d9f4a7d3f9	Simplify find_localzeros (#133325 ) Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325 Approved by: https://github.com/albanD	2024-10-10 00:52:50 +00:00
Ke Wen	4f45c76806	[PGNCCL] Limit access to ncclComm_ (#137573 ) When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it. `NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`. To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573 Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o ghstack dependencies: #137572	2024-10-10 00:34:05 +00:00
cyy	0739efbd1f	Remove reference of gcc7 from CI scripts (#137339 ) Because gcc7 can't be used to build Pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-10-10 00:29:29 +00:00
Shuqiang Zhang	47a515d260	[c10d] simplify barrier implementation and further decouple CPU/GPU (#137516 ) synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516 Approved by: https://github.com/fduwjj, https://github.com/kwen2501	2024-10-09 23:55:28 +00:00
Huy Do	51c33c0b72	Increase the runner size of AVX* jobs to 4xlarge (#137633 ) The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner. It looks ok to up the instance size to 4xlarge instead. I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447 Example periodic failures `de4c2a3b4e` (test_cpu_repro) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-10-09 23:43:49 +00:00
Edward Z. Yang	4304c68a4c	In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097 Approved by: https://github.com/angelayi ghstack dependencies: #137091	2024-10-09 23:34:35 +00:00
Edward Z. Yang	6908d8d450	Enable python dispatcher for reinplacing pass (#137091 ) Arguably this should be put somewhere higher up in the stack? Not sure. Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/ There is a repro but I need to fix more bugs before it can be checked in Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091 Approved by: https://github.com/bdhirsh	2024-10-09 23:34:35 +00:00
Felix Janda	31e334ad9e	[unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043 ) This fixes a compilation error on systems with the musl c library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043 Approved by: https://github.com/aaronenyeshi	2024-10-09 23:34:16 +00:00
Yifu Wang	aafa02506e	[CudaDMAConnectivityDetector] improve the detection robustness (#137530 ) - Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent). - Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529	2024-10-09 23:30:16 +00:00
Yifu Wang	fbaf9b62de	[SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529 ) This provides better accuracy without additional cost. Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475	2024-10-09 23:30:16 +00:00
Yifu Wang	39c5122a4f	[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR - Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out` - Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_` - Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks. - Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`). - Removes methods that were made for the python binding. - Replaces nvlink detection logic with `DMAConnectivityDetector`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474	2024-10-09 23:30:16 +00:00
Yifu Wang	e6edfe3928	[SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473	2024-10-09 23:30:16 +00:00
Bob Ren	b22749712c	type _inductor/optimize_indexing.py (#137599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-10-09 23:29:47 +00:00
Bob Ren	d67b4f9e5f	type _inductor/quantized_lowerings.py (#137598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598 Approved by: https://github.com/Skylion007	2024-10-09 23:29:26 +00:00
Bob Ren	9b01d17b8d	Use MetaProxy more pervasively (#137588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588 Approved by: https://github.com/ezyang ghstack dependencies: #136674	2024-10-09 23:22:03 +00:00
Nikita Shulga	13cf8360d8	[MPS] Fix testing for generator operators (#137601 ) Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs` Discovered that: - `torch.randn` and `torch.randint` fall into the same undefined category - `torch.logspace` is not implemented for MPS - Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem) - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601 Approved by: https://github.com/albanD	2024-10-09 23:17:11 +00:00
Bob Ren	48fe0d56d6	Type _inductor/exc.py (#137595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595 Approved by: https://github.com/Skylion007	2024-10-09 23:15:06 +00:00
Edward Z. Yang	7408742b67	Make ignore_fresh_unbacked_symbols reentrant (#137605 ) I have a test but it requires some other feature work that isn't fully baked. Maybe this will fix an xfail. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605 Approved by: https://github.com/albanD	2024-10-09 23:08:05 +00:00
Jin Zhou	5516ac5c21	[ROCm] Tunableop record untuned (#128813 ) When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp: - record untuned GEMMs to file. - a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-09 21:59:03 +00:00
Simon Fan	839d3568b0	[compiled autograd] fix -Wuninitialized (#137539 ) https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539 Approved by: https://github.com/isuruf, https://github.com/Skylion007	2024-10-09 21:16:26 +00:00
Yifu Wang	38027b9b47	[SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567 ) Fixes https://github.com/pytorch/pytorch/issues/137145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567 Approved by: https://github.com/Chillee, https://github.com/weifengpy	2024-10-09 20:45:57 +00:00
Andrew Gu	a93ea617b5	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-09 20:35:09 +00:00
eellison	47af7cc962	Add compiler bisector (#131936 ) This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good \| bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for https://github.com/pytorch/pytorch/issues/126546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936 Approved by: https://github.com/ezyang	2024-10-09 20:34:11 +00:00
Jane Xu	cfe970260a	Clarify opt-einsum usage, fix #127109 (#137596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596 Approved by: https://github.com/albanD	2024-10-09 20:31:24 +00:00
PyTorch MergeBot	c73d2634b9	Revert "Log chromium event for automatic dynamic reasons (#137491 )" This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92. Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))	2024-10-09 20:24:12 +00:00
PyTorch MergeBot	16a2c2cfd4	Revert "Introduce torch.sym_sum (#136429 )" This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29. Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))	2024-10-09 20:08:01 +00:00
Ke Wen	572f506f9c	[c10d] Improve split_group test (#137572 ) Fix 1: `backend1 = pg._get_backend`, here `pg` should be `ng1`. Fix 2: `dist.broadcast` should be called by ranks of subgroup `ng1` only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572 Approved by: https://github.com/Skylion007	2024-10-09 19:43:57 +00:00
Mikayla Gawarecki	70288c3c2d	Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444 ) Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263 Follow ups: Do the same for maia and mtia ## Motivation With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`. However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy. ## Why is this ok? The following comment on why numpy is necessary for serialization is legacy `c87c9f0a01/torch/_tensor.py (L303-L312)` We no longer do the following, though it was the case 5 years ago in the PR that added this > CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content Instead what now happens is that CPU storage is constructed with data from the file and then moved onto backend device. Old behavior (`legacy_load`): `67adda891a/torch/serialization.py (L620)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444 Approved by: https://github.com/albanD	2024-10-09 19:35:55 +00:00
Andrew Gu	aa61e251d4	[FSDP2] Added `shard_placement_fn` arg (#137496 ) ## Overview This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size. ``` # Example: def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]: largest_dim = largest_dim_size = -1 for dim, dim_size in enumerate(param.shape): if dim_size > largest_dim_size: largest_dim = dim largest_dim_size = dim_size return Shard(largest_dim) fully_shard(module, shard_placement_fn=shard_placement_fn) ``` ## Follow-Ups - Copy kernels: For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496 Approved by: https://github.com/weifengpy ghstack dependencies: #137593	2024-10-09 19:13:32 +00:00
Bob Ren	36133f39db	Tensorify compute on Python scalars (#136674 ) Signed-off-by: Bob Ren <bobrenfb.com> Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674 Approved by: https://github.com/ezyang	2024-10-09 18:51:41 +00:00
Bob Ren	f15edb291a	type _dynamo/trace_wrapped_higher_order_op.py (#137354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-10-09 18:35:28 +00:00
Zhiyong Wang	9a957e2842	[NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523 ) Summary: NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events. The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin. Test Plan: ``` tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev) # Create a list with same number of elements as world size (aka no. of ranks) # During allgather this list is going to be populated with tensors from all ranks (aka all gather) gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)] # get collective from all ranks if i <= 10 or RANK != 0: dist.all_gather(gathered_tensors, tensor) ``` My script triggers flight recoder. ``` trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0 trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0 ``` Content from /tmp/nccl_trace_rank_0: P1614645283 Differential Revision: D61929401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523 Approved by: https://github.com/c-p-i-o	2024-10-09 18:19:33 +00:00
Ryan Guo	394c143e4e	[dynamo] Fix error when inlining certain nested closure returned by another function (#137510 ) See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context. In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`. Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values. This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`. This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in `NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up. Fixes #136814. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510 Approved by: https://github.com/williamwen42	2024-10-09 18:13:57 +00:00
Justin Chu	018dabff20	[ONNX] Implement patch for jit.isinstance (#137592 ) Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-10-09 18:06:52 +00:00
Andrew Gu	ceb2fcc5db	[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` (#137593 ) This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593 Approved by: https://github.com/Skylion007	2024-10-09 17:57:11 +00:00
Huanyu He	bae8d5853e	[TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798 ) Summary: # context * enable the `_get_user_embeddings` function * run failed at P1610151892 ``` torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0). (Size-like symbols: u22) ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward if M <= 0 or N <= 0: ``` ``` N = prod(inner_dims) # type: ignore[arg-type] M = prod(outer_dims) # type: ignore[arg-type] if M <= 0 or N <= 0: return ( input.new_zeros(input_shape) if output_mask[0] else None, input.new_zeros(input_shape[axis:]) if output_mask[1] else None, input.new_zeros(input_shape[axis:]) if output_mask[2] else None, ) ``` # changes * use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic. * the size `ret[i][j]` could be zero, so the change in V1 isn't valid * for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/) ``` from torch.fx.experimental.symbolic_shapes import guard_size_oblivious if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0): ``` # past * found `u22` was introduced at ``` def _wait_impl(self) -> List[List[int]]: # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks. if isinstance(self._splits_awaitable, dist.Work): self._splits_awaitable.wait() ret = self._output_tensor.view(self.num_workers, -1).T.tolist() # <------ u22 introduced here if not torch.jit.is_scripting() and is_torchdynamo_compiling(): for i in range(len(ret)): for j in range(len(ret[i])): torch._check_is_size(ret[i][j]) # <---------- my question: why the _check_is_size isn't enough?? torch._check(ret[i][j] > 0) # <------ added by diff V1 ``` Test Plan: # run command ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 \| tee -a `tagT`.`tagH`.log ``` # results * before without enabling `_get_user_embeddings` [14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html) log: P1610151892 {F1889387940} * V1 enable `_get_user_embeddings` with `torch._check(ret[i][j] > 0)` [13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html) {F1889388378} * V2 enable `_get_user_embeddings` with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):` [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html) if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0): Differential Revision: D63424929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798 Approved by: https://github.com/ezyang	2024-10-09 17:19:45 +00:00
James Wu	4d45536e92	Save aot graph code in AOTAutogradCache for logging purposes (#137432 ) Save the string graph code from print_readable Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432 Approved by: https://github.com/bdhirsh ghstack dependencies: #137431	2024-10-09 16:59:08 +00:00
Masaki Kozuki	b71d0ac3b1	remove unused variable (#137565 ) per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565 Approved by: https://github.com/Skylion007	2024-10-09 16:31:43 +00:00
Oguz Ulgen	ae03c0cff3	Add microbenchmark for FxGraphHashDetails.debug_lines (#137506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506 Approved by: https://github.com/jamesjwu	2024-10-09 16:15:05 +00:00
albanD	e945b6600d	Support 3.8 compile again (#137587 ) This is not going to be very reliable since we don't have CI though... Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587 Approved by: https://github.com/Skylion007	2024-10-09 15:54:52 +00:00
Xinran / Allan Rui	1d15dd7891	Fix triton_reshape to properly expand `Min` keyword in triton codegen (#137357 ) Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape ``` Differential Revision: D63850158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2024-10-09 15:53:45 +00:00
Will Feng	de4c2a3b4e	Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253 ) This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253 Approved by: https://github.com/bdhirsh	2024-10-09 08:06:16 +00:00
Valentine233	ac8954d1ca	[pattern match][SDPA] remove contiguous in sdpa replacement (#136930 ) Fixes a perf issue which is found internally. In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5	2024-10-09 07:52:38 +00:00
FFFrog	72ad1b8c6c	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang ghstack dependencies: #136519	2024-10-09 07:34:30 +00:00
Avik Chaudhuri	a02093e824	fix test_export_constraints_error_not_in_range (#137500 ) Test Plan: fixed Differential Revision: D64052011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500 Approved by: https://github.com/tugsbayasgalan	2024-10-09 05:48:14 +00:00
zeshengzong	abb00efc14	Add torch.squeeze parameter description to declare allowed type (#137485 ) Fixes #137422 Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None` as `dim` value directly. ```python >>> import torch >>> x = torch.randn(3,1,2) >>> x.squeeze(dim=None) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Please look up dimensions by name, got: name = None. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485 Approved by: https://github.com/albanD	2024-10-09 05:29:13 +00:00
Huy Do	df114a447e	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-09 05:13:53 +00:00
PyTorch MergeBot	2fff990c16	Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 )" This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed. Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))	2024-10-09 04:53:30 +00:00
Jane Xu	972822dea1	Minorly reorder optim kwargs in docs, fixes #137391 (#137531 ) Closes #137391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531 Approved by: https://github.com/albanD	2024-10-09 04:14:45 +00:00
Benjamin Glass	4628fcf41a	Fix ir._WaitKernel (#137401 ) In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to an incorrect outputs list. Add the correct set of outputs, as done in ir._CollectiveKernel.create_out_of_place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401 Approved by: https://github.com/desertfire ghstack dependencies: #136924	2024-10-09 04:02:30 +00:00
Benjamin Glass	0414aeacd9	AOTInductor: silence linker warnings about executable stacks (#136924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924 Approved by: https://github.com/desertfire	2024-10-09 04:02:30 +00:00
Jane Xu	ddc7b6d0b4	Removes confusing note, addresses #38006 (#137535 ) Fixes #38006 The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535 Approved by: https://github.com/albanD	2024-10-09 04:00:38 +00:00
Yifu Wang	d3edf4ebf4	[SymmetricMemoryOps] implement two-shot all-reduce (#137473 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472	2024-10-09 03:49:42 +00:00
Yifu Wang	82e55b624f	[SymmetricMemoryOps] implement one_shot_all_reduce (#137472 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137471	2024-10-09 03:49:42 +00:00
Yifu Wang	5d83ee3e32	[SymmetricMemoryOps] refine cross-device barriers (#137471 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471 Approved by: https://github.com/Chillee, https://github.com/weifengpy	2024-10-09 03:49:42 +00:00
Michael Lazos	5f1759a025	[Dynamo] add flex attention mode test (#137121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121 Approved by: https://github.com/yanboliang, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119	2024-10-09 02:29:40 +00:00
Michael Lazos	d5785d4295	[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227	2024-10-09 02:29:40 +00:00
Michael Lazos	0a304d9048	[Dynamo] Handle extracted unbound tensor methods (#137227 ) fixes2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120	2024-10-09 02:29:40 +00:00
Michael Lazos	b3f30c9bc3	[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 ) Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default) Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue) Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120 Approved by: https://github.com/yanboliang, https://github.com/malfet ghstack dependencies: #137114, #137115, #137116, #137117	2024-10-09 02:29:40 +00:00
Michael Lazos	27dee935af	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-09 02:29:40 +00:00
Michael Lazos	38afac2917	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115	2024-10-09 02:29:40 +00:00
Michael Lazos	108b469f78	[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114	2024-10-09 02:29:40 +00:00
Michael Lazos	e41dffbedd	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114 Approved by: https://github.com/yanboliang	2024-10-09 02:29:40 +00:00
leslie-fang-intel	0b8048c78a	Fix AOTI CPP GEMM Template issue without freezing (#136421 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight ``` ReinterpretView( StorageBox( ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1])) ), FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]), origins=OrderedSet([addmm]) ) ``` In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack. Test Plan ``` python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-10-09 02:19:07 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
Chirag Pandya	384ddab294	[c10d] fix sequence numbers for coalesced operations (#135132 ) Summary: We were erroneously incrementing seq_collective for p2p operations. FIxes issue #134833 Test Plan: Unit tests. TODO: add more unit tests Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132 Approved by: https://github.com/fduwjj	2024-10-09 01:38:12 +00:00
Sam Larsen	8cbb58cff6	[inductor] Limit cpu copies in autotuning to CUDA devices (#137509 ) Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509 Approved by: https://github.com/int3, https://github.com/eellison	2024-10-09 01:31:58 +00:00
Parikshit Shah	932b9945c0	[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 ) Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes. Test Plan: tbd Reviewed By: Chillee, basilwong Differential Revision: D63714905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314 Approved by: https://github.com/eellison, https://github.com/basilwong Co-authored-by: Parikshit Shah <parikshit@meta.com>	2024-10-09 00:39:29 +00:00
Ke Wen	23c531b3e9	Allow parallelize_module to get device_mesh from ambient context (#134247 ) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of FQNs as in the out-of-model annotation case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247 Approved by: https://github.com/tianyu-l	2024-10-09 00:19:03 +00:00
Zhenbin Lin	de7f32a205	openreg add pin_memory (#135339 ) Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339 Approved by: https://github.com/albanD	2024-10-09 00:07:59 +00:00
eellison	8893881867	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang Co-authored-by: eellison <elias.ellison@gmail.com>	2024-10-09 00:05:52 +00:00
eqy	cba3f4f5e3	[CUDA] Clean up asserts in `test_cuda.py` (#137034 ) Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034 Approved by: https://github.com/Skylion007	2024-10-08 23:16:19 +00:00
Jane Xu	b16167874d	Minor SGD docs clarification fixing #137356 , #137352 (#137528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528 Approved by: https://github.com/albanD	2024-10-08 23:05:08 +00:00
Duygu Altinok	2a1829d728	Error message for allow_in_graph decorator and arbitrary function combo (#135972 ) Fixes #103615 Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972 Approved by: https://github.com/anijain2305	2024-10-08 22:48:38 +00:00
eellison	4aed81c0db	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-08 22:36:46 +00:00
Tugsbayasgalan Manlaibaatar	02013da038	Lift restriction on training IR for unflatten (#137470 ) Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470 Approved by: https://github.com/avikchaudhuri	2024-10-08 22:30:24 +00:00
Justin Chu	81c8a8ada6	[ONNX] Bump onnxscript in CI (#137497 ) To 0.1.0.dev20241008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497 Approved by: https://github.com/titaiwangms	2024-10-08 21:56:30 +00:00
Joel Schlosser	76ab1ab665	Fix autograd.Function + NJT when an output grad is None (#136875 ) For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws: ``` RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers ``` This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875 Approved by: https://github.com/soulitzer	2024-10-08 21:01:36 +00:00
PyTorch MergeBot	5e3e1c0151	Revert "[FSDP2] Required `mesh_dim_names` for HSDP (#137436 )" This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279. Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))	2024-10-08 20:50:49 +00:00
Edward Z. Yang	b499083a91	Get rid of quadratic tests to has_same_metadata (#136857 ) Fixes https://github.com/pytorch/pytorch/issues/136852 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857 Approved by: https://github.com/isuruf, https://github.com/bdhirsh	2024-10-08 20:49:23 +00:00
PyTorch MergeBot	d34b617bb9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 )" This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480. Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	8c937445ee	Revert "[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 )" This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803. Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	e5f9131327	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 )" This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d. Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	2d18c2d5e7	Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 )" This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075. Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
Edward Z. Yang	cc75ac084f	Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-08 20:17:03 +00:00
PyTorch MergeBot	5349ee2934	Revert "Parametrize test_lstm_packed (#137447 )" This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8. Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))	2024-10-08 20:15:24 +00:00
James Wu	3c1ab93678	Log chromium event for automatic dynamic reasons (#137491 ) Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally. Run following code: ``` import torch @torch.compile(backend="eager") def foo(t, x): return t.sin() + x torch._dynamo.config.automatic_dynamic_shapes = True torch._dynamo.config.assume_static_by_default = True # Change size x = torch.randn([1,2]) foo(x, 2) x = torch.randn([2,2]) foo(x, 2) torch._dynamo.reset() # Change dimensionality x = torch.randn([1,2]) foo(x, 2) x = torch.randn([1,2,3]) foo(x, 2) torch._dynamo.reset() # Change stride x = torch.randn([3,3]) foo(x, 2) x = torch.as_strided(x, [3,3], [2,2]) foo(x, 2) torch._dynamo.reset() # Change scalar x = torch.randn([1,2]) foo(x, 2) foo(x, 3) ``` Internal link to perfetto: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key The events look like this: <img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab"> <img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656"> <img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba"> <img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2024-10-08 19:53:12 +00:00
cyy	a2396b2dd8	[2/N] Fix extra warnings brought by clang-tidy-17 (#137459 ) Follows #137407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459 Approved by: https://github.com/Skylion007	2024-10-08 19:05:02 +00:00
Brian Hirsh	b41fc14072	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136759	2024-10-08 18:44:13 +00:00
Brian Hirsh	48b8f818b2	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka	2024-10-08 18:44:13 +00:00
Brian Hirsh	53af729a66	add meta for _segment_reduce_backward (#137442 ) reland of https://github.com/pytorch/pytorch/pull/124988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442 Approved by: https://github.com/albanD	2024-10-08 18:40:06 +00:00
Edward Z. Yang	1aac1ffce1	Don't generate implicit value ranges for missing symbols. (#136667 ) Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute. ``` TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667 Approved by: https://github.com/isuruf ghstack dependencies: #136429	2024-10-08 18:12:57 +00:00
Edward Z. Yang	90bed32b98	Introduce torch.sym_sum (#136429 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. update_hint_regression benchmark, before and after: ``` update_hint_regression,compile_time_instruction_count,2648328980 update_hint_regression,compile_time_instruction_count,2563748678 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429 Approved by: https://github.com/isuruf	2024-10-08 18:12:57 +00:00
James Wu	3bf6594d13	Log compile ids to pt2_remote_cache and pt2_compile_events (#137431 ) Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile. Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431 Approved by: https://github.com/oulgen	2024-10-08 18:04:48 +00:00
Yuanhao Ji	758dbac308	Add type check for `ord` in `torch.linalg.vector_norm()` and `torch.linalg.matrix_norm()` (#137463 ) fixes #137424, fixes #137460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463 Approved by: https://github.com/lezcano	2024-10-08 17:53:56 +00:00
Shivam Raikundalia	d87835ac32	[Profiler] Clear Out Dangling AppendOnlyLists (#137450 ) Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting. Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully Differential Revision: D64010911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450 Approved by: https://github.com/aaronenyeshi	2024-10-08 17:48:59 +00:00
PyTorch MergeBot	7e8dace0de	Revert "[ROCm] remove caffe2 from hipify (#137157 )" This reverts commit 40d826074546558f6665a4c118335a7725503cac. Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))	2024-10-08 17:45:45 +00:00
PyTorch MergeBot	a8047564ff	Revert "[FlexAttention] Support training bias for eager (#136910 )" This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f. Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))	2024-10-08 17:29:02 +00:00
PyTorch MergeBot	0b5ade8a12	Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 )" This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3. Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))	2024-10-08 17:26:19 +00:00
PyTorch MergeBot	2570d77a26	Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354 )" This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26. Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))	2024-10-08 17:22:40 +00:00
PyTorch MergeBot	76c5bdd2cc	Revert "[Dynamo] Handle extracted unbound tensor methods (#137227 )" This reverts commit 14eabd69152e31d059444310979625542db2aece. Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))	2024-10-08 17:12:41 +00:00
PyTorch MergeBot	c88c0e6c65	Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 )" This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc. Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))	2024-10-08 17:09:26 +00:00
PyTorch MergeBot	cc10ef4645	Revert "[Dynamo] add flex attention mode test (#137121 )" This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a. Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))	2024-10-08 17:03:34 +00:00
PyTorch MergeBot	11192ceca4	Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451 )" This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae. Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))	2024-10-08 17:00:59 +00:00
eellison	8184e202d8	Update mutation checking in pattern matcher (#137448 ) Fix for https://github.com/pytorch/pytorch/issues/137229 The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448 Approved by: https://github.com/zou3519	2024-10-08 16:56:40 +00:00
Avik Chaudhuri	28493efe6e	fix silly mapping issue with torch.Size (#137465 ) Test Plan: added test Differential Revision: D64022949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465 Approved by: https://github.com/yushangdi, https://github.com/angelayi	2024-10-08 16:53:15 +00:00
xadupre	7267363844	[ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340 ) Works around #136543. This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context. The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-10-08 16:45:59 +00:00
Andrew Gu	5fb30df7d6	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-08 16:31:18 +00:00
Shangdi Yu	0bfedb13e7	Remove aoti_torch_zero_ codegen (#137371 ) Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798. Test Plan: CI Differential Revision: D63916320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371 Approved by: https://github.com/jingsh	2024-10-08 15:57:41 +00:00
Bin Bao	c04b35a5ae	[AOTI] Add standalone version of TORCH_CHECK (#136873 ) Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained. Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873 Approved by: https://github.com/albanD, https://github.com/chenyang78	2024-10-08 15:30:01 +00:00
Huy Do	d5493ed579	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-08 15:26:27 +00:00
Joel Schlosser	3e2f276a14	Fix to() on non-contiguous NJTs (#137124 ) Called out via torchrec integration: `lengths` is not handled properly. Future work (not related to non-contiguous NJTs): #137275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124 Approved by: https://github.com/soulitzer ghstack dependencies: #137030, #137031	2024-10-08 15:11:05 +00:00
Edward Z. Yang	a77bb8527c	Make index check in applySelect support deferred runtime assert (#137046 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046 Approved by: https://github.com/albanD	2024-10-08 14:31:47 +00:00
Thanh Ha	9b2e453e24	Migrate ARM64 Linux binary jobs to runner determinator (#136666 ) Updates ARM64 Linux binary jobs to use the runner determinator. Issue: pytorch/ci-infra#265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666 Approved by: https://github.com/ZainRizvi	2024-10-08 12:14:06 +00:00
Shuqiang Zhang	76dca1fef3	[c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295 ) code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295 Approved by: https://github.com/kwen2501	2024-10-08 08:53:47 +00:00
drisspg	9f9d252971	[FlexAttention] only calculate grads for buffers that require_grad (#137451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451 Approved by: https://github.com/Chillee	2024-10-08 07:36:38 +00:00
Xuehai Pan	59cdd8ddf1	Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396 Approved by: https://github.com/albanD	2024-10-08 06:49:04 +00:00
PyTorch MergeBot	493d0eeef3	Revert "Add support for cat memory planning mms with max autotune (#132554 )" This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176. Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))	2024-10-08 06:21:06 +00:00
Michael Lazos	8ca15e87f5	Update torchbind expecttest from landrace (#137453 ) Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly) Attempted to revert the stack but there were conflicts Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453 Approved by: https://github.com/huydhn	2024-10-08 06:01:29 +00:00
Tugsbayasgalan Manlaibaatar	bb31e3f57e	Add original forward names to schema so that prettify pass works (#136887 ) When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp. Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887 Approved by: https://github.com/angelayi	2024-10-08 04:21:02 +00:00
Zhenbin Lin	46525abb71	OpenReg: support multiple executors (#136249 ) From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249 Approved by: https://github.com/FFFrog, https://github.com/albanD	2024-10-08 01:37:08 +00:00
Bob Ren	395e098209	type _dynamo/mutation_guard.py (#137350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350 Approved by: https://github.com/Skylion007	2024-10-08 00:04:34 +00:00
Max Podkorytov	52ba40c6f6	[ROCm][AOTI] add CK backend (#135641 ) Companion to #134379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78 Co-authored-by: Colin Peppler <colinpeppler@meta.com>	2024-10-07 23:53:58 +00:00
Colin Peppler	2c0b11c79b	forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435 ) Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0 ``` Differential Revision: D63991961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435 Approved by: https://github.com/jovianjaison	2024-10-07 23:44:04 +00:00
Yuanhao Ji	812f286d4a	Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711 ) This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here: `5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711 Approved by: https://github.com/soulitzer	2024-10-07 23:19:06 +00:00
eellison	d558ec0730	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-07 22:49:29 +00:00
Ludvig Bergenstråhle	01bf350967	Fix bmm_sparse_cuda illegal memory access (#131977 ) This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example: ```py import torch ind = torch.tensor([[1], [0], [0]], device="cuda") val = torch.tensor([1.], device="cuda") A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1)) B = torch.zeros((2, 1, 1), device="cuda") C = torch.bmm(A, B) ``` ## Details In the previous code, we may for example end up with the following situation: ``` i : indices_1D[i] ------------------------------------------ 0 : 1 <- start_idx, mid_idx 1 : 1 <- end_idx ... ``` When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977 Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved	2024-10-07 22:47:34 +00:00
William Wen	a6707a7303	[dynamo] log all graph breaks to graph_breaks logging artifact (#137244 ) We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244 Approved by: https://github.com/jansel	2024-10-07 22:34:27 +00:00
Bob Ren	a9f7b905de	type _dynamo/trace_wrapped_higher_order_op.py (#137354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-10-07 21:57:06 +00:00
PyTorch MergeBot	796c3c3415	Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221 )" This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c. Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))	2024-10-07 21:46:13 +00:00
Sam Larsen	319eda9dfd	[inductor] Add API to make post_grad_custom passes cache-able (#137298 ) Summary: See https://github.com/pytorch/pytorch/issues/130772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-10-07 21:11:54 +00:00
Peter Y. Yeh	8aa110cb00	[ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999 ) This pull request enables the int_mm_error tests for rocm 6.0+ . since #122431 landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-10-07 21:10:18 +00:00
Huy Do	46abaa3b0f	Increase parallelnative shards to 4 (#137433 ) The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-10-07 21:06:34 +00:00
Sam Larsen	c87c9f0a01	[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701 Approved by: https://github.com/eellison	2024-10-07 19:47:04 +00:00
Ryan Guo	900f57216f	[dynamo] Log a summary of frames Dynamo traced (#137297 ) This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function. Example: ```python import torch @torch.compile def foo(): x = torch.ones([10]) def bar(): y = x + x torch._dynamo.graph_break() z = y * x return z return bar(), bar foo() foo() ``` Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end. ``` ...... I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [ I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * foo /Users/ryanguo99/Documents/work/scratch/test.py:4 I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * bar /Users/ryanguo99/Documents/work/scratch/test.py:7 I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ] I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [] ...... ``` Fixes #118262. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297 Approved by: https://github.com/williamwen42	2024-10-07 19:44:41 +00:00
Pian Pawakapan	f33ffd01f2	[export] fix joint graph metadata (#136011 ) Differential Revision: D62652832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011 Approved by: https://github.com/tugsbayasgalan	2024-10-07 19:36:44 +00:00
Jason Ansel	08b84afda9	[inductor] Fix alignment hint for WorkspaceArg (#137429 ) Alignment hints refer to the base ptr, not the size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429 Approved by: https://github.com/eellison	2024-10-07 19:32:33 +00:00
PyTorch MergeBot	fe44b6a67f	Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835 )" This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1. Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))	2024-10-07 18:59:41 +00:00
Michael Lazos	144665d772	[Dynamo] add flex attention mode test (#137121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119	2024-10-07 18:55:26 +00:00
Michael Lazos	d255b34c0a	[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119 Approved by: https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227	2024-10-07 18:55:26 +00:00
Michael Lazos	14eabd6915	[Dynamo] Handle extracted unbound tensor methods (#137227 ) fixes2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227 Approved by: https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116, #137117, #137120	2024-10-07 18:55:26 +00:00
Michael Lazos	68151fd288	[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 ) Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default) Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue) Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115, #137116, #137117	2024-10-07 18:55:26 +00:00
Michael Lazos	941be418d8	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-07 18:55:26 +00:00
Michael Lazos	f9d69cde88	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115	2024-10-07 18:55:26 +00:00
Michael Lazos	b1fd7708bd	[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114	2024-10-07 18:55:26 +00:00
Michael Lazos	51bc839b94	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114 Approved by: https://github.com/yanboliang	2024-10-07 18:55:26 +00:00
Bob Ren	ff95ff5d38	type _dynamo/profiler.py (#137351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351 Approved by: https://github.com/Skylion007	2024-10-07 18:54:33 +00:00
Andrew Gu	aa145dead8	[FSDP2] Fixed mistargeted backward prefetch (#137348 ) If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348 Approved by: https://github.com/weifengpy	2024-10-07 18:10:09 +00:00
PyTorch MergeBot	01c07e7864	Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 )" This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da. Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))	2024-10-07 17:56:57 +00:00
cyy	0c0d8c8ff0	[1/N] Fix extra warnings brought by clang-tidy-17 (#137407 ) Before we can use clang-tidy-17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-10-07 17:53:59 +00:00
Rachel Guo	ceb4ed8450	[AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323 ) Summary: - Further added more types for debug value dumping. - Add a test case for symint inputs for debug printer. in real prod model use cases, "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda ``` Differential Revision: D63864708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323 Approved by: https://github.com/chenyang78	2024-10-07 17:41:40 +00:00
Animesh Jain	04e48ac562	[inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198 Approved by: https://github.com/jansel ghstack dependencies: #137191, #137193	2024-10-07 17:20:58 +00:00
Animesh Jain	e2b72348d0	[inductor] Reuse the subgraph if accessed via same get_attr node (#137193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193 Approved by: https://github.com/jansel ghstack dependencies: #137191	2024-10-07 17:20:58 +00:00
Animesh Jain	7a5eaecd92	[inductor] Correctly keep track of the graph_input_names (#137191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191 Approved by: https://github.com/jansel	2024-10-07 17:20:53 +00:00
Wei Feng	14b4099521	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-07 16:36:31 +00:00
Oguz Ulgen	33461592e2	[TLParse] Include cache hit/miss/bypass in the report name (#137282 ) Makes it easier to tell on glance https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html <img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282 Approved by: https://github.com/jamesjwu	2024-10-07 16:00:00 +00:00
James Wu	4db199f15f	Implement Remote AOTAutogradCache (#137278 ) Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name. Test Plan: Run benchmark: TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency See that it cache hits even with local cache removed. Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj New unit tests Reviewed By: oulgen Differential Revision: D63323958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278 Approved by: https://github.com/oulgen	2024-10-07 15:38:54 +00:00
Angela Yi	f80ed0b831	[export] Custom op meta kernel generation (two pass) (#137277 ) Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi Test Plan: followup diff (D63837739) Differential Revision: D63837740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277 Approved by: https://github.com/zou3519	2024-10-07 15:34:19 +00:00
Joshua Rosenkranz	e20e7a8c38	Fixed developer setup issue in open_registration_extension (#137355 ) This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following: ``` error: 'synchronizeStream' overrides a member function but is not marked 'override' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355 Approved by: https://github.com/albanD, https://github.com/spzala	2024-10-07 15:25:37 +00:00
Yuxin Wu	8c3ab21490	multiprocessing.spawn: allow a grace period when shutdown (#131278 ) When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder). This PR adds a grace period. Default behavior is unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278 Approved by: https://github.com/albanD	2024-10-07 12:37:34 +00:00
vasiliy	a063a82c8b	[redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341 Approved by: https://github.com/eqy	2024-10-07 00:58:51 +00:00
Nikita Shulga	d1b87e26e5	[BE] Delete empty files (#137376 ) Discovered by running ``` % find aten -type f -size 0 aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/dummy.c aten/src/ATen/native/vulkan/api/StringUtil.cpp aten/src/ATen/native/LegacyBridge.cpp aten/src/ATen/function_wrapper.py aten/src/ATen/cudnn/Exceptions.h ``` Most of them were added by `b774ce54f8` Remove reference to LegacyBridge.cpp from `aten_native_source_non_codegen_list`: `f42f63ee86/build_variables.bzl (L1317)` And reference to `native/quantized/cpu/qnnpack/wrappers/dummy.c` from `f42f63ee86/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl (L440)` Which seems to be a bug from some ancient Android toolchain Pull Request resolved: https://github.com/pytorch/pytorch/pull/137376 Approved by: https://github.com/kit1980, https://github.com/eqy, https://github.com/seemethere, https://github.com/jianyuh, https://github.com/Skylion007	2024-10-06 18:59:04 +00:00
Menglu Yu	0eba7e5451	Revert runtime numeric check in inductor due to increased compilation time (#137324 ) Summary: This diff reverts D63438718 Cause latency regression on multiple models Test Plan: NA Differential Revision: D63872515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137324 Approved by: https://github.com/xuzhao9	2024-10-06 05:23:24 +00:00
angelayi	1dc1b85714	[export] Move swap to a different file (#137134 ) Refactor so that unflattener doesn't become too messy Differential Revision: [D63719648](https://our.internmc.facebook.com/intern/diff/D63719648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137134 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #136191, #137102	2024-10-06 04:28:18 +00:00
angelayi	fa9cd46d12	[export] Update swap's forward function (#137102 ) Downstream APS code was failing to run the previously swapped module because of some fx.GraphModule forward function weirdness (P1594789677). So to fix this, I just attached a custom forward function which matches the unflattened module's forward function. Differential Revision: [D63683422](https://our.internmc.facebook.com/intern/diff/D63683422/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137102 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #136191	2024-10-06 04:25:36 +00:00
angelayi	52d7704b32	[export] Add optimization passes (#136191 ) Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other). Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten. Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191 Approved by: https://github.com/avikchaudhuri	2024-10-06 04:22:42 +00:00
Jeeja	ad4e91acfe	[fsdp2] based on device, use stream and Event (#136843 ) currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use _get_device_handle by device type to get the class and use this for stream and events. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843 Approved by: https://github.com/awgu	2024-10-06 04:17:47 +00:00
Jez Ng	4061910ba2	Have Triton CPU backend respect max_autotune setting (#137276 ) We would previously do it regardless of the setting's value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-10-06 03:03:33 +00:00
Yanbo Liang	711dacf984	[FlexAttention] Support training bias for eager (#136910 ) Add training bias eager implementation, take over the original POC from https://github.com/pytorch/pytorch/pull/136076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136910 Approved by: https://github.com/Chillee	2024-10-05 19:34:57 +00:00
Bob Ren	d073223663	turn CompilationCallbackHandler into dataclass (#137312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312 Approved by: https://github.com/Skylion007 ghstack dependencies: #137181	2024-10-05 19:03:28 +00:00
Catherine Lee	f54e142c58	Remove references to Rockset in trymerge (#137207 ) For the migration to ClickHouse But also Rockset is not used in trymerge anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/137207 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-10-05 12:53:22 +00:00
Jeff Daily	40d8260745	[ROCm] remove caffe2 from hipify (#137157 ) - Remove all "MasqueradingAsCUDA" files and classes. - Do not rename "CUDA" classes to "HIP". Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157 Approved by: https://github.com/eqy	2024-10-05 12:48:54 +00:00
Yanbo Liang	ca38f28543	[FlexAttention] Adjust BlockMask if reusing the one created at larger seqlen (#137255 ) Fixes #136232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137255 Approved by: https://github.com/Chillee	2024-10-05 07:31:32 +00:00
Nikita Shulga	4830bd0dd4	[Doc] Clarify that NaNs are not equal to each other (#137386 ) Fixes https://github.com/pytorch/pytorch/issues/137337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137386 Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/kit1980	2024-10-05 06:19:12 +00:00
Avik Chaudhuri	17718209ea	fix specialization bug in unflatten + preserve_module_call_signature (#137363 ) Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash. Test Plan: added a test Reviewed By: angelayi Differential Revision: D63913750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363 Approved by: https://github.com/angelayi	2024-10-05 04:26:02 +00:00
Nikita Shulga	6d0d7b6e37	[CI][BE] Restore cuda memory allocator setting (#137383 ) By adding `finally:` clause at the end of the test Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383 Approved by: https://github.com/ngimel	2024-10-05 04:16:38 +00:00
PyTorch UpdateBot	0067f586ba	[audio hash update] update the pinned audio hash (#136968 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136968 Approved by: https://github.com/pytorchbot	2024-10-05 04:08:59 +00:00
Yuanhao Ji	4d8b845797	Fix overflow error when `torch.bincount()` handles a large tensor (#136745 ) Fixes #136720 the result in this case says: ``` Traceback (most recent call last): File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module> result = torch.bincount(input) ^^^^^^^^^^^^^^^^^^^^^ RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745 Approved by: https://github.com/Skylion007	2024-10-05 04:04:48 +00:00
soulitzer	d6f340f66c	Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633 ) Thanks @awgu for raising this issue and the small repro From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633 Approved by: https://github.com/albanD	2024-10-04 23:59:46 +00:00
Michal Gallus	79562f3af8	[ROCm] Modify hipify script to work with Windows paths (#135360 ) This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one. Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour. Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2024-10-04 23:43:43 +00:00
albanD	8b6774d381	Clarify comment for error handling of dict getattr (#137381 ) Just a small nit Pull Request resolved: https://github.com/pytorch/pytorch/pull/137381 Approved by: https://github.com/malfet	2024-10-04 23:40:21 +00:00
Tarun Karuturi	f42f63ee86	Add option to disable operator profiling (#136838 ) Summary: X-link: https://github.com/pytorch/executorch/pull/5720 For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time. To disable operator profiling users need to do: ``` etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling); ``` Test Plan: Added test case. Differential Revision: D61883224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838 Approved by: https://github.com/dbort	2024-10-04 22:56:00 +00:00
Andrew Ho	f2d174c051	Update CODEOWNERS (#136278 ) Swap @gokulavasan for @divyanshk as codeowner of torch/utils/data/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136278 Approved by: https://github.com/divyanshk, https://github.com/janeyx99, https://github.com/jansel	2024-10-04 22:42:05 +00:00
albanD	88e54de219	More nogil unsafe API fix (#137142 ) Cover the PyDict APIs and confirms no update needed for PyModule one. The rest was already covered in https://github.com/pytorch/pytorch/pull/136899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-10-04 21:56:34 +00:00
Siddharth Kotapati	e27c0048db	Enable additional tests for MPS CI runs (#134356 ) As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn	2024-10-04 21:52:38 +00:00
Laith Sakka	7c1d93944e	Proper handling of arguments passed by in kwargs inside zip_schema (#137311 ) if the function is ```func(a, b, c) ``` and is called as ```func(a=1, b=.., c=..)``` before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue. This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes and pass arguments as kwargs, we do not process them. ``` for info, arg in torch._library.utils.zip_schema(schema, args, kwargs): handle_aliasing_and_mutation(info, arg) ``` Fix https://github.com/pytorch/pytorch/issues/137057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311 Approved by: https://github.com/zou3519	2024-10-04 21:50:31 +00:00
albanD	c0deec120f	Fix resurrection logic to trigger early enough (#137267 ) Fixes https://github.com/pytorch/pytorch/issues/136358 The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++. Before this PR, they have the following gc methods: Tensor: - tp_clear subtype_clear - tp_traverse THPVariable_subclass_traverse - tp_dealloc THPVariable_subclass_dealloc TensorBase: - tp_clear THPVariable_clear - tp_traverse THPFunction_traverse (fake function that is just an error) - tp_dealloc object_dealloc The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267 Approved by: https://github.com/ezyang, https://github.com/kshitij12345	2024-10-04 21:13:54 +00:00
Nikita Shulga	bd48933323	Run docker builds on Meta account for now (#137358 ) To fix ``` arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-096a3e2616140518b is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/pytorch-linux-jammy-py3-clang18-asan because no resource-based policy allows the ecr:InitiateLayerUpload action ``` Which seems to be doing the trick see https://github.com/pytorch/pytorch/actions/runs/11185419440/job/31098258344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137358 Approved by: https://github.com/huydhn	2024-10-04 20:39:56 +00:00
Andrew Gu	7b3378a39a	[FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005 ) This PR relaxes the even sharding requirement for the all-gather extensions. The `fsdp_pre_all_gather` now expects signature: ```diff def fsdp_pre_all_gather( self, mesh: DeviceMesh, + outer_size: torch.Size, + outer_stride: Tuple[int, ...], module: nn.Module, mp_policy: MixedPrecisionPolicy, ) -> Tuple[Tuple[torch.Tensor, ...], Any]: ``` - Since no one is using this new signature yet, we should be safe to change it. - Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now. - For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005 Approved by: https://github.com/weifengpy	2024-10-04 20:34:20 +00:00
Bob Ren	f4b415da11	type _dynamo/replay_record.py (#137183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137183 Approved by: https://github.com/Skylion007	2024-10-04 20:29:24 +00:00
Avik Chaudhuri	6a6a8b17b8	handle state tensors in training ir path (#137240 ) Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed. Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export` Differential Revision: D63802576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240 Approved by: https://github.com/tugsbayasgalan	2024-10-04 20:23:48 +00:00
Bob Ren	f0ef7fddde	Add ignored/unmaintained comment for capture_autograd_function flag (#137309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309 Approved by: https://github.com/jansel ghstack dependencies: #136961	2024-10-04 20:02:37 +00:00
Bin Bao	0878739b11	[AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269 ) Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-04 19:42:05 +00:00
Benjamin Glass	a968576777	Add lowering for aten.searchsorted (#135701 ) Adds lowering for `aten.searchsorted`. This entails: 1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`. 2. Adding support for striding to `ops.bucketize`. 3. Adding support for sorting tensors to `ops.bucketize`. 4. Adding a lowering for `aten.searchsorted.Tensor`. 5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors. 6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions. Closes #135873 Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701 Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98	2024-10-04 19:26:05 +00:00
eellison	58ec6a360c	force contiguity for all reduce (#137345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137345 Approved by: https://github.com/xmfan	2024-10-04 19:16:38 +00:00
Shangdi Yu	c0a930b104	Change to export_for_training in quantize_pt2e tests (#137233 ) Summary: as title also change it in `prepare_pt2e()` docstring Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization ``` Differential Revision: D63345059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233 Approved by: https://github.com/tugsbayasgalan	2024-10-04 18:33:02 +00:00
Michael Lazos	22e19bd2d7	Add link to torch.compile the missing manual in troubleshooting (#137301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-10-04 18:19:30 +00:00
Henry Tsang	79195b9453	[inductor] Add kwargs to bypass unexpected keyword argument error (#137329 ) Summary: I tried `TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/fbcode/profile.txt`. TypeError: DebugAutotuner.run() got an unexpected keyword argument 'benchmark_run' Test Plan: ci Differential Revision: D63876103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137329 Approved by: https://github.com/muchulee8	2024-10-04 18:17:56 +00:00
Tugsbayasgalan Manlaibaatar	d2d14d14e3	[RELAND] Fix unlift to preserve aliased constants (#137310 ) Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137310 Approved by: https://github.com/avikchaudhuri	2024-10-04 18:15:52 +00:00
Laith Sakka	8b9cbf22c2	Enable regression test for add loop benchmarks (#136573 ) The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573 Approved by: https://github.com/ezyang	2024-10-04 18:12:08 +00:00
Menglu Yu	ad240018f2	[PT2][Inductor][Reliability] Add back unit test for pad_mm with BF16 (#137231 ) Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100. Test Plan: ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16 ``` Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136 Network: Up: 100KiB Down: 4.3GiB (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6) Jobs completed: 7079. Time elapsed: 3:34.3s. Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D63794727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137231 Approved by: https://github.com/henrylhtsang	2024-10-04 17:49:55 +00:00
Shangdi Yu	b2979f4382	Allow autocast in training ir export (#137287 ) Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check. Test Plan: CI Differential Revision: D63345767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137287 Approved by: https://github.com/angelayi	2024-10-04 17:38:51 +00:00
Colin Peppler	42adadf2f2	[aotinductor] enable CUTLASS backend (#134379 ) ### Context This PR allows CUTLASS kernels usage in AOTI. It does this by: * For any CUTLASS kernels that win during autotuning, compile them as a .so & .o * When creating the final model .so, link all the CUTLASS kernels .o files * Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel) ### Example https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549 ``` TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-04 17:32:41 +00:00
Jeff Daily	c7b0d4b148	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-04 15:36:29 +00:00
cyy	67908e9111	Enable clang-tidy on torch/csrc/distributed/rpc (#137320 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137320 Approved by: https://github.com/Skylion007	2024-10-04 15:34:05 +00:00
Bin Bao	15c3479db7	[AOTI] Fix _scaled_mm ABI-compatible codegen (#137132 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode. Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132 Approved by: https://github.com/chenyang78 ghstack dependencies: #137008	2024-10-04 14:05:18 +00:00
Bin Bao	5d24ea81d3	[AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work. Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137008 Approved by: https://github.com/hl475	2024-10-04 14:02:46 +00:00
PyTorch MergeBot	f56f7476d3	Revert "Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 )" This reverts commit e4b98b11493914769d15ca8b124c0b5fa1fdd364. Reverted https://github.com/pytorch/pytorch/pull/136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/136909#issuecomment-2393774694))	2024-10-04 14:01:54 +00:00
PyTorch MergeBot	cd17b2645c	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))	2024-10-04 13:58:57 +00:00
PyTorch MergeBot	5509207543	Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331 )" This reverts commit 592e3a3d4069029946ec4c8d103a313806c53a88. Reverted https://github.com/pytorch/pytorch/pull/136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](https://github.com/pytorch/pytorch/pull/136331#issuecomment-2393760135))	2024-10-04 13:52:37 +00:00
Adnan Akhundov	e80f47fb4d	Pass special arguments to user-defined Triton kernels if required (#137236 ) Summary: Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](`762a7d197c/python/triton/runtime/autotuner.py (L106)`) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](`762a7d197c/python/triton/runtime/jit.py (L613)`) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters. This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves #136550. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 6 tests in 6.283s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236 Approved by: https://github.com/zou3519	2024-10-04 07:36:55 +00:00
cyy	6327a71880	[Environment Variable][2/N] Use thread-safe setenv wrapper (#124485 ) This follows #119449 to make setenv thread-safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485 Approved by: https://github.com/eqy	2024-10-04 07:30:51 +00:00
Pian Pawakapan	6dcd773c57	[export] clean up dynamic markers from tensors (#137230 ) Summary: When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`. I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call. We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call. Test Plan: test_export Differential Revision: D63773534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230 Approved by: https://github.com/avikchaudhuri	2024-10-04 06:50:45 +00:00
Yanbo Liang	a408cfcbf1	[torch.compile] torch.vmap supports dynamic shapes + enable flex attention create_block_mask dynamic shapes (#137163 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137163 Approved by: https://github.com/Chillee	2024-10-04 05:16:04 +00:00
Mauricio Villegas	40b09edd87	Add back DistributedDataParallel types that were lost when pyi was removed (#136835 ) When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835 Approved by: https://github.com/kwen2501	2024-10-04 04:44:20 +00:00
Tugsbayasgalan Manlaibaatar	97634e4f82	Rollout infra for executorch migration to training IR (#132703 ) Title Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132703 Approved by: https://github.com/tarun292	2024-10-04 04:33:08 +00:00
rzou	f500cb43bb	Fix torch.library.register_vmap (#137306 ) We didn't support multiple levels of vmap. The main problem is, during the batching rule, we need to exclude the vmap dispatch key (FuncTorchBatched) like how our C++ batching rules do it. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306 Approved by: https://github.com/Chillee	2024-10-04 03:46:35 +00:00
Bob Ren	cfc51c858a	type _dynamo/callback.py (#137181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181 Approved by: https://github.com/Skylion007	2024-10-04 03:28:52 +00:00
PyTorch MergeBot	9670e9e5b0	Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899 )" This reverts commit 4f93de895138cc3cb8c4383b480a2d0ecf407e1b. Reverted https://github.com/pytorch/pytorch/pull/136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136899#issuecomment-2392721534))	2024-10-04 03:28:31 +00:00
Yukio Siraichi	e4b98b1149	Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 ) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909 Approved by: https://github.com/jansel	2024-10-04 02:47:25 +00:00
Bob Ren	a1f1f585ab	clean up error_on_nested_jit_trace flag (#136961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961 Approved by: https://github.com/jansel	2024-10-04 02:07:54 +00:00
Yifu Wang	d32696249a	[IntraNodeComm] fix a race condition in one-shot all-reduce (#137257 ) One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective. Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257 Approved by: https://github.com/Chillee	2024-10-04 01:41:14 +00:00
Trung Truong	3d3b394e94	[MTIA](3/n) Implement CPU pins functions for MTIA hooks (#137283 ) Summary: Link CPU pins function in MTIA hooks to the host allocator implementation Test Plan: signals unit test in D63709424 Differential Revision: D63352770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137283 Approved by: https://github.com/egienvalue	2024-10-04 01:26:21 +00:00
jakeharmon8	15e127bc3b	[numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (#137135 ) NumPy now throws an OverflowError when trying to create np.uint64(-1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137135 Approved by: https://github.com/Skylion007	2024-10-04 01:21:12 +00:00
Bob Ren	13ec343afe	clean up capture_func_transforms flag (#136960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel	2024-10-04 01:10:52 +00:00
cyyever	6b9b2a126e	Build clang18 image for ASAN tests (#128763 ) Use the latest clang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128763 Approved by: https://github.com/malfet	2024-10-04 00:53:56 +00:00
Ke Wen	a93d3873e9	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy	2024-10-04 00:44:02 +00:00
Bin Bao	88e338f4dd	[AOTI] Add C shim for MKLDNN _linear_pointwise (#136999 ) Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136999 Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475	2024-10-04 00:35:10 +00:00
Nikita Shulga	57c02e5a00	[BE] Use helper functions in mps_extension (#137313 ) This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313 Approved by: https://github.com/atalman	2024-10-04 00:26:38 +00:00
Colin Peppler	bc916a5537	[easy] for test_ck_backend enable RE & activate remaining tests for FBCode (#137305 ) Differential Revision: D63859208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137305 Approved by: https://github.com/muchulee8, https://github.com/chenyang78	2024-10-04 00:22:35 +00:00
cyy	60d19cb59e	Enable clang-tidy on torch/csrc/distributed/autograd/* (#137180 ) Enable clang-tidy on `torch/csrc/distributed/autograd/*` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137180 Approved by: https://github.com/Skylion007	2024-10-03 23:49:23 +00:00
rzou	7e13e7dd7e	Disallow FakeTensor.data_ptr access in eager mode (#137221 ) Previously we raised a deprecation warning (beginning PyTorch 2.4). Now that we are on 2.6, we're completing the deprecation and disallowing this behavior. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221 Approved by: https://github.com/albanD, https://github.com/eellison	2024-10-03 23:47:55 +00:00
Justin Chu	cfcd0e1fe9	[ONNX] Update the faketensor documentation (#137292 ) Update the faketensor documentation to reflect current usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137292 Approved by: https://github.com/shubhambhokare1, https://github.com/sdpython	2024-10-03 23:27:11 +00:00
Shangdi Yu	4096ed7dc2	Migrate to training ir in quantization_pt2e_qat unittests (#137232 ) Summary: Change capture_pre_autograd_graph to export_for_training in unit tests. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat ``` Reviewed By: tugsbayasgalan Differential Revision: D63336660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137232 Approved by: https://github.com/angelayi	2024-10-03 22:57:04 +00:00
Nikita Shulga	b44f25e1ba	[CI] Move s390 binary build to its own workflow (#137304 ) It was added by https://github.com/pytorch/pytorch/pull/125399 and takes 3 hours to finish Considering limited number of runners, it often causes queueing see: <img width="402" alt="image" src="https://github.com/user-attachments/assets/5c67c1d6-af4c-4453-a089-aa1174513bfa"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137304 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/atalman	2024-10-03 22:31:36 +00:00
David Berard	54094c0c26	[inductor][user triton] Check size hints to determine indexing dtype (#137234 ) Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64. This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64. Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234 Approved by: https://github.com/eellison	2024-10-03 22:07:26 +00:00
Shangdi Yu	c83178d894	Change to export_for_training in XNNPACK tests (#137238 ) Summary: as title Test Plan: CI Differential Revision: D63344674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238 Approved by: https://github.com/tugsbayasgalan	2024-10-03 21:28:05 +00:00
angelayi	ce14f1f0c9	[aoti] Accept constant inputs (#137197 ) Fixes https://fb.workplace.com/groups/1028545332188949/posts/1056788036031345/?comment_id=1056790162697799&reply_comment_id=1057501845959964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137197 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire, https://github.com/hl475	2024-10-03 20:59:33 +00:00
eqy	46f158bfbc	[cuDNN] Check shapes during graph capture in cuDNN CTCLoss (#130071 ) Found out from #125952 about the existence of `_assert_async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130071 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-03 20:10:28 +00:00
Scott Wolchok	592e3a3d40	[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331 ) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes https://github.com/pytorch/pytorch/pull/127488 . Includes https://github.com/pytorch/executorch/pull/5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136331 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #136445	2024-10-03 18:18:37 +00:00
Scott Wolchok	c8a7da305b	[PyTorch] Add attribute version of C10_ALWAYS_INLINE (#136445 ) Sometimes (such as on a lambda), you need `__attribute__((always_inline))` but not `inline`. Differential Revision: [D63266917](https://our.internmc.facebook.com/intern/diff/D63266917/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136445 Approved by: https://github.com/malfet	2024-10-03 18:18:37 +00:00
PyTorch MergeBot	525f6715bc	Revert "Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162 )" This reverts commit f96020c246aec8514b945d530879635a03294f70. Reverted https://github.com/pytorch/pytorch/pull/137162 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but many jobs are failing with NameError: name _recursive_getattr is not defined + a Lint job fails ([comment](https://github.com/pytorch/pytorch/pull/137162#issuecomment-2392036062))	2024-10-03 18:17:56 +00:00
fduwjj	c7714b8d8d	[FR] Fix duplicate output for the case when not all ranks join on collective (#137256 ) As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again. Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries. In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now) P1626755348 Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256 Approved by: https://github.com/c-p-i-o	2024-10-03 18:06:45 +00:00
albanD	adc48a5b52	Python CAPI cleanup (#137266 ) This is unrelated to anything else, but as I was going through the code, fixing bad patterns and a refcount bug (which is unlikely to cause any real issue tbh) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137266 Approved by: https://github.com/Skylion007	2024-10-03 17:55:48 +00:00
Sam Larsen	8bb8c3997b	[inductor] parallel compile: add import of thread_safe_fork for internal (#137155 ) Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload. Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion. Differential Revision: D63736829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155 Approved by: https://github.com/xmfan, https://github.com/eellison	2024-10-03 17:37:21 +00:00
Tugsbayasgalan Manlaibaatar	f96020c246	Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162 ) When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants. Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162 Approved by: https://github.com/pianpwk	2024-10-03 17:28:53 +00:00
James Wu	4d3c0fc061	[AOTAutogradCache] add config for AOTAutograd remote cache (#137011 ) Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff. This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE Test Plan: Existing tests should pass and should build Reviewed By: oulgen Differential Revision: D63321965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011 Approved by: https://github.com/oulgen	2024-10-03 16:03:47 +00:00
Bob Ren	a569a8ac4c	type _dynamo/external_utils.py (#137185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137185 Approved by: https://github.com/Skylion007	2024-10-03 15:18:53 +00:00
Mikayla Gawarecki	b6cb174816	Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137184 Approved by: https://github.com/albanD	2024-10-03 14:56:11 +00:00
Yuanhao Ji	89b7a5d128	Implement `AcceleratorHooksInterface`'s virtual functions `deviceCount()` and `getCurrentDevice()` for CUDA and XPU (#136752 ) Fixes #136751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136752 Approved by: https://github.com/albanD	2024-10-03 14:44:58 +00:00
atalman	63bbf712d8	Add py3.13t linux wheel build (#137127 ) Builder PR required: https://github.com/pytorch/builder/pull/2001 Test PR: https://github.com/pytorch/pytorch/pull/136490/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127 Approved by: https://github.com/albanD	2024-10-03 13:13:48 +00:00
Yifu Wang	38114ec860	[async-tp] fix a race condition that can cause silent correctness issue (#137199 ) Details described in https://github.com/pytorch/pytorch/issues/137171: ![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b) Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`: - Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream. - After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream. NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137199 Approved by: https://github.com/weifengpy	2024-10-03 10:42:37 +00:00
Vincent Moens	f166d62764	Avoid `__ne__` weakref comparison and use identity instead in cache_size.py (#135000 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135000 Approved by: https://github.com/anijain2305	2024-10-03 07:43:58 +00:00
Vincent Moens	bd9517c1ee	cond_batch_rule with boolean pred (#135009 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135009 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel, https://github.com/zou3519	2024-10-03 07:43:30 +00:00
PyTorch MergeBot	0d1701f310	Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 )" This reverts commit 70019074806920f95976fedad775d7570294f635. Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))	2024-10-03 06:22:55 +00:00
Simon Fan	87bf2a8428	[compiled autograd] initialize cudagraph tls from context manager (#136735 ) FIXES https://github.com/pytorch/pytorch/issues/126934. Cudagraphs TLS is initialized on module import, but compiled autograd codepaths might not import it. This causes problems because autograd/compiled autograd will restore TLS state, and in this case will restore the TLS to an uninitialized state Should fix flaky cudagraph tests: https://github.com/pytorch/pytorch/issues/131663, https://github.com/pytorch/pytorch/issues/132108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136735 Approved by: https://github.com/BoyuanFeng, https://github.com/eellison ghstack dependencies: #136059	2024-10-03 06:22:11 +00:00
Simon Fan	b86269fab5	Unify cpp_extension build directory removal (#136059 ) Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-10-03 06:22:11 +00:00
wz337	55c343fa3a	[DTensor] Register replication strategy for a few upsampling interpolate ops (#137201 ) To unblock Llama 3.2 vision's use case to resize positional embeddings for fine-tuning. Context in [workplace post](https://fb.workplace.com/groups/319878845696681/permalink/1271172040567352/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137201 Approved by: https://github.com/XilunWu	2024-10-03 03:45:39 +00:00
drisspg	84cac3585d	Move _is_static_problem to mm_common (#137150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137150 Approved by: https://github.com/eellison	2024-10-03 02:55:43 +00:00
drisspg	5c0ce8d0a6	Skip Flaky Test: for #134602 (#137226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137226 Approved by: https://github.com/cpuhrsch	2024-10-03 01:53:59 +00:00
Jez Ng	b3953ff25e	[inductor] Reduce block sizes when using Triton CPU backend (#136612 ) This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612 Approved by: https://github.com/desertfire ghstack dependencies: #135342	2024-10-03 01:48:32 +00:00
Bin Bao	4513fb5c53	[Inductor] Use parametrize to break down some unit tests (#137156 ) Summary: To address the issue that some tests are marked as slow, see https://github.com/pytorch/pytorch/issues/136940#issuecomment-2387227598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137156 Approved by: https://github.com/eellison	2024-10-03 01:43:36 +00:00
Ke Wen	7631a04081	[c10d] Fix the device query story of ProcessGroup (#136790 ) Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`. A concern is that people may not be aware of what it actually does, due to bad naming of this function: `Return the device to use with ``group`` for control flow usage (object collectives, barrier).` The remediation is as follows: - Added a deprecation warning to `_get_pg_default_device`; - Added a private function `_get_object_coll_device` to undertake what it does; - Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790 Approved by: https://github.com/H-Huang	2024-10-03 01:36:22 +00:00
Avik Chaudhuri	cd5d1fe015	unflatten with specialized graphs per submodule call (#137013 ) Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example. ``` class N(torch.nn.Module): def forward(self, x, b): if b: return x + 1 else: return x + 2 class M(torch.nn.Module): def __init__(self): super().__init__() self.n = N() def forward(self, x): x0 = x + 3 x1 = self.n(x0, True) x2 = x1 + 4 x3 = self.n(x2, False) return x3 + 5 m = M() inp = (torch.ones(1),) print(m(inp)) # tensor([16.]) ep = torch.export.export(m, inp) print(ep.module()(inp)) # tensor([16.]) unflattened = torch.export.unflatten(ep) print(unflattened(inp)) # tensor([15.]) ``` However, this goes against the spirit of specializing graphs when exporting: we should expect* that for every call to a submodule we might generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule. The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected. However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter. A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case. Differential Revision: D63642479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013 Approved by: https://github.com/pianpwk	2024-10-03 00:55:44 +00:00
atalman	6241006c28	Fix dependency on filesystem on Linux (#137209 ) Similar to: https://github.com/pytorch/pytorch/pull/134494 We are seeing come back of https://github.com/pytorch/pytorch/issues/133437 due to use of filesystem on Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/137209 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-10-03 00:18:28 +00:00
Catherine Lee	235f7e06f4	[CI] upload_metrics function to upload to s3 instead of dynamo (#136799 ) * Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo * Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier * Removes the dynamo_key field since it is no longer needed * Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore * Moves s3 resource initialization behind a function so import is faster --- Tested by emitting a metric during run_test and seeing that documents got added to s3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799 Approved by: https://github.com/ZainRizvi	2024-10-02 23:19:28 +00:00
PyTorch MergeBot	2c9e194e23	Revert "[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 )" This reverts commit b50b3b32191e7192a28c54a417891f24df4e4dda. Reverted https://github.com/pytorch/pytorch/pull/135955 on behalf of https://github.com/PaliC due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/135955#issuecomment-2389810936))	2024-10-02 22:46:31 +00:00
drisspg	bb03ef7aca	[FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204 ) ## Summary Originally reported in https://github.com/pytorch-labs/attention-gym/issues/45 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137204 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2024-10-02 22:19:33 +00:00
Shivam Raikundalia	759cd73adb	[Profiler] Update Kineto Submodule (#137137 ) Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024 Test Plan: Phabricator + OSS CI Differential Revision: D63723255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137 Approved by: https://github.com/aaronenyeshi	2024-10-02 22:19:28 +00:00
Dan Zimmerman	e9e5d767b6	[inductor] Fix build_paths usage in config.py (#137187 ) Summary: In https://github.com/pytorch/pytorch/pull/136234 we accidentally used the old version of `build_paths`, but in https://github.com/pytorch/pytorch/pull/136952 the API slightly changed. This diff addresses that issue by updating the API usage. Test Plan: CI Reviewed By: ColinPeppler Differential Revision: D63764809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137187 Approved by: https://github.com/ColinPeppler	2024-10-02 22:06:02 +00:00
Joel Schlosser	e95b230fd8	Fix NJT serialization (#137031 ) Fixes #129366 Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed. This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs, Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031 Approved by: https://github.com/soulitzer ghstack dependencies: #137030	2024-10-02 21:41:35 +00:00
eqy	be423a8480	[SDPA] Bump `grad_query` fudge factor for Flash Attention (#135711 ) Tolerance issue for small GPUs e.g., (A16, A2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135711 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-10-02 21:35:00 +00:00
Gabriel Ferns	36fb342ffd	Check for fused kernel before inplace update (#137042 ) Summary: Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so. Test Plan: New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers. Fixes #120217 Here's a diagram of the issue: ![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042 Approved by: https://github.com/eellison	2024-10-02 21:14:34 +00:00
Shangdi Yu	a3f3773477	Make PT2E work with both IR simultaneously (#135769 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat ``` Differential Revision: D62449830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135769 Approved by: https://github.com/angelayi	2024-10-02 21:05:22 +00:00
Howard Huang	4a9225fa1f	improve get_schedule_class() (#137103 ) Small change to make `get_schedule_class()` take case insensitive schedule names Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103 Approved by: https://github.com/kwen2501	2024-10-02 20:08:25 +00:00
Jane Xu	2d465e4d1d	[non ghstack] Init threadpool with user defined num_threads before default (#137051 ) Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051 Approved by: https://github.com/albanD	2024-10-02 20:02:30 +00:00
Jovian Anthony Jaison	59d7cf7342	Add _dynamo.config inline_inbuilt_nn_modules and specialize_float logging (#137139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137139 Approved by: https://github.com/ezyang	2024-10-02 19:58:38 +00:00
chilli	2b329d3bf1	Fix typo in _normalize ref (#137079 ) I think this should basically make no difference numerically, but it does have some ramifications on things like CSE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137079 Approved by: https://github.com/Skylion007 ghstack dependencies: #136826, #137043, #137049, #137065	2024-10-02 19:06:48 +00:00
Joel Schlosser	6374a19a6e	Fix wrapper subclass serialization with custom sizes / strides (#137030 ) Fixes #130154 This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic. The PyCapsule issue also affects `deepcopy`, so that's fixed here as well. Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030 Approved by: https://github.com/albanD	2024-10-02 18:55:03 +00:00
Xuehai Pan	8962610247	[BE][clang-format] make macro `PyObject_HEAD_INIT(type)` and `PyVarObject_HEAD_INIT(type, size)` have its own line (#136949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949 Approved by: https://github.com/albanD, https://github.com/eqy ghstack dependencies: #136945	2024-10-02 18:39:22 +00:00
Xuehai Pan	89c37be6b7	[BE][clang-format] make macro `PyObject_HEAD` have its own line (#136945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945 Approved by: https://github.com/albanD	2024-10-02 18:39:21 +00:00
Xilun Wu	54f50f19eb	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-02 18:00:23 +00:00
PyTorch MergeBot	4559cddaf9	Revert "Add py3.13t linux wheel build (#137127 )" This reverts commit 6b7adc12140d3073c5700cc1c48998556489857e. Reverted https://github.com/pytorch/pytorch/pull/137127 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but 2 jobs are failing ([comment](https://github.com/pytorch/pytorch/pull/137127#issuecomment-2389250791))	2024-10-02 17:44:42 +00:00
Wei Feng	b50b3b3219	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-02 17:26:45 +00:00
Henry Tsang	c318bafe9c	[inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153 ) Summary: Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`. I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions. Differential Revision: D63723925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153 Approved by: https://github.com/desertfire	2024-10-02 17:20:27 +00:00
Jean Schmidt	466623fb51	[CI] Support for CI GPU test and benchmark on containers (#137169 ) Renames the arc references to container, and add changes required so CI that requires GPU can run on containers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169 Approved by: https://github.com/huydhn	2024-10-02 17:10:59 +00:00
Jean Schmidt	e3fd4d796f	[CI] Skip sccache for nvcc builds when building for A100 (#137170 ) There is a unknown issue with nvcc builds and sccache, it crashes with: ``` /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o sccache: error: failed to execute compile sccache: caused by: error reading compile response from server sccache: caused by: Failed to read response header sccache: caused by: failed to fill whole buffer ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170 Approved by: https://github.com/huydhn	2024-10-02 17:07:24 +00:00
Jean Schmidt	d4cf90d282	[BE] [CI] Skip clean gha workspace if CI is running in a container for checkout-pytorch (#137168 ) For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment. The motivation for this change is twofold: * There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated; * In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path; Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168 Approved by: https://github.com/huydhn	2024-10-02 17:04:50 +00:00
Bob Ren	af3e25fea7	remove capture_autograd_function flag (#136959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136959 Approved by: https://github.com/jansel	2024-10-02 16:59:19 +00:00
Bin Bao	bcaa0f5ee9	[CI] Remove nanogpt from perf smoke test (#137176 ) Summary: nanogpt's performance is not stable. Remove it from the perf smoke test. We may want to use another test instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137176 Approved by: https://github.com/eellison	2024-10-02 16:35:04 +00:00
Jeff Daily	7001907480	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-02 16:27:15 +00:00
Max Hu	a954a9ea75	[Inductor] External callable registration API for Matmul tuning candidates (#130774 ) Fixes #[130769](https://github.com/pytorch/pytorch/issues/130769) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130774 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-10-02 15:38:10 +00:00
Animesh Jain	af86a6fdba	[dynamo][user-defined-class] Fallback when object.__new__ fails (#137044 ) Seen in https://github.com/vllm-project/vllm/pull/8949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137044 Approved by: https://github.com/jansel	2024-10-02 14:15:36 +00:00
Yu, Guangye	d29094888b	Use torch.Stream&torch.Event for Dynamo capature (#134850 ) # Motivation This PR aims to solve the multiple Inheritance problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134850 Approved by: https://github.com/yf225, https://github.com/EikanWang	2024-10-02 14:15:33 +00:00
Brian Hirsh	bf73af4b4e	dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878 ) Previously if we had a graph like: ``` triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...) getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr'] getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr'] sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1) mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid) ``` The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning: (1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel) (2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent) Reviewed By: embg Differential Revision: D63551393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878 Approved by: https://github.com/zou3519	2024-10-02 13:52:44 +00:00
Bin Bao	5c2c3ca10b	[Inductor] Fix test_conv2d_unary_cpu_cpp_wrapper failure (#137158 ) Summary: test_conv2d_unary_cpu_cpp_wrapper is failing on ciflow/slow because of mis-handling of inf. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137158 Approved by: https://github.com/chenyang78	2024-10-02 13:21:35 +00:00
Colin Peppler	d117ec1d6e	[3/3][Inductor] Make CK work in FBCode (#136234 ) Summary: # Context Goal: Enable CK for Inductor in FBCode We split this stack into three diffs to help with review & in case we need to revert anything. # This Diff * Gets us to have CK kernels as an option for GEMM autotuning in Inductor. Reviewed By: zjing14 Differential Revision: D62662705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-02 12:17:38 +00:00
atalman	6b7adc1214	Add py3.13t linux wheel build (#137127 ) Builder PR required: https://github.com/pytorch/builder/pull/2001 Test PR: https://github.com/pytorch/pytorch/pull/136490/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127 Approved by: https://github.com/albanD	2024-10-02 11:59:33 +00:00
Will Constable	8c29a0dd0e	[pipelining] Clean up dead code (#136804 ) 'set_requires_grad' dict appears to be always full of "False" values, and we always set requires_grad based on the value of 'has_backward' setting of required_grad field was being repeatedly done during get_fwd_recv_ops, but it should be done just once, so move it to the function that creates recv buffers in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804 Approved by: https://github.com/kwen2501	2024-10-02 11:26:31 +00:00
cyy	862029a1ef	[Distributed] [15/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#137072 ) Follows #136848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137072 Approved by: https://github.com/kwen2501	2024-10-02 10:56:15 +00:00
Bob Ren	ed02309232	type _dynamo/create_parameter_op.py (#136958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136958 Approved by: https://github.com/jansel	2024-10-02 10:23:37 +00:00
Mu-Chu Lee	52d29a2b94	[reland #136389 ] Skip kernel saving if already existed (#137073 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073 Approved by: https://github.com/desertfire	2024-10-02 09:27:08 +00:00
zeshengzong	e374d6850a	[distributed][test] Remove unused variable and fix doc typo (#136943 ) Refactor distributed test code: - Fix TODO: Remove unused variable - Fix doc typo - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136943 Approved by: https://github.com/H-Huang	2024-10-02 08:31:53 +00:00
Jason Ansel	e9a55b43a1	[inductor] Support lists of tensors in operatorbench (#136911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911 Approved by: https://github.com/eellison	2024-10-02 06:41:06 +00:00
Will Feng	a89e3c2490	Add compiled_autograd_kwargs_override Dynamo config (#136967 ) For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support). With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing. With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967 Approved by: https://github.com/xmfan	2024-10-02 06:23:59 +00:00
Nikita Shulga	b51d22b8bb	[BE] [NEON] Use `vshlq_n_u32` instead of `vshlq_u32` (#137122 ) As compiler optimizes it away anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/137122 Approved by: https://github.com/kit1980	2024-10-02 06:18:11 +00:00
chilli	2854d157de	Add type annotations for higher order ops/flex_attention (#137065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137065 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #136826, #137043, #137049	2024-10-02 04:39:25 +00:00
atalman	3b8511dadf	Remove python 3.8 from triton builds (#137141 ) All jobs have switched to Python 3.9. These 3.8 builds no longer necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/137141 Approved by: https://github.com/albanD	2024-10-02 03:36:54 +00:00
Bin Bao	8e39f2a4a5	[Inductor] Enable a SDPA pattern matching for CUDA (#137085 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/122429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137085 Approved by: https://github.com/eellison	2024-10-02 03:22:11 +00:00
alenawang	18525e185e	Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056 ) Fixes #132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>	2024-10-02 01:45:00 +00:00
Ruben Rodriguez Buchillon	f108f88c40	[logging/debugging] handle None (constant) args in debug log (#137032 ) Summary: # Why The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type # What When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error # Background https://github.com/pytorch/pytorch/pull/131594 Test Plan: - execute repro from https://github.com/pytorch/pytorch/issues/135584#issue-2516944089 with and without the edits Differential Revision: D63652564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137032 Approved by: https://github.com/angelayi	2024-10-02 01:43:22 +00:00
Benjamin Glass	f984b88718	Ensure noncontiguous tensor creation tests offsetting (#136396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136055	2024-10-02 00:40:43 +00:00
Benjamin Glass	c7638da558	Lowerings: remove restriction on TensorBox keyword arguments (#136055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136055 Approved by: https://github.com/eellison	2024-10-02 00:40:43 +00:00
abhishek-fujitsu	63d6908da0	fix build error with gcc 12+ (#137092 ) Fixes #127920 This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target. Issue: When building with GCC 12+, the following error occurs: ``` error: argument 1 null where non-null expected [-Werror=nonnull] 431 \| __builtin_memmove(__result, __first, sizeof(_Tp) * _Num); \| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This change ensures that: 1. The flag is only added for GCC 12 or higher 2. The flag is only added if it's supported by the compiler 3. The flag is added specifically to the test_api target, not globally By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings. Test Plan: - Verified successful build with GCC 12 and above - Ensured no regression in builds with earlier GCC versions and other compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092 Approved by: https://github.com/malfet	2024-10-02 00:37:15 +00:00
Angela Yi	d725758210	[ts_converter] Fix prim::If buffer names (#136648 ) Summary: We previously incorrectly handled the following graph, specifically for the node `w.3` in `block0`: ``` graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu), %y.1 : int): %2 : __torch__.___torch_mangle_1.M = prim::CreateObject() %3 : int = prim::Constant[value=20](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:34 %4 : int = prim::Constant[value=10](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:34 %5 : int = prim::Constant[value=1](), scope: M:: %w.1 : int = prim::GetAttr[name="w"](%2), scope: M:: %7 : int = aten::mul(%w.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:25 = prim::SetAttr[name="w"](%2, %7), scope: M:: %h.1 : int = prim::GetAttr[name="h"](%2), scope: M:: %9 : int = aten::mul(%h.1, %3), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:25 = prim::SetAttr[name="h"](%2, %9), scope: M:: %10 : bool = aten::gt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:19 %res.37 : Tensor = prim::If(%10), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:16 block0(): %w.3 : int = prim::GetAttr[name="w"](%2), scope: M:: %res.1 : Tensor = aten::add(%x.1, %w.3, %5), scope: M:: # <string>:5:9 -> (%res.1) block1(): %h.3 : int = prim::GetAttr[name="h"](%2), scope: M:: %res.3 : Tensor = aten::add(%x.1, %h.3, %5), scope: M:: # <string>:5:9 -> (%res.3) %16 : bool = aten::lt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:19 %res : Tensor = prim::If(%16), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:16 block0(): %w : int = prim::GetAttr[name="w"](%2), scope: M:: %res.15 : Tensor = aten::add(%res.37, %w, %5), scope: M:: # <string>:5:9 -> (%res.15) block1(): %h : int = prim::GetAttr[name="h"](%2), scope: M:: %res.21 : Tensor = aten::add(%res.37, %h, %5), scope: M:: # <string>:5:9 -> (%res.21) return (%res) ``` Test Plan: CI Differential Revision: D63399064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136648 Approved by: https://github.com/SherlockNoMad	2024-10-02 00:07:47 +00:00
Sahan Paliskara	8765804542	Continue on error for pytorch autolint (#137104 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137104 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-10-01 22:30:36 +00:00
Zain Rizvi	f0fa460c60	[BE] Add script to keept the runner-determinator scripts in sync (#136794 ) Whenever we update runner_determinator.py it needs to be copied over into _runner-determinator.yml. This is a quick utility script to make that process less tedious Pull Request resolved: https://github.com/pytorch/pytorch/pull/136794 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-10-01 22:26:28 +00:00
albanD	4f93de8951	Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899 ) PyList_GetItem are audited but not other APIs yet (they will be done in a follow up PR to keep this one small enough). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136899 Approved by: https://github.com/colesbury, https://github.com/atalman	2024-10-01 22:05:35 +00:00
Catherine Lee	6baee60e3c	upload test stats: remove nan/inf when uploading (#136877 ) `json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json This is fine if you json.load, but ClickHouse cannot handle it Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877 Approved by: https://github.com/huydhn	2024-10-01 21:47:46 +00:00
eellison	0788d016d6	Update incompatible cudagraph ops skip message (#137015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015 Approved by: https://github.com/BoyuanFeng	2024-10-01 21:23:36 +00:00
drisspg	34c18887ad	[FlexAttention] Remove restriction on QK headdim > V headdim (#135884 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135884 Approved by: https://github.com/Chillee	2024-10-01 21:17:54 +00:00
Jez Ng	99eb47fb6d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet	2024-10-01 20:43:10 +00:00
PyTorch MergeBot	86b715c5f6	Revert "Skip kernel saving if already existed. (#136389 )" This reverts commit 2521cd387482a70d30e4ea922fa4fe3b488c9f6d. Reverted https://github.com/pytorch/pytorch/pull/136389 on behalf of https://github.com/muchulee8 due to Issue #136940 ([comment](https://github.com/pytorch/pytorch/pull/136389#issuecomment-2386950623))	2024-10-01 20:04:43 +00:00
PyTorch MergeBot	b53ab8b86a	Revert "[dtensor][experimental] expose DTensor Context Parallel API (#137038 )" This reverts commit e23e766cc089b568aa4c0ebf0747ff9b504b8915. Reverted https://github.com/pytorch/pytorch/pull/137038 on behalf of https://github.com/huydhn due to Sorry for reverting your changes but the doc build failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/137038#issuecomment-2386902253))	2024-10-01 19:49:28 +00:00
Menglu Yu	a00f0d5db8	[PT2][Inductor] Add runtime numeric check for the post grad pass (#136724 ) Summary: Similar to D51838043, we further add post grad runtime numeric check since some graph passes are performed at aten-level. Differential Revision: D63438718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136724 Approved by: https://github.com/Yuzhen11	2024-10-01 18:56:01 +00:00
Edward Z. Yang	d61e45283e	Properly interpolate sloc here (#137088 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137088 Approved by: https://github.com/Skylion007	2024-10-01 18:33:03 +00:00
Jun Luo	c2dee8ea9c	enable lazy init for MTIA (#136902 ) Summary: As title. Test Plan: OSS and Internal CIs Reviewed By: nautsimon, hanzlfs Differential Revision: D63434511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136902 Approved by: https://github.com/nautsimon	2024-10-01 18:30:56 +00:00
Nikita Shulga	1f3a793790	Fix PyTorch builds on MacOS-13 (#137095 ) By including SonomaOps header Fixes https://github.com/pytorch/pytorch/issues/137094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137095 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-10-01 17:56:35 +00:00
Xilun Wu	e23e766cc0	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-01 17:41:28 +00:00
Tugsbayasgalan Manlaibaatar	73b07df042	Preserve custom ops via run_decomps (#136882 ) This is re-apply of https://github.com/pytorch/pytorch/pull/136773?fbclid=IwZXh0bgNhZW0CMTEAAR3SmginkvZcILVY7G2XDa_KosnV4DPmq1l6pkjPIM255QgJLKVAR90rGAU_aem_ZWpcVdUsmAGzOGiwbjtBDg. Note that this doesn't completely remove the _preserve_ops list from export mainly because we want to have small change to address failing executorch tests. All the complications included in this PR is deleted in the next PR. Differential Revision: [D63553086](https://our.internmc.facebook.com/intern/diff/D63553086/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136882 Approved by: https://github.com/bdhirsh	2024-10-01 17:38:00 +00:00
Ruben Rodriguez Buchillon	b1b6816e05	[testing] reenable kernel_benchmark.py tests (#136876 ) Summary: # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background (copied from similar issue resolved earlier) It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark Differential Revision: D63498897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876 Approved by: https://github.com/henrylhtsang	2024-10-01 17:16:21 +00:00
Nikita Shulga	3d0cb81594	[MPS] Enable bfloat16 testing (#136987 ) By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`. I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987 Approved by: https://github.com/albanD ghstack dependencies: #137070	2024-10-01 17:10:07 +00:00
Pian Pawakapan	cc2a66c55e	[export] hook up mark_dynamic to export Dims (#137029 ) Adds Dim.DYNAMIC which calls torch._dynamo.mark_dynamic() in the backend. Similar to Dim.AUTO in that it does automatic inference for ranges & relations, but errors out for specializations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137029 Approved by: https://github.com/avikchaudhuri	2024-10-01 17:05:09 +00:00
Isuru Fernando	ef6fd3d780	Fix adaptive_max_pool2d fallback (#136367 ) Fixes #136332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-01 16:20:34 +00:00
Nikita Shulga	8f4f7bed5d	[MPS] Fix bfloat to complex casts (#137070 ) For Metal cast ops to comple, one need to explicitly cast to/from `bfloat` unlike for other dtypes Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137070 Approved by: https://github.com/Skylion007	2024-10-01 15:47:29 +00:00
PyTorch MergeBot	696d01aef3	Revert "inductor: use previous guards to know if a size is 1 for broadcasting (#136670 )" This reverts commit dfdda2f6a603ae9245f38a3e8f6365c3cb6d49ac. Reverted https://github.com/pytorch/pytorch/pull/136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	951107e8c2	Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 )" This reverts commit b17cd264d38ca3381391c449bdaf9f03381caf35. Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	923410193b	Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760 )" This reverts commit c010c6099bf304bbb681af534b9f3996c33ce582. Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
Bob Ren	8f5c2b5f17	type _dynamo/test_case.py (#136957 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136957 Approved by: https://github.com/Skylion007	2024-10-01 14:36:22 +00:00
Bob Ren	d4cc2aaf1e	type _dynamo/logging.py (#136956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136956 Approved by: https://github.com/Skylion007	2024-10-01 14:35:54 +00:00
PyTorch MergeBot	7303716005	Revert "Simplify find_localzeros (#133325 )" This reverts commit 99f90c379ed214ab30882a87bdb3924ed6d6c899. Reverted https://github.com/pytorch/pytorch/pull/133325 on behalf of https://github.com/ezyang due to https://fb.workplace.com/groups/gpuinference/permalink/2921405651341417/ ([comment](https://github.com/pytorch/pytorch/pull/133325#issuecomment-2385832600))	2024-10-01 13:25:03 +00:00
Edward Z. Yang	6bd9d37266	Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935, #136972	2024-10-01 13:22:10 +00:00
Edward Z. Yang	cc8f1cddd4	Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935	2024-10-01 13:22:10 +00:00
Tom Ritchford	b85f21fc1d	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136653	2024-10-01 10:23:22 +00:00
chilli	083921852b	set FlexAttention devices properly during tracing (#137049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137049 Approved by: https://github.com/zou3519, https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #136826, #137043	2024-10-01 09:08:08 +00:00
chilli	34cef1eaa7	Allow automatic dynamic shapes for closures and set current node properly in flexattention subgraph lowering (#137043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137043 Approved by: https://github.com/drisspg ghstack dependencies: #136826	2024-10-01 09:08:08 +00:00
Haifeng Jin	37dd924c2d	Fix test/test_linalg.py for NumPy 2 (#136800 ) Related to #107302. When built and tested with NumPy 2 the following unit tests failed. ``` =========================================================== short test summary info ============================================================ FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested. FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). =========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================ ``` This PR fixes them. The test is now compatible with both NumPy 1 & 2. Some more details: 1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2. 2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2. 3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800 Approved by: https://github.com/lezcano	2024-10-01 07:53:24 +00:00
Yu, Guangye	df5bbc09d1	Make device-specific event inherits from torch.Event (#134845 ) # Motivation This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event. And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845 Approved by: https://github.com/albanD, https://github.com/EikanWang	2024-10-01 06:28:41 +00:00
cyy	47a78daf91	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy	2024-10-01 06:24:30 +00:00
Yuanhao Ji	be169f743b	[Dynamo] Mark `config.dead_code_elimination` as deprecated (#136933 ) part of #136862 For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-10-01 03:51:59 +00:00
Simon Fan	6e10f7d8c1	[compiled autograd] undo view_to_reshape inductor fx pass in node name matching (#136741 ) inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741 Approved by: https://github.com/jansel ghstack dependencies: #135663	2024-10-01 03:22:49 +00:00
Simon Fan	40157db5a7	[compiled autograd] log placeholder origin in verbose (#135663 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135663 Approved by: https://github.com/jansel	2024-10-01 03:22:49 +00:00
Henry Tsang	6966811da6	[test] skip not omit big gpu tests for cuda_cpp_wrapper (#137055 ) Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky. Test Plan: ci Differential Revision: D62037908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055 Approved by: https://github.com/masnesral	2024-10-01 03:03:27 +00:00
cyy	17455695d6	[Distributed] [14/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136848 ) Follows #136713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136848 Approved by: https://github.com/H-Huang	2024-10-01 02:01:13 +00:00
Edward Z. Yang	951af3d3d8	Format torch.fx.experimental.validator (#136935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934	2024-10-01 01:47:17 +00:00
Edward Z. Yang	33c2d3232f	Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934 Approved by: https://github.com/Skylion007	2024-10-01 01:47:16 +00:00
chilli	d9c400bd9f	Added some tests to prevent regressions in partitioning and flexattention (#136826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136826 Approved by: https://github.com/yanboliang, https://github.com/drisspg	2024-10-01 01:08:44 +00:00
niklasz	3f457ee1f6	Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513 ) …inductor codegen. Fixes #136260 Note: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow. Regarding the bug, re-running the issue's reproduction code: ``` import torch def fn(x): return x.to(device="cuda", non_blocking=True) inp = torch.randn(3, 4) torch.compile(fn)(inp) ``` We now have the non_blocking being passed on to codegen properly: ``` V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] ===== pre insert_deferred_runtime_asserts __compiled_fn_1 ===== V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] <eval_with_key>.0 class GraphModule(torch.nn.Module): V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4]"): V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] l_x_ = L_x_ V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] return (to,) V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] ===== __compiled_fn_1 ===== V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4][4, 1]cpu"): V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] l_x_ = L_x_ V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] return (to,) V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] ===== Forward graph 0 ===== I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"): I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True); arg0_1 = None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32); device_put = None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] return (convert_element_type,) I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code: V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference'] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile() V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals()) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1, = args V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] args.clear() V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride(arg0_1, (3, 4), (4, 1)) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] with torch.cuda._DeviceGuard(0): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] torch.cuda.set_device(0) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0.copy_(arg0_1, True) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del arg0_1 V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return (buf0, ) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._dynamo.testing import rand_strided V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import print_performance V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] fn = lambda: call([arg0_1]) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__": V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] ``` See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513 Approved by: https://github.com/eellison	2024-10-01 00:32:47 +00:00
Shen Xu	19a4d68224	Add missing mappings to support torch.uint16 in quantization and export (#136547 ) Test Plan: CI. Differential Revision: D63142844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136547 Approved by: https://github.com/angelayi	2024-10-01 00:01:01 +00:00
eellison	18e707645c	Substitute unbacked symints in expressions (#137020 ) Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020 Approved by: https://github.com/ezyang	2024-09-30 23:07:22 +00:00
PyTorch MergeBot	af64c44b56	Revert "Don't uselessly recompute axiom dict every static eval call (#135429 )" This reverts commit 1d6e0412f5205b1cd709e034526d7f21d6f2d56f. Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/ezyang due to try again ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2384288879))	2024-09-30 22:29:13 +00:00
Dan Zimmerman	c07ebaf430	[triton] Try to use triton.language.extra.libdevice when possible (#136997 ) Summary: X-link: https://github.com/facebookresearch/generative-recommenders/pull/90 In view of https://github.com/triton-lang/triton/pull/3825 we should try to use `triton.language.extra.libdevice` instead of `triton.language.extra.cuda.libdevice`. Test Plan: CI Reviewed By: bertmaher, karthik-man Differential Revision: D63583965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136997 Approved by: https://github.com/bertmaher	2024-09-30 21:52:44 +00:00
Dan Zimmerman	b3972ee19a	[triton] Unify build_paths.py for NV & AMD, fix typing (#136952 ) Summary: Some build improvements. Test Plan: CI Differential Revision: D63583959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136952 Approved by: https://github.com/bertmaher	2024-09-30 21:51:45 +00:00
PyTorch MergeBot	66a269afe8	Revert "Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 )" This reverts commit cf1a7eab250ea37ca8fda0327e8e38693c3c5c1a. Reverted https://github.com/pytorch/pytorch/pull/136934 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))	2024-09-30 21:44:44 +00:00
PyTorch MergeBot	c94536ae74	Revert "Format torch.fx.experimental.validator (#136935 )" This reverts commit 377e4bc877a3ac4cd6d073aa513a309159ade991. Reverted https://github.com/pytorch/pytorch/pull/136935 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))	2024-09-30 21:44:44 +00:00
PyTorch MergeBot	8982906502	Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 )" This reverts commit 3ff2d93d9f72fd26503ef0cf5c5956edad4c52e6. Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))	2024-09-30 21:35:08 +00:00
abhishek-fujitsu	b825848d85	Fix aarch64 debug build with GCC (#136990 ) Fixes #136440 Issue: When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations. Fix: Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations. The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt: - For aarch64 GCC debug builds: use -Og - For all other debug builds: retain -O0 This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged. Testing: Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-30 21:11:50 +00:00
Andrew Gu	866a64ce9a	[FSDP2] Added check for contiguous parameters (#137000 ) Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous. We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000 Approved by: https://github.com/weifengpy	2024-09-30 21:10:47 +00:00
PyTorch MergeBot	66e3186a48	Revert "Init threadpool with user defined num_threads before default (#136793 )" This reverts commit adbcaee950afa6697c04962096344bf0962a542f. Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))	2024-09-30 21:10:12 +00:00
Nikita Shulga	bc6adb9596	[EZ][BE] Delete `ISSUE_TEMPALTE.md` (#137040 ) As it has been superseded by [ISSUES_TEMPLATE](https://github.com/pytorch/pytorch/tree/main/.github/ISSUE_TEMPLATE) folder, per https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository#creating-issue-forms Pull Request resolved: https://github.com/pytorch/pytorch/pull/137040 Approved by: https://github.com/ZainRizvi	2024-09-30 21:04:32 +00:00
Zain Rizvi	d46ebcb31b	Enable experiments for protected branches (#136785 ) This is to allow the protected branches (like `main` and `nightly`) also run on the LF fleet, now that we've migrated over Pull Request resolved: https://github.com/pytorch/pytorch/pull/136785 Approved by: https://github.com/jeanschmidt	2024-09-30 20:58:28 +00:00
PyTorch MergeBot	2ef1454189	Revert "Add int1 to int7 dtypes (#136301 )" This reverts commit bfa16a161d5089a9ba008f5e665f29b58dc16526. Reverted https://github.com/pytorch/pytorch/pull/136301 on behalf of https://github.com/PaliC due to causing internal failures ([comment](https://github.com/pytorch/pytorch/pull/136301#issuecomment-2384119600))	2024-09-30 20:50:49 +00:00
Howard Huang	0ccd39a64b	Fix prefix store seg fault (#136872 ) fixes https://github.com/pytorch/pytorch/issues/136723 Do not allow `None` to be passed into `PrefixStore` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136872 Approved by: https://github.com/kwen2501	2024-09-30 20:43:08 +00:00
Tom Ritchford	7b96f3c75d	Fix six broken tests in test_ops.py (#136653 ) ## The problem. [A commit from three weeks ago](`82d00acfee`) appears to have broken five tests but was not caught by CI. [A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below). They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32` Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy` ## The solution The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices). This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area. ## To reproduce To reproduce the six cases in your own client: ``` PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32 ``` @amjames Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653 Approved by: https://github.com/zou3519	2024-09-30 20:32:55 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
Laith Sakka	dfe1d45332	Enable tracing through auot_functionalized_v2 in compiled autograd (#136806 ) auto_functionalize_v2 will be the same as auto_functionalize except that args will have some more constants, or symints, and tensors are in one of the input list args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136806 Approved by: https://github.com/zou3519	2024-09-30 19:16:13 +00:00
PyTorch MergeBot	1ef5d4cdde	Revert "Allow parallelize_module to get device_mesh from ambient context (#134247 )" This reverts commit 80e7478cc84919a48770ad85d6118294776fca73. Reverted https://github.com/pytorch/pytorch/pull/134247 on behalf of https://github.com/malfet due to Broke lint, which one can clearly see in PR CI https://github.com/pytorch/pytorch/actions/runs/11112138513/job/30873604386 ([comment](https://github.com/pytorch/pytorch/pull/134247#issuecomment-2383952449))	2024-09-30 19:07:01 +00:00
Nikita Shulga	4af03e54b7	[MPS][BE] Use `None` as alias for all types (#137004 ) Test like `new_` and `empty_` fail the current implementation, see Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003	2024-09-30 19:06:13 +00:00
Nikita Shulga	c610aa80dc	Testing: Unblock `new_*` testing on MPS (#137003 ) By changing `other_dtype` to `torch.half` rather than `double` in `sample_inputs_new_fns` if MPS is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986	2024-09-30 19:06:12 +00:00
Ke Wen	80e7478cc8	Allow parallelize_module to get device_mesh from ambient context (#134247 ) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of FQNs as in the out-of-model annotation case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247 Approved by: https://github.com/tianyu-l	2024-09-30 18:42:06 +00:00
Nikita Shulga	40f80a70fa	Fix lint (#137023 ) By migrating some of the workflows to Python-3.9 as 3.8 has been deprecated by https://github.com/pytorch/pytorch/pull/132138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137023 Approved by: https://github.com/ZainRizvi, https://github.com/janeyx99, https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-09-30 18:29:02 +00:00
Quinn Zhu	d33638588e	[aoti][inplace] Support skipping model buffers (#136770 ) Summary: Some AOTI tensor constants may be model buffers that never needs to be updated. Differential Revision: D62777502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770 Approved by: https://github.com/muchulee8	2024-09-30 18:28:42 +00:00
Edward Z. Yang	3ff2d93d9f	Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917, #136934, #136935	2024-09-30 18:04:36 +00:00
Sahan Paliskara	475a8a4e0c	Update ci-sev.md to make merge blocking not the default	2024-09-30 10:53:31 -07:00
Nikita Shulga	76a57568de	Update windows maintainers (#136901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136901 Approved by: https://github.com/albanD	2024-09-30 16:12:49 +00:00
Nikita Shulga	ae3d5ed589	[MPS] Enable `nan_to_num` for bfloat16 (#136986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136986 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985	2024-09-30 16:09:44 +00:00
Nikita Shulga	d8d3aeae59	[MPS] Enable Renorm for bfloat16 (#136985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136985 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984	2024-09-30 16:09:44 +00:00
Nikita Shulga	538fcd7579	[MPS] Enable `torch.linalg.cross` for bfloat16 (#136984 ) By adding explicit instantiation. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136984 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983	2024-09-30 16:09:40 +00:00
PyTorch MergeBot	c13c7e11c5	Revert "[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 )" This reverts commit 6931c1644afdba53e63ce5671455e4e1b7265dd9. Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk `6931c1644a` ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2383563919))	2024-09-30 15:47:04 +00:00
Nikita Shulga	33d3d6e42a	[MPS] Enable bucketization for bfloat16 (#136983 ) By simply adding explicit instantiation Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136983 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982	2024-09-30 14:45:57 +00:00
Nikita Shulga	3ed2969889	[MPS] Extend `fmin`/`fmax`/`copysign` and `nextafter` to blfoat (#136982 ) Just adds instantiation of the kernels and sometimes explicit cast. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136982 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981	2024-09-30 14:45:57 +00:00
Nikita Shulga	797092b263	[MPS] Fix Gamma for bfloat16 dtypes (#136981 ) Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136981 Approved by: https://github.com/Skylion007	2024-09-30 14:45:52 +00:00
Bin Bao	a15f3f51bc	[AOTI] Update sam_fast from timeout to fail_to_run (#136996 ) Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996 Approved by: https://github.com/ezyang	2024-09-30 14:05:49 +00:00
Brian Hirsh	c010c6099b	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136670, #136759	2024-09-30 13:25:02 +00:00
Brian Hirsh	b17cd264d3	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka ghstack dependencies: #136670	2024-09-30 13:25:02 +00:00
Brian Hirsh	dfdda2f6a6	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-09-30 13:24:57 +00:00
cyy	05b15dba7e	[1/N] Fix clang-tidy warnings in torch/csrc/api/ (#134545 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134545 Approved by: https://github.com/ezyang	2024-09-30 09:06:30 +00:00
Bin Bao	d6d9183456	[Inductor] Switch cpp_wrapper tests to ABI-compatible (#136904 ) Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136904 Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78	2024-09-30 05:44:52 +00:00
Bin Bao	ad8fae2aa9	[AOTI] Support test_open_device_registration in ABI-compatible (#136906 ) Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136906 Approved by: https://github.com/chenyang78	2024-09-30 05:08:16 +00:00
Aaron Gokaslan	8dddd45679	[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 ) Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features. https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920 Approved by: https://github.com/ezyang	2024-09-30 02:50:16 +00:00
Thomas	80393c90b3	docs: clarify alias usage for `x` parameter in vector_norm function (#136921 ) - Added a note in the documentation specifying that the `input` parameter can be used as an alias for `x`. Fixes #136560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136921 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-30 02:50:06 +00:00
Edward Z. Yang	377e4bc877	Format torch.fx.experimental.validator (#136935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917, #136934	2024-09-30 02:20:40 +00:00
Edward Z. Yang	cf1a7eab25	Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917	2024-09-30 02:20:40 +00:00
xinan.lin	0a26851601	[Inductor] Handle device property `warp_size` is None but used on XPU. (#136834 ) Fix #136820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136834 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-09-30 02:08:45 +00:00
CaoE	6931c1644a	[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-30 00:53:18 +00:00
Edward Z. Yang	9dbc6bacff	Propagate detailed location information of shape guards to guards/recompiles output (#136917 ) To see the payoff, look at test/dynamo/test_logging.py The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations. The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917 Approved by: https://github.com/anijain2305	2024-09-30 00:43:12 +00:00
Laith Sakka	e205193e1c	Enable failing diffs on regression (#136551 ) 1. example of failing diff https://github.com/pytorch/pytorch/pull/136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551 Approved by: https://github.com/ezyang ghstack dependencies: #136383	2024-09-29 22:31:26 +00:00
Jeff Daily	d33a5e2a57	[ROCm] fastSpecializedAtomicAdd for MI300 (#135770 ) MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-09-29 21:52:09 +00:00
fduwjj	c9653bf2ca	[Elasitc][fix] Use the right env variable TORCH_ELASTIC_WORKER_IDENTICAL for unit test (#136916 ) as title, this is an easy fix for unit test. Differential Revision: [D63577774](https://our.internmc.facebook.com/intern/diff/D63577774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136916 Approved by: https://github.com/wz337 ghstack dependencies: #136865	2024-09-29 03:55:10 +00:00
cyy	156ca01e51	Enable clang-tidy on torch/csrc/lazy (#136851 ) Enable clang-tidy on torch/csrc/lazy Pull Request resolved: https://github.com/pytorch/pytorch/pull/136851 Approved by: https://github.com/Skylion007	2024-09-28 21:16:40 +00:00
Aaron Gokaslan	d3c2123ea6	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 ) * This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly. * Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels. * Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time. * These kernels and settings will be needed for Flash Attention 3 whenever we add that too. Fixes #133695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686 Approved by: https://github.com/ezyang	2024-09-28 21:11:15 +00:00
Edward Z. Yang	1d6e0412f5	Don't uselessly recompute axiom dict every static eval call (#135429 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429 Approved by: https://github.com/isuruf	2024-09-28 20:59:59 +00:00
FFFrog	6ecb73bafd	Limit the option value of TORCH_SHOW_DISPATCH_TRACE (#136510 ) It`s more convenient for user to enable or disable dispatch trace by setting TORCH_SHOW_DISPATCH_TRACE to 1 or 0, especially debug in IDE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136510 Approved by: https://github.com/shink, https://github.com/ezyang	2024-09-28 20:59:05 +00:00
Boyuan Feng	28224329ad	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-28 19:56:53 +00:00
Jason Ansel	cf53ab95dc	[halide-backend] Fix ops.fma codegen (#136810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136810 Approved by: https://github.com/eellison ghstack dependencies: #136808, #136809	2024-09-28 19:26:04 +00:00
Jason Ansel	8da9c4178c	[inductor] Benchmark Halide in operatorbench.py (#136809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809 Approved by: https://github.com/eellison ghstack dependencies: #136808	2024-09-28 19:26:04 +00:00
atalman	a54b69279b	Bump triton pin to latest 3.1.x release branch (#136874 ) Moves pin to latest in release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/136874 Approved by: https://github.com/bertmaher, https://github.com/drisspg, https://github.com/kit1980, https://github.com/malfet	2024-09-28 13:47:07 +00:00
Ivan Zaitsev	b35f70da05	[ez] fixup the export of D62879819 (#136900 ) a line from D62879819 (#136190) went missing somehow Pull Request resolved: https://github.com/pytorch/pytorch/pull/136900 Approved by: https://github.com/atalman	2024-09-28 13:46:17 +00:00
Banit Agrawal	c4ae45104f	[PyTorch Pinned Allocaor] Move background thread init from constructor to allocate function (#136879 ) Differential Revision: D63553138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136879 Approved by: https://github.com/zyan0	2024-09-28 07:24:44 +00:00
Jason Ansel	375921b755	[inductor] Improve operatorbench.py (#136808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808 Approved by: https://github.com/eellison	2024-09-28 06:22:02 +00:00
James Wu	96104db132	[easy] fix typo in debug logs for fx graph cache (#136889 ) Summary: Accidentally messed up the debug logging here, fixing typo (scuba + tlparse logging is unaffected) Test Plan: existing tests Differential Revision: D63555766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136889 Approved by: https://github.com/oulgen	2024-09-28 03:56:09 +00:00
Shivam Raikundalia	9e4f24f8e5	Fix PT2 Source Code Annotations (#136460 ) Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs. Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph. Differential Revision: D63138786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460 Approved by: https://github.com/ezyang	2024-09-28 03:54:43 +00:00
Nitin Jain	8df97d78c2	[QAT] Make Fused modules torchscriptable (#136285 ) Summary: Same as title. Inspired by: https://pytorch.org/tutorials/recipes/script_optimized.html#fix-common-errors-when-using-the-script-method Test Plan: CI Differential Revision: D62980019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136285 Approved by: https://github.com/jerryzh168	2024-09-28 03:46:19 +00:00
wz337	93dcb92bae	[DeviceMesh][EZ] Add group description to new group (#136558 ) Add group description to new_group in device_mesh to help with debuggability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2024-09-28 03:09:41 +00:00
Edward Z. Yang	99f90c379e	Simplify find_localzeros (#133325 ) Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325 Approved by: https://github.com/albanD	2024-09-28 02:38:31 +00:00
Jerry Zhang	bfa16a161d	Add int1 to int7 dtypes (#136301 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases for weight quantization (https://www.internalfb.com/diff/D62464487) Test Plan: python test/test_quantization.py -k test_uint4_int4_dtype Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136301 Approved by: https://github.com/ezyang	2024-09-28 02:08:33 +00:00
albanD	e4571e7025	Add abi flags to cpp_extension cache folder (#136890 ) This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues. This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward. This is tested locally by running each version alternatively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890 Approved by: https://github.com/colesbury	2024-09-28 00:49:56 +00:00
fduwjj	f42e88fea5	[reland][Elastic] Skip store barrier and store get in host assign (#136865 ) As title this is to reland https://github.com/pytorch/pytorch/pull/136579 as it broke some OSS CI Differential Revision: [D63542918](https://our.internmc.facebook.com/intern/diff/D63542918/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136865 Approved by: https://github.com/atalman	2024-09-27 23:40:42 +00:00
David Berard	ef3142d2a0	[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 ) In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time. However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871). This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686 Approved by: https://github.com/zou3519	2024-09-27 23:02:46 +00:00
Kunal Bhalla	9d67c31758	Cast device index to int before logging (#135405 ) int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc. Explicitly casting to int to have the numbers printed out correctly. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405 Approved by: https://github.com/wconstab	2024-09-27 23:01:12 +00:00
angelayi	fe158cfb47	[aoti] Add warning to ask users to switch to new API (#135893 ) Instead of the following: ``` so_path = torch._export.aot_compile(...) torch._export.aot_load(so_path) ``` The recommended path is to: ``` ep = torch.export.export(...) pt2_path = torch._inductor.aoti_compile_and_package(ep, ...) torch._inductor.package.load_package(pt2_path) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135893 Approved by: https://github.com/desertfire	2024-09-27 22:38:11 +00:00
Jane Xu	adbcaee950	Init threadpool with user defined num_threads before default (#136793 ) Fixes #134714 (or attempts to, idk how to test yet) For posterity, how one can test: 1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary 2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python 3. import torch 4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793 Approved by: https://github.com/albanD	2024-09-27 22:22:37 +00:00
Jesse Cai	bc21689136	[sparse][semi-structured] Add float8 dtype support to 24 sparsity (#136397 ) Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397 Approved by: https://github.com/drisspg	2024-09-27 21:37:34 +00:00
Oguz Ulgen	a28b40fa74	Improve is_fbcode functionality (#136871 ) Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof. Test Plan: unit tests Reviewed By: masnesral Differential Revision: D63545169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871 Approved by: https://github.com/masnesral	2024-09-27 21:19:01 +00:00
Nikita Shulga	283bda01aa	[MPS] Error checking/bf16 support for `torch.normal` (#136863 ) Before that attempt to run something like ``` % python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))" ``` Resulted in hard error ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` After the change, it raises a nice type error Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821, #136822	2024-09-27 21:11:59 +00:00
PyTorch MergeBot	f7ab0e9989	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit b42f1e3641314c8dc369255b850450acddf3477c. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/ZainRizvi due to Sorry, this seems to break ROCm builds. inductor/test_flex_attention.py::TestFlexAttention::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_float16_score_mod1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/11069782242/job/30759299713) [HUD commit link](`b42f1e3641`) ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2380031525))	2024-09-27 20:47:54 +00:00
Yifu Wang	6e70ec9aa5	[SymmetricMemory] expose the multicast_ptr (#136840 ) This allows writing triton kernels using the `multimem` ptx instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136840 Approved by: https://github.com/Chillee	2024-09-27 20:41:33 +00:00
PyTorch MergeBot	f21b471978	Revert "Fix numerical instability for norm (#129352 )" This reverts commit 66340e67515cd3592bda6bdd9bfe2ffa22fe7413. Reverted https://github.com/pytorch/pytorch/pull/129352 on behalf of https://github.com/atalman due to Breaks Internal CI ([comment](https://github.com/pytorch/pytorch/pull/129352#issuecomment-2379989485))	2024-09-27 20:18:47 +00:00
Yifu Wang	d55eef5c59	[SymmetricMemory] improve multicast initialization/fallback logic (#136577 ) Fixes https://github.com/pytorch/pytorch/issues/136494 Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully. - In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device. - Before initializing multicast for a block, ensure all ranks/devices have multicast support. - This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block. - Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-09-27 20:04:21 +00:00
Wei Wang	e512eac410	Companion PR to https://github.com/pytorch/pytorch/pull/134022 (#136818 ) Note:[ cusparselt 0.6.0](https://docs.nvidia.com/cuda/cusparselt/release_notes.html#cusparselt-v0-6-0)+ supports SM90 (Hopper). Thanks @xwang233 for catching this bug while testing upstream binaries! Fixes the issues like: ``` A_compressed = torch._cslt_compress(A) RuntimeError: CUDA error: architecture mismatch when calling `cusparseLtInit(&handle)` ``` @kit1980 Could we get this cherry-picked to 2.5.0 please? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136818 Approved by: https://github.com/eqy, https://github.com/jcaip, https://github.com/malfet	2024-09-27 19:57:15 +00:00
Kimish Patel	e5a57932f0	[Pytorch][AO] Update choose_qparams_per_token op to output correct shape for scales and zp (#136807 ) - also makes scales and zp dtype reconcile with meta impl as well as other quantized ops representation of scales and zero point - make sure qunatize_per_token's output_dtype is respected There are a few places where we need to reconcile on scale and zero point dtype but that will come later. This fixes are mainly being done to enable quantized kv cache though ET stack Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807 Approved by: https://github.com/jerryzh168	2024-09-27 18:46:17 +00:00
Pian Pawakapan	6075f566cc	[export] simplify automatic dynamic shapes processing (#136591 ) Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620. This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`). On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. `7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)` One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked. Differential Revision: D63358628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591 Approved by: https://github.com/avikchaudhuri	2024-09-27 18:28:51 +00:00
Bob Ren	a8b5adcdd5	add types to _dynamo/code_context.py (#136665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136665 Approved by: https://github.com/williamwen42	2024-09-27 18:27:42 +00:00
PyTorch MergeBot	287dc36395	Revert "[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 )" This reverts commit 9f5b97a0065dfc4a7978a0fdf3fac2df8aef9519. Reverted https://github.com/pytorch/pytorch/pull/136686 on behalf of https://github.com/davidberard98 due to breaks lint on main. Please rebase to see and fix the error ([comment](https://github.com/pytorch/pytorch/pull/136686#issuecomment-2379830921))	2024-09-27 18:25:49 +00:00
Mikayla Gawarecki	2208ff64ba	Fix RMSNorm doc per #136597 (#136727 ) Fixes #136597 (remove incorrect sqrt around `RMS(x)`) <img width="857" alt="Screenshot 2024-09-26 at 11 46 32 AM" src="https://github.com/user-attachments/assets/21ea26ad-bd9f-4b9b-8b60-f52a1dc16da6"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136727 Approved by: https://github.com/albanD	2024-09-27 18:21:48 +00:00
William Wen	2157e396a3	[dynamo] attempt run only mode when dynamo cache limit is hit (#136655 ) Implement https://github.com/pytorch/pytorch/issues/135458. Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655 Approved by: https://github.com/jansel	2024-09-27 17:15:05 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Davis Rollman	17f396b0b4	Delete project.default_flavors_mode buckconfig (#136772 ) Summary: Buck1 only buckconfig Test Plan: CI Reviewed By: JakobDegen Differential Revision: D63430482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136772 Approved by: https://github.com/malfet	2024-09-27 16:24:50 +00:00
cyy	cbc182d2e0	Remove problematic constructor (#136708 ) Since it calls a pure virtual function and it is not used elsewhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136708 Approved by: https://github.com/ezyang	2024-09-27 16:16:58 +00:00
James Wu	dc8c0aaf4d	[AOTAutogradCache] Log time taken_ns (#136529 ) Summary: This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry. This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events. Test Plan: Run benchmark, see that the times are in the chromium events. Reviewed By: aorenste Differential Revision: D62590077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529 Approved by: https://github.com/oulgen	2024-09-27 16:14:09 +00:00
David Berard	9f5b97a006	[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 ) In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time. However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871). This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686 Approved by: https://github.com/zou3519	2024-09-27 16:11:02 +00:00
bhack	ad51995468	Add a nightly hotpatch utils for python only PR (#136535 ) I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug. This could quickly run in an official nightly Docker container or in a nightly venv/coda env. Let me know what do you think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535 Approved by: https://github.com/ezyang	2024-09-27 15:58:48 +00:00
Nikita Shulga	9d72f7481b	[MPS] Fix AvgPool2d for float16 (#136822 ) This was a stupid cast error that caused MPSGraph to crash with the following exception ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821	2024-09-27 15:32:18 +00:00
Nikita Shulga	2b6f4e9e24	[BE][MPS] Delete MacOS12 low-precision ops (#136821 ) `norm` and `masked.normalize` still have to stay in the list Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755	2024-09-27 15:32:18 +00:00
Sam Larsen	45a8b5682e	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136858 ) This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858 Approved by: https://github.com/atalman	2024-09-27 15:14:12 +00:00
IvanKobzarev	34d788ffb0	[aotd] Do not force contiguous() for channels_last (#135225 ) Original Issue: https://github.com/pytorch/pytorch/issues/134644 We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing. => Tracing time: - Store trace_tangents_memory_formats in metadata - Coerce tangents to deduced memory_format Runtime: - Coerce tangents to tracing memory format from metadata Subclasses logic: - Previously coercing tangents logic did not handle nested subclasses case, fixing this. For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass: [subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ] If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure. The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method). Other small change: Remove duplicated not-related comment. Testing ``` python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous ``` Benchmarking: After change: ``` └─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms ``` Before change: ``` BEFORE_CHANGE SUBCLASS 4.1194 ``` No siginificant changes in processing time. (We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225 Approved by: https://github.com/bdhirsh	2024-09-27 15:01:20 +00:00
PyTorch MergeBot	de159f0c8d	Revert "Deal with size oblivious before going into worker (#135137 )" This reverts commit 285fa03b5e1540a52b354664f609f8576c5b5431. Reverted https://github.com/pytorch/pytorch/pull/135137 on behalf of https://github.com/ezyang due to this is the one that actually broke main ([comment](https://github.com/pytorch/pytorch/pull/135137#issuecomment-2379438566))	2024-09-27 14:41:27 +00:00
Justin Chu	1be3d62237	[ONNX] Remove unused functions (#136609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136609 Approved by: https://github.com/Skylion007	2024-09-27 14:34:05 +00:00
PyTorch MergeBot	e5228a7771	Revert "Don't uselessly recompute axiom dict every static eval call (#135429 )" This reverts commit 507c69e20f645fdb0fbf43b05be0c5117971464e. Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/malfet due to It(or it's parent) broke trunk CI, see `507c69e20f` ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2379422971))	2024-09-27 14:33:25 +00:00
Crefeda Rodrigues	a55aa71b04	Limit number of cores to 16 when benchmarking Inductor on ARM (#136424 ) Sets OMP_NUM_THREADS to 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136424 Approved by: https://github.com/malfet	2024-09-27 14:22:49 +00:00
PyTorch MergeBot	e9d2765ec8	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d1bb8e828f280d1c66fff193c043d5bc36154577. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226))	2024-09-27 12:54:47 +00:00
Wu, Chunyuan	c2637a7b26	[inductor] [cpp] fix gemm_output_name conflict (#136419 ) Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case: ```log File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call C_ptr = f"&({kernel.index(C, [0, 0])})" File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index else self.args.input(node.get_name()) File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input assert name not in V.graph.removed_buffers, name AssertionError: buf_GemmOut ``` The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer. The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`. When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict. Rename these buffers by adding the name of the `template_buffer` as the prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418, #136518	2024-09-27 12:23:17 +00:00
Boyuan Feng	b42f1e3641	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-27 11:26:47 +00:00
IvanKobzarev	9581508383	[aotd] Cleanup on subclasses in inductor freezing (#136549 ) Cleanup: 1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper 2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat. Verfified fbcode tests compiled_optimizers Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549 Approved by: https://github.com/bdhirsh	2024-09-27 11:20:03 +00:00
cyy	bbff667e32	[Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713 ) Follows #136528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713 Approved by: https://github.com/kwen2501	2024-09-27 10:11:53 +00:00
Salman Mohammadi	48c18ff850	[dynamo] Added support for tensor's `is_inference` method (#136450 ) Fixes #135439 This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks: ```python def fn_simple(x): if x.is_inference(): return x.sum() else: return x.min() ``` I've also tried to add guards on the tensor to guard against `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450 Approved by: https://github.com/ezyang	2024-09-27 09:15:07 +00:00
FFFrog	e14b58ffbd	Using device-agnostic autocast api (#136613 ) - using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast() - using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast() Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613 Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501	2024-09-27 07:16:24 +00:00
Howard Huang	ad6c70b656	[PP] Remove modifications to autograd nodes in ZB (#136678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136678 Approved by: https://github.com/wconstab, https://github.com/kwen2501 ghstack dependencies: #136507, #136584	2024-09-27 07:07:58 +00:00
hippocookie	9529d018e9	Refactor offset logic and work for nD (#135861 ) Optimize TODO task in code in distributed test files. - TODO: make this test cleaner and work for nD - TODO: add comments for create_plan/TestDedupTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/135861 Approved by: https://github.com/wz337	2024-09-27 06:13:06 +00:00
Nikita Shulga	69bd13d12e	[EZ][BE] Add `torch.complex` to MPS_DTYPES (#136755 ) As minimal supported OS has been rasied to MacOS 13, some basic complex operations should be supported, and the rest could be `xfailed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754	2024-09-27 05:01:40 +00:00
Laith Sakka	73f038c5b3	Log total miss inplaced bytes (#136684 ) Summary: title. Test Plan: add tests. run existing tests. Differential Revision: D63411459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136684 Approved by: https://github.com/zou3519	2024-09-27 04:57:57 +00:00
Oguz Ulgen	0200bea562	Delete grid reduction optimization that is causing specialization (#136783 ) Summary: https://fb.workplace.com/groups/1075192433118967/posts/1510513706253502 Creating a set is causing symexpr to specialize Test Plan: CI Reviewed By: ezyang Differential Revision: D63432357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136783 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-09-27 04:39:39 +00:00
Bob Ren	a63d7cb54c	add typing to _dynamo/current_scope_id.py (#136676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136676 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/Skylion007	2024-09-27 04:09:15 +00:00
PyTorch MergeBot	5eb68d565f	Revert "[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 )" This reverts commit 2c5f5e303a8d6fd55b6651f4d965fafaa6a540a7. Reverted https://github.com/pytorch/pytorch/pull/136594 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136594#issuecomment-2378358302))	2024-09-27 04:06:05 +00:00
Edward Z. Yang	507c69e20f	Don't uselessly recompute axiom dict every static eval call (#135429 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429 Approved by: https://github.com/isuruf ghstack dependencies: #135137	2024-09-27 04:03:25 +00:00
Edward Z. Yang	285fa03b5e	Deal with size oblivious before going into worker (#135137 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135137 Approved by: https://github.com/isuruf	2024-09-27 04:03:25 +00:00
Blaine Burton Rister	86631eccda	[Inductor] Remove stride-0 dimensions from more complex block pointers (#135557 ) Related issue: #125077 ### Feature Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading. This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`. Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`). ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Before this PR, we would have stride-0 dimensions: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 8 xnumel = 8 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1]) tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` ### Test Plan Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.) This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly. Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557 Approved by: https://github.com/shunting314, https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-27 04:01:40 +00:00
Sam Larsen	2c5f5e303a	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 ) Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594 Approved by: https://github.com/mengluy0125, https://github.com/jansel	2024-09-27 04:01:09 +00:00
Edward Z. Yang	a2d2a30311	Add torch._dynamo.config.fail_on_cache_limit_hit (#136767 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136767 Approved by: https://github.com/albanD, https://github.com/jansel ghstack dependencies: #136533	2024-09-27 03:58:00 +00:00
Mu-Chu Lee	2521cd3874	Skip kernel saving if already existed. (#136389 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels. Test Plan: Existing OSS CI [Redacted, Some internal model results in D63441430] Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136389 Approved by: https://github.com/desertfire	2024-09-27 03:03:28 +00:00
Fuzzkatt	d1382aaf3d	skip test_out_of_memory for jetson (#133270 ) Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere	2024-09-27 02:36:48 +00:00
Bin Bao	26869d38e1	[Inductor] Further solve missing aoti_torch_check symbole issue (#136775 ) Summary: https://github.com/pytorch/pytorch/pull/136669 didn't resolve all the internal test failures. Add more tests to OSS CI to catch the remaining issues, and fix some internal TARGETS dependency. Differential Revision: D63473744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136775 Approved by: https://github.com/henrylhtsang	2024-09-27 02:26:49 +00:00
CaoE	66340e6751	Fix numerical instability for norm (#129352 ) Fixes #123645 When the reduce size is large, reducing directly may exceed the range that FP32 can represent, resulting in incorrect results. Reducing in group and using double as the intermediate cumulative type can avoid exceeding the representation range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129352 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-27 00:51:31 +00:00
Sahan Paliskara	adc77a9b7f	[lintrunner] auto apply formatting changes as suggestions (#136239 ) (Remove spurious cc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136239 Approved by: https://github.com/huydhn, https://github.com/eqy Co-authored-by: Huy Do <huydhn@gmail.com>	2024-09-27 00:51:21 +00:00
Ruben Rodriguez Buchillon	faedee12fa	[test] enable test_triton_wrapper again (#136721 ) Summary: Reenable the `test_triton_wrapper.py` test again # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper Differential Revision: D63438186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721 Approved by: https://github.com/henrylhtsang	2024-09-27 00:44:40 +00:00
ankurneog	22a4129a76	Generalization of FSDP common for non-cuda execution (#133209 ) ## Motivation The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209 Approved by: https://github.com/kwen2501	2024-09-27 00:38:10 +00:00
Sergii Dymchenko	a619ced5ed	Revert "Update run_test.py" This reverts commit 193073b4914a7f80758541d391eacbe21194ecdf.	2024-09-26 17:34:52 -07:00
Sergii Dymchenko	193073b491	Update run_test.py	2024-09-26 16:56:29 -07:00
eellison	aa56f80ec1	Dont pairwise check unfusable nodes in scheduler (#136682 ) Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314	2024-09-26 23:46:52 +00:00
Nikita Shulga	0b62ebfeaa	[CI] Populate `JOB_ID` for MPS tests (#136791 ) Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099 Should fix the following warning during MPS test run: ``` /Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value. warn(f"Not emitting metrics for {metric_name}. {e}") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-09-26 23:00:52 +00:00
Bin Bao	da5c7b6f4e	[AOTI] Set CUDA device for torch._export.aot_load (#136715 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so. Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715 Approved by: https://github.com/angelayi	2024-09-26 22:20:12 +00:00
Joel Schlosser	991f8f8ec3	Bias gradient calculation for NJT linear backward (#136660 ) Previously NYI - @mikaylagawarecki needs it for Transformers. Fixes #136652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660 Approved by: https://github.com/soulitzer	2024-09-26 21:38:10 +00:00
eqy	c0e98a485b	[FP8][CUDA] Fix stale expected error message (#136581 ) CC @nWEIdia as I think we have seen internal failures on this Pull Request resolved: https://github.com/pytorch/pytorch/pull/136581 Approved by: https://github.com/mikaylagawarecki	2024-09-26 20:57:38 +00:00
Roy Hvaara	5789f8d5dc	[MPS] Add regression test for large inputs to `F.linear` (#136084 ) This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13. ~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-26 20:46:14 +00:00
Sergii Dymchenko	9656a603b2	Fix lint (#136781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136781 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet	2024-09-26 19:13:56 +00:00
Sergii Dymchenko	c878ea2c4e	Add info about "release tracker" label for cherry-picking bot (#136777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136777 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-09-26 18:45:45 +00:00
Jithun Nair	851b9732aa	Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603 ) PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](`a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)`) docker images. Building aotriton from source can take ~45 minutes. This PR fixes the issue by downloading the aotriton tarball in such scenarios, unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603 Approved by: https://github.com/atalman Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2024-09-26 18:05:51 +00:00
Pian Pawakapan	f0a92541fe	[export] fix lifted constants order for 0-input graphs (#136658 ) Summary: With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted. This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion). Test Plan: test_export Differential Revision: D63403514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658 Approved by: https://github.com/avikchaudhuri	2024-09-26 17:44:24 +00:00
fduwjj	40c825d773	[reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768 Approved by: https://github.com/kwen2501, https://github.com/atalman	2024-09-26 17:37:07 +00:00
Rachel Guo	da09984c0d	[AOTI][Tooling][9/n] Add debug printer support for cpp kernel type (#136465 ) Summary: As title. Cpp kernel has a different codegen path: https://www.internalfb.com/code/fbsource/[6df946858879dd9bcefa18710dd79095a957f0dd]/fbcode/caffe2/torch/_inductor/codegen/cpp.py?lines=4643 Previously it is not wrapped/supported by the debug printer manager. This diff adds this support. It can be useful for cpu models. See this for a use case: https://www.internalfb.com/phabricator/paste/view/P1598561051?lines=927 Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run 'fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn -- aot --batch-size 1 ``` Differential Revision: D63053101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136465 Approved by: https://github.com/hl475	2024-09-26 17:30:43 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	e4e83a4ac4	Remove aten.item hack (#136663 ) Summary: Title Test Plan: CI Differential Revision: D63404353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136663 Approved by: https://github.com/bdhirsh	2024-09-26 17:14:48 +00:00
albanD	2421344d8f	Update current maintainers (#136672 ) This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info. The main rules we followed: - No code contributor is removed, they're all placed as emeritus - Breakdown too big categories to make this document useful to know who to ping - No category where the code is still in the codebase is removed - We did not rework the categories (for example to be closer to module: labels) and leave that for later - All non-emeritus names are ordered by their number of comments on issues related to their topic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet	2024-09-26 17:13:16 +00:00
Edward Z. Yang	beb46de342	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #136599	2024-09-26 16:50:13 +00:00
Edward Z. Yang	11fd55827d	Make CLOSURE_VARS construction lazy (#136599 ) This makes us less likely to hit import cycle problems with torch Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136599 Approved by: https://github.com/anijain2305	2024-09-26 16:50:13 +00:00
drisspg	ff2360c733	[FlexAttention] Reduce expensive test time by 10x (#136677 ) Now that we support non 128 divisble sequence lengths; drops expensive tests by like 10x Before ```Shell 46.32s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 45.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 44.45s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 43.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 ``` After: ```Shell 4.25s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod5 4.20s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod4 4.19s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 4.04s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 3.99s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 3.98s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136677 Approved by: https://github.com/Chillee ghstack dependencies: #136673	2024-09-26 16:40:21 +00:00
drisspg	840c6b7a68	[FlexAttention] Add Better error message for cpu tensors (#136673 ) Partially address: #136525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673 Approved by: https://github.com/Chillee	2024-09-26 16:40:21 +00:00
Thanh Ha	ddab704b28	Use wildcard for portion of AMI version number (#136764 ) Rather than specifying a specific version number for the AMIs, use wildcards for the date section. Issue: pytorch/pytorch#136762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136764 Approved by: https://github.com/ZainRizvi	2024-09-26 16:39:25 +00:00
cyy	59e8f8228f	[3/N] Fix clang-tidy warnings in torch/csrc/lazy (#136705 ) Follows #136634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136705 Approved by: https://github.com/Skylion007	2024-09-26 16:29:43 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Nikita Shulga	68579ef665	[EZ][MPS] Extend `arange` to bfloat16 (#136754 ) RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES` Fixes https://github.com/pytorch/pytorch/issues/136624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754 Approved by: https://github.com/Skylion007	2024-09-26 15:33:45 +00:00
Nikita Shulga	73ec76ed50	[MPS] Implement `isposinf` and `isneginf` (#136689 ) Not sure, why `isinf` is a composite op, but those needs to be implemented by hand. Implementation is a trivial call to ```objc [mpsGraph equalWithPrimaryTensor:input secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity() dataType:input.dataType]] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689 Approved by: https://github.com/Skylion007	2024-09-26 15:33:20 +00:00
drisspg	d05645841e	Update get_device_properties to take in optional device (#136683 ) Aligns behavior with the rest of cuda's device info query methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683 Approved by: https://github.com/eqy	2024-09-26 15:07:31 +00:00
PyTorch MergeBot	d5e4a20c17	Revert "Introduce _ArglessActivation base class for parameterless activation functions (#136296 )" This reverts commit dda0e4de32b29098f25f9b2889423c9446680cc1. Reverted https://github.com/pytorch/pytorch/pull/136296 on behalf of https://github.com/atalman due to Breaks Internal CI. Error: Too many arguments [19]: Call `nn.modules.activation._ArglessActivation.__init__` expects 0 positional arguments, 1 was provided. ([comment](https://github.com/pytorch/pytorch/pull/136296#issuecomment-2377091280))	2024-09-26 14:12:12 +00:00
Joel Schlosser	4150ab44a4	Fix composite op redispatch for NJT in inference mode (#134683 ) Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT. This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683 Approved by: https://github.com/soulitzer, https://github.com/albanD ghstack dependencies: #136566	2024-09-26 14:10:53 +00:00
Joel Schlosser	f8debd5d83	Fix wrapper subclass reentrant dispatch + TorchDispatchMode (#136566 ) Fixes #136565 This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled. This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566 Approved by: https://github.com/bdhirsh	2024-09-26 14:06:51 +00:00
leslie-fang-intel	963e793e1b	[Inductor][CPP] Optimize WOQ INT8 wgt dequant in AMX GEMM template (#136630 ) Summary Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion. Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements. With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently. Performance before ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 38.0439 ms 100.0% _weight_int8pack_mm 50.2524 ms 75.7% SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 78.2038 ms 100.0% _weight_int8pack_mm 119.1962 ms 65.6% SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 79.2368 ms 100.0% _weight_int8pack_mm 118.3212 ms 67.0% SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 225.7201 ms 100.0% _weight_int8pack_mm 388.5588 ms 58.1% ``` Performance after this PR ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 11.0086 ms 100.0% _weight_int8pack_mm 50.2918 ms 21.9% SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 24.3528 ms 100.0% _weight_int8pack_mm 119.8492 ms 20.3% SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 24.6148 ms 100.0% _weight_int8pack_mm 119.1908 ms 20.7% SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 78.1369 ms 100.0% _weight_int8pack_mm 387.6289 ms 20.2% SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630 Approved by: https://github.com/jgong5 ghstack dependencies: #136353	2024-09-26 08:41:58 +00:00
Menglu Yu	77fba0c407	[PT2][Optimus] Fix a group batch fusion corner case (#136650 ) Summary: We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node. See more context in the post https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196 ``` P1609807876 # E2E before fix f646303196 after fix Differential Revision: D63399959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650 Approved by: https://github.com/ezyang	2024-09-26 06:35:11 +00:00
Kurt Mohler	d1bb8e828f	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-26 04:52:05 +00:00
PyTorch MergeBot	b408591b53	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit 529b6ab0bb9f8800ed795ec8e4fa1f0e8042bb0a. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some test_flex_attention is failing in trunk after this change `529b6ab0bb` ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2375824802))	2024-09-26 04:06:41 +00:00
cyy	3c542ce831	[Reland] Check function declarations of COREML code (#136070 ) Reland of #135467 by fixing periodic workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136070 Approved by: https://github.com/ezyang	2024-09-26 03:52:06 +00:00
Roy Hvaara	042af7ec53	[BE] [MPS] Use validation helper for input tensors (#134609 ) Small refactor to use already existing helper with equivalent behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134609 Approved by: https://github.com/malfet	2024-09-26 03:47:30 +00:00
rzou	e4d32d2194	Improve data-dependent-output meta kernel error message (#136671 ) Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/136671 Approved by: https://github.com/williamwen42	2024-09-26 03:46:04 +00:00
xinan.lin	190e09d8b6	[Inductor UT] Generalize device-bias code introduced from #134874 and (#136596 ) [Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases. Fix #136595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2024-09-26 02:56:59 +00:00
eugenekoran	dda0e4de32	Introduce _ArglessActivation base class for parameterless activation functions (#136296 ) Fixes #133683 Fixes #133684 Fixes #133688 This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()` Key changes and considerations: 1. Added new class `_ArglessActivation`: 2. Refactored the following classes to inherit from `_ArglessActivation`: - Sigmoid - Tanh - Softsign - Tanhshrink - Softmax2d 3. Performance consideration: - This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation - The impact is expected to be minimal in most use cases Docs view before: <img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe"> Docs view after: <img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296 Approved by: https://github.com/mikaylagawarecki	2024-09-26 02:45:05 +00:00
rzou	d0456b4274	noop on torch.library APIs under torch::deploy (multipy) (#136645 ) Fixes https://github.com/pytorch/pytorch/issues/136177 The motivation is that torch::deploy doesn't handle this well. The workaround for users is to use C++ custom ops. All torch.library APIs ultimately go through the torch.library.Library object, so we add checks to noop for torch::deploy there. Test Plan: - new test - going to test this internally and hope nothing breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645 Approved by: https://github.com/ezyang	2024-09-26 02:34:34 +00:00
Bin Bao	5c78c6b05a	[CI] Switch aarch64 dashboard run back to nightly (#136643 ) Summary: Reduce the frequency of the aarch64 dashboard CI run since we don't need to monitor its instability anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136643 Approved by: https://github.com/huydhn	2024-09-26 01:26:05 +00:00
Howard Huang	141cae2eb8	[pipelining] Fix more leaks and check leaks in tests (#136584 ) Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details). This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress. Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles. Uses objgraph for a nice debug utility when a leak is found. Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak. I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker. Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py, and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`: ``` warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle? warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes) Graph viewer (xdot) not found, generating a png instead Image generated as /tmp/objgraph-ztz642h3.png ``` rendering of ` /tmp/objgraph-ztz642h3.png`: <img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584 Approved by: https://github.com/kwen2501, https://github.com/H-Huang ghstack dependencies: #136507 Co-authored-by: Howard Huang <howardhuang@fb.com>	2024-09-26 01:10:40 +00:00
Nichols A. Romero	e8f1dd6ba0	Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (#136283 ) Fixes #131701 Use CMake imported targets more consistently to eliminate hardcode paths. Here is the new relevant sections of Caffe2Targets.cmake: ``` set_target_properties(c10_hip PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64" ) ``` ``` set_target_properties(torch_hip PROPERTIES INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL" INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS" INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver" ) ``` HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-26 00:34:43 +00:00
Zheng, Zhaoqiong	f3dd1721f4	[Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946 ) remove the hardware and software prerequisites and set up env part. keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html Update the support for Intel Client GPU MTL-H Update inference & training examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946 Approved by: https://github.com/seemethere	2024-09-26 00:22:05 +00:00
PyTorch MergeBot	9223c16208	Revert "Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit dd4a51b39aa02cba23b3a387b41c5026770d9220. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/atalman due to Breaks torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2375417145))	2024-09-25 23:01:03 +00:00
Bin Bao	ecc15c4f89	[AOTI] Fix a missing aoti_torch_check symbol issue (#136669 ) Summary: When Inductor generates cpp kernels, they should be pure cpp loops which are independent to libtorch as much as possible. Differential Revision: D63403473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136669 Approved by: https://github.com/henrylhtsang	2024-09-25 22:56:10 +00:00
Huy Do	b7a5c7d331	Do not XFAIL test_segfault in fbcode (#136661 ) https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661 Approved by: https://github.com/malfet	2024-09-25 22:26:24 +00:00
ratnampa	8d65d9f11b	Constraint setuptools to 72.1.0 or older in requirements.txt (#136489 ) FIXES: https://github.com/pytorch/pytorch/issues/136541 Setuptools>=74.0.0 has deprecated support for some functions in distutils, and so the builds run into error such as ```AttributeError: module 'distutils' has no attribute '_msvccompiler'```. Also, the pytorch builds have setuptools pin to 72.1.0 according to these PRs: https://github.com/pytorch/builder/pull/1995 and `89d9a8cf6f`. So, until there is a fix to change the function usage in accordance with latest setuptools, the 72.1.0 version works fine. Also observed in CI jobs: https://github.com/pytorch/pytorch/actions/runs/10979326524 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136489 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 22:06:05 +00:00
Xuan Zhang	c9d12f6360	[inductor][memory] add signpost event for memory pass (#136538 ) Add logging to scuba table for internal models. For verification, I triggered a sample workflow internally and checked the scuba table logging to make sure the `Paramaters` column has the expected loggings, see [here](https://fburl.com/scuba/workflow_signpost/39h7qo9s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136538 Approved by: https://github.com/yf225	2024-09-25 21:47:46 +00:00
rzou	b5c2a657ae	Add zou3519 to CODEOWNERS for HOPs (#136679 ) There are some tricky things that I want to guard against Pull Request resolved: https://github.com/pytorch/pytorch/pull/136679 Approved by: https://github.com/Chillee	2024-09-25 21:29:48 +00:00
Animesh Jain	289df45cee	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" (#136590 ) This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverts * https://github.com/pytorch/pytorch/pull/135503 * https://github.com/pytorch/pytorch/pull/135502 * https://github.com/pytorch/pytorch/pull/135422 This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation. ``` import torch from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention from torch._inductor.lowering import make_pointwise, register_lowering from torch._inductor.virtualized import ops from torch.nn.attention.flex_attention import create_block_mask torch.set_default_device('cuda') flex_attention = torch.compile(flex_attention, dynamic=False) prefix_lengths = torch.arange(8) def prefix_lm(b, h, q, kv): return prefix_lengths[b] >= kv mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590 Approved by: https://github.com/Chillee	2024-09-25 21:10:43 +00:00
Boyuan Feng	529b6ab0bb	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-25 21:08:40 +00:00
Edward Yang	76b044d7cb	Don't actually import module when checking if its valid (#136548 ) Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet. Test Plan: sandcastle and ossci ``` TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench ``` Differential Revision: D63330224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548 Approved by: https://github.com/Skylion007	2024-09-25 20:47:32 +00:00
atalman	11c5f9ac3b	Use amazon linux 2023 runners for Docker builds (#136544 ) Migrate these builds to linux 2023. We want to build and test the Docker images in CD. Looks like we are hitting this issue: https://github.com/docker/buildx/issues/379 when trying to build Docker on Amazon Linux 2023. Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544 Proposed Solution is to fix it in user_data . Please see: https://github.com/pytorch/test-infra/issues/5712 I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544 Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136544 Approved by: https://github.com/ZainRizvi Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 20:39:56 +00:00
Xinran / Allan Rui	13b0baf2a1	[FX] Update _inline_module util function to work with both args and kwargs (#136631 ) Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions ``` Differential Revision: D63347675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631 Approved by: https://github.com/jfix71	2024-09-25 20:20:57 +00:00
Sunishchal Dev	a8ed873ba2	Add missing input "eps" to adam docs (#135191 ) Minor fix for missing input argument in the Adam optimizer docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191 Approved by: https://github.com/janeyx99	2024-09-25 20:17:23 +00:00
cyy	6aa6bd4ca5	[Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528 ) Follows #136439. A dangling reference to qualifiedName was found and fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528 Approved by: https://github.com/kwen2501	2024-09-25 20:12:08 +00:00
Xiaozhu Meng	5a29a06aa3	[AMD][inductor] do not use float64 on AMD internally (#136441 ) Summary: Internal AMD triton seems to have issue with float64 constant: ``` ### Most recent error lines found on the logs: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] ^ E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK]) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp7 = tmp5 + tmp6 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp6 = 0.5 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp5 = tmp4.to(tl.float32) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp4 = (((r3 + (x0((17 + (16ks0ks1)) // 18))) % ks2) // ks0) % ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp3 = tmp2.to(tl.int1) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp2 = tmp0 < tmp1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp1 = 16ks0ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp0 = r3 + (x0((17 + (16ks0*ks1)) // 18)) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] r3 = rindex E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rmask = rindex < rnumel E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rindex = roffset + rbase E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ``` Bisecting showing this error introduced by D62465575 This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0. Test Plan: rocm6.0 can work ``` TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1 ``` emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354) Differential Revision: D63263806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441 Approved by: https://github.com/houseroad	2024-09-25 19:13:17 +00:00
Zain Rizvi	37f340c1e5	[EZ] Remove remaining amz2023 runner variant references (#136540 ) Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it Explicit references to the amz2023 runner type variants were removed in the following PRs: - https://github.com/pytorch/ignite/pull/3285 - https://github.com/pytorch/ao/pull/887 - https://github.com/pytorch/fbscribelogger/pull/1 - https://github.com/pytorch/pytorch/pull/134355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-25 19:01:00 +00:00
David Berard	9c2c61d2dd	[inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472 ) AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this. Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472 Approved by: https://github.com/jansel	2024-09-25 18:21:09 +00:00
cyy	a259fbf72c	[2/N] Fix clang-tidy warnings in torch/csrc/lazy (#136634 ) Follows #134655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136634 Approved by: https://github.com/Skylion007	2024-09-25 18:08:29 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	0b38fa154a	Fix meta registry in export (#136492 ) Summary: Title Test Plan: CI This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp Differential Revision: D63297050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492 Approved by: https://github.com/bdhirsh	2024-09-25 17:53:02 +00:00
Justin Chu	8582835499	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre, https://github.com/cyyever Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-25 17:44:18 +00:00
Edward Z. Yang	7cb6d31567	Dump partially traced make_fx graph in event of error to tlparse (#136508 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136508 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/malfet ghstack dependencies: #136533	2024-09-25 17:44:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9409274bc1	Fix bug in functional tensor decomp (#136600 ) Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export. Test Plan: CI Differential Revision: D63363749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600 Approved by: https://github.com/bdhirsh	2024-09-25 17:37:50 +00:00
David Berard	5d7ed02f52	[user-written triton kernels] specialize exprs if they are expected to be tl.constexpr (#136512 ) Fixes #136504 If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'): ``` File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module> triton_meta={'signature': {0: 'fp32', 1: 'fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: NameError: name 's0' is not defined ``` To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen. Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison	2024-09-25 17:12:11 +00:00
Pian Pawakapan	7c6d543a5b	[export] fix _get_non_persistent_buffers for duplicates (#136552 ) Summary: Export's method _get_non_persistent_buffers doesn't check duplicate submodules, so we run into state_dict related issues if non-persistent buffers exist on shared submodules. Test Plan: test_export Differential Revision: D63332976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136552 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2024-09-25 16:46:31 +00:00
Sahan Paliskara	aa80b82cea	[hygiene] Delete dead alerting code (#136583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136583 Approved by: https://github.com/clee2000	2024-09-25 15:44:46 +00:00
Sergii Dymchenko	0232278b33	Fix comment posting permissions for check-labels.yml (#136610 ) Currently it fails with Error fetching https://api.github.com/repos/pytorch/pytorch/issues/136607/comments HTTP Error 403: Forbidden (see https://github.com/pytorch/pytorch/actions/runs/11026434368/job/30622960113?pr=136607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136610 Approved by: https://github.com/malfet	2024-09-25 15:43:19 +00:00
Huy Do	34711fe8c9	Fix test_skip_data_serialization pickle exception match (#136617 ) The test is failing in trunk atm with the following error: ``` test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'" ``` for example, `36f0e61166` This comes from this cpython commit `a3076c734d`, and manifests in python 3.12.5 currently used in CI. The failure doesn't happen when I try it out with 3.12.3 and 3.12.4. Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.	2024-09-25 08:35:46 -07:00
Catherine Lee	deb820602a	viable/strict update: log push to s3 (#136470 ) As stated in https://github.com/pytorch/test-infra/pull/5686, I cannot figure out a way to determine the push time from webhooks (other than when the webhook was sent, but that isn't super accurate either). Instead, manually save a json file to s3 that contains information for the sha and date so that we can still get this information. Relies on https://github.com/pytorch/test-infra/pull/5690 tested in https://github.com/pytorch/pytorch/pull/136387 (but I squashed so it's kinda hard to find now) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136470 Approved by: https://github.com/huydhn	2024-09-25 15:28:53 +00:00
PyTorch MergeBot	e3b89ca124	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))	2024-09-25 14:11:01 +00:00
Bin Bao	20a855bf01	[AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463 ) Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461, #136462	2024-09-25 14:06:33 +00:00
PyTorch MergeBot	5171b0e3c6	Revert "[ONNX] Remove the operators test (#136335 )" This reverts commit 9629835b1ccce8e72fc93bf95be13e3d53cb4871. Reverted https://github.com/pytorch/pytorch/pull/136335 on behalf of https://github.com/ezyang due to I'll reland this, bear with me ([comment](https://github.com/pytorch/pytorch/pull/136335#issuecomment-2374183435))	2024-09-25 14:06:03 +00:00
Bin Bao	070952aca5	[AOTI] Move stack_allocation logic from CppWrapperCpu (#136462 ) Summary: Move stack_allocation logic from CppWrapperCpu to CppWrapperCpuArrayRef Differential Revision: [D63300359](https://our.internmc.facebook.com/intern/diff/D63300359) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136462 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461	2024-09-25 14:03:03 +00:00
Bin Bao	5ad5f40283	[AOTI][reland] Create another wrapper class to handle ArrayRef (#136461 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #136062	2024-09-25 14:00:09 +00:00
Edward Z. Yang	25ab87c09b	Add lint rule META_NO_CREATE_UNBACKED (#135870 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135870 Approved by: https://github.com/albanD	2024-09-25 13:33:56 +00:00
Tom Ritchford	dd4a51b39a	Fix constant propagation in builtins and UserClasses (#131354 ) * Fixes https://github.com/pytorch/pytorch/issues/118675 * Replaces https://github.com/pytorch/pytorch/pull/118994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-25 13:03:40 +00:00
Jez Ng	a0c76ea853	Make test_skip_data_serialization regex more flexible (#136580 ) Some CI machines seem to throw "Can't get local object" rather than "Can't pickle local object". Pull Request resolved: https://github.com/pytorch/pytorch/pull/136580 Approved by: https://github.com/mikaylagawarecki	2024-09-25 11:27:23 +00:00
IvanKobzarev	370c1c4297	[aotd] Fix rrelu compilation (#136008 ) Issues: https://github.com/pytorch/pytorch/issues/135083 https://github.com/pytorch/pytorch/issues/120292 rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph. Also that decomposition is registered as python_dispatch kernel for AutogradCUDA. Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this. Testing: ``` python test/functorch/test_aotdispatch.py -k test_rrelu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008 Approved by: https://github.com/bdhirsh	2024-09-25 11:26:19 +00:00
Wu, Chunyuan	c3fdf587b5	[inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518 ) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418	2024-09-25 10:25:07 +00:00
Jokeren	cabfbef6cf	[pytorch][PR] [inductor] More fixes on the keys of `constants` and `signature` dictionaries (#136514 ) Summary: Previous PR forgets to change two other places that also create `constants` and `signature`. Test Plan: Imported from GitHub, without a `Test Plan:` line. {F1884584338} Differential Revision: D63027728 Pulled By: Myrthan Co-authored-by: Jokeren <robinho364@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514 Approved by: https://github.com/jansel Co-authored-by: Jokeren <robinho364@gmail.com>	2024-09-25 09:34:14 +00:00
Wu, Chunyuan	2e30c160ef	[inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418 ) Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case: ``` /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16, const bfloat16, const bfloat16, bfloat16, int64_t)’: /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression 279 \| constexpr int64_t m_block_end = Mr_blocks; \| ^~~~~~~~~ /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression 237 \| const int64_t Mr_blocks = (M + Mr - 1) / Mr; \| ^~~~~~~~~ ``` The PR also updates the UT to add a test for `BS`=512 in single thread. The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue. By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418 Approved by: https://github.com/jgong5	2024-09-25 09:24:05 +00:00
Anatoly Myachev	a0a1873148	[Inductor] Fix Triton tests after updating pybind11 to 2.13.6 (#136280 ) https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself. Possible errors that have been noticed due to this change: <details> <summary> the first error </summary> ```bash _________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________ Traceback (most recent call last): File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order eager_out = f(x) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f arange_out(x, y) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out kernel[grid](x, out, n_elements, BLOCK_SIZE=4) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda> return lambda args, kwargs: self.run(grid=grid, warmup=False, args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run kernel = self.compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename, File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps return cls( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) TypeError: vars() argument must have __dict__ attribute ``` </details> <details> <summary> the second error </summary> ```bash ________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________ Traceback (most recent call last): File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed out = f(x, y) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn return fn(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__ return self._torchdynamo_orig_callable( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__ result = self._inner_convert( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__ return _compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function return function(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner out_code = transform_code_object(code, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object transformations(instructions, code_options) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn return fn(args, kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform tracer.run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run super().run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run while self.step(): File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step self.dispatch_table[inst.opcode](self, inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE self._return(inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return self.output.compile_subgraph( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler return self._call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx return aot_autograd( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified compiled_fn = dispatch_and_compile() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function return _create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base return _fw_compiler_base(model, example_inputs, is_inference) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base return inner_compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner compiled_graph = FxGraphCache.load( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load compiled_graph = compile_fx_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn return self.compile_to_module().call File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module return self._compile_to_module() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module> File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton kernel = TritonCodeCache.load(kernel_name, source_code) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load return cls.load_by_key_path(key, path, linemap, attrs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module raise RuntimeError( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>	2024-09-25 08:09:46 +00:00
Pei-Hsuan Wu	1cb265fafa	[AILab][attempt2] Add TryExcept when decoding healthcheck port (#136574 ) Summary: ## Context The first attempt has lint error in OSS https://hud.pytorch.org/pr/pytorch/pytorch/136438#30553902641 {F1886895223} ## This Diff Fix error message with try catch Error Message: ``` File "/packages/aps_models.examples.dlrm.lite/dlrm_train_app-inplace#link-tree/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 224, in _setup_healthcheck port=int(healthcheck_port), ValueError: invalid literal for int() with base 10: \'%port.thrift%\' ``` Test Plan: ``` arc lint ``` Reviewed By: felixsu2006 Differential Revision: D63343041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136574 Approved by: https://github.com/atalman	2024-09-25 04:43:51 +00:00
Nikita Shulga	561cd5a0a6	[BE] Use C++17 convetion methods in CUDA kernels (#136575 ) - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` And so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/136575 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-25 04:30:01 +00:00
Nikita Shulga	5340feb8aa	Disable iOS workflow (#136571 ) See https://github.com/pytorch/pytorch/issues/136284 It's been broken for more than a week and it does not seem like anyone cares about fixing it. Once it's landed I'll reassigned the issue on `oncall: mobile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136571 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-09-25 04:29:34 +00:00
Bin Bao	1c9a1a2a19	[AOTI] Support MKL linear ops in cpp wrapper (#134974 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor. Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974 Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-09-25 03:53:11 +00:00
chilli	0200ad3457	Turn on unique kernel names (#136503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136503 Approved by: https://github.com/ezyang, https://github.com/eellison ghstack dependencies: #136509	2024-09-25 03:39:45 +00:00
Nichols A. Romero	482fe186b9	Add ROCm documentation to libtorch (C++) reST. (#136378 ) Fixes #126640 Added ROCm support section to libtorch (C++) reST. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136378 Approved by: https://github.com/ezyang	2024-09-25 02:30:56 +00:00
leslie-fang-intel	3c7edf1ec0	[Inductor][CPP] Fix int8 cvt half (#136353 ) Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following: ``` #include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h" extern "C" void kernel(const int8_t* in_ptr0, half* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L)) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); auto tmp1 = at::vec::convert<half>(tmp0); tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); } } } ``` In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16. TestPlan * AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api` * `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec` * Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case. * `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/.ConvertReduced"` `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-09-25 02:23:43 +00:00
eqy	8225e7706e	[CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496 ) Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496 Approved by: https://github.com/ezyang	2024-09-25 01:14:45 +00:00
Will Cromar	5233b5a448	Update PyTorch/XLA CI image to Python 3.10 (#135278 ) The old image used Python 3.8. Corresponding XLA PR: https://github.com/pytorch/xla/pull/7953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135278 Approved by: https://github.com/JackCaoG, https://github.com/atalman	2024-09-25 00:53:39 +00:00
eqy	670d64a802	[SDPA][Nested Tensor] Bump `grad_query` fudge factor for small GPUs (#135715 ) Similar to #135711, here we see a ~1/1000 mismatch with absolute value ~0.0016 when 0.001 is allowed Pull Request resolved: https://github.com/pytorch/pytorch/pull/135715 Approved by: https://github.com/drisspg	2024-09-25 00:36:10 +00:00
Pearu Peterson	8f2a4cc4b1	Tune bsr_dense_addmm for int8 inputs on A100 (#136088 ) As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088 Approved by: https://github.com/cpuhrsch	2024-09-25 00:24:12 +00:00
Justin Chu	9629835b1c	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre	2024-09-24 23:08:48 +00:00
Edward Z. Yang	b57d67e263	Add isuruf to core reviewers (#136554 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136554 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-24 23:06:46 +00:00
angelayi	210b136c07	[export] Add experimental swap API (#136190 ) Prototyped the following API which takes in an ExportedProgram, a dictionary of fqn to modules to swap, and returns a (unlifted) GraphModule ``` _swap_modules( ep: ExportedProgram, modules_to_swap: Dict[str, torch.nn.Module] ) -> torch.fx.GraphModule: ``` Differential Revision: [D62879819](https://our.internmc.facebook.com/intern/diff/D62879819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136190 Approved by: https://github.com/avikchaudhuri	2024-09-24 22:50:44 +00:00
PyTorch MergeBot	706eda5cd8	Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 )" This reverts commit 5033a1ca0dd22dae34a8939add33dbebfe0fd31d. Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))	2024-09-24 22:24:26 +00:00
William Wen	ae80bce496	[dynamo] refactor resume_execution.py to use bytecode templates (#136483 ) Use bytecode from template instead of hardcoding bytecode in resume_execution.py. Gets rid of a lot of Python-version dependent bytecode generation. Also makes resume_execution.py easier to support in future Python version updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136483 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-24 22:20:26 +00:00
Nikita Shulga	36f0e61166	[BE] Use nested namespace in ATen/native/cuda (#136570 ) It's a nice C++17 feature Pull Request resolved: https://github.com/pytorch/pytorch/pull/136570 Approved by: https://github.com/Skylion007	2024-09-24 22:19:10 +00:00
Jeff Daily	1d3af68202	[ROCm] install_miopen.sh exit for ROCm >= 6.3 (#136436 ) Follow up to #132555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136436 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/atalman	2024-09-24 22:15:26 +00:00
Justin Chu	780f4debdb	[ONNX] Remove _optimize_graph from public init (#136279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136279 Approved by: https://github.com/xadupre ghstack dependencies: #136281	2024-09-24 22:00:55 +00:00
Edward Z. Yang	00bc17555a	Don't try to evaluate sympy.Eq in replacement; we knew this wouldn't simplify since we are here (#136533 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136533 Approved by: https://github.com/isuruf, https://github.com/pianpwk	2024-09-24 21:52:25 +00:00
Kurt Mohler	b1a02bf708	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-24 21:34:43 +00:00
PyTorch MergeBot	0133fbcfe7	Revert "Correctly convert Python float to float64 when passing argument as Tensor (#136413 )" This reverts commit f0f79dd8f1df6cf6342c9c23ae3a9be0f74eb9f5. Reverted https://github.com/pytorch/pytorch/pull/136413 on behalf of https://github.com/ezyang due to forward fix is stuck, revert this ([comment](https://github.com/pytorch/pytorch/pull/136413#issuecomment-2372404873))	2024-09-24 21:20:37 +00:00
Bin Bao	95c0f7493f	[Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062 ) Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit. Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-09-24 21:02:51 +00:00
Yifu Wang	da1560c49f	[SymmetricMemory] add support for cuStreamWriteValue32 (#136488 ) cuStreamWriteValue efficiently combines the issuing of a system-level fence with the update of a single memory location. It is highly suitable for inter-stream progress sharing (e.g., all_gather_with_progress). Exposing it via SymmetricMemory allows users to more easily implement efficient progress-aware matmuls in triton ([xformers example](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/_triton/sequence_parallel_fused_kernels.py)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136488 Approved by: https://github.com/eqy, https://github.com/Chillee	2024-09-24 20:56:29 +00:00
Justin Chu	7c777dd587	[ONNX] Unify ONNXProgram and remove the old one (#136281 ) ## Note `test_fx_to_onnx_with_onnxruntime.py` is removed for now (it has a lot of xfails anyways). A better version will be added back. Fixes https://github.com/pytorch/pytorch/issues/136274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136281 Approved by: https://github.com/xadupre, https://github.com/albanD	2024-09-24 20:52:19 +00:00
Will Constable	dbc3356655	[pipelining] fix py ref cycle in stage_backward (#136507 ) TLDR; found forward activation tensors were being kept alive "forever" (or until GC ran), and tracked it down to a cycle involving `stage_backward.<locals>.extract_tensors_with_grads`. The reference cycle in question is below. (constructed using gc.get_referrers after doing a gc.collect in gc debug mode) tensor is kept alive by `[(<class 'cell'>, '0x7f7360234400')]` tuple of cell objects `(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)` is kept alive by `[(<class 'function'>, '0x7f734fff0ee0')]` `<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>` is kept alive by `[(<class 'cell'>, '0x7f73602343d0')]` Put into more plain terms, ``` def stage_backward(...): ... stage_output_tensors = [] # a cell object will exist that contains the variables defined in stage_backward and used by # both stage_backward and nested functions # in this case, the cell object contains 'stage_output_tensors' but # this function object will hold a reference to a 'cell' that contains any vars from # the parent scope not explicitly passed into the function as args. def extract_tensors_with_grads(...): ... # extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors # is in the cell stage_output_tensors.append(output_val) ... # but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads), # so `extract_tensors_with_grads` will be in the cell extract_tensors_with_grads(...) ``` More debug details: https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing In pdb: ``` gc.collect() g = gc.garbage g[-1] [rank0]:(Pdb) [rank0]:<function stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0> g[-2] [rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at 0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at 0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1 d6340>) g[-3] [rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06, 2.6226e-06, ..., 6.4969e-06, [rank0]: -4.4405e-06, -4.7684e-06], ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507 Approved by: https://github.com/awgu, https://github.com/kwen2501	2024-09-24 20:46:37 +00:00
chilli	7ff8e66140	Fix flexattention sympy expr printer issue (#136509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136509 Approved by: https://github.com/yanboliang	2024-09-24 20:10:29 +00:00
Henry Tsang	02ef5dd327	[inductor][test] Check if mkl dnn bf16 is supported when using bf16 (#136290 ) Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code. Context: https://github.com/pytorch/pytorch/pull/135038 Differential Revision: D62984129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136290 Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78	2024-09-24 19:32:48 +00:00
Joel Schlosser	888744bd36	NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021 ) Related: #132695 This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes: * `(B, j0, D) + (1, 1, 1)` * `(B, j0, D) + (B, 1, 1)` * `(B, j0, D) + (B, 1, D)` * etc. This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021 Approved by: https://github.com/cpuhrsch	2024-09-24 19:11:49 +00:00
David Berard	8ecc5f1a8f	[TorchScript][tensorexpr] imbue locale for IRPrinter (#136458 ) We had an internal report where the NNC-generated CUDA code had thousands separators in integer literals. Although I wasn't able to cleanly repro, I did come up with a hacky repro and verified that this fix works (see #136459). Differential Revision: [D63278771](https://our.internmc.facebook.com/intern/diff/D63278771) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136458 Approved by: https://github.com/eellison	2024-09-24 19:00:57 +00:00
Nikita Shulga	c6192f32f1	[MPS] Add upsample_bicubic2d as Metal op (#136123 ) More or less literal copy-n-paste of `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)` and `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)` Missing `uint8` implementation mimics CUDA behavior Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk Later refinements: - Switch from 2D dispatch to 1D one (to match CUDA behavior) - Added batch + channel loops - Fixed scale computation to match align corners behavior - Added backward implementation Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e. ```metal emplate <typename T> static inline void atomic_add_helper( device atomic<int>* data, long offset, float value) { auto ptr = data + (offset >> 1); auto old = atomic_load_explicit(ptr, memory_order_relaxed); union { int i; T t[2]; } val; do { val.i = old; val.t[offset & 1] += static_cast<T>(value); } while (!atomic_compare_exchange_weak_explicit( ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed)); } ``` Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123 Approved by: https://github.com/albanD	2024-09-24 18:58:11 +00:00
Animesh Jain	dacf0c4884	[dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516 ) Fixes https://github.com/pytorch/pytorch/issues/136254 Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516 Approved by: https://github.com/williamwen42	2024-09-24 18:26:12 +00:00
Sam Larsen	1028cedf71	[inductor] Enable parallel compile by default in fbcode (#136246 ) Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import. Test Plan: Ran codecache tests with JK on, then canaried locally with JK off Differential Revision: D62913998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246 Approved by: https://github.com/eellison	2024-09-24 18:10:01 +00:00
Oguz Ulgen	9abdc62065	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-24 17:23:09 +00:00
ankurneog	efed357ef5	Add dtypes support in opinfo for Intel Gaudi (#132840 ) ## Motivation This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584 we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840 Approved by: https://github.com/albanD	2024-09-24 17:17:15 +00:00
PyTorch MergeBot	064093a4d6	Revert "Increase update_hint_regression problem size to 1000 (#136434 )" This reverts commit 3116fbda0fcf9af0c3dfe1280fb7e05e30e6ad5f. Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))	2024-09-24 17:05:20 +00:00
Shangdi Yu	ebfcbe0822	Move print_export_warning so lru_cache works (#136491 ) Summary: as title move print_export_warning() out of the function so `lru_cache` actually works Test Plan: CI Differential Revision: D63297083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136491 Approved by: https://github.com/pianpwk	2024-09-24 16:52:22 +00:00
Fuzzkatt	44ec706789	add tolerance changes for test_sdpa_autocast in test_nestedtensor.py (#136485 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136485 Approved by: https://github.com/soulitzer	2024-09-24 16:31:32 +00:00
Robert Hardwick	eac04fe72a	Increase bf32 tolerances for some cdist tests in test_torch (#136315 ) - Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal. Logic behind choice of tolerance: The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less. Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow ) ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large self.assertEqual(expected, actual) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Tensor-likes are not close! Mismatched elements: 134118 / 1000000 (13.4%) Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed) Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed) ``` @malfet @jondea Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315 Approved by: https://github.com/albanD	2024-09-24 16:10:11 +00:00
Ma Jian	0b667c073e	Disable compiled autograd for re-entrant autograd (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-24 15:09:16 +00:00
gaopengff	33e10803c8	Fix ut in internal distributed_test.py (#136251 ) I have failed with test case of test_new_subgroups_by_enumeration_input_rank_exceeds_world_size, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251 Approved by: https://github.com/kwen2501	2024-09-24 15:06:20 +00:00
Justin Chu	58274e4655	Remove onnx imports in dynamo (#136334 ) Remove imports of the ``torch.onnx.operators`` module in dynamo. Since ONNX depends on dynamo, this import line causes a circular dependency. Judging from the source they are not actually needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136334 Approved by: https://github.com/xadupre, https://github.com/jansel, https://github.com/titaiwangms	2024-09-24 14:54:23 +00:00
Isuru Fernando	2a178a6982	Avoid changing FTZ/DAZ flags in CPP builder (#136466 ) Fixes https://github.com/pytorch/pytorch/issues/136273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136466 Approved by: https://github.com/ezyang	2024-09-24 14:39:17 +00:00
Fuzzkatt	6300eb1dc7	tf32 off for test_noncontiguous_samples in test_ops.py (#136484 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136484 Approved by: https://github.com/soulitzer	2024-09-24 14:26:47 +00:00
Amadeusz Skrzypczak	47ebb5856e	Make avoid_device_init() aware of hpu device (#136194 ) Added hpu to devices handled by avoid_device_init() in FakeTensorMode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136194 Approved by: https://github.com/eellison	2024-09-24 14:13:45 +00:00
enkilee	54fc4f56ff	[Docs fix] fix syntax error in docs :torch.blackman_window (#136354 ) Fixes #ISSUE_NUMBER https://pytorch.org/docs/stable/generated/torch.blackman_window.html error at : equal to torch.blackman_window(L + 1, periodic=False)[:-1]). should delete the last ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136354 Approved by: https://github.com/soulitzer	2024-09-24 14:00:26 +00:00
Aaron Orenstein	9fc721d22b	Add cache logs + other minor caching cleanup (#136456 ) Summary: - Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache. - Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other) - Prepare `_ManifoldCache` for use with other subpath keys - Move create_cache to be more public and use it in codecache - Add _InductorMetaTy alias (still just a dict) - Cleaned up some common cached_autotune calls in triton_heuristics Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456 Approved by: https://github.com/oulgen	2024-09-24 14:00:23 +00:00
IvanKobzarev	342c031f0e	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-24 13:15:01 +00:00
cyy	f048569c24	[Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439 ) Follows #131671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439 Approved by: https://github.com/kwen2501	2024-09-24 13:05:15 +00:00
PyTorch MergeBot	538ee7bf60	Revert "Fix tensor.data_ptr() representation overflow (#135567 )" This reverts commit 2e8d431a8fbfdbdb07448195f16afa9e101188ac. Reverted https://github.com/pytorch/pytorch/pull/135567 on behalf of https://github.com/etaf due to Block XPU, let's re-land with triton update. ([comment](https://github.com/pytorch/pytorch/pull/135567#issuecomment-2371200549))	2024-09-24 12:59:14 +00:00
Bob Ren	32727b9859	Add types to _dynamo/testing.py (#136402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136402 Approved by: https://github.com/jansel	2024-09-24 10:23:54 +00:00
Xuehai Pan	73c10a04f6	[dynamo][easy] support `sys.intern` (#136081 ) Closes #134023 - #134023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136081 Approved by: https://github.com/anijain2305	2024-09-24 09:12:34 +00:00
Amin Alam	1266be21f4	deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141 ) Fix to #136140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141 Approved by: https://github.com/kwen2501	2024-09-24 07:26:10 +00:00
Jianyu Huang	0a35986cdb	Add option to configure reduced precision math backend for SDPA (#135964 ) Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA. Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels Differential Revision: D62625515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964 Approved by: https://github.com/jbschlosser	2024-09-24 07:11:38 +00:00
Wu, Chunyuan	44c871c34b	[inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661 ) ## Description Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune. The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same). If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`. This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied. ### Unsupported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: 401408 * d0 + 100352 * d1 + *7168 d2 + 1792 * d3** + 128 * d4 + d5 The load of `buf1` in the epilogue node: 401408 * d0 + 100352 * d1 + *1792 d2 + 25088 * d3** + 128 * d4 + d5 The above two indexes are different. ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): i0, i1, i2, i3, i4, i5 = index tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp2 = tmp0 + tmp1 tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp4 = tmp2 + tmp3 return tmp4 , ranges=[8, 4, 14, 4, 14, 128], origin_node=clone, origins=OrderedSet([clone]) )) ``` ### Supported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: d0 + 576 * d1 + 32 * d2 The load of `buf1` in the epilogue node: d0 + 576 * d1 + 32 * d2 The above two indexes are the same. The layout of `buf2` and `buf1` are different though which is handled by the reindexer: `buf1`: `size=[324, 32], stride=[32, 1]` `buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]` ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise( 'cpu', torch.bfloat16, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(_frozen_param4, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2) tmp5 = tmp3 + tmp4 tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32) return tmp6 , ranges=[1, 32, 18, 18], origin_node=convert_element_type_4, origins=OrderedSet([add, mul, convert_element_type_4]) )) ``` ## TODO Add the support for fusions when the indexes are different in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 05:25:28 +00:00
Max Podkorytov	7283530db2	[ROCm][Inductor][CK] FP8 gemm (#136337 ) At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337 Approved by: https://github.com/chenyang78	2024-09-24 05:19:45 +00:00
Aaron Orenstein	7f98781f84	Fix autodeps from D62049222 that pyfmt broke (#136455 ) Summary: `arc lint` changed the formatting which then caused autodeps to be confused. Test Plan: this passes: ``` arc lint --skip AUTODEPS fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py ``` Differential Revision: D63277059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455 Approved by: https://github.com/bobrenjc93, https://github.com/oulgen	2024-09-24 05:06:12 +00:00
blzheng	797c7e2802	[Quant][PT2E]change flatten recipe for X86InductorQuantizer (#136298 ) This PR modifies the flatten recipe: if none of the users of the flatten node are quantizable ops, int8 flatten will be disabled to avoid unnecessary dtype conversions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136298 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 04:30:12 +00:00
Riley Dulin	3be150653c	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-24 03:28:12 +00:00
Guilherme Leobas	e09c5b6046	Remove `vt` argument in `raise_observed_exception` (#136037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037 Approved by: https://github.com/zou3519	2024-09-24 02:36:57 +00:00
fduwjj	9372692c7b	[FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473 ) Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473 Approved by: https://github.com/c-p-i-o	2024-09-24 02:34:43 +00:00
Nikita Shulga	4fd16dd8aa	Clarify that `libtorch` API is C++17 compatible (#136471 ) As it relies on some common C++17 primitives, such as `std::optional` Replace all docs references from C++14 to C++17 Fixes https://github.com/pytorch/pytorch/issues/133205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-24 02:03:33 +00:00
Jez Ng	e4d294221b	[inductor] Log precompilation time (#136395 ) This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395 Approved by: https://github.com/eellison	2024-09-24 01:47:54 +00:00
Edward Z. Yang	802ba79121	Inherit all secrets to inductor workflow (#135354 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354 Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet	2024-09-24 01:30:40 +00:00
Aaron Orenstein	06909803cc	Existing mypy issues (#136236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2024-09-24 01:02:07 +00:00
Xuan Zhang	a14f57b126	fix the inductor tests (#136474 ) Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474 Approved by: https://github.com/malfet	2024-09-24 00:59:22 +00:00
Nikita Shulga	9d9bc65b5e	Make `FlashAttentionKernel.cpp` compilable for SVE with GCC-11 (#136477 ) Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528 Hattip to @abhishek-iitmadras for the investigation Fixes https://github.com/pytorch/pytorch/issues/136432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477 Approved by: https://github.com/atalman, https://github.com/kit1980	2024-09-24 00:54:26 +00:00
Ke Wen	e0f84f40f7	[Pipelining] Allow non-0 stages to accept kwargs (#136416 ) For supporting usage case in torchchat: all non-0 stages requires `input_pos` and `cache_lane`. ``` kwargs = {"input_pos": input_pos, "cache_lane": lane} if pp_rank == first_pp_rank: output = decorder.step(new_token, kwargs) elif pp_rank == last_pp_rank: output = decorder.step(kwargs) else: # middle pp ranks decorder.step(**kwargs) ``` The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416 Approved by: https://github.com/wconstab	2024-09-23 23:50:59 +00:00
Guilherme Leobas	52c917b0ba	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-23 21:45:44 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
PyTorch MergeBot	fd182b90a7	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))	2024-09-23 19:57:13 +00:00
Nikita Shulga	08dba25775	[BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449 ) - `Tensor::type()` -> `Tensor::scalar_type()` - `Tensor::data<T>()` -> `Tensor::data_ptr<T>()` Should fix following warnings during the compilation: ``` caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o[0m /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 739 \| int* rowOffsets = crow.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 740 \| int* colIndices = col.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here 225 \| DeprecatedTypeProperties & type() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449 Approved by: https://github.com/huydhn	2024-09-23 19:20:34 +00:00
Xiaodong Wang	9a1dc41de7	[AMD] Skipping 0 byte send/recv for AMD GPU (#136362 ) Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them. Reviewed By: danzimm Differential Revision: D63075000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362 Approved by: https://github.com/malfet, https://github.com/houseroad	2024-09-23 19:14:12 +00:00
Edward Z. Yang	3116fbda0f	Increase update_hint_regression problem size to 1000 (#136434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434 Approved by: https://github.com/laithsakka	2024-09-23 18:51:44 +00:00
PyTorch MergeBot	274883083d	Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318 )" This reverts commit d21841d077b00350d5e621e7b74dace71849c701. Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))	2024-09-23 17:47:49 +00:00
Aleksei Nikiforov	d859fcbc61	s390x: build s390x binaries on each pull request (#125399 ) Ensure that s390x keeps building for each PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399 Approved by: https://github.com/huydhn	2024-09-23 17:39:48 +00:00
Joel Schlosser	83a3ee0699	Support embedding_bag() with NJT input (#135888 ) Fixes #93843 `EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888 Approved by: https://github.com/cpuhrsch	2024-09-23 17:35:19 +00:00
James Wu	4649aeaebf	Make AOTAutogradCache support remote FXGraphCache (#136173 ) Summary: After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache. This does not implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache. Test Plan: (Meta only tests) Reviewed By: aorenste Differential Revision: D62384944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173 Approved by: https://github.com/oulgen	2024-09-23 17:24:27 +00:00
Nikita Shulga	c3e678382b	Fix addmm silent correctness on aarch64 (#136371 ) Do not dispatch to fast gemmv functions when alpha is not equal to 1 Add regression test to address the problem Fixes https://github.com/pytorch/pytorch/issues/136299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371 Approved by: https://github.com/swolchok	2024-09-23 17:10:34 +00:00
Edward Z. Yang	f0f79dd8f1	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93	2024-09-23 16:48:08 +00:00
wz337	637d5c4b7e	[DSD] Fix loading uneven full tensor into sharded state dict (#136365 ) Fix #136228. This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365 Approved by: https://github.com/fegin	2024-09-23 16:35:58 +00:00
fduwjj	da51fe1c42	[FR] Fix errors in all2all check, improve some log output (#136399 ) We found that we show the hashed pg name in our script output, which is not UX friendly. Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable. Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399 Approved by: https://github.com/c-p-i-o	2024-09-23 16:31:31 +00:00
PyTorch MergeBot	df6a8fa1eb	Revert "[aotd] Fix freezing API for subclasses (#136265 )" This reverts commit cdef760560049ebda5fb7e30b1703f345fe05cfa. Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))	2024-09-23 16:25:05 +00:00
Andrew Gu	9992084f38	[FSDP2] Fixed `test_all_gather_extensions_monkey_patch` (#136130 ) I messed up the test before. The extensions were not running :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130 Approved by: https://github.com/weifengpy ghstack dependencies: #136129	2024-09-23 15:12:44 +00:00
Andrew Gu	b9f53c0dce	[FSDP2] Added module, mp policy to `fsdp_pre_all_gather` (#136129 ) - Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR. - Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example. The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129 Approved by: https://github.com/weifengpy	2024-09-23 15:12:36 +00:00
Bin Bao	d21841d077	[AOTI] Create another wrapper class to handle ArrayRef (#136318 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: D62961885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318 Approved by: https://github.com/frank-wei	2024-09-23 15:10:27 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Edward Z. Yang	bae427e4b1	Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107 ) By refactoring this way, I can put a non-expiring LRU cache here. Splitting also will make it easier for me to tell who is using up all the time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107 Approved by: https://github.com/aorenste	2024-09-23 14:39:20 +00:00
PyTorch MergeBot	e9bfbf78d5	Revert "Allow fx graph caching higher order operators (opt-in) (#135877 )" This reverts commit 66d5eb64e0be91680a8531ccb24f098554610d46. Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))	2024-09-23 09:04:24 +00:00
cyy	75f141be62	Avoid unnecessary CMake warnings on Windows (#136393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393 Approved by: https://github.com/ezyang	2024-09-23 06:42:59 +00:00
Yuxin Wu	663e760065	add unittest for OOM message (#129671 ) Add unittest for the bug in #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671 Approved by: https://github.com/eqy	2024-09-23 04:48:01 +00:00
Yiming Zhou	068fdd602f	[export] enable custom tag metadata re-export test (#136048 ) Improves and enables a commented out test originally introduced in #131912 In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048 Approved by: https://github.com/zhxchen17	2024-09-23 04:37:58 +00:00
Oguz Ulgen	66d5eb64e0	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-23 04:33:27 +00:00
cyy	a38e4c5e1e	Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547 Approved by: https://github.com/ezyang	2024-09-23 03:44:55 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Xilun Wu	3406ac24d9	[BE] fix circular import in torch/distributed/utils.py (#136286 ) Summary Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286 Approved by: https://github.com/Skylion007	2024-09-22 20:54:12 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00
Zhou, Lingzhi	35532fc477	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-21 23:52:02 +00:00
cyy	e4cdc31227	[14/N] Fix clang-tidy warnings in aten/src/ATen (#133988 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988 Approved by: https://github.com/ezyang	2024-09-21 22:41:40 +00:00
Bob Ren	9731ccb9e0	Type _dynamo/variables/lazy.py (#136376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376 Approved by: https://github.com/Skylion007	2024-09-21 22:18:02 +00:00
Jovian Anthony Jaison	09715638ab	Add _dynamo.config.suppress_errors logging (#136379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379 Approved by: https://github.com/ezyang	2024-09-21 21:00:26 +00:00
Aaron Orenstein	3176966732	update cache tests (#136215 ) Summary: - Clean up cache test code a bit. - Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check). Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215 Approved by: https://github.com/bobrenjc93	2024-09-21 20:36:22 +00:00
Ramana Sundararaman	be4b7e8131	Param fixes in docstring (#136097 ) Fixes wrong param names in docstrings. cc: @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097 Approved by: https://github.com/ezyang	2024-09-21 18:56:34 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
PyTorch MergeBot	cc17d58809	Revert "S390x update builder image (#132983 )" This reverts commit 080a249fc2290602402e01bf5864d9d9a416e5b6. Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))	2024-09-21 16:46:54 +00:00
Xuan Zhang	03957efa5d	[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874 ) Motivations: A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf). Solutions: 1. implement a peak memory estimator via liveness analysis 2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory Results: On some models we can reduce the peak memory significantly: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:-----------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| alexnet \| 128 \| 1.17 \| 0.99 \| 1.19 \| \| vgg16 \| 64 \| 4.10 \| 3.57 \| 1.15 \| \| DebertaV2ForQuestionAnswering \| 1 \| 11.60 \| 10.56 \| 1.10 \| In the presence of compiler based AC, peak memory can be further reduced: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:------------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| AlbertForMaskedLM \| 4 \| 6.87 \| 6.43 \| 1.07 \| \| AlbertForQuestionAnswering \| 4 \| 8.69 \| 7.76 \| 1.12 \| \| MobileBertForQuestionAnswering \| 128 \| 4.67 \| 3.90 \| 1.20 \| [Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case. Other infos: * neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_. * minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second. * no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874 Approved by: https://github.com/yf225	2024-09-21 16:28:38 +00:00
DavidGu-Datong	fb4670a1f9	fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174 ) Fixes #134848 For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174 Approved by: https://github.com/soulitzer	2024-09-21 14:21:42 +00:00
Will Constable	ea737e4e5d	[Pipelining] Make PipelineStage support meta initialization (#136243 ) Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2024-09-21 09:47:22 +00:00
cyy	c459430558	Pass Werror to CUDA host compiler (#130213 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213 Approved by: https://github.com/ezyang	2024-09-21 08:01:06 +00:00
Menglu Yu	e18439113e	[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349 ) Summary: see D62220158 Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pad_mm -- --exact 'caffe2/test/inductor:pad_mm - test_pad_mm_bf16 (caffe2.test.inductor.test_pad_mm.PadMMTest)' --run-disabled ``` ### H100 Buck UI: https://www.internalfb.com/buck2/e5d85802-cab7-41a5-aacc-95f541796a99 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149258587374 Network: Up: 9.1KiB Down: 0B (reSessionID-b339b51b-6a0e-4347-9414-1ba38f26a5d0) Jobs completed: 9. Time elapsed: 1:15.7s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 1. Build failure 0 ### A100 Buck UI: https://www.internalfb.com/buck2/1082ad6e-56b0-4eb5-8092-ce507ca9a70e Test UI: https://www.internalfb.com/intern/testinfra/testrun/8444249533824784 Network: Up: 9.2KiB Down: 0B (reSessionID-2b3056ac-f29e-4de4-b6f5-9d994acf566b) Jobs completed: 9. Time elapsed: 1:36.9s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E see D62220158 Differential Revision: D63040455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136349 Approved by: https://github.com/dshi7	2024-09-21 06:35:50 +00:00
cyy	02871461f7	Fix clang-tidy warnings in torch/csrc/lazy (#134655 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134655 Approved by: https://github.com/ezyang	2024-09-21 02:59:35 +00:00
Laith Sakka	0b91e7e2dc	Remove duplicate line (#136383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-21 01:35:13 +00:00
eqy	29f7b8d483	[TF32] Account for TF32 in `test_conv_double_backward` (#135716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135716 Approved by: https://github.com/Skylion007	2024-09-21 01:06:22 +00:00
Nikita Shulga	7936584a88	Fix `Vectorized<double>::next_after` SVE compilation (#136388 ) Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values This fixes a compilation issue introduced by https://github.com/pytorch/pytorch/pull/119571 and exposed by https://github.com/pytorch/pytorch/pull/133339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136388 Approved by: https://github.com/kit1980	2024-09-20 23:54:17 +00:00
albanD	067d203b22	Upgrade pybind11 API calls for 3.13t (#136370 ) This is a modified version of https://github.com/pytorch/pytorch/pull/130341 that preserve support for older pybind version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136370 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-20 23:09:55 +00:00
Colin Peppler	1a10751731	[AOTI][Tooling] Filter out kernels based off lowercase names (#135395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135395 Approved by: https://github.com/YUNQIUGUO	2024-09-20 21:56:08 +00:00
Isuru Fernando	0c936c3ecb	Add decomps for max_unpool (#133146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-20 21:35:25 +00:00
侯奇	293fccf86d	add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012 ) `torch::cuda::nccl` is an option for developers to depend only on torch but not nccl. But to use `torch::cuda::nccl::send`/`torch::cuda::nccl::recv`, `ncclGroupStart()`/`ncclGroupEnd()` is needed, `torch::cuda::nccl::AutoNcclGroup` can be used. but `torch::cuda::nccl::AutoNcclGroup` is not exported and is LOCAL symbol, which can't be used from outside of libtorch. <img width="1618" alt="image" src="https://github.com/pytorch/pytorch/assets/1913192/25b0bd54-2da6-480f-876d-b05acfecfe62"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130012 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-09-20 21:20:25 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
cyy	d2455b99fb	Use cpython declaration of _PyWeakref_ClearRef (#136300 ) To avoid the DLL inconsistency warning by MSVC: ``` torch/csrc/utils/python_compat.h(38): warning C4273: '_PyWeakref_ClearRef': inconsistent dll linkage ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136300 Approved by: https://github.com/Skylion007	2024-09-20 18:58:58 +00:00
Bob Ren	7f9c06462f	fix mypi in utils/_sympy/functions.py (#136339 ) Signed-off-by: Bob Ren <bobren@fb.com> Turns out older versions of python, in particular 3.8 shows errors that 3.12 doesn't. For posterity these are the steps I took to reproduce: ``` conda create -n py38 python=3.8 conda activate py38 pip install -r requirements.txt lintrunner init dmypy restart && lintrunner --all-files --take MYPY ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136339 Approved by: https://github.com/Skylion007 ghstack dependencies: #136205	2024-09-20 18:39:16 +00:00
Bin Bao	f53a0f9cc1	[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356 ) Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now. Reviewed By: henrylhtsang Differential Revision: D63035855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356 Approved by: https://github.com/henrylhtsang	2024-09-20 18:35:09 +00:00
Xu Song	5997354151	Add more distributed examples (#130427 ) 1. Add `gather` example 2. Add device to `scatter` example Pull Request resolved: https://github.com/pytorch/pytorch/pull/130427 Approved by: https://github.com/kwen2501	2024-09-20 18:27:27 +00:00
PyTorch MergeBot	df1eef9779	Revert "[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 )" This reverts commit f3c54ccf8f6139807f4623037c0174964a286652. Reverted https://github.com/pytorch/pytorch/pull/136282 on behalf of https://github.com/huydhn due to This breaks OSS, let revert it and land the revert internally then ([comment](https://github.com/pytorch/pytorch/pull/136282#issuecomment-2364219252))	2024-09-20 17:49:06 +00:00
Jeff Daily	15dba021bb	[ROCm][CI] upgrade CI to ROCm 6.2 (#132555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-09-20 17:39:31 +00:00
Chirag Pandya	29affa6b95	return instead of using skipTest (#136244 ) Summary: Return from functions instead of using `skipTest`. This is mostly to make our test report happier. Skipped tests still show up in our Broken test report. ``` OK (skipped=1) I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters Skipped: Store doesn't support extended APIs ``` Test Plan: Tested locally. Test shows up as passed instead of skipped. ``` Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62912065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244 Approved by: https://github.com/XilunWu	2024-09-20 17:36:28 +00:00
David Berard	d7a6980078	[inductor] Make DtypeView work with cpp_wrapper without abi_compatible (#136233 ) Fixes #136159 Prior to this PR, using cpp_wrapper without abi_compatible could result in incorrect dtypes. The following block of code implements cpp_wrapper codegen for reinterpret_view for abi_compatible mode, but not for non-abi_compatible mode. `f6f1504d39/torch/_inductor/codegen/cpp_wrapper_cpu.py (L1678-L1814)` Added a test that verifies that we keep the view behavior, but returned tensors also have correct dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136233 Approved by: https://github.com/FindHao, https://github.com/eellison, https://github.com/jansel	2024-09-20 17:30:35 +00:00
Aleksei Nikiforov	080a249fc2	S390x update builder image (#132983 ) S390x update builder image Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-20 17:26:26 +00:00
PyTorch MergeBot	783c5ba80a	Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 )" This reverts commit 0b81f700aa7eb20d4b9f20e9627dd1208e50ea58. Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))	2024-09-20 17:10:27 +00:00
IvanKobzarev	cdef760560	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-20 16:32:49 +00:00
Aditya Tewari	4842f0fac6	Enable torch build with SLEEF on ARM by default (#133339 ) Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform. Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend. On models like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a ~1.2x speedup (with 16 threads). The below results are run with `Batch_Size=1` and `Cores=8, 16` ![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339 Approved by: https://github.com/malfet, https://github.com/kimishpatel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-20 16:02:32 +00:00
Riley Dulin	f3c54ccf8f	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-20 07:34:52 +00:00
Sun, Jiayi	687e5cf8c5	[inductor] Relax the conditions for loop split (#135335 ) Summary This PR Relaxes the conditions for loop split to support dynamic shape cases. Now the conditions that need to be met to apply loop split optimization are as follows: 1. No reduction and no mudular index for all nodes. 2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs. Example: ``` import torch import torch.nn as nn class GN(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m, dynamic=True) with torch.no_grad(): compiled_m(input) ``` Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960s22z0 + 960s2z1 + 960z2 + z3, 'index1': 32z0 + (z3//30), 'index2': 30s22, 'index3': z3, 'index4': 960s2z0((s2*2//s2)) + 960z1((s22//s2)) + 960z2 + z3}`. After loop split `z3` will split to `30z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960s2*2z0 + 960s2z1 + 960z2 + 30z3 + z4, 'index1': 32z0 + z3, 'index2': 30s2*2, 'index3': 30z3 + z4, 'index4': 960s2z0((s22//s2)) + 960z1((s22//s2)) + 960z2 + 30z3 + z4}` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1))))]; auto tmp1 = out_ptr0[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp3 = out_ptr1[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp11 = in_ptr1[static_cast<int64_t>(x3)]; auto tmp13 = in_ptr2[static_cast<int64_t>(x3)]; auto tmp2 = decltype(tmp0)(tmp0 - tmp1); auto tmp4 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp5 = c10::convert<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = decltype(tmp2)(tmp2 tmp9); auto tmp12 = decltype(tmp10)(tmp10 * tmp11); auto tmp14 = decltype(tmp12)(tmp12 + tmp13); out_ptr2[static_cast<int64_t>(x3 + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))] = tmp14; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L)) { for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))); } for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2*s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-09-20 05:42:52 +00:00
albanD	cf31724db7	Fix and improvements to toward 3.13t (#136319 ) Small part of https://github.com/pytorch/pytorch/pull/130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-09-20 04:22:18 +00:00
Tom Ritchford	e3ea5429f2	Implement GetAttrVariable.as_python_constant() (#134216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134216 Approved by: https://github.com/amjames, https://github.com/williamwen42	2024-09-20 03:44:43 +00:00
Sergii Dymchenko	d9aca9914b	Remove duplicated words in library.rst (#136340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340 Approved by: https://github.com/svekars	2024-09-20 03:30:54 +00:00
Huy Do	fe0e9fb385	Fix flaky SIGSEGV crash in test_profile_memory (#136304 ) Fixes https://github.com/pytorch/pytorch/issues/132331 We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash). I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`. ### Testing `pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304 Approved by: https://github.com/briancoutinho	2024-09-20 02:56:49 +00:00
Kurt Mohler	d45b0151e5	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-20 02:41:56 +00:00
Felix Su	1dfa07e885	passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913 ) Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user. Test Plan: unit tests Reviewed By: gag1jain Differential Revision: D62408767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913 Approved by: https://github.com/gag1jain	2024-09-20 00:54:02 +00:00
Tristan Rice	bebf5302ba	TCPStoreLibUvBackend: trace operations (#136320 ) Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu	2024-09-20 00:53:21 +00:00
Wei Wang	9b424aac1d	[CI][CUSPARSELT] Extend cusparselt installation script to support cuda 12.6 (#136321 ) To prepare for future cuda updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136321 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-19 23:45:57 +00:00
Brian Hirsh	172ecf78b7	DTensor: dont hash symint tensor input in propagate_tensor_meta (#136266 ) This fixes a subset of issues for dynamic shapes + DTensor. It's pretty easy to run into other issues - it's likely that we need https://github.com/pytorch/pytorch/pull/125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/136266 Approved by: https://github.com/ezyang, https://github.com/yf225	2024-09-19 20:39:36 +00:00
cyy	7bbdf87517	[22/N] Fix clang-tidy warnings in jit (#134829 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829 Approved by: https://github.com/ezyang	2024-09-19 19:24:42 +00:00
Laith Sakka	b71802fa79	add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175 Approved by: https://github.com/ezyang	2024-09-19 19:15:50 +00:00
Rachel Guo	8cba0ec958	[AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug printer (#136182 ) Summary: Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info. It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues. thanks ColinPeppler and henrylhtsang for this "feature request". Test Plan: The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`: {F1871629091} Differential Revision: D62791371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182 Approved by: https://github.com/henrylhtsang	2024-09-19 18:51:57 +00:00
Shan19900305	49723a8ff3	fix stride compare failed when size value equal to one in ForeachUtils.h (#134546 ) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546 Approved by: https://github.com/janeyx99	2024-09-19 18:43:41 +00:00
Jerry Mannil	ccca3de0cd	[ROCm] Enable Flex attention tests on AMD gpus (#136245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136245 Approved by: https://github.com/malfet	2024-09-19 18:02:41 +00:00
Bob Ren	8d9c42735a	Type _sympy/functions.py [1/n] (#136205 ) Signed-off-by: Bob Ren <bobren@fb.com> I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/136205 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu	2024-09-19 17:15:53 +00:00
James Wu	803ce507f1	Log structured logging overhead to dynamo compile (kinda) (#136142 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number in compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142 Approved by: https://github.com/bobrenjc93	2024-09-19 16:11:38 +00:00
Andrew Gu	65df26f615	[FSDP2] Fixed 2D mismatched grad placements (#136237 ) ``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237 Approved by: https://github.com/weifengpy	2024-09-19 14:35:15 +00:00
PyTorch MergeBot	4ea741d24f	Revert "Reland D62220158 (#136213 )" This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629. Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))	2024-09-19 12:44:54 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
Jan Wieczorek	908a5689eb	Return unsafe_view instead of view from matmul when folding occurs (#134568 ) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568 Approved by: https://github.com/zou3519	2024-09-19 11:52:16 +00:00
Huy Do	db80b98ec4	XFAIL test_segfault (#136252 ) Fixes https://github.com/pytorch/pytorch/issues/128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252 Approved by: https://github.com/andrewkho	2024-09-19 04:17:06 +00:00
Duygu Altinok	775517693a	Add type checks for Tensor.add_ (#135864 ) Fixes #127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864 Approved by: https://github.com/williamwen42	2024-09-19 03:09:36 +00:00
William Wen	e037bb326f	[dynamo] fix crash in InspectSignatureVariable (#136010 ) Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel	2024-09-19 00:23:00 +00:00
Jerry Zhang	f2b0fc89f2	Add uint16 support for observer (#136238 ) Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238 Approved by: https://github.com/tarun292	2024-09-18 23:52:18 +00:00
Nikita Shulga	068c80e6b6	[BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292 ) [reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph , CachedGraph >' requested here 341 \| decltype(std::declval<_Fp>()(std::declval<_Args>()...)) \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph , CachedGraph >] 351 \| static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 \| using _Result = decltype(__try_call<_Fp, _Args...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph , CachedGraph >' requested here 27 \| __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 \| using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 \| bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 \| using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; \| ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 \| template <class _Fp, class = _EnableIfLValueCallable<_Fp>> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 \| _LIBCPP_HIDE_FROM_ABI function(_Fp); \| ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292 Approved by: https://github.com/kit1980	2024-09-18 23:38:31 +00:00
Nikita Shulga	b9a197df77	[BE][MPS] Delete duplicated code in `View.mm` (#136295 ) After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295 Approved by: https://github.com/kit1980	2024-09-18 22:44:43 +00:00
Siju Samuel	f1ad680818	[dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763 ) Fixes #ISSUE_NUMBER Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763 Approved by: https://github.com/yanboliang, https://github.com/yf225	2024-09-18 22:32:34 +00:00
Will Feng	bc9597b7d8	[Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219 ) Changes in this PR: - Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda. - Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests. - The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219 Approved by: https://github.com/yifuwang	2024-09-18 22:30:23 +00:00
Isuru Fernando	1a86d8aa29	Fix calling Add._from_args and Mul._from_args (#136143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143 Approved by: https://github.com/ezyang	2024-09-18 20:51:04 +00:00
Atul Jangra	aae68e2976	Add wait counter for nccl abort (#136067 ) Summary: Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack. This will help us measure how much time we take the NCCL abort. Test Plan: Unit tests Reviewed By: c-p-i-o Differential Revision: D62675010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067 Approved by: https://github.com/fduwjj	2024-09-18 20:14:10 +00:00
eqy	68a7246f13	[cuDNN][conv][A100] Bump tolerances for `vmap_autograd_grad` `conv2d` on A100 (#136178 ) Likely due to a cuDNN heuristics update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178 Approved by: https://github.com/Skylion007	2024-09-18 19:42:13 +00:00
maajidkhann	5a6ddbcc3b	Extending the Pytorch vec backend for SVE (ARM) (#119571 ) Motivation: In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes. Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec This PR: * Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE. * More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions) * We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec. * Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571 Approved by: https://github.com/malfet, https://github.com/snadampal Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>	2024-09-18 18:59:10 +00:00
Jack Taylor	bad69044d8	[ROCm] upgrade ROCm CI builds to py3.10 (#134108 ) Upgrade ROCm CI builds to py3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-18 17:39:34 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
Scott Wolchok	605f2d802a	[PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202 ) Manually audited and can't figure out why this would be needed. Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202 Approved by: https://github.com/malfet	2024-09-18 16:57:15 +00:00
CaoE	6a6f5b20c5	Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936 ) Fixes #132613. Add `_addmm_activation` to lower precision cast policy on AutocastCPU. `_addmm_activation` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-09-18 16:31:27 +00:00
Isuru Fernando	c8d152cb0e	Fix fast_expand recursion error (#136163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163 Approved by: https://github.com/ezyang	2024-09-18 13:58:45 +00:00
Sun, Jiayi	701ba5203f	[Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932 ) Fix https://github.com/pytorch/pytorch/issues/135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel	2024-09-18 13:03:45 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Menglu Yu	083c9149b7	Reland D62220158 (#136213 ) Summary: We fix the unit test test_pad_mm and reland the diff Test Plan: See in D62220158 Differential Revision: D62891584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213 Approved by: https://github.com/dshi7	2024-09-18 07:33:41 +00:00
Jason Ansel	a0207c8471	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-18 04:47:51 +00:00
Nikita Shulga	9aa22eabe7	[CI] Make linux-aarch64 shards actually running different tests (#136208 ) Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman	2024-09-18 03:10:21 +00:00
Kiuk Chung	8895f69d12	[torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152 ) Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0. Changes in this PR: 1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x. 2. Do the same for `numpy.exceptions.VisibleDeprecationWarning` 3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0) 4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152 Approved by: https://github.com/atalman	2024-09-18 02:11:22 +00:00
Nikita Shulga	6682327c75	[BE] Make `NestedTensorTransformerFunctions.cu` compilable without warnings (#136222 ) Before the change compilation produced following warnings: ``` /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] 584 \| TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims); \| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ ``` after it compiled without a warning Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-09-18 01:24:05 +00:00
leslie-fang-intel	b18ba9419e	[AO][Inductor] Enable WOQ fusion pattern with permute (#135928 ) Summary Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO. Test Plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-09-18 00:56:16 +00:00
Chirag Pandya	cccf500193	[c10d] remove sleep from watchdogHandler (#135760 ) Summary: Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout. Flight recorder is configured to take a minute, at most, to dump out it's buffer. This sleep ends up waiting for `8` minutes before destroy is called. Test Plan: Unit tests. Differential Revision: D62529875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang	2024-09-18 00:55:01 +00:00
Nikita Shulga	f6f1504d39	[MPS] Fix 5D+ reductions over negative dimentions (#136198 ) This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions Added regresion test case to `TestMPS.test_sum` Fixes https://github.com/pytorch/pytorch/issues/136132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198 Approved by: https://github.com/albanD	2024-09-17 21:53:31 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
eqy	e3aa5e2f64	[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->initialized_` (#136155 ) #133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151 CC @shuqiangzhang @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155 Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang	2024-09-17 18:50:12 +00:00
Huanyu He	a4e9a1c90b	[TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045 ) Summary: # context * for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/) * basica idea of this diff is to short circuit the pytree flatten-unflatten function pairs between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict. NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545} * short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup. * hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users. # details * The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC. * a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns. WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`. # additional changes * absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`. * set `graph.owning_module` in export.unflatten as required by the graph modification * add one more layer of `sparse_module` for closely mimicing the APF model structure. Test Plan: # run test * serializer ``` buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer ``` * apf ``` buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir' ``` * local mp run ``` ==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ==== finished test_mtml_instagram_model_562438350_single_gpu_with_ir Imports took: 6.0s! Profile with --import-profiler. --_ \|""---__ Executed 1 example in 203.1s: \|'.\| \|\| . """\| Successful: 1 \| \|\| \|\| /\|\""-. \| Failed: 0 \| \|\| \|\| \| \| \| Skipped: 0 \| \|\| \|\| \| \\|/ \| Not executed: 8 \|."\| \|\| --"" '__\| https://testslide.readthedocs.io/ --" \|__---""" ``` Differential Revision: D62606738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045 Approved by: https://github.com/angelayi	2024-09-17 18:42:56 +00:00
angelayi	ea10c072f3	[export] Deserialize args with python keyword names (#136036 ) Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036 Approved by: https://github.com/zhxchen17	2024-09-17 18:13:14 +00:00
Joel Schlosser	a8382847f4	Support rms_norm() for NJT (#135872 ) `rms_norm()` is a nice-to-have for ViT :) This PR: * SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp. * Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side. * Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #125947	2024-09-17 18:09:20 +00:00
Nikita Shulga	785e98783b	Delete links to non-existing `run_plan_mpi.cc` (#136204 ) That were deleted by https://github.com/pytorch/pytorch/pull/125092 Fixes https://github.com/pytorch/pytorch/issues/136199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-09-17 17:51:56 +00:00
Trung Truong	cc365fdd7b	[MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889 ) Summary: Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn At the moment, both the major and minor version are just 0 Test Plan: Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/ Differential Revision: D62595296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889 Approved by: https://github.com/egienvalue	2024-09-17 17:42:56 +00:00
Xintong Hu	8e5bb356e0	[PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527 ) Summary: as title Test Plan: new UT Differential Revision: D62398390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527 Approved by: https://github.com/frank-wei	2024-09-17 17:26:53 +00:00
Nikhil Gupta	63dc5dff10	[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857 ) Regression PR : https://github.com/pytorch/cpuinfo/pull/255 Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-09-17 16:50:17 +00:00
Justin Chu	67b14ce8bd	[ONNX] Fix numpy method to return the correct type (#136162 ) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-17 15:51:00 +00:00
Mauricio Villegas	ece8267d2c	Add back optim type hints that were lost when .pyi files were removed (#136185 ) When stub files (`.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185 Approved by: https://github.com/janeyx99	2024-09-17 15:45:15 +00:00
Edward Z. Yang	913f97e878	Don't run reshape pattern match on dynamic shape size tensor (#136100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100 Approved by: https://github.com/mengluy0125	2024-09-17 15:08:55 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
PyTorch MergeBot	3b5e2689a1	Revert "Optimize dict reconstruct to not codegen untouched values (#134876 )" This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c. Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))	2024-09-17 13:00:01 +00:00
ankurneog	e248c1d7eb	Update real device in FSDP state_dict_utils (#134994 ) ## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(args, kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994 Approved by: https://github.com/fegin	2024-09-17 04:39:08 +00:00
wz337	408fe41a45	[DSD][EZ] Minor update in _state_dict_utils.py (#136165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763	2024-09-17 04:32:43 +00:00
Brian Hirsh	dc82d274e6	make view.dtype always return an alias (#136074 ) Fixes https://github.com/pytorch/pytorch/issues/136064 In the linked repro, this issue was that there was some code like this: ``` # x has dtype torch.float32 def f(x): y = x.view(torch.float32) y.copy_(...) ``` Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input. Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`). This does not happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set. This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input. I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #136041	2024-09-17 03:40:54 +00:00
Brian Hirsh	d463a81c27	inductor: dont use default_dtype during rng functionalization (#136041 ) Fixes https://github.com/pytorch/pytorch/issues/119162 See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041 Approved by: https://github.com/eellison	2024-09-17 03:40:54 +00:00
Zhijing Li (Accelerator Enablement)	3f74310784	Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 )" (#136160 ) Test Plan: make train-hstu-cint-publish-bf16-tgif-local Differential Revision: D62766335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160 Approved by: https://github.com/muchulee8	2024-09-17 01:06:10 +00:00
PyTorch MergeBot	37a08b33bb	Revert "fix compiled_autograd deadlock throw (#135795 )" This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98. Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))	2024-09-16 23:59:56 +00:00
Laith Sakka	071da87cd7	use csv extention for test report in order for it to be uploaded to s3 (#136128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128 Approved by: https://github.com/clee2000	2024-09-16 21:47:46 +00:00
Justin Chu	c12536b3c0	[ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153 ) Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed. ``` <OpOverload(op='aten.__and__', overload='Scalar')>, <OpOverload(op='aten.__and__', overload='Tensor')>, <OpOverload(op='aten.__or__', overload='Scalar')>, <OpOverload(op='aten.__or__', overload='Tensor')>, <OpOverload(op='aten.__xor__', overload='Scalar')>, <OpOverload(op='aten.__xor__', overload='Tensor')>, <OpOverload(op='aten._add_batch_dim', overload='default')>, <OpOverload(op='aten._assert_tensor_metadata', overload='default')>, <OpOverload(op='aten._backward', overload='default')>, <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>, <OpOverload(op='aten._cast_Byte', overload='default')>, <OpOverload(op='aten._cast_Char', overload='default')>, <OpOverload(op='aten._cast_Double', overload='default')>, <OpOverload(op='aten._cast_Float', overload='default')>, <OpOverload(op='aten._cast_Half', overload='default')>, <OpOverload(op='aten._cast_Int', overload='default')>, <OpOverload(op='aten._cast_Long', overload='default')>, <OpOverload(op='aten._cast_Short', overload='default')>, <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>, <OpOverload(op='aten._convolution', overload='deprecated')>, <OpOverload(op='aten._convolution_double_backward', overload='default')>, <OpOverload(op='aten._convolution_mode', overload='default')>, <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>, <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>, <OpOverload(op='aten._dim_arange', overload='default')>, <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>, <OpOverload(op='aten._gather_sparse_backward', overload='default')>, <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>, <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>, <OpOverload(op='aten._is_zerotensor', overload='default')>, <OpOverload(op='aten._lu_with_info', overload='default')>, <OpOverload(op='aten._nnpack_available', overload='default')>, <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>, <OpOverload(op='aten._pad_circular', overload='default')>, <OpOverload(op='aten._pad_enum', overload='default')>, <OpOverload(op='aten._pad_packed_sequence', overload='default')>, <OpOverload(op='aten._propagate_xla_data', overload='default')>, <OpOverload(op='aten._remove_batch_dim', overload='default')>, <OpOverload(op='aten._reshape_from_tensor', overload='default')>, <OpOverload(op='aten._rowwise_prune', overload='default')>, <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>, <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>, <OpOverload(op='aten._shape_as_tensor', overload='default')>, <OpOverload(op='aten._sobol_engine_draw', overload='default')>, <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_log_softmax', overload='int')>, <OpOverload(op='aten._sparse_mm', overload='default')>, <OpOverload(op='aten._sparse_mm', overload='reduce')>, <OpOverload(op='aten._sparse_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_softmax', overload='int')>, <OpOverload(op='aten._sparse_sum', overload='default')>, <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>, <OpOverload(op='aten._sparse_sum', overload='dtype')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>, <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>, <OpOverload(op='aten._test_check_tensor', overload='default')>, <OpOverload(op='aten._test_serialization_subcmul', overload='default')>, <OpOverload(op='aten._test_string_default', overload='default')>, <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._to_cpu', overload='default')>, <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>, <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>, <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>, <OpOverload(op='aten._version', overload='default')>, <OpOverload(op='aten._weight_norm', overload='default')>, <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>, <OpOverload(op='aten.absolute', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>, <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>, <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>, <OpOverload(op='aten.align_as', overload='default')>, <OpOverload(op='aten.align_tensors', overload='default')>, <OpOverload(op='aten.all', overload='dimname')>, <OpOverload(op='aten.any', overload='dimname')>, <OpOverload(op='aten.arccos', overload='default')>, <OpOverload(op='aten.arccosh', overload='default')>, <OpOverload(op='aten.arcsin', overload='default')>, <OpOverload(op='aten.arcsinh', overload='default')>, <OpOverload(op='aten.arctan', overload='default')>, <OpOverload(op='aten.arctan2', overload='default')>, <OpOverload(op='aten.arctanh', overload='default')>, <OpOverload(op='aten.argsort', overload='default')>, <OpOverload(op='aten.argsort', overload='dimname')>, <OpOverload(op='aten.argsort', overload='stable')>, <OpOverload(op='aten.argwhere', overload='default')>, <OpOverload(op='aten.atleast_1d', overload='Sequence')>, <OpOverload(op='aten.atleast_2d', overload='Sequence')>, <OpOverload(op='aten.atleast_3d', overload='Sequence')>, <OpOverload(op='aten.avg_pool1d', overload='default')>, <OpOverload(op='aten.bilinear', overload='default')>, <OpOverload(op='aten.broadcast_tensors', overload='default')>, <OpOverload(op='aten.can_cast', overload='default')>, <OpOverload(op='aten.cat', overload='names')>, <OpOverload(op='aten.cdist', overload='default')>, <OpOverload(op='aten.chain_matmul', overload='default')>, <OpOverload(op='aten.chalf', overload='default')>, <OpOverload(op='aten.choose_qparams_optimized', overload='default')>, <OpOverload(op='aten.clip', overload='Tensor')>, <OpOverload(op='aten.clip', overload='default')>, <OpOverload(op='aten.column_stack', overload='default')>, <OpOverload(op='aten.combinations', overload='default')>, <OpOverload(op='aten.concat', overload='default')>, <OpOverload(op='aten.concat', overload='names')>, <OpOverload(op='aten.concatenate', overload='default')>, <OpOverload(op='aten.concatenate', overload='names')>, <OpOverload(op='aten.conv1d', overload='default')>, <OpOverload(op='aten.conv1d', overload='padding')>, <OpOverload(op='aten.conv2d', overload='default')>, <OpOverload(op='aten.conv2d', overload='padding')>, <OpOverload(op='aten.conv3d', overload='default')>, <OpOverload(op='aten.conv3d', overload='padding')>, <OpOverload(op='aten.conv_tbc_backward', overload='default')>, <OpOverload(op='aten.conv_transpose1d', overload='default')>, <OpOverload(op='aten.conv_transpose2d', overload='input')>, <OpOverload(op='aten.conv_transpose3d', overload='input')>, <OpOverload(op='aten.corrcoef', overload='default')>, <OpOverload(op='aten.cosine_embedding_loss', overload='default')>, <OpOverload(op='aten.cosine_similarity', overload='default')>, <OpOverload(op='aten.cov', overload='default')>, <OpOverload(op='aten.cross', overload='default')>, <OpOverload(op='aten.cross_entropy_loss', overload='default')>, <OpOverload(op='aten.ctc_loss', overload='IntList')>, <OpOverload(op='aten.ctc_loss', overload='Tensor')>, <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>, <OpOverload(op='aten.cummax', overload='dimname')>, <OpOverload(op='aten.cummaxmin_backward', overload='default')>, <OpOverload(op='aten.cummin', overload='dimname')>, <OpOverload(op='aten.cumprod', overload='dimname')>, <OpOverload(op='aten.cumprod_backward', overload='default')>, <OpOverload(op='aten.cumsum', overload='dimname')>, <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>, <OpOverload(op='aten.cumulative_trapezoid', overload='x')>, <OpOverload(op='aten.data', overload='default')>, <OpOverload(op='aten.det', overload='default')>, <OpOverload(op='aten.diag', overload='default')>, <OpOverload(op='aten.diagflat', overload='default')>, <OpOverload(op='aten.diff', overload='default')>, <OpOverload(op='aten.divide', overload='Scalar')>, <OpOverload(op='aten.divide', overload='Scalar_mode')>, <OpOverload(op='aten.divide', overload='Tensor')>, <OpOverload(op='aten.divide', overload='Tensor_mode')>, <OpOverload(op='aten.dstack', overload='default')>, <OpOverload(op='aten.einsum', overload='default')>, <OpOverload(op='aten.embedding_backward', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='padding_idx')>, <OpOverload(op='aten.embedding_sparse_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>, <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>, <OpOverload(op='aten.fft_fft', overload='default')>, <OpOverload(op='aten.fft_fft2', overload='default')>, <OpOverload(op='aten.fft_fftn', overload='default')>, <OpOverload(op='aten.fft_fftshift', overload='default')>, <OpOverload(op='aten.fft_hfft', overload='default')>, <OpOverload(op='aten.fft_hfft2', overload='default')>, <OpOverload(op='aten.fft_hfftn', overload='default')>, <OpOverload(op='aten.fft_ifft', overload='default')>, <OpOverload(op='aten.fft_ifft2', overload='default')>, <OpOverload(op='aten.fft_ifftn', overload='default')>, <OpOverload(op='aten.fft_ifftshift', overload='default')>, <OpOverload(op='aten.fft_ihfft', overload='default')>, <OpOverload(op='aten.fft_ihfft2', overload='default')>, <OpOverload(op='aten.fft_ihfftn', overload='default')>, <OpOverload(op='aten.fft_irfft', overload='default')>, <OpOverload(op='aten.fft_irfft2', overload='default')>, <OpOverload(op='aten.fft_irfftn', overload='default')>, <OpOverload(op='aten.fft_rfft', overload='default')>, <OpOverload(op='aten.fft_rfft2', overload='default')>, <OpOverload(op='aten.fft_rfftn', overload='default')>, <OpOverload(op='aten.fix', overload='default')>, <OpOverload(op='aten.flatten_dense_tensors', overload='default')>, <OpOverload(op='aten.fliplr', overload='default')>, <OpOverload(op='aten.flipud', overload='default')>, <OpOverload(op='aten.float_power', overload='Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>, <OpOverload(op='aten.frobenius_norm', overload='dim')>, <OpOverload(op='aten.gather', overload='dimname')>, <OpOverload(op='aten.gather_backward', overload='default')>, <OpOverload(op='aten.ger', overload='default')>, <OpOverload(op='aten.gradient', overload='array')>, <OpOverload(op='aten.gradient', overload='scalararray')>, <OpOverload(op='aten.gradient', overload='scalarint')>, <OpOverload(op='aten.gradient', overload='scalarrayarray')>, <OpOverload(op='aten.gradient', overload='scalarrayint')>, <OpOverload(op='aten.gradient', overload='tensorarray')>, <OpOverload(op='aten.gradient', overload='tensorarrayint')>, <OpOverload(op='aten.greater', overload='Scalar')>, <OpOverload(op='aten.greater', overload='Tensor')>, <OpOverload(op='aten.greater_equal', overload='Scalar')>, <OpOverload(op='aten.greater_equal', overload='Tensor')>, <OpOverload(op='aten.grid_sampler', overload='default')>, <OpOverload(op='aten.group_norm', overload='default')>, <OpOverload(op='aten.gru', overload='data')>, <OpOverload(op='aten.gru', overload='input')>, <OpOverload(op='aten.gru_cell', overload='default')>, <OpOverload(op='aten.hinge_embedding_loss', overload='default')>, <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>, <OpOverload(op='aten.histogramdd', overload='default')>, <OpOverload(op='aten.histogramdd', overload='int_bins')>, <OpOverload(op='aten.hstack', overload='default')>, <OpOverload(op='aten.index_add', overload='dimname')>, <OpOverload(op='aten.index_copy', overload='dimname')>, <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>, <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>, <OpOverload(op='aten.index_select', overload='dimname')>, <OpOverload(op='aten.index_select_backward', overload='default')>, <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>, <OpOverload(op='aten.inner', overload='default')>, <OpOverload(op='aten.instance_norm', overload='default')>, <OpOverload(op='aten.inverse', overload='default')>, <OpOverload(op='aten.is_complex', overload='default')>, <OpOverload(op='aten.is_conj', overload='default')>, <OpOverload(op='aten.is_distributed', overload='default')>, <OpOverload(op='aten.is_floating_point', overload='default')>, <OpOverload(op='aten.is_inference', overload='default')>, <OpOverload(op='aten.is_leaf', overload='default')>, <OpOverload(op='aten.is_neg', overload='default')>, <OpOverload(op='aten.is_nonzero', overload='default')>, <OpOverload(op='aten.is_signed', overload='default')>, <OpOverload(op='aten.is_vulkan_available', overload='default')>, <OpOverload(op='aten.isclose', overload='default')>, <OpOverload(op='aten.isfinite', overload='default')>, <OpOverload(op='aten.isreal', overload='default')>, <OpOverload(op='aten.istft', overload='default')>, <OpOverload(op='aten.item', overload='default')>, <OpOverload(op='aten.kl_div', overload='default')>, <OpOverload(op='aten.kron', overload='default')>, <OpOverload(op='aten.kthvalue', overload='dimname')>, <OpOverload(op='aten.l1_loss', overload='default')>, <OpOverload(op='aten.layer_norm', overload='default')>, <OpOverload(op='aten.ldexp', overload='Tensor')>, <OpOverload(op='aten.less', overload='Scalar')>, <OpOverload(op='aten.less', overload='Tensor')>, <OpOverload(op='aten.less_equal', overload='Scalar')>, <OpOverload(op='aten.less_equal', overload='Tensor')>, <OpOverload(op='aten.linalg_cholesky', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='p_str')>, <OpOverload(op='aten.linalg_det', overload='default')>, <OpOverload(op='aten.linalg_eigh', overload='default')>, <OpOverload(op='aten.linalg_eigvals', overload='default')>, <OpOverload(op='aten.linalg_eigvalsh', overload='default')>, <OpOverload(op='aten.linalg_inv', overload='default')>, <OpOverload(op='aten.linalg_ldl_factor', overload='default')>, <OpOverload(op='aten.linalg_lu_factor', overload='default')>, <OpOverload(op='aten.linalg_matmul', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>, <OpOverload(op='aten.linalg_matrix_power', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>, <OpOverload(op='aten.linalg_matrix_rank', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>, <OpOverload(op='aten.linalg_multi_dot', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='ord_str')>, <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_pinv', overload='default')>, <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>, <OpOverload(op='aten.linalg_slogdet', overload='default')>, <OpOverload(op='aten.linalg_solve', overload='default')>, <OpOverload(op='aten.linalg_solve_ex', overload='default')>, <OpOverload(op='aten.linalg_svd', overload='default')>, <OpOverload(op='aten.linalg_svdvals', overload='default')>, <OpOverload(op='aten.linalg_tensorinv', overload='default')>, <OpOverload(op='aten.linalg_tensorsolve', overload='default')>, <OpOverload(op='aten.linalg_vander', overload='default')>, <OpOverload(op='aten.linalg_vecdot', overload='default')>, <OpOverload(op='aten.linear', overload='default')>, <OpOverload(op='aten.log_sigmoid', overload='default')>, <OpOverload(op='aten.log_softmax', overload='Dimname')>, <OpOverload(op='aten.log_softmax', overload='int')>, <OpOverload(op='aten.logcumsumexp', overload='dimname')>, <OpOverload(op='aten.logdet', overload='default')>, <OpOverload(op='aten.logsumexp', overload='names')>, <OpOverload(op='aten.lstm', overload='data')>, <OpOverload(op='aten.lstm', overload='input')>, <OpOverload(op='aten.lstm_cell', overload='default')>, <OpOverload(op='aten.lu_solve', overload='default')>, <OpOverload(op='aten.margin_ranking_loss', overload='default')>, <OpOverload(op='aten.masked_select_backward', overload='default')>, <OpOverload(op='aten.matmul', overload='default')>, <OpOverload(op='aten.matrix_exp', overload='default')>, <OpOverload(op='aten.matrix_exp_backward', overload='default')>, <OpOverload(op='aten.matrix_power', overload='default')>, <OpOverload(op='aten.max', overload='names_dim')>, <OpOverload(op='aten.max', overload='other')>, <OpOverload(op='aten.max_pool1d', overload='default')>, <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>, <OpOverload(op='aten.max_pool2d', overload='default')>, <OpOverload(op='aten.max_pool3d', overload='default')>, <OpOverload(op='aten.mean', overload='names_dim')>, <OpOverload(op='aten.median', overload='names_dim')>, <OpOverload(op='aten.meshgrid', overload='default')>, <OpOverload(op='aten.meshgrid', overload='indexing')>, <OpOverload(op='aten.min', overload='names_dim')>, <OpOverload(op='aten.min', overload='other')>, <OpOverload(op='aten.mish_backward', overload='default')>, <OpOverload(op='aten.mode', overload='dimname')>, <OpOverload(op='aten.msort', overload='default')>, <OpOverload(op='aten.multilabel_margin_loss', overload='default')>, <OpOverload(op='aten.multiply', overload='Scalar')>, <OpOverload(op='aten.multiply', overload='Tensor')>, <OpOverload(op='aten.nanmean', overload='default')>, <OpOverload(op='aten.nanmedian', overload='names_dim')>, <OpOverload(op='aten.nanquantile', overload='default')>, <OpOverload(op='aten.nanquantile', overload='scalar')>, <OpOverload(op='aten.native_channel_shuffle', overload='default')>, <OpOverload(op='aten.negative', overload='default')>, <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>, <OpOverload(op='aten.nll_loss', overload='default')>, <OpOverload(op='aten.nll_loss2d', overload='default')>, <OpOverload(op='aten.nll_loss_nd', overload='default')>, <OpOverload(op='aten.nonzero_numpy', overload='default')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>, <OpOverload(op='aten.norm_except_dim', overload='default')>, <OpOverload(op='aten.not_equal', overload='Scalar')>, <OpOverload(op='aten.not_equal', overload='Tensor')>, <OpOverload(op='aten.nuclear_norm', overload='default')>, <OpOverload(op='aten.nuclear_norm', overload='dim')>, <OpOverload(op='aten.one_hot', overload='default')>, <OpOverload(op='aten.orgqr', overload='default')>, <OpOverload(op='aten.outer', overload='default')>, <OpOverload(op='aten.output_nr', overload='default')>, <OpOverload(op='aten.pad', overload='default')>, <OpOverload(op='aten.pad_sequence', overload='default')>, <OpOverload(op='aten.pairwise_distance', overload='default')>, <OpOverload(op='aten.pdist', overload='default')>, <OpOverload(op='aten.pinverse', overload='default')>, <OpOverload(op='aten.poisson_nll_loss', overload='default')>, <OpOverload(op='aten.prelu', overload='default')>, <OpOverload(op='aten.prod', overload='dim_Dimname')>, <OpOverload(op='aten.promote_types', overload='default')>, <OpOverload(op='aten.qr', overload='default')>, <OpOverload(op='aten.quantile', overload='default')>, <OpOverload(op='aten.quantile', overload='scalar')>, <OpOverload(op='aten.quantized_gru_cell', overload='default')>, <OpOverload(op='aten.quantized_lstm_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.relu6', overload='default')>, <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>, <OpOverload(op='aten.repeat_interleave', overload='self_int')>, <OpOverload(op='aten.result_type', overload='Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>, <OpOverload(op='aten.result_type', overload='Tensor')>, <OpOverload(op='aten.retains_grad', overload='default')>, <OpOverload(op='aten.rms_norm', overload='default')>, <OpOverload(op='aten.rnn_relu', overload='data')>, <OpOverload(op='aten.rnn_relu', overload='input')>, <OpOverload(op='aten.rnn_relu_cell', overload='default')>, <OpOverload(op='aten.rnn_tanh', overload='data')>, <OpOverload(op='aten.rnn_tanh', overload='input')>, <OpOverload(op='aten.rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.row_stack', overload='default')>, <OpOverload(op='aten.rrelu', overload='default')>, <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>, <OpOverload(op='aten.scatter', overload='dimname_src')>, <OpOverload(op='aten.scatter', overload='dimname_value')>, <OpOverload(op='aten.scatter_add', overload='dimname')>, <OpOverload(op='aten.selu', overload='default')>, <OpOverload(op='aten.silu_backward', overload='default')>, <OpOverload(op='aten.size', overload='Dimname')>, <OpOverload(op='aten.size', overload='int')>, <OpOverload(op='aten.slogdet', overload='default')>, <OpOverload(op='aten.slow_conv3d', overload='default')>, <OpOverload(op='aten.smm', overload='default')>, <OpOverload(op='aten.softmax', overload='Dimname')>, <OpOverload(op='aten.softmax', overload='int')>, <OpOverload(op='aten.sort', overload='dimname')>, <OpOverload(op='aten.sort', overload='dimname_stable')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.special_digamma', overload='default')>, <OpOverload(op='aten.special_erf', overload='default')>, <OpOverload(op='aten.special_erfc', overload='default')>, <OpOverload(op='aten.special_erfinv', overload='default')>, <OpOverload(op='aten.special_exp2', overload='default')>, <OpOverload(op='aten.special_expit', overload='default')>, <OpOverload(op='aten.special_expm1', overload='default')>, <OpOverload(op='aten.special_gammainc', overload='default')>, <OpOverload(op='aten.special_gammaincc', overload='default')>, <OpOverload(op='aten.special_gammaln', overload='default')>, <OpOverload(op='aten.special_i0', overload='default')>, <OpOverload(op='aten.special_log1p', overload='default')>, <OpOverload(op='aten.special_log_softmax', overload='default')>, <OpOverload(op='aten.special_logit', overload='default')>, <OpOverload(op='aten.special_logsumexp', overload='default')>, <OpOverload(op='aten.special_multigammaln', overload='default')>, <OpOverload(op='aten.special_ndtr', overload='default')>, <OpOverload(op='aten.special_polygamma', overload='default')>, <OpOverload(op='aten.special_psi', overload='default')>, <OpOverload(op='aten.special_round', overload='default')>, <OpOverload(op='aten.special_sinc', overload='default')>, <OpOverload(op='aten.special_softmax', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='other_scalar')>, <OpOverload(op='aten.special_xlogy', overload='self_scalar')>, <OpOverload(op='aten.square', overload='default')>, <OpOverload(op='aten.sspaddmm', overload='default')>, <OpOverload(op='aten.std', overload='correction_names')>, <OpOverload(op='aten.std', overload='default')>, <OpOverload(op='aten.std', overload='dim')>, <OpOverload(op='aten.std', overload='names_dim')>, <OpOverload(op='aten.std_mean', overload='correction_names')>, <OpOverload(op='aten.std_mean', overload='default')>, <OpOverload(op='aten.std_mean', overload='dim')>, <OpOverload(op='aten.std_mean', overload='names_dim')>, <OpOverload(op='aten.stft', overload='center')>, <OpOverload(op='aten.stft', overload='default')>, <OpOverload(op='aten.stride', overload='Dimname')>, <OpOverload(op='aten.stride', overload='int')>, <OpOverload(op='aten.subtract', overload='Scalar')>, <OpOverload(op='aten.subtract', overload='Tensor')>, <OpOverload(op='aten.sum', overload='dim_DimnameList')>, <OpOverload(op='aten.sum_to_size', overload='default')>, <OpOverload(op='aten.svd', overload='default')>, <OpOverload(op='aten.sym_size', overload='int')>, <OpOverload(op='aten.sym_stride', overload='int')>, <OpOverload(op='aten.take_along_dim', overload='default')>, <OpOverload(op='aten.tensordot', overload='default')>, <OpOverload(op='aten.thnn_conv2d', overload='default')>, <OpOverload(op='aten.tile', overload='default')>, <OpOverload(op='aten.to_dense', overload='default')>, <OpOverload(op='aten.to_dense_backward', overload='default')>, <OpOverload(op='aten.to_mkldnn_backward', overload='default')>, <OpOverload(op='aten.to_sparse', overload='default')>, <OpOverload(op='aten.to_sparse', overload='sparse_dim')>, <OpOverload(op='aten.to_sparse_bsc', overload='default')>, <OpOverload(op='aten.to_sparse_bsr', overload='default')>, <OpOverload(op='aten.to_sparse_csc', overload='default')>, <OpOverload(op='aten.to_sparse_csr', overload='default')>, <OpOverload(op='aten.trace_backward', overload='default')>, <OpOverload(op='aten.trapezoid', overload='dx')>, <OpOverload(op='aten.trapezoid', overload='x')>, <OpOverload(op='aten.trapz', overload='dx')>, <OpOverload(op='aten.trapz', overload='x')>, <OpOverload(op='aten.triplet_margin_loss', overload='default')>, <OpOverload(op='aten.true_divide', overload='Scalar')>, <OpOverload(op='aten.true_divide', overload='Tensor')>, <OpOverload(op='aten.type_as', overload='default')>, <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>, <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>, <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>, <OpOverload(op='aten.upsample_linear1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest1d', overload='default')>, <OpOverload(op='aten.upsample_nearest1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest2d', overload='default')>, <OpOverload(op='aten.upsample_nearest2d', overload='vec')>, <OpOverload(op='aten.upsample_nearest3d', overload='default')>, <OpOverload(op='aten.upsample_nearest3d', overload='vec')>, <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>, <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>, <OpOverload(op='aten.vander', overload='default')>, <OpOverload(op='aten.var', overload='correction_names')>, <OpOverload(op='aten.var', overload='default')>, <OpOverload(op='aten.var', overload='dim')>, <OpOverload(op='aten.var', overload='names_dim')>, <OpOverload(op='aten.var_mean', overload='correction_names')>, <OpOverload(op='aten.var_mean', overload='default')>, <OpOverload(op='aten.var_mean', overload='dim')>, <OpOverload(op='aten.var_mean', overload='names_dim')>, <OpOverload(op='aten.vstack', overload='default')>, <OpOverload(op='aten.where', overload='Scalar')>, <OpOverload(op='aten.where', overload='ScalarOther')>, <OpOverload(op='aten.where', overload='ScalarSelf')>, <OpOverload(op='aten.where', overload='default')>, <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>, <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153 Approved by: https://github.com/xadupre, https://github.com/gramalingam	2024-09-16 21:28:54 +00:00
Pearu Peterson	b76d1b79e6	Add scaling arguments to bsr_dense_addmm (#136104 ) As in the title. Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413 The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task. Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous. Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104 Approved by: https://github.com/cpuhrsch	2024-09-16 20:26:54 +00:00
PyTorch MergeBot	bfbcdf4967	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))	2024-09-16 20:26:35 +00:00
Dan Johnson	3c97b0ab00	Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499 ) NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported. Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-09-16 20:08:06 +00:00
Kiuk Chung	abd16a8c64	[torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030 ) Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()` https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030 Approved by: https://github.com/albanD	2024-09-16 20:07:29 +00:00
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
James Wu	7537f74277	Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491 ) Summary: We refactor FxGraphCache.load into three phases: - prepare_key, which checks that an inductor input is cacheable and bypasses otherwise - load_with_key, which tries to lookup the key in the cache - post compile, where we do some logging and run post compile steps Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc. Differential Revision: D62314862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491 Approved by: https://github.com/oulgen	2024-09-16 19:48:08 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
Nikita Shulga	38caf10411	[EZ] Fix spelling typo (#136157 ) s/toosl/tools/ (spotted by @louie-tsai) Also, capitalize CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157 Approved by: https://github.com/kit1980	2024-09-16 19:30:30 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
eugenekoran	717fca2cac	Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146 ) Fixes #125920 [Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146 Approved by: https://github.com/janeyx99	2024-09-16 19:02:21 +00:00
Alexander Kurakin	f89ce4dfbb	`torch.nn.MultiheadAttention`: docs: improvement (#136111 ) `torch.nn.MultiheadAttention`: docs: improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111 Approved by: https://github.com/janeyx99	2024-09-16 18:52:20 +00:00
Nikita Shulga	d3647d15e6	Remove accidentally committed code (#136154 ) Accidentally left out during rebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-09-16 18:34:20 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	7fe004f7cf	Revert "Add CI for Triton CPU backend (#135342 )" This reverts commit 426580a67db15ec17b2b861a09667bf59927e033. Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Suresh Babu Kolla	5193f23469	[Pytorch] Cleanup Strobelight URL and shorten for readability (#136102 ) Summary: - Converted strobelight URL prefix to more readable and editable json - Dump shortened URLs when possible for easier readability Test Plan: ``` python ./torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py ``` Differential Revision: D62690292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102 Approved by: https://github.com/laithsakka	2024-09-16 18:10:33 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit e54b559e8860e343692bb5534777b2384a57a613. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Justin Chu	0aa41eb52f	[ONNX] Run type promotion test in CI and update the table (#135915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-16 16:46:13 +00:00
IvanKobzarev	090046b936	[effects] Turn off dtype promotion for with_effects lowering (#136039 ) By default inductor promotes arguments to the common highest dtype. Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects. Disabling dtype promotion for this lowering. Removing previous workaround making token dtype torch.bool. Testing: ``` python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039 Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519	2024-09-16 16:14:05 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Jon Janzen	13bd1256f9	Delete stable prototype (#135911 ) This project ended up going in an entirely different direction, so we can close out all this Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-09-16 15:32:17 +00:00
Bin Bao	d833f49602	[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#136046 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues Test Plan: CI Differential Revision: D62658837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046 Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel	2024-09-16 14:35:19 +00:00
Bin Bao	a803cb0531	[AOTI] Refactor how cpp_wrapper specific options are set (#136035 ) Summary: 1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place. 2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper. Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035 Approved by: https://github.com/chenyang78	2024-09-16 14:32:13 +00:00
atalman	bbc3fdbbde	Add python 3.13.0t build to Docker images (#136001 ) Adds 3.13t python to Docker images Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001 Approved by: https://github.com/albanD	2024-09-16 12:49:36 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Xuehai Pan	951c21d679	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133778	2024-09-16 04:53:06 +00:00
Xuehai Pan	9961aaa601	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-16 04:53:06 +00:00
Ke Wen	d2207c57f7	[Distributed] add pack-check method for float8_e5m2 (#136115 ) Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115 Approved by: https://github.com/Skylion007 ghstack dependencies: #135891, #135961	2024-09-15 21:37:43 +00:00
Howard Huang	e501ed71d4	Update link in distributed.tensor.parallel.rst (#136103 ) dtensor folder was moved Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-09-15 19:36:29 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
Andrii Grynenko	a141c6bb0d	[pytorch][monitoring] Dynamic backend for WaitCounter (#135967 ) Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application. Differential Revision: D62549295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967 Approved by: https://github.com/c-p-i-o	2024-09-15 18:07:49 +00:00
Tugsbayasgalan Manlaibaatar	dec3403b24	Add some doc for export_for_training (#135918 ) Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080, #135912	2024-09-15 17:08:12 +00:00
Tugsbayasgalan Manlaibaatar	1904b09e61	Create export_for_inference API and expose core_aten as public facing API (#135912 ) Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080	2024-09-15 17:05:07 +00:00
Tugsbayasgalan Manlaibaatar	382fad58b3	Deprecate _preserve_ops and consolidate with decomp_table (#135080 ) In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it. After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag. Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh	2024-09-15 17:01:58 +00:00
PyTorch MergeBot	357b7fb579	Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 )" This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03. Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))	2024-09-15 05:32:38 +00:00
cyy	31e42a45dd	Fix redundant move warnings by g++ (#134987 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987 Approved by: https://github.com/ezyang	2024-09-15 05:28:19 +00:00
PyTorch UpdateBot	e1abd346a3	[audio hash update] update the pinned audio hash (#136106 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106 Approved by: https://github.com/pytorchbot	2024-09-15 04:31:35 +00:00
Will Feng	386884e553	[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997 ) > Ignore FSDP2 forward hook side-effects in AC Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job: `451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)` So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation. ---- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997 Approved by: https://github.com/zou3519 ghstack dependencies: #135727	2024-09-15 02:00:17 +00:00
leslie-fang-intel	8072ebc36c	SKIP llama for dynamic size testing (#135960 ) Running Torchbench llama with dynamic size failed with ``` File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32). ``` Skip this model for marking dynamic dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960 Approved by: https://github.com/ezyang	2024-09-15 00:06:49 +00:00
Guilherme Leobas	a1a57a424d	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-14 23:25:28 +00:00
Bob Ren	a5eb43d8b4	Add TensorReferenceAnalysis and some tests (#135886 ) Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886 Approved by: https://github.com/ezyang	2024-09-14 23:09:40 +00:00
Isuru Fernando	391f2d6d50	use a fast expand algorithm (#135999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999 Approved by: https://github.com/ezyang	2024-09-14 23:09:34 +00:00
Isuru Fernando	5b21d91197	Fix dividing Mul by factor (#136079 ) Fixes https://github.com/pytorch/pytorch/issues/136032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079 Approved by: https://github.com/ezyang	2024-09-14 22:14:27 +00:00
Jez Ng	426580a67d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel ghstack dependencies: #133408	2024-09-14 21:45:19 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jason Ansel	c64ae601ba	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-14 21:00:41 +00:00
Aaron Gokaslan	7f5abb44af	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 ) Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087 Approved by: https://github.com/malfet	2024-09-14 20:48:44 +00:00
Michael Lazos	8df01c8258	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 18:52:22 +00:00
Michael Lazos	860838e9be	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 18:52:22 +00:00
Michael Lazos	1b9daeb240	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 18:52:22 +00:00
Michael Lazos	06caa2d560	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 18:52:22 +00:00
Michael Lazos	14cabdf626	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 18:52:22 +00:00
Michael Lazos	5c5c33ac32	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 18:52:22 +00:00
Michael Lazos	228760b945	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 18:52:22 +00:00
Bin Bao	b4c84c3167	[AOTI] Fix a fallback op returning None issue (#135997 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor. Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997 Approved by: https://github.com/chenyang78	2024-09-14 18:12:06 +00:00
Laith Sakka	b82122beef	Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730 ) All of the previous benchmarks are similar, ListOfLinears should be representative enough. I copied the previous benchmarks from unit tests without an intention, was just trying to create a large number of benchmarks to better observe noise. This PR keeps only one, we can add more as we see value and regressions in the future. Also this diff adds a GPU version. ``` collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 6479525851 compile time instruction count for iteration 1 is 1024432680 compile time instruction count for iteration 2 is 1019417317 compile time instruction count for iteration 3 is 1013603566 compile time instruction count for iteration 4 is 1008853980 compile time instruction count for iteration 5 is 1009541481 compile time instruction count for iteration 6 is 1005025533 compile time instruction count for iteration 7 is 1004116323 compile time instruction count for iteration 8 is 1000828633 compile time instruction count for iteration 9 is 999788323 collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 40837529730 compile time instruction count for iteration 1 is 18411921909 compile time instruction count for iteration 2 is 18383665161 compile time instruction count for iteration 3 is 18348983522 compile time instruction count for iteration 4 is 18349276590 compile time instruction count for iteration 5 is 18353046274 compile time instruction count for iteration 6 is 18346818581 compile time instruction count for iteration 7 is 18340057998 compile time instruction count for iteration 8 is 18331267320 compile time instruction count for iteration 9 is 18328381338 collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu compile time instruction count for iteration 0 is 15408870979 compile time instruction count for iteration 1 is 10949520859 compile time instruction count for iteration 2 is 11058786167 compile time instruction count for iteration 3 is 11003606719 compile time instruction count for iteration 4 is 10896406770 compile time instruction count for iteration 5 is 10982875189 compile time instruction count for iteration 6 is 10931848275 compile time instruction count for iteration 7 is 10956345008 compile time instruction count for iteration 8 is 11045384499 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 16:45:52 +00:00
Suresh Babu Kolla	b8637503c0	[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 ) Summary: Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that. - Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based) - Both OSS and Fbcode now use one compile time profiler in torch/_strobelight Test Plan: Tested OSS with following commands: ``` python torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` See test commands for fbcode in comments. Differential Revision: D62444551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953 Approved by: https://github.com/laithsakka	2024-09-14 16:35:22 +00:00
William Wen	f97cccf62a	[3.13] fix 3.13 pickle error in torch/package (#136049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049 Approved by: https://github.com/albanD ghstack dependencies: #136034	2024-09-14 14:28:09 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	23dec79cef	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	8c8a3086a7	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 4528777e034b157a8329d1879daf52290eea199a. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	46f5037007	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 149d0b716173787df4543186ff74b605aca54e3e. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	7975ec3a29	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	f3180f0088	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	838c912502	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	72b868d034	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:54 +00:00
Zhenbin Lin	41b58a1bec	OpenReg: Fix issue when copying on the same device (#135956 ) Current copy gets wrong value when src and dst are both openreg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956 Approved by: https://github.com/albanD	2024-09-14 09:57:45 +00:00
CaoE	f96a073c9d	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 09:53:17 +00:00
Will Feng	a815611db9	[Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727 ) If node is AC region output and has a backward hook on it, we intentionally choose to save it. This is to work around circular dependencies in Traceable FSDP2+AC. Example: ``` out = fully_shard(utils.checkpoint(module))(x) norm_out = layer_norm(out) ``` and there is a circular dependency: 1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`. 2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed. 3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)! Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency. ---- Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727 Approved by: https://github.com/Chillee	2024-09-14 08:45:58 +00:00
Oguz Ulgen	3352c9ac94	Add higher order operator name to the cache bypass exception (#135876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2024-09-14 07:05:29 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
Nikita Shulga	a9bef85263	[CI] Increase open file handles limit to 16K on MacOS (#136061 ) May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi	2024-09-14 06:16:12 +00:00
Laith Sakka	44dd218a61	Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768 ) When we measure compile time instruction count, probably we do want in most cases to measure gc instructions disabling it here by default. if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 06:15:28 +00:00
Nikita Shulga	1a67e2b680	[MPS] Add native im2col (#135706 ) It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm` Strongly inspired by CUDA implementation from `09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706 Approved by: https://github.com/albanD	2024-09-14 06:09:36 +00:00
Jack Taylor	b9b6094793	[ROCm] Skip pointwise associative scan tests due to regression (#135995 ) https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail ``` ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda ``` Skipping temporarily while triage is underway. Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445 ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped out = decomp_fn(args, **kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan raise RuntimeError("Unable to generate code for associative_scan op") torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op ``` NOTE: even "eager" backend fails ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense raise NotImplementedError("associative_scan is not implemented for eager") NotImplementedError: associative_scan is not implemented for eager ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995 Approved by: https://github.com/malfet	2024-09-14 05:40:10 +00:00
fduwjj	911a43f930	[TCPStore] Remove deprecated constructor (#136004 ) While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion. Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-09-14 04:25:47 +00:00
Michael Lazos	e77bd0ebd2	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 02:41:16 +00:00
Michael Lazos	5c67cf180e	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 02:41:16 +00:00
Michael Lazos	7743149b2b	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 02:41:08 +00:00
Michael Lazos	ce3c74f274	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 02:40:59 +00:00
Michael Lazos	149d0b7161	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 02:40:52 +00:00
Michael Lazos	4528777e03	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 02:40:43 +00:00
Michael Lazos	731b178b56	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 02:40:32 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit 51c52061339069a2162e921e5b464fad5a411522. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Yu, Guangye	2e8d431a8f	Fix tensor.data_ptr() representation overflow (#135567 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135550 In PyTorch, [`tensor.data_ptr()`](`e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)`) is reinterpreted by a [signed int64](`e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)`) data type, which could result in an overflow issue, like below: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is -23453392437248 # this is inconsistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](`c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)`). With this PR, the output will become: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is 18446720620317114368 # this is consistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` # Solution Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD	2024-09-14 01:52:04 +00:00
Nikita Shulga	95496e4855	[CI] Check that PyTorch is built with OpenMP (#136060 ) Restriction for x86 only builds should have been removed long time ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi	2024-09-14 01:51:36 +00:00
Li, Xingyuan	5de4cb8cd8	[Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-09-14 01:43:05 +00:00
Joel Schlosser	06bc717410	Fix sum() forward for NJT (#131945 ) This PR solves two problems with `sum()` support in NJT: * `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519. * Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945 Approved by: https://github.com/davidberard98, https://github.com/jananisriram	2024-09-14 00:58:03 +00:00
Nikita Shulga	081c4a966d	[BE] Use squeeze/unsqueeze in im2col (#136006 ) And move unsqeeze out of the dispatch, as it's dtype agnostic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-14 00:35:37 +00:00
Ke Wen	4237592b8f	[Distributed] add pack-check method for float8_e4m3fn (#135961 ) We check 8 x FP8 simultaneously, at size of 8 bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #135891	2024-09-14 00:32:27 +00:00
William Wen	a00faf4408	[3.13] fix 3.13 pickle error in serialization.py (#136034 ) Error encountered when adding dynamo 3.13 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034 Approved by: https://github.com/albanD	2024-09-14 00:02:40 +00:00
eellison	b608ff3bea	[Easy] Dont match to mm_plus_mm if not in max autotune (#135929 ) It's only an optimization when we tune the triton template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929 Approved by: https://github.com/FindHao	2024-09-13 23:38:02 +00:00
Jerry Zhang	b8eef500a6	Fix attr check for quantization spec (#135736 ) Summary: Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max This PR added checks for other fields as well Test Plan: regression tests Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736 Approved by: https://github.com/sxu	2024-09-13 23:01:22 +00:00
Menglu Yu	aad556a0b5	[PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962 ) Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776 ``` P1586356950 # e2e before fix f642153776 after fix Differential Revision: D62625318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962 Approved by: https://github.com/jackiexu1992	2024-09-13 22:53:08 +00:00
Zain Rizvi	3c5d44dda5	Cleanup unused runner variants (#136058 ) Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows For more details see description in the PR that generated these code changes: - https://github.com/pytorch/test-infra/pull/5665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-09-13 22:50:07 +00:00
Justin Chu	e2d3af405f	[ONNX] Remove logging apis from public (#133825 ) Remove - torch.onnx.enable_log - torch.onnx.disable_log - torch.onnx.set_log_stream - torch.onnx.log Because they are not meant for public consumption and has been marked for deprecation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825 Approved by: https://github.com/titaiwangms	2024-09-13 22:19:52 +00:00
Jessica Vandebon	baff86dafb	[MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871 ) Reviewed By: egienvalue, hanzlfs Differential Revision: D61662214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871 Approved by: https://github.com/egienvalue, https://github.com/nautsimon	2024-09-13 22:13:58 +00:00
Huy Do	db5e1b44d2	Fix inductor-micro-benchmark results upload (take 2) (#136052 ) I had a brain freeze when I wrote the original fix. The parameters were in the wrong order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-09-13 22:05:10 +00:00
Nikita Shulga	a30d5ba16c	Fix bug in split-build workflows codegen (#136043 ) By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510 If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-13 21:29:06 +00:00
Laith Sakka	46935c8241	Reduce default iterations to 5 . (#135773 ) running all benchmarks takes around 15 mins rn, this is the data https://www.internalfb.com/phabricator/paste/view/P1583590240 the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold. that said, the diff also add a way to increase the number of iterations for a specific benchmark. after the change results https://www.internalfb.com/phabricator/paste/view/P1583618969 time is down to half (7 mins) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 21:16:38 +00:00
Laith Sakka	4f407c1884	Only measure compile time instruction count for sum_floordiv benchmark (#135785 ) there was a recent strange noise +5%, -5%. using only compile time : 1) avoid gc time . 2) avoid other operations that are not what we try to measure by this. ==> less probable noise. ``` collecting compile time instruction count for sum_floordiv_regression compile time instruction count for iteration 0 is 8899290248 compile time instruction count for iteration 1 is 1188830489 compile time instruction count for iteration 2 is 1180579615 compile time instruction count for iteration 3 is 1176263131 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785 Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305	2024-09-13 21:14:10 +00:00
Laith Sakka	2e461e54e8	Add gpu and gpu_dynamic versions of add_loop (#135809 ) I am thinking maybe 3 iterations are enough for this one? - so I am keeping eager and inductor since inductor is 2X eager time - Eager dynamic is 2X eager so keeping this as well. - inductor have three tests. (dynamic gpu, gpu and cpu) I am unsure if am over profiling here happy to trim if anyone have suggestions. ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8213664211 compile time instruction count for iteration 1 is 2798628246 compile time instruction count for iteration 2 is 2796811362 compile time instruction count for iteration 3 is 2794438188 compile time instruction count for iteration 4 is 2794634117 collecting compile time instruction count for add_loop_eager_dynamic compile time instruction count for iteration 0 is 5724108021 compile time instruction count for iteration 1 is 5499908609 compile time instruction count for iteration 2 is 5569101366 compile time instruction count for iteration 3 is 5493806364 compile time instruction count for iteration 4 is 5493169851 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 49789381222 compile time instruction count for iteration 1 is 25769347393 compile time instruction count for iteration 2 is 25772594322 compile time instruction count for iteration 3 is 25768695952 compile time instruction count for iteration 4 is 25768032314 collecting compile time instruction count for add_loop_inductor_gpu compile time instruction count for iteration 0 is 23966942581 compile time instruction count for iteration 1 is 23771950919 compile time instruction count for iteration 2 is 23770784286 compile time instruction count for iteration 3 is 23780160875 compile time instruction count for iteration 4 is 23774634465 collecting compile time instruction count for add_loop_inductor_dynamic_gpu compile time instruction count for iteration 0 is 41505055086 compile time instruction count for iteration 1 is 41293654089 compile time instruction count for iteration 2 is 41301016100 compile time instruction count for iteration 3 is 41306056207 compile time instruction count for iteration 4 is 41308171566 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 20:42:31 +00:00
atalman	a3d827a28c	Use python 3.11 for Large Wheel build (#136042 ) Use Python 3.11 in nightly Large wheel builds. Required for Colab testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042 Approved by: https://github.com/kit1980, https://github.com/malfet Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>	2024-09-13 20:27:11 +00:00
Yiming Zhou	4312794b92	[reland][export] fix re-export custom metadata (#135720 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134778 The previous D62304294 broke some executorch tests. It has already been reverted. In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle ``` Differential Revision: D62514208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720 Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168	2024-09-13 20:15:15 +00:00
Sergii Dymchenko	b856f3539b	Fix script name in the comments (#135507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507 Approved by: https://github.com/atalman	2024-09-13 19:59:47 +00:00
Jing Xu	835e7bb077	fix requirements.txt installation failure issue on Windows (#134567 ) Fixes #134564 Root cause: The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries.. ![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683) Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below. ```bash >python -m pip install lintrunner Collecting lintrunner Downloading lintrunner-0.12.5.tar.gz (62 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: lintrunner Building wheel for lintrunner (pyproject.toml) ... error error: subprocess-exited-with-error × Building wheel for lintrunner (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [137 lines of output] Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off` ðŸ“¡ Using build options bindings from pyproject.toml Compiling proc-macro2 v1.0.79 Compiling unicode-ident v1.0.12 Compiling version_check v0.9.4 Compiling windows_x86_64_msvc v0.52.4 Compiling winapi v0.3.9 Compiling serde v1.0.197 Compiling autocfg v1.2.0 Compiling syn v1.0.109 Compiling lazy_static v1.4.0 Compiling libc v0.2.153 Compiling equivalent v1.0.1 Compiling hashbrown v0.14.3 Compiling memchr v2.7.2 Compiling yansi v1.0.1 Compiling unicode-width v0.1.11 Compiling regex-syntax v0.8.3 Compiling encode_unicode v0.3.6 Compiling cfg-if v1.0.0 Compiling winnow v0.6.5 Compiling cc v1.0.92 error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors warning: build failed, waiting for other jobs to finish... error: could not compile `serde` (build script) due to 2 previous errors error: could not compile `proc-macro2` (build script) due to 2 previous errors error: could not compile `syn` (build script) due to 2 previous errors error: could not compile `libc` (build script) due to 2 previous errors error: could not compile `winapi` (build script) due to 2 previous errors ðŸ’¥ maturin failed Caused by: Failed to build a native library through cargo Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --` ðŸ“¦ Including license file "LICENSE" ðŸ”— Found bin bindings error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for lintrunner Failed to build lintrunner ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567 Approved by: https://github.com/malfet	2024-09-13 18:43:55 +00:00
PyTorch MergeBot	b6d6aa49b8	Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 )" This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0. Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))	2024-09-13 18:06:56 +00:00
PyTorch MergeBot	deee21cb78	Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 )" This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac. Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))	2024-09-13 17:53:21 +00:00
Daohang Shi	3f69410976	[gpu-profiler] Expose active and repeat in os env var (#135757 ) Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/ Test Plan: `buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api ` eyes Differential Revision: D62529249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757 Approved by: https://github.com/Yuzhen11	2024-09-13 17:48:27 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit d3833253928f29ed760b2dccac2b730028a868ca. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Catherine Lee	bc0f330169	[trymerge] Manually close merged PR when Github fails (#135890 ) Manually close merged PR when Github fails to do it. Consequences of current design: Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job) Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num" Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-13 17:29:24 +00:00
Rachel Guo	7834c0bb2c	[AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887 ) Summary: As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well. The inductor python wrapper code level printing would look something like this: {F1859224287} Test Plan: CI Reviewed By: chenyang78 Differential Revision: D62415575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887 Approved by: https://github.com/chenyang78	2024-09-13 17:19:25 +00:00
PyTorch MergeBot	6ef49fe8f1	Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 )" This reverts commit 3d2431380999252d5401f83d5010b398a32e7597. Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))	2024-09-13 17:09:45 +00:00
Jack Taylor	a15774563b	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-09-13 16:45:39 +00:00
PyTorch MergeBot	564d00f364	Revert "Fix clang-tidy warnings in Caffe2 code (#134935 )" This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d. Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))	2024-09-13 16:42:37 +00:00
drisspg	ae02d663cd	[FlexAttention] Fix output layout (#135882 ) We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882 Approved by: https://github.com/yanboliang, https://github.com/Chillee	2024-09-13 16:36:05 +00:00
James Wu	ad2f0e9f81	Add remote cache time saved to compilation metrics (#135490 ) Summary: Record remote cache time saved via frame_phase_timing We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved. Test Plan: Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized. Show that column exists in table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float. Reviewed By: aorenste Differential Revision: D62106921 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490 Approved by: https://github.com/aorenste	2024-09-13 16:35:51 +00:00
Edward Z. Yang	21ffa18ad1	Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933 Approved by: https://github.com/angelayi	2024-09-13 15:23:42 +00:00
eqy	2519e5a8de	[CUDA][FP8] Skip rowwise scaling test on sm89 (#135718 ) Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718 Approved by: https://github.com/Skylion007	2024-09-13 15:07:20 +00:00
Laith Sakka	ba6e0f31ab	Remove cycle dependency by localizing the import. (#135926 ) Summary: Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config. ``` File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module> ``` https://fburl.com/logarithm/ol5kx0ee complaining about a cycle dependency this fix it. Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages Reviewed By: aorenste Differential Revision: D62616765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007	2024-09-13 15:05:41 +00:00
PyTorch MergeBot	7ed0563cad	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	eb7dd91dd1	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	3f30360d05	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	4734e356d6	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	ac169795a9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	fca58bfda1	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	dc71e7a7d4	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	1cdf658f4a	Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 )" This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68. Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))	2024-09-13 12:35:05 +00:00
PyTorch MergeBot	b5c52e96e8	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))	2024-09-13 12:29:03 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
CaoE	2f53d570fe	Update document for autocast on CPU (#135299 ) Update document for autocast on CPU due to the support of float16 and changes in the operator list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars	2024-09-13 09:11:47 +00:00
Ke Wen	31007cf200	[Distributed] add FP8 support to NaN checker (#135891 ) Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891 Approved by: https://github.com/wconstab	2024-09-13 08:43:54 +00:00
Michael Lazos	c56728b643	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-13 08:41:32 +00:00
Michael Lazos	7d5e0dd4b1	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-13 08:41:32 +00:00
Michael Lazos	2af3b8ffd8	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-13 08:41:24 +00:00
Michael Lazos	0c080cb2c7	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-13 08:41:17 +00:00
Michael Lazos	30b007bea3	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-13 08:41:07 +00:00
Michael Lazos	fafdd588f2	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-13 08:41:00 +00:00
Michael Lazos	e504fb7069	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-13 08:40:50 +00:00
Jez Ng	b346e99376	remove fast_flush arguments (#135387 ) I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value. Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-09-13 08:13:46 +00:00
Animesh Jain	7dc1788396	[inductor] Remove the batch fusion passes from being a default (#135922 ) Ads team do a search internally to figure out which fusion passes to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922 Approved by: https://github.com/eellison, https://github.com/yanboliang ghstack dependencies: #135819	2024-09-13 06:07:33 +00:00
xinan.lin	9fd54d787d	[Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656 Approved by: https://github.com/EikanWang, https://github.com/zou3519	2024-09-13 05:27:56 +00:00
xingyuan li	b38be727eb	[Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_torchinductor_opinfo.py` Reuse `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556 Approved by: https://github.com/etaf, https://github.com/eellison	2024-09-13 05:16:28 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
wz337	eea5e6ff0f	[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763 ) Fix https://github.com/pytorch/pytorch/issues/134095 This is a workaround for loading full state dict into a FSDP1+TP 2D model. Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do: - load the full state dict into a 1D FSDP model - dcp.save the full/shard state dict into storage - initialize a 2D FSDP1+TP model - get the default sharded state dict for the 2D model (full_state_dict=False) - dcp.load the state dict from storage - load the state dict into the 2D model Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763 Approved by: https://github.com/fegin ghstack dependencies: #135725	2024-09-13 03:51:14 +00:00
Pian Pawakapan	6df91b5917	real tensor prop for composite ops (#135717 ) Fixes #135632 Adds real tensor propagation for decompositions, checking any symbols on their outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717 Approved by: https://github.com/ezyang	2024-09-13 03:35:16 +00:00
wz337	0cdc6a8dcd	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-13 03:26:36 +00:00
Prachi Gupta	6cdc70bccd	[ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917 Approved by: https://github.com/malfet	2024-09-13 02:46:48 +00:00
Yu, Guangye	e6b68359d7	Fix xpu memory stats error (#135818 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818 Approved by: https://github.com/EikanWang	2024-09-13 02:41:21 +00:00
Nikita Shulga	1c04cbfba6	[BE] Use `C10_UNUSED` (#135914 ) Instead of `(void)foo; // Suppress unused variable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914 Approved by: https://github.com/huydhn, https://github.com/eqy	2024-09-13 02:27:07 +00:00
Shivam Raikundalia	062681a0ed	[Profiler] Torch Profiler distributed info is not JSON serializable (#135548 ) Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON. Test Plan: Added unit test to check that numpy values can be serialized Differential Revision: D62411619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD	2024-09-13 02:22:33 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Jason Ansel	bf68e16e94	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-13 01:14:18 +00:00
eqy	d732df7e56	[Inductor] Disable TF32 in `test_slice_scatter_reinplace` (#135709 ) TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709 Approved by: https://github.com/eellison	2024-09-13 00:30:45 +00:00
Sahan Paliskara	c9de2efde6	[Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894 ) Addresses https://github.com/pytorch/pytorch/issues/135880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-09-13 00:19:42 +00:00
Jason Ansel	1f15c0c7a5	[fx] Replace _snake_case with a regexp (#135822 ) ~2x speedup on this function, though saves <0.5s overall Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820, #135821	2024-09-13 00:18:41 +00:00
Jason Ansel	a72124add9	[fx] Minor optimization in create_arg (#135821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820	2024-09-13 00:18:41 +00:00
Jason Ansel	10ca4c0564	[inductor] Use TracerBase directly in LoopBody (#135820 ) This skips some unneeded work in the subclass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788	2024-09-13 00:18:41 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Isuru Fernando	f576960bbc	do not expand in replace/simplify if no changes (#135863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863 Approved by: https://github.com/ezyang	2024-09-13 00:12:01 +00:00
Nikita Shulga	1aba224cfd	Update nightly PyTorch version to 2.6.0 (#135916 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916 Approved by: https://github.com/kit1980	2024-09-13 00:08:52 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
Ma Jian	00dc7d4356	fix compiled_autograd deadlock throw (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-12 23:24:57 +00:00
Yanbo Liang	1760bbc259	[FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823 ) Fixes #134739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823 Approved by: https://github.com/BoyuanFeng	2024-09-12 23:11:01 +00:00
Jack Taylor	fb9d8e3248	[ROCm] Use ieee precision for fp32 in flex attention (#135702 ) `3bebc09be9` Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-09-12 23:00:48 +00:00
eellison	aaabfc8930	[Easy] Check if quant registered in constant folding (#135875 ) Belated fix for https://github.com/pytorch/pytorch/issues/110904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875 Approved by: https://github.com/shunting314	2024-09-12 22:16:39 +00:00
William Wen	63d6cd351a	[dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404 ) Fixes https://github.com/pytorch/pytorch/issues/134608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-09-12 22:04:48 +00:00
PyTorch MergeBot	3de9e474df	Revert "Check function declarations of Core ML code (#135467 )" This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63. Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))	2024-09-12 22:04:35 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
Fadi Arafeh	3d24313809	Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 ) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR* the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058 Approved by: https://github.com/jondea, https://github.com/malfet	2024-09-12 20:30:20 +00:00
Riley Dulin	cd472bb1e3	[torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553 ) Summary: Sometimes we only want to generate a replacement for a matched pattern once we know some information about the nodes in the pattern. So far, we have found this the most useful to do matches based on specific shapes of tensors flowing into functions. Use a callback function similar to `match_filters`. By default this isn't used. Had to make `replacement` a None-able parameter because Callable was already used to detect a case where a graph needed to be traced. Differential Revision: D62412628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553 Approved by: https://github.com/SherlockNoMad	2024-09-12 18:52:14 +00:00
Guilherme Leobas	f032135bbf	Add batching rule for torch.scatter_reduce (#135547 ) Fixes #134797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547 Approved by: https://github.com/zou3519	2024-09-12 18:51:21 +00:00
Joel Schlosser	525bec804c	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-12 17:54:25 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Rachel Guo	c1277945d3	[AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731 ) Summary: As title. Effect after merging this diff would look something like this: ``` print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0) triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0) print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0) buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32) # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm] print('inductor: before_launch - extern_kernels.addmm - buf0', buf0) extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1) print('inductor: after_launch - extern_kernels.addmm - buf0', buf0) ``` Context: D62272588 only support major triton kernel jit inductor debug printing codegen Test Plan: CI & OSS CI Reviewed By: chenyang78, ColinPeppler Differential Revision: D62397017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731 Approved by: https://github.com/ColinPeppler	2024-09-12 17:31:10 +00:00
Isuru Fernando	dab7d646d5	Use a better decomposition for split_with_sizes (#135728 ) This decomposition has less checks and improves the performance of torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728 Approved by: https://github.com/ezyang	2024-09-12 16:38:51 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Justin Chu	d67cc58181	[ONNX] Fix symbolic values and numpy implementation (#135786 ) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes https://github.com/pytorch/pytorch/issues/135700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786 Approved by: https://github.com/titaiwangms	2024-09-12 14:24:43 +00:00
Animesh Jain	dddaadac6c	[dynamo] Dont graph break on inner torch.compile (#135819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819 Approved by: https://github.com/jansel	2024-09-12 11:39:09 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Yanbo Liang	c30042fbeb	[GPT-fast] Update compilation time target for Llama & Mixtral (#135817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817 Approved by: https://github.com/xmfan, https://github.com/huydhn	2024-09-12 07:13:44 +00:00
Sun, Jiayi	6700175531	[Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574 ) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-12 06:56:34 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Jason Ansel	86335e9135	[reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735 Approved by: https://github.com/oulgen	2024-09-12 05:50:39 +00:00
angelayi	14e3f3c062	[aoti] Remove nlohmann/json.hpp from header (#135765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765 Approved by: https://github.com/malfet	2024-09-12 05:38:51 +00:00
Dmitry Rogozhkin	9852c6d236	xpu: fix 3rd party builds on systems with cmake<3.25 (#135767 ) Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement. See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html Fixes: #135766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767 Approved by: https://github.com/malfet, https://github.com/guangyey	2024-09-12 05:31:01 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	3decb676aa	[inductor] Optimize cache_on_self (#135445 ) This is a small compile time win, but also makes profiles more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445 Approved by: https://github.com/oulgen	2024-09-12 05:22:23 +00:00
Zhenbin Lin	8d68a02905	OpenReg: Split the daemon into drvier/executor (#135646 ) Split the daemon into a proper user-process driver vs device-process executor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646 Approved by: https://github.com/albanD	2024-09-12 05:03:46 +00:00
Jason Ansel	28330a8a39	[reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733 Approved by: https://github.com/oulgen	2024-09-12 04:29:37 +00:00
Animesh Jain	eaba287adb	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg	2024-09-12 04:05:08 +00:00
cyy	f5f1d0a753	Fix build warnings for torch_python (#134981 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981 Approved by: https://github.com/ezyang	2024-09-12 03:59:34 +00:00
Adam J. Stewart	5bc238c73e	torch.hub: add get_dir/set_dir type hints (#134906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906 Approved by: https://github.com/Skylion007	2024-09-12 03:53:29 +00:00
He Kai	79223114db	Avoid inserting extra transpose when the input to group norm is NHWC (#135575 ) When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575 Approved by: https://github.com/ezyang	2024-09-12 03:36:05 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Aaron Orenstein	21a64d57b1	[BE] typing for decorators - masked/_ops (#135108 ) Differential Revision: D62184735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108 Approved by: https://github.com/Skylion007	2024-09-12 01:34:09 +00:00
Shangdi Yu	1a74952925	"Remove BLOCK_LIST" (#135729 ) Summary: Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir. Remove BLOCK_LIST since it's empty. Now all internal unittests will use training ir. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder buck2 run 'fbcode//mode/dev-nosan' caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder ``` Differential Revision: D62387987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729 Approved by: https://github.com/tugsbayasgalan	2024-09-12 01:22:06 +00:00
Huy Do	a130ed828a	Fix the upload of x86 micro benchmark results (#135780 ) Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780 Approved by: https://github.com/atalman	2024-09-12 01:16:38 +00:00
Menglu Yu	eb0fe02933	[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 ) Summary: We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression - Only happens in A100 - Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding - To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}}) Test Plan: # unit test ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm ``` Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651 Network: Up: 9.0KiB Down: 142B (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e) Jobs completed: 9. Time elapsed: 3:18.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e test see [D62388582](https://www.internalfb.com/diff/D62388582) Differential Revision: D62220158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167 Approved by: https://github.com/jackiexu1992	2024-09-12 00:51:34 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
xinan.lin	16b37b309f	[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313 Approved by: https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #135312	2024-09-11 23:59:54 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Alexander Jipa	5ca46be15e	Fix/torch cat doc attr (#135698 ) The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698 Approved by: https://github.com/albanD Co-authored-by: Alexander Jipa <azzhipa@amazon.com>	2024-09-11 22:32:55 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Shubham Bhokare	66db61f0d1	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-11 21:29:04 +00:00
PyTorch MergeBot	c025f7becc	Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317 )" This reverts commit e004d539da3335d97a8134c9081245628f18eb67. Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))	2024-09-11 21:27:53 +00:00
FFFrog	8c4e1148b8	Refactoring byte_order (#135558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558 Approved by: https://github.com/mikaylagawarecki	2024-09-11 21:06:43 +00:00
Nikita Shulga	e20ee39558	Expand bitwise ops to unsigned types (#135525 ) Fixes https://github.com/pytorch/pytorch/issues/135436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525 Approved by: https://github.com/ezyang	2024-09-11 20:48:52 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Sidney Tsang	5d964a5eb7	[Export] Fix SDPA decomposition (#135297 ) Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary. Test Plan: CI Differential Revision: D62278378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297 Approved by: https://github.com/drisspg	2024-09-11 20:21:59 +00:00
Bin Bao	118d7e1480	[Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694 ) Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly. Differential Revision: D62500331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694 Approved by: https://github.com/masnesral	2024-09-11 20:07:11 +00:00
Bob Ren	dd47f6f623	Simplify expr before getting implications in _maybe_evaluate_static (#135499 ) Fixes #134268 Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499 Approved by: https://github.com/ezyang	2024-09-11 19:48:29 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Shangdi Yu	ad75b09d89	Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86 ``` CI Differential Revision: D62448302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623 Approved by: https://github.com/tugsbayasgalan	2024-09-11 19:23:08 +00:00
rzou	a2cb9b7331	Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581 Approved by: https://github.com/eellison ghstack dependencies: #135530	2024-09-11 18:43:18 +00:00
Edward Z. Yang	451eaf0ff2	Log full exception trace when error raised in Dynamo (#135697 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697 Approved by: https://github.com/Skylion007	2024-09-11 18:14:33 +00:00
Zain Rizvi	09519eb195	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-11 18:01:26 +00:00
Bob Ren	5314ae2660	Don't use exception chaining for BackendCompilerFailed (#135545 ) Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545 Approved by: https://github.com/ezyang	2024-09-11 17:49:18 +00:00
Jack Taylor	da587de9cb	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of https://github.com/pytorch/pytorch/pull/130617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852 Approved by: https://github.com/masnesral, https://github.com/malfet	2024-09-11 17:21:40 +00:00
Jithun Nair	82a4df2d5f	[CI] [ROCm] Run rocm workflow on every push to main branch (#135644 ) Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644 Approved by: https://github.com/huydhn	2024-09-11 17:21:05 +00:00
Catherine Lee	18a9030952	[CI] Fix update slow tests (#135390 ) * Add pytorchbot to list of approvers for file * Add labels to the auto created PR The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal idk if this has much value, clearly we've been managing without the update Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390 Approved by: https://github.com/ZainRizvi	2024-09-11 17:02:17 +00:00
Isuru Fernando	03f23d07b4	Optimize ShapeEnv.replace (#135652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652 Approved by: https://github.com/ezyang ghstack dependencies: #135621, #135622	2024-09-11 16:50:59 +00:00
Isuru Fernando	8c738c9270	Improve performance of sympy_generic_le (#135622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622 Approved by: https://github.com/ezyang ghstack dependencies: #135621	2024-09-11 16:20:03 +00:00
Isuru Fernando	7ddacaf40a	Improve performance of canonicalize_bool_expr (#135621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621 Approved by: https://github.com/ezyang	2024-09-11 16:20:03 +00:00
PyTorch MergeBot	183c32fd3b	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 0d15122092c27fec1143b800bab7c996d126b547. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))	2024-09-11 15:57:00 +00:00
PyTorch MergeBot	3ab12e2596	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))	2024-09-11 15:53:55 +00:00
PyTorch MergeBot	596e93b506	Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 )" This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1. Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](`5c3d0a2ded`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))	2024-09-11 15:51:12 +00:00
PyTorch MergeBot	f96e8041b1	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))	2024-09-11 15:48:27 +00:00
PyTorch MergeBot	7cf9c81918	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))	2024-09-11 15:39:21 +00:00
Sam Larsen	49e0b88aab	Fix test_triton_kernel_float64_constant (#135583 ) Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007	2024-09-11 15:16:23 +00:00
Pushpak Raj Gautam	ee8c5cc1cc	For S444023: Back out "deprecate `search_autotune_cache` (#133628 )" (#135186 ) Summary: For S444023 Test Plan: Revert prevented the NaN errors - f639391901 Training job ran for 7767 iterations. NaN errors show up within the first 1k. Reviewed By: nmacchioni Differential Revision: D62224747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186 Approved by: https://github.com/kit1980	2024-09-11 14:08:40 +00:00
Nikita Lutsenko	ce4d146f56	ATen \| Fix MPSCNNNeuron creation on Mac Catalyst. (#135595 ) Summary: These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into. This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed. Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745} Reviewed By: MichaelTay Differential Revision: D62430010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595 Approved by: https://github.com/MichaelTay	2024-09-11 11:12:23 +00:00
Amadeusz Skrzypczak	0226fcaacf	Disable cuda specific restrictions in _scaled_mm for other devices (#135579 ) Fixes #135576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579 Approved by: https://github.com/drisspg	2024-09-11 11:05:38 +00:00
Yanbo Liang	4cde5096c4	[Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629 ) Fixes #134560 and #135206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629 Approved by: https://github.com/drisspg	2024-09-11 08:10:50 +00:00
Ke Wen	443c015393	[Distributed] Improve efficiency of NaN checker (#135414 ) Some customers would like to run the NaN checks on the fly, so we are improving its efficiency. ## Benchmarking Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1` Red kernel: ncclAllreduce Blue kernel: Nan check <img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3"> ## Comparison with torch ops: Let's say a user manually check for NaNs with the following torch ops before all-reduce: ``` torch.any(torch.isnan(x)) ``` <img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b"> So our perf is on-par with torch ops. ## Changes - Load from vidmem using "big packs" of 16 bytes - Bump `blockDim.x` from 256 to 512 - Separate loads and checks into two loops, each of 8 iterations - Unroll the loops - Templated functions for checking NaN in a "big pack" based on dtype Special thanks to @jbachan from NCCL! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414 Approved by: https://github.com/wconstab	2024-09-11 07:53:42 +00:00
Yiming Zhou	4ae6d7c18f	Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634 ) Summary: Broke some tests. Revert this diff Test Plan: CI Differential Revision: D62474337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634 Approved by: https://github.com/tugsbayasgalan	2024-09-11 06:16:26 +00:00
Eddie Yan	3084b7b5c0	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-11 05:59:25 +00:00
Animesh Jain	5c3d0a2ded	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg ghstack dependencies: #135588	2024-09-11 05:23:42 +00:00
fduwjj	c608b17f60	[PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496 ) While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road. Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496 Approved by: https://github.com/wconstab	2024-09-11 04:42:25 +00:00
Michael Lazos	444b52ff40	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-11 04:18:22 +00:00
Michael Lazos	160c228a4b	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-11 04:18:22 +00:00
Michael Lazos	0d15122092	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-11 04:18:22 +00:00
Michael Lazos	6a3edfcc1e	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-11 04:18:22 +00:00
penguin-wwy	356f14e7b7	Fix the output of FileCheck when not run and add unit tests (#135345 ) When FileCheck is destructed without execution, it should output all rules. For example: ``` >>> fc = FileCheck().check("test") >>> del fc You have not run this instance of FileCheck! FileCheck checks: CHECK: test ``` Additionally, unit tests for the Python interface of FileCheck will be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345 Approved by: https://github.com/eellison	2024-09-11 04:13:24 +00:00
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
angelayi	cd9ee49a69	[aoti] Add cpp loader (#135374 ) * Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python... * Added a new config, `aot_inductor.package_cpp_only` which will not package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users. * Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config. * Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`. * `load_package` will load a singular model, given the model name. * The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows? Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374 Approved by: https://github.com/desertfire, https://github.com/malfet	2024-09-11 03:00:01 +00:00
chuanqiw	26e5572dd2	Bump triton xpu pin and release version (#135638 ) Similar with https://github.com/pytorch/pytorch/pull/135627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638 Approved by: https://github.com/atalman	2024-09-11 00:56:15 +00:00
Animesh Jain	693897df42	[dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041 ) Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041 Approved by: https://github.com/ezyang	2024-09-11 00:43:26 +00:00
Nikita Shulga	3bf6be457d	[MPS] Add missing dispatch to rshift.Tensor (#135607 ) Missed it while working on https://github.com/pytorch/pytorch/pull/131813 Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607 Approved by: https://github.com/manuelcandales	2024-09-11 00:20:53 +00:00
titaiwangms	492f064f15	[ONNX] Add assertion nodes to ignoring list (#135591 ) Fixes #135419 PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591 Approved by: https://github.com/justinchuby	2024-09-11 00:18:17 +00:00
rzou	29408ea81a	Add option to tweak inductor stride settings for user-defined triton kernels (#135530 ) Previously, Inductor was allowed to modify the stride/storage_offset (layout) for inputs to user-defined triton kernels. This can cause silent incorrectness because most triton kernels are written for a specific striding pattern (usually contiguous). This PR adds a config to allow the user to choose Inductor's behavior on this. The options are: - "flexible_layout" (default): Inductor can modify the layout for inputs to user-defined triton kernels as much as it wants. - "needs_fixed_stride_order": Inductor must preserve the stride order (when compared to tracing) for inputs to user-defined triton kernels. This matches our handling for custom operators. In the future, we'll want a "needs_exact_strides" option (this is the safest option). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530 Approved by: https://github.com/FindHao, https://github.com/oulgen	2024-09-11 00:11:17 +00:00
Haoming Lu	02dcb07765	Add boolean support in pack segments ops for both cpu and cuda impls (#132897 ) (#135620 ) Summary: Same as int types, forward only. bypass-github-export-checks diff has been synced to github Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/ Reviewed By: garroud Differential Revision: D60785563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620 Approved by: https://github.com/kit1980 Co-authored-by: Haoming Lu <haominglu@meta.com>	2024-09-11 00:03:17 +00:00
Animesh Jain	5c38aa72c0	[dynamo][dicts][nv-embed] Support update with kwargs (#135588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588 Approved by: https://github.com/yanboliang	2024-09-10 23:50:23 +00:00
atalman	5134ba7458	Bump triton pin and release version (#135627 ) Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627 Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet	2024-09-10 23:46:36 +00:00
titaiwangms	e48ee2cf50	[ONNX] Fix scaled_dot_product_attention with float scale (#135594 ) Fixes #125158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594 Approved by: https://github.com/justinchuby	2024-09-10 23:04:02 +00:00
hongxyan	eb38ee21ba	[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397 ) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2*30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397 Approved by: https://github.com/eqy, https://github.com/malfet	2024-09-10 21:03:01 +00:00
Shunting Zhang	8057b72763	[ez][inductor] don't benchmark cloning if there are no mutated args (#135533 ) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533 Approved by: https://github.com/jansel ghstack dependencies: #135531	2024-09-10 20:54:31 +00:00
Shunting Zhang	7b17918dc9	[inductor] fix a device sync issue for benchmarking fusion (#135531 ) Fix https://github.com/pytorch/pytorch/issues/134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531 Approved by: https://github.com/jansel	2024-09-10 20:54:31 +00:00
Yiming Zhou	66c45f3ed9	[export] fix re-export custom metadata (#135282 ) Fixes #134778 When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`). This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`. Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282 Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-09-10 20:15:02 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Catherine Lee	4ca65d3323	[CI] Increase sharding for jobs that are timing out (#135582 ) Increase sharding for * slow grad check * slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test * avx Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-10 19:45:13 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Rachel Guo	1f15973657	[AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285 ) Summary: 1. Add the debug printer call to a level lower for triton kernel python wrapper codegen path 2. Add `torch.save()` for jit inductor as well 3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing) Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D62272588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285 Approved by: https://github.com/chenyang78	2024-09-10 19:24:58 +00:00
Dan Zimmerman	fc88ba260f	[amdsmi][torch] Update amdsmi API usages (#135504 ) Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see `7b2463abe0` for the changes Test Plan: CI Differential Revision: D62325661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504 Approved by: https://github.com/eqy, https://github.com/houseroad	2024-09-10 19:15:39 +00:00
Sam Larsen	bf8d0e3107	[inductor] Enable subprocess parallel compile internally with killswitch (#132467 ) Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467 Approved by: https://github.com/eellison	2024-09-10 19:05:46 +00:00
Shivam Raikundalia	3a1239a248	[Profiler] Harden Record Function Kwargs (#135365 ) Summary: In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following. 1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future. 2. Make sure that any boolean is lowercase when a string so that the JSON does not break when parsing it 3. Force stream parameter to be an int Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only. Differential Revision: D62304843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365 Approved by: https://github.com/aaronenyeshi	2024-09-10 18:44:05 +00:00
Sam Larsen	4f9f1775d8	Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370 ) Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern. Test Plan: ``` python test/inductor/test_combo_kernels.py python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370 Approved by: https://github.com/jansel	2024-09-10 18:43:14 +00:00
Thanh Ha	5e0788befb	Migrate remaining jobs to use runner determinator (#134867 ) At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over. Issue: https://lf-pytorch.atlassian.net/browse/PC-25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867 Approved by: https://github.com/ZainRizvi	2024-09-10 18:14:00 +00:00
Ivan Zaitsev	440f8f57af	Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079 )" (#135562 ) This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8. #135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562 Approved by: https://github.com/jansel, https://github.com/seemethere	2024-09-10 18:07:11 +00:00
Zhou, Lingzhi	e004d539da	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-10 17:45:29 +00:00
Zixi Qi	c4b84a46a9	Add more logging to TunableOp validators (#135396 ) Summary: Add more logging to TunableOp validators Test Plan: Verified additional logging when loading kernel selections: ``` ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 ``` ``` [qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d\|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning File changed: fbcode//hipblas_tuning_pt_llama0.csv Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022 Network: Up: 0B Down: 0B Jobs completed: 4189. Time elapsed: 0.2s. BUILD SUCCEEDED Enabled tuning - Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16 INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830 INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0 reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator HIPBLASLT_VERSION=800-a15e4178 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s - Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16 Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s - Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16 Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s - Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16 Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s 2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412 2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075 2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985 2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748 ``` Reviewed By: leitian Differential Revision: D62322830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396 Approved by: https://github.com/eqy	2024-09-10 17:20:59 +00:00
cyy	bc1b8f094d	Check function declarations of Core ML code (#135467 ) Relax the restrictions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467 Approved by: https://github.com/ezyang	2024-09-10 16:05:22 +00:00
rzou	f65a564fa2	[inductor] Flip custom_op_default_layout_constraint (#135239 ) By default, Inductor should respect the stride order of input Tensors to custom operators. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239 Approved by: https://github.com/albanD ghstack dependencies: #135391	2024-09-10 14:27:43 +00:00
Edward Z. Yang	386b313028	Handle KeyError for compiler collective in scalars too (#135385 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385 Approved by: https://github.com/jansel	2024-09-10 12:33:04 +00:00
torotoki	6d7cbc20d2	Add dynamo itertools.pairwise support (#135416 ) Fixes #133766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416 Approved by: https://github.com/XuehaiPan, https://github.com/jansel Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2024-09-10 11:37:59 +00:00
xinan.lin	ca16956b20	[Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #134693	2024-09-10 10:11:52 +00:00
xinan.lin	67735d1ee8	[Inductor] Generalize `is_cuda` to specific device_type to make cpp_wrapper mode be extensible (#134693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel	2024-09-10 10:11:13 +00:00
Boyuan Feng	6e13f5eb38	[FlexAttention] Add broadcast support for kv batch dimension (#135505 ) This PR adds broadcast support for KV batch dimension. ## Details Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention. This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward. ## Benchmark GPU: H100 We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`. ``` python benchmarks/transformer/score_mod.py --calculate-bwd ``` ### Perf before this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|------------------------------\| \| Average \| 0.743 \| \| \| \| \| \| Max \| 0.955 \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.548 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.834 \| \| \| \| \| \| Max \| 1.261 \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| \| Min \| 0.456 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 107.040 \| 140.800 \| 0.888 \| 0.760 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.840 \| 19.744 \| 112.576 \| 140.064 \| 0.802 \| 0.804 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.232 \| 17.344 \| 87.744 \| 142.496 \| 0.878 \| 0.616 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 108.192 \| 143.328 \| 0.888 \| 0.755 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.904 \| 22.400 \| 106.432 \| 136.512 \| 0.889 \| 0.780 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.424 \| 26.752 \| 91.712 \| 106.688 \| 0.726 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.808 \| 22.432 \| 89.024 \| 101.920 \| 0.883 \| 0.873 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.840 \| 22.272 \| 88.896 \| 102.592 \| 0.891 \| 0.867 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.240 \| 32.416 \| 116.768 \| 112.256 \| 0.933 \| 1.040 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 29.536 \| 37.024 \| 113.664 \| 102.688 \| 0.798 \| 1.107 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.656 \| 32.800 \| 116.992 \| 127.008 \| 0.935 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.592 \| 32.480 \| 116.928 \| 112.160 \| 0.942 \| 1.043 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.920 \| 198.656 \| 204.512 \| 0.653 \| 0.971 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 37.760 \| 62.528 \| 189.536 \| 170.624 \| 0.604 \| 1.111 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.896 \| 62.368 \| 198.304 \| 205.824 \| 0.656 \| 0.963 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.952 \| 198.432 \| 203.648 \| 0.653 \| 0.974 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 318.528 \| 355.904 \| 947.232 \| 1162.496 \| 0.895 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 199.776 \| 252.128 \| 677.792 \| 813.184 \| 0.792 \| 0.834 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 316.512 \| 363.328 \| 947.712 \| 1361.984 \| 0.871 \| 0.696 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 317.984 \| 356.864 \| 947.264 \| 1165.024 \| 0.891 \| 0.813 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 446.656 \| 734.656 \| 1664.288 \| 2172.960 \| 0.608 \| 0.766 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 278.688 \| 467.648 \| 1182.624 \| 1339.296 \| 0.596 \| 0.883 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 447.872 \| 744.096 \| 1662.944 \| 2196.544 \| 0.602 \| 0.757 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 448.128 \| 732.928 \| 1663.072 \| 2156.800 \| 0.611 \| 0.771 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.648 \| 16.640 \| 107.520 \| 143.008 \| 0.940 \| 0.752 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.776 \| 18.240 \| 129.056 \| 141.920 \| 0.865 \| 0.909 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.168 \| 16.640 \| 103.616 \| 139.648 \| 0.912 \| 0.742 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.616 \| 16.640 \| 128.608 \| 164.448 \| 0.938 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 21.952 \| 125.344 \| 170.304 \| 0.901 \| 0.736 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 23.712 \| 104.288 \| 196.896 \| 0.834 \| 0.530 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.072 \| 21.952 \| 102.080 \| 177.056 \| 0.869 \| 0.577 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.648 \| 21.920 \| 109.920 \| 170.848 \| 0.896 \| 0.643 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.936 \| 127.808 \| 228.832 \| 0.954 \| 0.559 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 29.472 \| 33.856 \| 113.152 \| 215.072 \| 0.871 \| 0.526 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.496 \| 32.160 \| 116.576 \| 231.744 \| 0.948 \| 0.503 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.904 \| 116.320 \| 229.824 \| 0.955 \| 0.506 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.480 \| 61.440 \| 176.448 \| 345.312 \| 0.659 \| 0.511 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 38.304 \| 59.424 \| 169.312 \| 371.360 \| 0.645 \| 0.456 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.960 \| 61.760 \| 176.512 \| 358.912 \| 0.663 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.352 \| 61.696 \| 176.512 \| 344.928 \| 0.654 \| 0.512 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.224 \| 357.728 \| 905.728 \| 1668.448 \| 0.884 \| 0.543 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 199.904 \| 248.416 \| 636.544 \| 1109.088 \| 0.805 \| 0.574 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 314.880 \| 363.616 \| 906.304 \| 1658.176 \| 0.866 \| 0.547 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.160 \| 354.368 \| 906.080 \| 1649.024 \| 0.892 \| 0.549 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.912 \| 739.840 \| 1555.808 \| 2521.952 \| 0.604 \| 0.617 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 279.776 \| 463.904 \| 1068.928 \| 1849.888 \| 0.603 \| 0.578 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.080 \| 748.960 \| 1553.504 \| 2629.888 \| 0.596 \| 0.591 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.208 \| 740.608 \| 1558.880 \| 2524.960 \| 0.602 \| 0.617 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 33.568 \| 41.280 \| 170.016 \| 147.584 \| 0.813 \| 1.152 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 30.688 \| 43.040 \| 159.552 \| 146.720 \| 0.713 \| 1.087 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.112 \| 41.504 \| 170.112 \| 152.672 \| 0.822 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.240 \| 41.152 \| 170.272 \| 134.976 \| 0.832 \| 1.261 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.672 \| 76.416 \| 295.296 \| 263.648 \| 0.637 \| 1.120 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.088 \| 72.576 \| 281.920 \| 237.664 \| 0.621 \| 1.186 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.032 \| 76.672 \| 295.520 \| 265.248 \| 0.626 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.096 \| 76.096 \| 295.456 \| 262.112 \| 0.632 \| 1.127 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.920 \| 111.232 \| 401.568 \| 382.944 \| 0.844 \| 1.049 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 68.192 \| 95.232 \| 338.752 \| 326.816 \| 0.716 \| 1.037 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.984 \| 111.840 \| 401.856 \| 444.224 \| 0.840 \| 0.905 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 94.176 \| 110.496 \| 401.600 \| 383.136 \| 0.852 \| 1.048 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.488 \| 227.040 \| 727.424 \| 739.712 \| 0.579 \| 0.983 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 95.616 \| 169.760 \| 616.864 \| 574.112 \| 0.563 \| 1.074 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.680 \| 228.672 \| 727.616 \| 746.048 \| 0.576 \| 0.975 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.104 \| 225.696 \| 727.904 \| 735.392 \| 0.581 \| 0.990 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1227.296 \| 1386.656 \| 3720.192 \| 4539.904 \| 0.885 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 691.360 \| 831.712 \| 2515.872 \| 3067.808 \| 0.831 \| 0.820 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1228.192 \| 1403.136 \| 3715.520 \| 5309.280 \| 0.875 \| 0.700 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1229.024 \| 1384.992 \| 3715.904 \| 4550.368 \| 0.887 \| 0.817 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1784.832 \| 2865.888 \| 6539.840 \| 8460.224 \| 0.623 \| 0.773 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1017.408 \| 1660.480 \| 4369.824 \| 5056.992 \| 0.613 \| 0.864 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1792.448 \| 2904.864 \| 6546.080 \| 8537.024 \| 0.617 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1795.552 \| 2856.864 \| 6544.672 \| 8400.160 \| 0.629 \| 0.779 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.880 \| 148.832 \| 179.936 \| 0.881 \| 0.827 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.168 \| 38.080 \| 138.528 \| 167.552 \| 0.818 \| 0.827 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 39.168 \| 148.512 \| 181.248 \| 0.874 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.784 \| 148.864 \| 180.224 \| 0.883 \| 0.826 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.832 \| 76.352 \| 253.632 \| 295.968 \| 0.640 \| 0.857 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 45.760 \| 65.792 \| 239.040 \| 290.752 \| 0.696 \| 0.822 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.576 \| 253.312 \| 304.032 \| 0.637 \| 0.833 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.192 \| 253.600 \| 296.096 \| 0.640 \| 0.856 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.728 \| 109.728 \| 357.696 \| 498.912 \| 0.854 \| 0.717 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 68.704 \| 92.288 \| 295.616 \| 386.240 \| 0.744 \| 0.765 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.632 \| 111.392 \| 357.408 \| 512.448 \| 0.841 \| 0.697 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.280 \| 109.952 \| 357.696 \| 501.440 \| 0.848 \| 0.713 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.392 \| 230.496 \| 612.224 \| 807.552 \| 0.570 \| 0.758 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 96.512 \| 165.184 \| 502.624 \| 672.384 \| 0.584 \| 0.748 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.360 \| 232.608 \| 612.064 \| 832.320 \| 0.565 \| 0.735 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.008 \| 230.528 \| 612.640 \| 804.320 \| 0.568 \| 0.762 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1227.968 \| 1377.408 \| 3477.920 \| 5324.384 \| 0.892 \| 0.653 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 695.264 \| 824.544 \| 2268.224 \| 3210.208 \| 0.843 \| 0.707 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.640 \| 1404.576 \| 3476.832 \| 5463.456 \| 0.875 \| 0.636 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.416 \| 1378.752 \| 3478.048 \| 5367.712 \| 0.891 \| 0.648 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1788.736 \| 2867.712 \| 6039.520 \| 8616.256 \| 0.624 \| 0.701 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1021.952 \| 1653.824 \| 3866.208 \| 5306.848 \| 0.618 \| 0.729 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.752 \| 2896.352 \| 6044.128 \| 8871.360 \| 0.617 \| 0.681 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.080 \| 2868.672 \| 6040.160 \| 8550.144 \| 0.623 \| 0.706 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.504 \| 71.552 \| 312.768 \| 255.040 \| 0.804 \| 1.226 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 49.472 \| 71.104 \| 285.696 \| 243.520 \| 0.696 \| 1.173 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 58.112 \| 72.896 \| 312.768 \| 288.256 \| 0.797 \| 1.085 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.952 \| 71.680 \| 312.768 \| 255.552 \| 0.808 \| 1.224 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.336 \| 144.256 \| 580.128 \| 500.160 \| 0.571 \| 1.160 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.160 \| 123.712 \| 552.544 \| 447.648 \| 0.616 \| 1.234 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.400 \| 145.184 \| 580.032 \| 504.032 \| 0.568 \| 1.151 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.368 \| 143.904 \| 580.192 \| 499.936 \| 0.572 \| 1.161 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.216 \| 209.568 \| 787.872 \| 747.712 \| 0.846 \| 1.054 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 121.984 \| 168.256 \| 651.968 \| 628.256 \| 0.725 \| 1.038 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.088 \| 211.488 \| 788.320 \| 864.352 \| 0.837 \| 0.912 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.440 \| 208.576 \| 787.424 \| 749.120 \| 0.851 \| 1.051 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.472 \| 441.376 \| 1405.440 \| 1431.648 \| 0.565 \| 0.982 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 172.960 \| 312.064 \| 1172.064 \| 1096.448 \| 0.554 \| 1.069 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.632 \| 446.336 \| 1405.408 \| 1448.480 \| 0.559 \| 0.970 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 250.944 \| 440.128 \| 1406.624 \| 1421.952 \| 0.570 \| 0.989 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2418.720 \| 2747.936 \| 7330.432 \| 9023.712 \| 0.880 \| 0.812 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1353.696 \| 1608.480 \| 4941.696 \| 6078.752 \| 0.842 \| 0.813 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2427.456 \| 2746.816 \| 7329.792 \| 10539.968 \| 0.884 \| 0.695 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2426.688 \| 2763.168 \| 7336.256 \| 9057.536 \| 0.878 \| 0.810 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3554.240 \| 5634.400 \| 12919.872 \| 16843.489 \| 0.631 \| 0.767 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2003.648 \| 3250.784 \| 8610.144 \| 10015.424 \| 0.616 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3582.080 \| 5710.944 \| 12923.328 \| 17011.871 \| 0.627 \| 0.760 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3581.920 \| 5618.144 \| 12934.528 \| 16745.888 \| 0.638 \| 0.772 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.120 \| 71.232 \| 269.760 \| 295.680 \| 0.802 \| 0.912 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 49.408 \| 65.312 \| 242.304 \| 253.952 \| 0.756 \| 0.954 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.504 \| 72.544 \| 269.632 \| 298.976 \| 0.793 \| 0.902 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.760 \| 71.040 \| 269.600 \| 296.640 \| 0.813 \| 0.909 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 82.336 \| 147.168 \| 466.080 \| 487.456 \| 0.559 \| 0.956 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.040 \| 435.392 \| 453.248 \| 0.667 \| 0.961 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.856 \| 147.424 \| 465.920 \| 499.552 \| 0.555 \| 0.933 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.760 \| 146.656 \| 466.176 \| 485.984 \| 0.557 \| 0.959 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 206.976 \| 678.080 \| 866.976 \| 0.853 \| 0.782 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 121.664 \| 164.768 \| 538.240 \| 636.160 \| 0.738 \| 0.846 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 209.664 \| 677.696 \| 883.424 \| 0.842 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 177.440 \| 207.840 \| 677.248 \| 868.288 \| 0.854 \| 0.780 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.272 \| 449.536 \| 1163.424 \| 1420.832 \| 0.557 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 173.472 \| 305.376 \| 929.408 \| 1104.544 \| 0.568 \| 0.841 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 249.376 \| 454.976 \| 1163.648 \| 1455.296 \| 0.548 \| 0.800 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.368 \| 450.144 \| 1163.520 \| 1409.984 \| 0.556 \| 0.825 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2416.576 \| 2726.208 \| 6835.520 \| 10442.784 \| 0.886 \| 0.655 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1357.440 \| 1590.752 \| 4433.664 \| 5975.296 \| 0.853 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2427.360 \| 2747.040 \| 6853.056 \| 10670.784 \| 0.884 \| 0.642 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2441.120 \| 2718.944 \| 6836.640 \| 10433.792 \| 0.898 \| 0.655 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3555.392 \| 5620.960 \| 11944.000 \| 16504.801 \| 0.633 \| 0.724 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2010.848 \| 3241.152 \| 7636.064 \| 9870.464 \| 0.620 \| 0.774 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3557.440 \| 5688.352 \| 11935.744 \| 17090.496 \| 0.625 \| 0.698 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3562.720 \| 5630.432 \| 11939.168 \| 16392.033 \| 0.633 \| 0.728 \| </details> ### Perf after this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|----------------------------\| \| Average \| 0.776 \| \| \| \| \| \| Max \| 1.006 \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.566 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.817 \| \| \| \| \| \| Max \| 1.150 \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| \| Min \| 0.454 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.680 \| 17.056 \| 64.544 \| 73.376 \| 0.919 \| 0.880 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.712 \| 19.872 \| 65.408 \| 72.864 \| 0.791 \| 0.898 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.160 \| 17.280 \| 64.896 \| 73.888 \| 0.935 \| 0.878 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.192 \| 17.120 \| 64.896 \| 75.424 \| 0.946 \| 0.860 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.648 \| 22.496 \| 89.184 \| 82.592 \| 0.873 \| 1.080 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.320 \| 26.816 \| 91.264 \| 82.880 \| 0.758 \| 1.101 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.096 \| 22.528 \| 89.184 \| 83.776 \| 0.892 \| 1.065 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.680 \| 22.432 \| 89.184 \| 120.096 \| 0.877 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.384 \| 32.512 \| 119.232 \| 128.960 \| 0.996 \| 0.925 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.176 \| 37.248 \| 113.664 \| 119.520 \| 0.810 \| 0.951 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.512 \| 32.928 \| 119.264 \| 131.456 \| 0.987 \| 0.907 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.448 \| 32.704 \| 119.200 \| 128.352 \| 0.992 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.952 \| 62.176 \| 199.040 \| 214.304 \| 0.675 \| 0.929 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 39.744 \| 62.880 \| 189.504 \| 179.968 \| 0.632 \| 1.053 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.472 \| 62.784 \| 199.136 \| 217.664 \| 0.661 \| 0.915 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 42.048 \| 61.952 \| 199.168 \| 214.496 \| 0.679 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 341.184 \| 357.632 \| 980.256 \| 1328.896 \| 0.954 \| 0.738 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 212.576 \| 252.960 \| 673.888 \| 824.864 \| 0.840 \| 0.817 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.000 \| 363.296 \| 980.768 \| 1375.808 \| 0.936 \| 0.713 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.768 \| 356.832 \| 980.960 \| 1326.272 \| 0.955 \| 0.740 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 459.392 \| 737.120 \| 1678.240 \| 2205.248 \| 0.623 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 292.672 \| 468.096 \| 1178.016 \| 1371.584 \| 0.625 \| 0.859 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.144 \| 745.312 \| 1680.000 \| 2252.512 \| 0.620 \| 0.746 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.112 \| 736.576 \| 1679.008 \| 2216.480 \| 0.627 \| 0.758 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.064 \| 16.704 \| 105.120 \| 120.768 \| 0.962 \| 0.870 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.552 \| 18.144 \| 107.136 \| 121.696 \| 0.857 \| 0.880 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.096 \| 16.768 \| 102.688 \| 120.864 \| 0.960 \| 0.850 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.032 \| 16.576 \| 104.736 \| 124.672 \| 0.967 \| 0.840 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.392 \| 21.952 \| 104.736 \| 174.656 \| 0.883 \| 0.600 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 20.128 \| 23.712 \| 105.216 \| 199.008 \| 0.849 \| 0.529 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.904 \| 21.888 \| 103.744 \| 179.520 \| 0.909 \| 0.578 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.968 \| 21.952 \| 104.640 \| 177.312 \| 0.910 \| 0.590 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.096 \| 31.904 \| 118.720 \| 231.968 \| 1.006 \| 0.512 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.528 \| 33.952 \| 112.480 \| 218.304 \| 0.899 \| 0.515 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.160 \| 32.224 \| 118.752 \| 237.312 \| 0.998 \| 0.500 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.128 \| 32.032 \| 118.240 \| 233.120 \| 1.003 \| 0.507 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.280 \| 177.408 \| 350.688 \| 0.674 \| 0.506 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 39.552 \| 59.360 \| 168.832 \| 371.488 \| 0.666 \| 0.454 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.984 \| 61.696 \| 177.376 \| 360.416 \| 0.680 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.760 \| 177.184 \| 355.744 \| 0.669 \| 0.498 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.744 \| 357.888 \| 939.712 \| 1665.376 \| 0.949 \| 0.564 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 212.608 \| 248.832 \| 633.280 \| 1122.848 \| 0.854 \| 0.564 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.712 \| 363.232 \| 940.448 \| 1689.440 \| 0.935 \| 0.557 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 341.056 \| 355.264 \| 940.128 \| 1641.152 \| 0.960 \| 0.573 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.736 \| 741.024 \| 1569.824 \| 2559.552 \| 0.622 \| 0.613 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 293.856 \| 464.192 \| 1066.240 \| 1840.416 \| 0.633 \| 0.579 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.704 \| 753.152 \| 1570.112 \| 2641.088 \| 0.612 \| 0.594 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.832 \| 745.536 \| 1570.144 \| 2602.560 \| 0.618 \| 0.603 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.680 \| 41.280 \| 171.840 \| 158.176 \| 0.864 \| 1.086 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 31.360 \| 42.976 \| 158.912 \| 139.264 \| 0.730 \| 1.141 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.168 \| 41.600 \| 171.648 \| 161.344 \| 0.845 \| 1.064 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.136 \| 41.152 \| 171.808 \| 158.336 \| 0.854 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.832 \| 76.384 \| 295.680 \| 277.696 \| 0.639 \| 1.065 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.632 \| 72.512 \| 281.760 \| 250.752 \| 0.629 \| 1.124 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 49.504 \| 76.608 \| 295.584 \| 279.712 \| 0.646 \| 1.057 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.864 \| 75.904 \| 295.456 \| 277.568 \| 0.644 \| 1.064 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.392 \| 111.232 \| 408.640 \| 442.656 \| 0.894 \| 0.923 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 71.392 \| 95.168 \| 338.784 \| 341.760 \| 0.750 \| 0.991 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.808 \| 112.256 \| 408.608 \| 456.160 \| 0.889 \| 0.896 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 100.032 \| 110.816 \| 408.512 \| 444.192 \| 0.903 \| 0.920 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.040 \| 226.112 \| 726.880 \| 774.176 \| 0.597 \| 0.939 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 99.904 \| 169.696 \| 616.448 \| 607.104 \| 0.589 \| 1.015 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.488 \| 228.384 \| 727.776 \| 782.368 \| 0.593 \| 0.930 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.744 \| 225.664 \| 728.000 \| 773.600 \| 0.602 \| 0.941 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1324.192 \| 1387.808 \| 3866.944 \| 5217.184 \| 0.954 \| 0.741 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 738.464 \| 832.608 \| 2507.392 \| 3146.688 \| 0.887 \| 0.797 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.016 \| 1404.256 \| 3867.872 \| 5382.624 \| 0.944 \| 0.719 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.144 \| 1386.688 \| 3867.552 \| 5203.264 \| 0.956 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1847.488 \| 2866.336 \| 6612.704 \| 8597.696 \| 0.645 \| 0.769 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1066.592 \| 1660.640 \| 4357.696 \| 5174.016 \| 0.642 \| 0.842 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1850.464 \| 2905.408 \| 6616.928 \| 8793.280 \| 0.637 \| 0.752 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1848.896 \| 2834.720 \| 6623.872 \| 8637.920 \| 0.652 \| 0.767 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.384 \| 38.656 \| 150.336 \| 182.624 \| 0.941 \| 0.823 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.360 \| 38.112 \| 137.664 \| 171.840 \| 0.823 \| 0.801 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.608 \| 39.040 \| 150.528 \| 183.872 \| 0.938 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.064 \| 38.656 \| 150.560 \| 183.520 \| 0.933 \| 0.820 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.344 \| 76.352 \| 253.920 \| 301.440 \| 0.646 \| 0.842 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 46.720 \| 65.824 \| 239.424 \| 296.384 \| 0.710 \| 0.808 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.248 \| 76.416 \| 253.728 \| 307.808 \| 0.644 \| 0.824 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.376 \| 76.288 \| 253.728 \| 304.736 \| 0.647 \| 0.833 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.144 \| 364.960 \| 503.072 \| 0.901 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 71.136 \| 92.384 \| 294.432 \| 393.056 \| 0.770 \| 0.749 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.200 \| 111.360 \| 365.152 \| 512.640 \| 0.891 \| 0.712 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.240 \| 365.088 \| 504.224 \| 0.900 \| 0.724 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.680 \| 230.336 \| 613.472 \| 816.896 \| 0.589 \| 0.751 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 100.256 \| 165.088 \| 502.144 \| 676.480 \| 0.607 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.008 \| 232.480 \| 613.184 \| 836.672 \| 0.581 \| 0.733 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.232 \| 230.624 \| 613.536 \| 827.136 \| 0.586 \| 0.742 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1324.064 \| 1378.688 \| 3631.808 \| 5308.384 \| 0.960 \| 0.684 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 731.776 \| 826.688 \| 2263.168 \| 3241.344 \| 0.885 \| 0.698 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1316.128 \| 1403.200 \| 3625.088 \| 5550.688 \| 0.938 \| 0.653 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1311.904 \| 1378.880 \| 3616.320 \| 5353.696 \| 0.951 \| 0.675 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1837.856 \| 2887.392 \| 6121.632 \| 8586.656 \| 0.637 \| 0.713 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1066.976 \| 1654.368 \| 3843.136 \| 5291.040 \| 0.645 \| 0.726 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1854.208 \| 2896.832 \| 6130.112 \| 8745.984 \| 0.640 \| 0.701 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1860.512 \| 2889.344 \| 6135.648 \| 8750.592 \| 0.644 \| 0.701 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.640 \| 71.552 \| 315.968 \| 296.512 \| 0.847 \| 1.066 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 50.784 \| 71.040 \| 284.288 \| 258.880 \| 0.715 \| 1.098 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 61.312 \| 72.704 \| 315.680 \| 302.016 \| 0.843 \| 1.045 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.800 \| 71.776 \| 316.320 \| 297.152 \| 0.847 \| 1.065 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.576 \| 144.416 \| 580.576 \| 535.936 \| 0.586 \| 1.083 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.064 \| 123.648 \| 553.344 \| 481.376 \| 0.615 \| 1.150 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.160 \| 145.248 \| 581.024 \| 540.000 \| 0.579 \| 1.076 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.512 \| 143.552 \| 581.088 \| 535.776 \| 0.589 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.152 \| 209.408 \| 798.400 \| 868.704 \| 0.903 \| 0.919 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 127.552 \| 168.800 \| 650.816 \| 663.328 \| 0.756 \| 0.981 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.376 \| 211.360 \| 798.080 \| 895.552 \| 0.896 \| 0.891 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.440 \| 208.576 \| 797.888 \| 873.152 \| 0.908 \| 0.914 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 257.536 \| 441.760 \| 1408.960 \| 1514.720 \| 0.583 \| 0.930 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 179.328 \| 312.096 \| 1170.368 \| 1177.472 \| 0.575 \| 0.994 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 259.264 \| 446.944 \| 1408.768 \| 1530.400 \| 0.580 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 258.080 \| 440.480 \| 1408.864 \| 1514.144 \| 0.586 \| 0.930 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.808 \| 2771.456 \| 7616.704 \| 10405.248 \| 0.937 \| 0.732 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1435.744 \| 1610.336 \| 4927.520 \| 6220.000 \| 0.892 \| 0.792 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.264 \| 2745.056 \| 7611.232 \| 10631.392 \| 0.945 \| 0.716 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2576.256 \| 2735.456 \| 7626.400 \| 10346.976 \| 0.942 \| 0.737 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.744 \| 5634.816 \| 13077.056 \| 17182.528 \| 0.653 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2099.360 \| 3250.176 \| 8589.664 \| 10236.672 \| 0.646 \| 0.839 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3676.800 \| 5716.288 \| 13073.088 \| 17311.071 \| 0.643 \| 0.755 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.136 \| 5570.496 \| 13070.720 \| 17192.863 \| 0.660 \| 0.760 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.600 \| 71.008 \| 272.320 \| 300.000 \| 0.868 \| 0.908 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 50.176 \| 65.344 \| 241.568 \| 258.912 \| 0.768 \| 0.933 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.120 \| 72.512 \| 272.672 \| 305.408 \| 0.843 \| 0.893 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.248 \| 71.136 \| 272.640 \| 301.120 \| 0.861 \| 0.905 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.872 \| 146.784 \| 466.912 \| 496.832 \| 0.571 \| 0.940 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.072 \| 435.584 \| 462.112 \| 0.667 \| 0.943 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.392 \| 147.392 \| 466.656 \| 504.448 \| 0.566 \| 0.925 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.360 \| 146.688 \| 466.656 \| 499.040 \| 0.568 \| 0.935 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.024 \| 207.584 \| 684.768 \| 873.568 \| 0.911 \| 0.784 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 126.944 \| 164.288 \| 536.192 \| 645.984 \| 0.773 \| 0.830 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 188.768 \| 209.760 \| 684.096 \| 897.504 \| 0.900 \| 0.762 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.408 \| 207.776 \| 685.024 \| 876.384 \| 0.912 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 259.168 \| 449.536 \| 1167.936 \| 1433.280 \| 0.577 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 180.000 \| 305.312 \| 928.000 \| 1113.920 \| 0.590 \| 0.833 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 258.464 \| 455.136 \| 1167.808 \| 1462.848 \| 0.568 \| 0.798 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 257.824 \| 450.208 \| 1167.744 \| 1448.000 \| 0.573 \| 0.806 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2598.368 \| 2729.120 \| 7134.400 \| 10381.632 \| 0.952 \| 0.687 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1435.456 \| 1591.040 \| 4424.768 \| 6035.808 \| 0.902 \| 0.733 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2594.752 \| 2725.952 \| 7128.384 \| 10822.496 \| 0.952 \| 0.659 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2597.888 \| 2716.960 \| 7101.568 \| 10385.440 \| 0.956 \| 0.684 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3647.648 \| 5581.632 \| 12089.952 \| 16667.233 \| 0.654 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2093.952 \| 3241.440 \| 7579.392 \| 9847.936 \| 0.646 \| 0.770 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3650.528 \| 5650.688 \| 12105.568 \| 16963.680 \| 0.646 \| 0.714 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3680.064 \| 5585.312 \| 12117.504 \| 16935.040 \| 0.659 \| 0.716 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505 Approved by: https://github.com/Chillee	2024-09-10 09:30:02 +00:00
Roy Hvaara	23b1486185	[MPS] Allow nan mean reduction in `nll_loss` (#135434 ) This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162. Fixes #134431 Ref #64572 #119108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434 Approved by: https://github.com/malfet	2024-09-10 08:37:59 +00:00
Victor Tao	9902b349cb	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-10 07:27:55 +00:00
Tugsbayasgalan Manlaibaatar	5a9ac83e94	Fix doc (#135551 ) Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551 Approved by: https://github.com/yushangdi ghstack dependencies: #135549	2024-09-10 07:18:44 +00:00
Sam Larsen	1adf28a5c0	[inductor] print triton float64 constants correctly (#135260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260 Approved by: https://github.com/jansel	2024-09-10 07:05:02 +00:00
Tugsbayasgalan Manlaibaatar	c18052da0e	Add some minor doc improvement and ban using training IR for unflattener (#135549 ) Title Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549 Approved by: https://github.com/yushangdi	2024-09-10 06:48:42 +00:00
Yichen Yan	c0d2f991b1	Increase `TRITON_MAX_BLOCK['X']` (#135181 ) Fixes #135028 As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181 Approved by: https://github.com/jansel	2024-09-10 05:54:37 +00:00
Thomas Bohnstingl	e889252493	Implementation of scan (#134102 ) This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions. @ydwu4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102 Approved by: https://github.com/ydwu4	2024-09-10 04:51:16 +00:00
Avik Chaudhuri	6546c6186d	do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518 ) Test Plan: added test Differential Revision: D62395371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518 Approved by: https://github.com/zhxchen17	2024-09-10 03:47:36 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
zengxian	7ec17b49cf	Fix dynamo benchmark skip logic for cpu device (#135193 ) Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2024-09-10 03:02:19 +00:00
Wu, Chunyuan	146921007a	[inductor] [cpp] fix the input contiguous check in max-autotune (#134982 ) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: `d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)` And it finally runs into this `copy_input` and returns a `FlexibleLayout`. `d14fe3ffed/torch/_inductor/ir.py (L4722)` When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](`d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)`) which calls [slice_nd](`d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)`) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](`d14fe3ffed/torch/_inductor/ir.py (L2288)`) invokes [decide_layout](`d14fe3ffed/torch/_inductor/ir.py (L2135)`) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-10 02:47:38 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Guilherme Leobas	136e28f616	Enable forward AD in functional.affine_grid (#135494 ) Fixes #121411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494 Approved by: https://github.com/zou3519, https://github.com/soulitzer	2024-09-10 00:07:07 +00:00
Jeff Daily	39a61795e3	remove amax_ptr from scaled_gemm (#135421 ) amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-09-09 23:04:36 +00:00
Scott Wolchok	b4feec9782	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 ) Building XNNPACK as a static library has some issues because of multiple global params floating around. Let's try to get rid of it in xplat and see how it fares. Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529 Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign	2024-09-09 22:47:01 +00:00
Yanbo Liang	d81731615f	[Dynamo] Adding CallFunctionNoArgsSource and (#135425 ) CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device() Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425 Approved by: https://github.com/anijain2305	2024-09-09 22:46:00 +00:00
shubhambhokare1	e2f9a83b85	[ONNX] Drop final None values as inputs for nodes in exporter graph (#135520 ) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 22:28:41 +00:00
PyTorch MergeBot	70a65a8bd5	Revert "NJT <-> padded dense conversions (#125947 )" This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942. Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test `09a5e88bef`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))	2024-09-09 22:01:09 +00:00
PyTorch MergeBot	689d278543	Revert "Add `__init__.py` to shape inference folder. (#135461 )" This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715. Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))	2024-09-09 21:55:13 +00:00
atalman	9b764491e3	Use upload-artifact@v4.4.0 for create_release.yml (#135528 ) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/download-artifact@v4.1.7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-09 20:48:52 +00:00
Maclyn Brandwein	cbc6b30a24	Fix broken E2E tests on Linux machines (#135394 ) Summary: I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught -- `ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job: https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job Test Plan: `arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'` -> https://www.internalfb.com/sandcastle/workflow/256705178764255769 Differential Revision: D62321167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394 Approved by: https://github.com/laithsakka	2024-09-09 20:18:08 +00:00
PyTorch MergeBot	5b368de7f7	Revert "[ONNX] Update fake mode usage in onnx docs (#135512 )" This reverts commit a13c118994b4f118388d97a35abcb91a396cd437. Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))	2024-09-09 20:15:12 +00:00
Joel Schlosser	09a5e88bef	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-09 19:37:32 +00:00
Sahan Paliskara	a4e6a0b240	[split build] move periodic split builds into own concurrency group (#135510 ) To avoid nightly workflows cancelling each other Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-09 19:35:57 +00:00
imShZh	4ab232d0c4	Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433 ) Fixes #135432 In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case. In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong. This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433 Approved by: https://github.com/ezyang	2024-09-09 19:32:18 +00:00
Sergii Dymchenko	2032f107d7	Don't try to tag s390x docker images (#135509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509 Approved by: https://github.com/atalman	2024-09-09 19:07:48 +00:00
rzou	5f7d956362	Fix bugs blocking flipping the default layout constraint for custom ops (#135391 ) Fixes two things: - For regular PyTorch ops, the default layout constraint tag is always flexible_layout. This was a bug with #135238 - Mark the new quantized _wrapped_linear_prepack ops as flexible_layout. The metas for these are incorrect, I didn't want to fix them (and changing the default requires the metas actually be correct). Test Plan: - The next PR up in the stack. The PRs are split because the next one is riskier. foo Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391 Approved by: https://github.com/albanD	2024-09-09 18:24:21 +00:00
shubhambhokare1	a13c118994	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby	2024-09-09 18:10:37 +00:00
Chien-Chin Huang	21241bfeee	[CP] Extend CP to support load-balancing shards (#132442 ) This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442 Approved by: https://github.com/wconstab	2024-09-09 18:04:38 +00:00
PyTorch MergeBot	73a6fc6e30	Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314 )" This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da. Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](`011cae9570`) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))	2024-09-09 17:33:01 +00:00
Roy Hvaara	09287e3af4	[MPS] Add regression test for `fft.fftfreq` (#135440 ) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440 Approved by: https://github.com/ezyang	2024-09-09 17:12:36 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00
Bin Bao	9c6dff4941	[AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857 ) Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857 Approved by: https://github.com/angelayi	2024-09-09 16:54:12 +00:00
atalman	0eb425a563	[Release] Apply Release changes scripts after release 2.4 (#135495 ) Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495 Approved by: https://github.com/kit1980	2024-09-09 16:49:04 +00:00
Victor Tao	011cae9570	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-09 16:24:58 +00:00
CaoE	dfb2b661f7	Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525 ) Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-09 15:31:38 +00:00
Roy Hvaara	5a69e0ebbe	[MPS] Update decorator comments with issue ref (#135448 ) Updating the comments with references to better places for context now that the bugs have been identified. xref #135442 #135447 #134184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448 Approved by: https://github.com/ezyang	2024-09-09 15:18:52 +00:00
Xavier Dupré	5e145861f2	[ONNX] Improves documentation of ONNX exporter (#135372 ) The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 15:09:01 +00:00
Yuxin Wu	c35b953531	Fix wrong error msg (#135423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423 Approved by: https://github.com/ezyang	2024-09-09 13:28:31 +00:00
PHLens	dced0d6d9f	Add `__init__.py` to shape inference folder. (#135461 ) Fixes #135196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461 Approved by: https://github.com/ezyang	2024-09-09 13:27:58 +00:00
Jiong Gong	c0436c5701	[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686 ) (#135438 ) Fix #134686. PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438 Approved by: https://github.com/leslie-fang-intel	2024-09-09 05:16:02 +00:00
cyy	60e8dc4374	Check function declarations in Caffe2 code (#134925 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925 Approved by: https://github.com/ezyang	2024-09-09 05:03:29 +00:00
xingyunjohn1	e6c3f58584	Fix example: Address broadcasting error in the addition of `attn_bias… (#135427 ) …` and `attn_mask`, and correct device assignment for newly created variables in the method. Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method. 1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired. 2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device. This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> @mikaylagawarecki provided a more elegant implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427 Approved by: https://github.com/ezyang	2024-09-09 03:47:34 +00:00
PhilipMay	90e12cf63d	Fix return type of `nansum` example. (#135435 ) One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435 Approved by: https://github.com/ezyang	2024-09-09 03:34:52 +00:00
Zhou, Lingzhi	44c08f4984	[Partitioner] Query whether nodes exist in graph faster (#135316 ) Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316 Approved by: https://github.com/ezyang	2024-09-09 03:34:02 +00:00
Rafal Litka	b6186353c6	enable lazy_init for hpu (#135203 ) enables lazy_init for hpu device Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203 Approved by: https://github.com/ezyang	2024-09-09 03:32:20 +00:00

3453 changed files with 126229 additions and 69524 deletions

3

.buckconfig.oss

View File

 @ -21,6 +21,3 @@
   cxx = /usr/bin/clang++
   cxxpp = /usr/bin/clang++
   ld = /usr/bin/clang++
 [project]
   default_flavors_mode=all

									
										1

.ci/docker/android/AndroidManifest.xml
									
												View File
											
				@ -1 +0,0 @@

				<manifest package="org.pytorch.deps" />

									
										66

.ci/docker/android/build.gradle
									
												View File
											
				@ -1,66 +0,0 @@

				buildscript {

				    ext {

				        minSdkVersion = 21

				        targetSdkVersion = 28

				        compileSdkVersion = 28

				        buildToolsVersion = '28.0.3'

				        coreVersion = "1.2.0"

				        extJUnitVersion = "1.1.1"

				        runnerVersion = "1.2.0"

				        rulesVersion = "1.2.0"

				        junitVersion = "4.12"

				    }

				    repositories {

				        google()

				        mavenLocal()

				        mavenCentral()

				        jcenter()

				    }

				    dependencies {

				        classpath 'com.android.tools.build:gradle:4.1.2'

				        classpath 'com.vanniktech:gradle-maven-publish-plugin:0.14.2'

				    }

				}

				repositories {

				    google()

				    jcenter()

				}

				apply plugin: 'com.android.library'

				android {

				    compileSdkVersion rootProject.compileSdkVersion

				    buildToolsVersion rootProject.buildToolsVersion

				    defaultConfig {

				        minSdkVersion minSdkVersion

				        targetSdkVersion targetSdkVersion

				    }

				    sourceSets {

				        main {

				            manifest.srcFile 'AndroidManifest.xml'

				        }

				    }

				}

				dependencies {

				    implementation 'com.android.support:appcompat-v7:28.0.0'

				    implementation 'androidx.appcompat:appcompat:1.0.0'

				    implementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'

				    implementation 'com.google.code.findbugs:jsr305:3.0.1'

				    implementation 'com.facebook.soloader:nativeloader:0.10.5'

				    implementation 'junit:junit:' + rootProject.junitVersion

				    implementation 'androidx.test:core:' + rootProject.coreVersion

				    implementation 'junit:junit:' + rootProject.junitVersion

				    implementation 'androidx.test:core:' + rootProject.coreVersion

				    implementation 'androidx.test.ext:junit:' + rootProject.extJUnitVersion

				    implementation 'androidx.test:rules:' + rootProject.rulesVersion

				    implementation 'androidx.test:runner:' + rootProject.runnerVersion

				}

6

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +1,5 @@
 .6b
 .7b
 manylinux_2_17
 rocm6.2
 f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
 e4ab195d2bd19e939c675a13280c29714c6ef9f2cf420690da150fa0cac043b1
 be04068c3c0857a4cfd17d7e39e71d0423ebac2
 e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

									
										58

.ci/docker/build.sh
									
												View File
												
				@ -244,16 +244,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3-clang9-android-ndk-r21e)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=9

				    LLVMDEV=yes

				    PROTOBUF=yes

				    ANDROID=yes

				    ANDROID_NDK_VERSION=r21e

				    GRADLE_VERSION=6.8.3

				    NINJA_VERSION=1.9.0

				    ;;

				  pytorch-linux-focal-py3.9-clang10)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				@ -275,6 +265,7 @@ case "$image" in

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    GRAPHVIZ=yes

				    ;;

				  pytorch-linux-focal-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				@ -286,18 +277,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -307,6 +287,17 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				@ -355,6 +346,12 @@ case "$image" in

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3-clang18-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=18

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				@ -379,6 +376,14 @@ case "$image" in

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.12-triton-cpu)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				@ -400,9 +405,6 @@ case "$image" in

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -415,9 +417,6 @@ case "$image" in

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -494,8 +493,6 @@ docker build \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "ANDROID=${ANDROID}" \

				       --build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \

				       --build-arg "SWIFTSHADER=${SWIFTSHADER}" \

				@ -509,6 +506,7 @@ docker build \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "TRITON_CPU=${TRITON_CPU}" \

				       --build-arg "ONNX=${ONNX}" \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 cd1c833b079adb324871dcbbe75b43d42ffc0ade
 c382df0d2b2ef383d57998a61187cfefcb26e3

1

.ci/docker/ci_commit_pins/triton-cpu.txt Normal file

View File

				`@ -0,0 +1 @@`
				`c7711371cace304afe265c1ffa906415ab82fc66`

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 cc981feba10a3f4c2e46f3fe368e8fcf5f5643df
 b14bf5593cf58a8541f3e6b9125600a867d4ef

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 b6a61e7df814ba806f498f8bb3160f84b120c
 cf34004b8a67d290a962da166f5aa2fc66751326

									
										112

.ci/docker/common/install_android.sh
									
												View File
											
				@ -1,112 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${ANDROID_NDK}" ]

				_https_amazon_aws=https://ossci-android.s3.amazonaws.com

				apt-get update

				apt-get install -y --no-install-recommends autotools-dev autoconf unzip

				apt-get autoclean && apt-get clean

				rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				pushd /tmp

				curl -Os --retry 3 $_https_amazon_aws/android-ndk-${ANDROID_NDK}-linux-x86_64.zip

				popd

				_ndk_dir=/opt/ndk

				mkdir -p "$_ndk_dir"

				unzip -qo /tmp/android*.zip -d "$_ndk_dir"

				_versioned_dir=$(find "$_ndk_dir/" -mindepth 1 -maxdepth 1 -type d)

				mv "$_versioned_dir"/* "$_ndk_dir"/

				rmdir "$_versioned_dir"

				rm -rf /tmp/*

				# Install OpenJDK

				# https://hub.docker.com/r/picoded/ubuntu-openjdk-8-jdk/dockerfile/

				sudo apt-get update && \

				    apt-get install -y openjdk-8-jdk && \

				    apt-get install -y ant && \

				    apt-get clean && \

				    rm -rf /var/lib/apt/lists/* && \

				    rm -rf /var/cache/oracle-jdk8-installer;

				# Fix certificate issues, found as of

				# https://bugs.launchpad.net/ubuntu/+source/ca-certificates-java/+bug/983302

				sudo apt-get update && \

				    apt-get install -y ca-certificates-java && \

				    apt-get clean && \

				    update-ca-certificates -f && \

				    rm -rf /var/lib/apt/lists/* && \

				    rm -rf /var/cache/oracle-jdk8-installer;

				export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

				# Installing android sdk

				# https://github.com/circleci/circleci-images/blob/staging/android/Dockerfile.m4

				_tmp_sdk_zip=/tmp/android-sdk-linux.zip

				_android_home=/opt/android/sdk

				rm -rf $_android_home

				sudo mkdir -p $_android_home

				curl --silent --show-error --location --fail --retry 3 --output /tmp/android-sdk-linux.zip $_https_amazon_aws/android-sdk-linux-tools3859397-build-tools2803-2902-platforms28-29.zip

				sudo unzip -q $_tmp_sdk_zip -d $_android_home

				rm $_tmp_sdk_zip

				sudo chmod -R 777 $_android_home

				export ANDROID_HOME=$_android_home

				export ADB_INSTALL_TIMEOUT=120

				export PATH="${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"

				echo "PATH:${PATH}"

				# Installing Gradle

				echo "GRADLE_VERSION:${GRADLE_VERSION}"

				_gradle_home=/opt/gradle

				sudo rm -rf $gradle_home

				sudo mkdir -p $_gradle_home

				curl --silent --output /tmp/gradle.zip --retry 3 $_https_amazon_aws/gradle-${GRADLE_VERSION}-bin.zip

				sudo unzip -q /tmp/gradle.zip -d $_gradle_home

				rm /tmp/gradle.zip

				sudo chmod -R 777 $_gradle_home

				export GRADLE_HOME=$_gradle_home/gradle-$GRADLE_VERSION

				alias gradle="${GRADLE_HOME}/bin/gradle"

				export PATH="${GRADLE_HOME}/bin/:${PATH}"

				echo "PATH:${PATH}"

				gradle --version

				mkdir /var/lib/jenkins/gradledeps

				cp build.gradle /var/lib/jenkins/gradledeps

				cp AndroidManifest.xml /var/lib/jenkins/gradledeps

				pushd /var/lib/jenkins

				export GRADLE_LOCAL_PROPERTIES=gradledeps/local.properties

				rm -f $GRADLE_LOCAL_PROPERTIES

				echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES

				echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES

				chown -R jenkins /var/lib/jenkins/gradledeps

				chgrp -R jenkins /var/lib/jenkins/gradledeps

				sudo -H -u jenkins $GRADLE_HOME/bin/gradle -Pandroid.useAndroidX=true -p /var/lib/jenkins/gradledeps -g /var/lib/jenkins/.gradle --refresh-dependencies --debug --stacktrace assemble

				chown -R jenkins /var/lib/jenkins/.gradle

				chgrp -R jenkins /var/lib/jenkins/.gradle

				popd

				rm -rf /var/lib/jenkins/.gradle/daemon

				# Cache vision models used by the test

				source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

									
										4

.ci/docker/common/install_aotriton.sh
									
												View File
												
				@ -4,12 +4,12 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

									
										51

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -9,7 +9,12 @@ install_ubuntu() {

				  # Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``

				  apt-get install -y cargo

				  echo "Checking out sccache repo"

				  git clone https://github.com/pytorch/sccache

				  if [ -n "$CUDA_VERSION" ]; then

				      # TODO: Remove this

				      git clone https://github.com/pytorch/sccache

				  else

				      git clone https://github.com/mozilla/sccache -b v0.8.2

				  fi

				  cd sccache

				  echo "Building sccache"

				  cargo build --release

				@ -19,6 +24,10 @@ install_ubuntu() {

				  rm -rf sccache

				  apt-get remove -y cargo rustc

				  apt-get autoclean && apt-get clean

				  echo "Downloading old sccache binary from S3 repo for PCH builds"

				  curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache-0.2.14a

				  chmod 755 /opt/cache/bin/sccache-0.2.14a

				}

				install_binary() {

				@ -36,18 +45,46 @@ if [ -n "$ROCM_VERSION" ]; then

				  curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache

				else

				  ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				  # TODO: Install the pre-built binary from S3 as building from source

				  # https://github.com/pytorch/sccache has started failing mysteriously

				  # in which sccache server couldn't start with the following error:

				  #   sccache: error: Invalid argument (os error 22)

				  install_binary

				  if [ -n "$CUDA_VERSION" ]; then

				    # TODO: Install the pre-built binary from S3 as building from source

				    # https://github.com/pytorch/sccache has started failing mysteriously

				    # in which sccache server couldn't start with the following error:

				    #   sccache: error: Invalid argument (os error 22)

				    install_binary

				  else

				    install_ubuntu

				  fi

				fi

				chmod a+x /opt/cache/bin/sccache

				function write_sccache_stub() {

				  # Unset LD_PRELOAD for ps because of asan + ps issues

				  # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589

				  printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n  exec sccache $(which $1) \"\$@\"\nelse\n  exec $(which $1) \"\$@\"\nfi" > "/opt/cache/bin/$1"

				  if [ $1 == "gcc" ]; then

				  # Do not call sccache recursively when dumping preprocessor argument

				  # For some reason it's very important for the first cached nvcc invocation

				    cat > "/opt/cache/bin/$1" <<EOF

				#!/bin/sh

				if [ "\$1" = "-E" ] || [ "\$2" = "-E" ]; then

				  exec $(which $1) "\$@"

				elif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache $(which $1) "\$@"

				else

				  exec $(which $1) "\$@"

				fi

				EOF

				  else

				    cat > "/opt/cache/bin/$1" <<EOF

				#!/bin/sh

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache $(which $1) "\$@"

				else

				  exec $(which $1) "\$@"

				fi

				EOF

				  fi

				  chmod a+x "/opt/cache/bin/$1"

				}

									
										11

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -13,11 +13,18 @@ if [ -n "$CLANG_VERSION" ]; then

				  elif [[ $UBUNTU_VERSION == 22.04 ]]; then

				    # work around ubuntu apt-get conflicts

				    sudo apt-get -y -f install

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    if [[ $CLANG_VERSION == 18 ]]; then

				      apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"

				    fi

				  fi

				  sudo apt-get update

				  apt-get install -y --no-install-recommends clang-"$CLANG_VERSION"

				  apt-get install -y --no-install-recommends llvm-"$CLANG_VERSION"

				  if [[ $CLANG_VERSION -ge 18 ]]; then

				    apt-get install -y libomp-${CLANG_VERSION}-dev libclang-rt-${CLANG_VERSION}-dev clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"

				  else

				    apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"

				  fi

				  # Install dev version of LLVM.

				  if [ -n "$LLVMDEV" ]; then

									
										19

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -65,23 +65,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      NUMPY_VERSION=1.24.4

				    else

				      NUMPY_VERSION=1.26.2

				    fi

				    conda_install "openblas==0.3.25=*openmp*"

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then

				      NUMPY_VERSION=1.26.0

				    else

				      NUMPY_VERSION=1.21.2

				    fi

				    conda_install "mkl=2021.4.0 mkl-include=2021.4.0"

				  fi

				  conda_install ${CONDA_COMMON_DEPS}

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				  # and libpython-static for torch deploy

				@ -103,8 +90,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Install some other packages, including those needed for Python test reporting

				  pip_install -r /opt/conda/requirements-ci.txt

				  pip_install numpy=="$NUMPY_VERSION"

				  pip_install -U scikit-learn

				  if [ -n "$DOCS" ]; then

				    apt-get update

									
										22

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

				@ -22,6 +22,13 @@ function do_cpython_build {

				    check_var $py_ver

				    check_var $py_folder

				    tar -xzf Python-$py_ver.tgz

				    local additional_flags=""

				    if [ "$py_ver" == "3.13.0t" ]; then

				        additional_flags=" --disable-gil"

				        mv cpython-3.13/ cpython-3.13t/

				    fi

				    pushd $py_folder

				    local prefix="/opt/_internal/cpython-${py_ver}"

				@ -37,8 +44,10 @@ function do_cpython_build {

				        local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"

				    fi

				    # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null

				    make -j40 > /dev/null

				    make install > /dev/null

				@ -69,7 +78,14 @@ function build_cpython {

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    if [ "$py_ver" = "3.13.0" ]; then

				    if [ "$py_ver" = "3.13.0t" ]; then

				        PY_VER_SHORT="3.13"

				        PYT_VER_SHORT="3.13t"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PYT_VER_SHORT

				    elif [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

									
										73

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -105,7 +105,7 @@ function install_121 {

				}

				function install_124 {

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				@ -137,6 +137,39 @@ function install_124 {

				  ldconfig

				}

				function install_126 {

				  echo "Installing CUDA 12.6.2 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.2 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run

				  chmod +x cuda_12.6.2_560.35.03_linux.run

				  ./cuda_12.6.2_560.35.03_linux.run --toolkit --silent

				  rm -f cuda_12.6.2_560.35.03_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_062

				  ldconfig

				}

				function prune_118 {

				    echo "Pruning CUDA 11.8 and cuDNN"

				    #####################################################################################

				@ -227,12 +260,46 @@ function prune_124 {

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.1 prune visual tools

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				@ -243,6 +310,8 @@ do

				        ;;

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										14

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -5,19 +5,19 @@ set -ex

				NCCL_VERSION=v2.21.5-1

				function install_cusparselt_052 {

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				@ -44,7 +44,7 @@ function install_124 {

				  cd ..

				  rm -rf nccl

				  install_cusparselt_052

				  install_cusparselt_062

				  ldconfig

				}

									
										2

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,7 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-4]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

									
										16

.ci/docker/common/install_graphviz.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				if [ -n "${UBUNTU_VERSION}" ]; then

				    apt update

				    apt-get install -y graphviz

				elif [ -n "${CENTOS_VERSION}" ]; then

				    dnf update

				    dnf install -y graphviz

				else

				    echo "Unsupported Linux distribution"

				    exit 1

				fi

									
										51

.ci/docker/common/install_miopen.sh
									
												View File
												
				@ -10,6 +10,21 @@ if [[ -z $ROCM_VERSION ]]; then

				    exit 1;

				fi

				IS_UBUNTU=0

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    IS_UBUNTU=1

				    ;;

				  centos)

				    IS_UBUNTU=0

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				# To make version comparison easier, create an integer representation.

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})

				@ -57,9 +72,11 @@ MIOPEN_CMAKE_COMMON_FLAGS="

				-DMIOPEN_BUILD_DRIVER=OFF

				"

				# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version

				if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    echo "ROCm 6.2 MIOpen does not need any patches, do not build from source"

				if [[ $ROCM_INT -ge 60300 ]]; then

				    echo "ROCm 6.3+ MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-6.2-staging"

				elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				    echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"

				    exit 0

				@ -93,12 +110,21 @@ else

				    exit 1

				fi

				yum remove -y miopen-hip

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get remove -y miopen-hip

				else

				  yum remove -y miopen-hip

				fi

				git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}

				pushd MIOpen

				# remove .git to save disk space since CI runner was running out

				rm -rf .git

				# Don't build CK to save docker build time

				if [[ $ROCM_INT -ge 60200 ]]; then

				    sed -i '/composable_kernel/d' requirements.txt

				fi

				# Don't build MLIR to save docker build time

				# since we are disabling MLIR backend for MIOpen anyway

				if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				@ -111,10 +137,15 @@ cmake -P install_deps.cmake --minimum

				# clean up since CI runner was running out of disk space

				rm -rf /tmp/*

				yum clean all

				rm -rf /var/cache/yum

				rm -rf /var/lib/yum/yumdb

				rm -rf /var/lib/yum/history

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				else

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				fi

				## Build MIOpen

				mkdir -p build

				@ -131,7 +162,11 @@ make -j $(nproc) package

				# clean up since CI runner was running out of disk space

				rm -rf /usr/local/cget

				yum install -y miopen-*.rpm

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  sudo dpkg -i miopen-hip*.deb

				else

				  yum install -y miopen-*.rpm

				fi

				popd

				rm -rf MIOpen

									
										2

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -32,7 +32,7 @@ pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20240831 --no-deps

				pip_install onnxscript==0.1.0.dev20241009 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

									
										8

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -15,8 +15,11 @@ conda_reinstall() {

				if [ -n "${XPU_VERSION}" ]; then

				  TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"

				  TRITON_TEXT_FILE="triton-xpu"

				elif [ -n "${TRITON_CPU}" ]; then

				  TRITON_REPO="https://github.com/triton-lang/triton-cpu"

				  TRITON_TEXT_FILE="triton-cpu"

				else

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_REPO="https://github.com/triton-lang/triton"

				  TRITON_TEXT_FILE="triton"

				fi

				@ -44,9 +47,10 @@ chown -R jenkins /var/lib/jenkins/triton

				chgrp -R jenkins /var/lib/jenkins/triton

				pushd /var/lib/jenkins/

				as_jenkins git clone ${TRITON_REPO} triton

				as_jenkins git clone --recursive ${TRITON_REPO} triton

				cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

									
										7

.ci/docker/common/install_user.sh
									
												View File
												
				@ -2,6 +2,13 @@

				set -ex

				# Since version 24 the system ships with user 'ubuntu' that has id 1000

				# We need a work-around to enable id 1000 usage for this script

				if [[ $UBUNTU_VERSION == 24.04 ]]; then

				    # touch is used to disable harmless error message

				    touch /var/mail/ubuntu && chown ubuntu /var/mail/ubuntu && userdel -r ubuntu

				fi

				# Mirror jenkins user in container

				# jenkins user as ec2-user should have the same user-id

				echo "jenkins:x:1000:1000::/var/lib/jenkins:" >> /etc/passwd

									
										11

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -41,13 +41,16 @@ function install_ubuntu() {

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				        apt-get install -y intel-ocloc

				    fi

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    if [ -n "$XPU_VERSION" ]; then

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev-0.9

				    else

				        apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev

				        apt-get install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9

				    fi

				    # Cleanup

				@ -97,7 +100,7 @@ EOF

				        intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \

				        level-zero-devel

				    # Install Intel Support Packages

				    yum install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    yum install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9

				    # Cleanup

				    dnf clean all

				@ -131,7 +134,7 @@ function install_sles() {

				    zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel

				    # Install Intel Support Packages

				    zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    zypper install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9

				}

									
										5

.ci/docker/conda/Dockerfile
									
												View File
												
				@ -70,6 +70,10 @@ FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				ENV DESIRED_CUDA=12.4

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				ENV DESIRED_CUDA=12.6

				# Install MNIST test data

				FROM base as mnist

				ADD ./common/install_mnist.sh install_mnist.sh

				@ -79,6 +83,7 @@ FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.1  /usr/local/cuda-12.1 /usr/local/cuda-12.1

				COPY --from=cuda12.4  /usr/local/cuda-12.4 /usr/local/cuda-12.4

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				# Final step

				FROM ${BASE_TARGET} as final

									
										6

.ci/docker/conda/build.sh
									
												View File
												
				@ -37,6 +37,12 @@ esac

				(

				  set -x

				  # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				  # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				  sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				  sudo systemctl daemon-reload

				  sudo systemctl restart docker

				  docker build \

				    --target final \

				    --progress plain \

									
										5

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -66,6 +66,11 @@ RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				RUN bash ./install_magma.sh 12.6

				RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda

				FROM cpu as rocm

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

									
										1

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -10,6 +10,7 @@ ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				# Note: This is required patch since CentOS have reached EOL

				# otherwise any yum install setp will fail

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

									
										9

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -124,7 +124,14 @@ if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				fi

				(

				    set -x

				    DOCKER_BUILDKIT=1 docker build \

				    # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				    # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				    sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				    sudo systemctl daemon-reload

				    sudo systemctl restart docker

				    DOCKER_BUILDKIT=1 docker build  \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

									
										14

.ci/docker/manywheel/build_scripts/ssl-check.py
									
												View File
												
				@ -1,10 +1,12 @@

				# cf. https://github.com/pypa/manylinux/issues/53

				import sys

				from urllib.request import urlopen

				GOOD_SSL = "https://google.com"

				BAD_SSL = "https://self-signed.badssl.com"

				import sys

				print("Testing SSL certificate checking for Python:", sys.version)

				@ -12,14 +14,8 @@ if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):

				    print("This version never checks SSL certs; skipping tests")

				    sys.exit(0)

				if sys.version_info[0] >= 3:

				    from urllib.request import urlopen

				    EXC = OSError

				else:

				    from urllib import urlopen

				    EXC = IOError

				EXC = OSError

				print(f"Connecting to {GOOD_SSL} should work")

				urlopen(GOOD_SSL)

51

.ci/docker/requirements-ci.txt

View File

 @ -5,7 +5,7 @@
 #Pinned versions: 1.6
 #test that import:
 boto3==1.19.12
 boto3==1.35.42
 #Description: AWS SDK for python
 #Pinned versions: 1.19.12, 1.16.34
 #test that import:
 @ -90,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.10.0
 mypy==1.11.2
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.10.0
 @ -118,7 +118,7 @@ numba==0.55.2 ; python_version == "3.10"
 #numpy
 #Description: Provides N-dimensional arrays and linear algebra
 #Pinned versions: 1.20
 #Pinned versions: 1.26.2
 #test that import: test_view_ops.py, test_unary_ufuncs.py, test_type_promotion.py,
 #test_type_info.py, test_torch.py, test_tensorexpr_pybind.py, test_tensorexpr.py,
 #test_tensorboard.py, test_tensor_creation_ops.py, test_static_runtime.py,
 @ -128,6 +128,9 @@ numba==0.55.2 ; python_version == "3.10"
 #test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
 #test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
 #test_binary_ufuncs.py
 numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
 numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
 numpy==2.1.2; python_version >= "3.13"
 #onnxruntime
 #Description: scoring engine for Open Neural Network Exchange (ONNX) models
 @ -139,9 +142,9 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 optree==0.12.1
 optree==0.13.0
 #Description: A library for tree manipulation
 #Pinned versions: 0.12.1
 #Pinned versions: 0.13.0
 #test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
 #test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
 #common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
 @ -202,6 +205,11 @@ xdoctest==1.1.0
 #Pinned versions: 1.1.0
 #test that import:
 pydot==3.0.1
 #Description: Needed for testing FxGraphDrawer
 #Pinned versions:
 #test that import:
 pygments==2.15.0
 #Description: support doctest highlighting
 #Pinned versions: 2.12.0
 @ -253,7 +261,7 @@ tb-nightly==2.13.0a20230426
 #test that import:
 # needed by torchgen utils
 typing-extensions
 typing-extensions>=4.10.0
 #Description: type hints for python
 #Pinned versions:
 #test that import:
 @ -322,13 +330,12 @@ lxml==5.0.0
 PyGithub==2.3.0
 sympy==1.12.1 ; python_version == "3.8"
 sympy==1.13.1 ; python_version >= "3.9"
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 onnx==1.16.1
 onnx==1.17.0
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -337,3 +344,31 @@ onnxscript==0.1.0.dev20240817
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 parameterized==0.8.1
 #Description: Parameterizes unittests, both the tests themselves and the entire testing class
 #Pinned versions:
 #test that import:
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 1.24.0
 #test that import: test_sac_estimator.py
 pwlf==2.2.1 ; python_version >= "3.8"
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 astunparse
 PyYAML
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 scons==4.5.2 ; platform_machine == "aarch64"
 pulp==2.9.0 ; python_version >= "3.8"
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .0.0
 .1.0

									
										5

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -68,6 +68,8 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				ENV ROCM_PATH /opt/rocm

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				@ -121,5 +123,8 @@ RUN bash ./install_cache.sh && rm install_cache.sh

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				USER jenkins

				CMD ["bash"]

									
										27

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -87,19 +87,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install Android NDK

				ARG ANDROID

				ARG ANDROID_NDK

				ARG GRADLE_VERSION

				COPY ./common/install_android.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				COPY ./android/AndroidManifest.xml AndroidManifest.xml

				COPY ./android/build.gradle build.gradle

				RUN if [ -n "${ANDROID}" ]; then bash ./install_android.sh; fi

				RUN rm install_android.sh cache_vision_models.sh common_utils.sh

				RUN rm AndroidManifest.xml

				RUN rm build.gradle

				ENV INSTALLED_ANDROID ${ANDROID}

				# (optional) Install Vulkan SDK

				ARG VULKAN_SDK_VERSION

				COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

				@ -147,6 +134,13 @@ COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG TRITON_CPU

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt

				RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-cpu.txt

				ARG EXECUTORCH

				# Build and install executorch

				COPY ./common/install_executorch.sh install_executorch.sh

				@ -176,6 +170,13 @@ RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi

				RUN rm install_acl.sh

				ENV INSTALLED_ACL ${ACL}

				# (optional) install graphviz

				ARG GRAPHVIZ

				COPY ./common/install_graphviz.sh install_graphviz.sh

				RUN if [ -n "${GRAPHVIZ}" ]; then bash ./install_graphviz.sh; fi

				RUN rm install_graphviz.sh

				ENV INSTALLED_GRAPHVIZ ${GRAPHVIZ}

				# Install ccache/sccache (do this last, so we get priority in PATH)

				ARG SKIP_SCCACHE_INSTALL

				COPY ./common/install_cache.sh install_cache.sh

									
										10

.ci/libtorch/build.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,10 @@

				#!/usr/bin/env bash

				# This is mostly just a shim to manywheel/build.sh

				# TODO: Make this a dedicated script to build just libtorch

				set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

21

.ci/manywheel/LICENSE Normal file

View File

 @ -0,0 +1,21 @@
 The MIT License (MIT)
 Copyright (c) 2016 manylinux
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.

									
										25

.ci/manywheel/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,25 @@

				#!/usr/bin/env bash

				set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				case "${GPU_ARCH_TYPE:-BLANK}" in

				    BLANK)

				        # Legacy behavior for CircleCI

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    cuda)

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    rocm)

				        bash "${SCRIPTPATH}/build_rocm.sh"

				        ;;

				    cpu | cpu-cxx11-abi | cpu-s390x | xpu)

				        bash "${SCRIPTPATH}/build_cpu.sh"

				        ;;

				    *)

				        echo "Un-recognized GPU_ARCH_TYPE '${GPU_ARCH_TYPE}', exiting..."

				        exit 1

				        ;;

				esac

									
										505

.ci/manywheel/build_common.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,505 @@

				#!/usr/bin/env bash

				# meant to be called only from the neighboring build.sh and build_cpu.sh scripts

				set -ex

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				# Require only one python installation

				if [[ -z "$DESIRED_PYTHON" ]]; then

				    echo "Need to set DESIRED_PYTHON env variable"

				    exit 1

				fi

				if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then

				    echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"

				    echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"

				    exit 1

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# TODO move this into the Docker images

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # TODO: Remove this once nvidia package repos are back online

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				fi

				# We use the package name to test the package by passing this to 'pip install'

				# This is the env variable that setup.py uses to name the package. Note that

				# pip 'normalizes' the name first by changing all - to _

				if [[ -z "$TORCH_PACKAGE_NAME" ]]; then

				    TORCH_PACKAGE_NAME='torch'

				fi

				if [[ -z "$TORCH_NO_PYTHON_PACKAGE_NAME" ]]; then

				    TORCH_NO_PYTHON_PACKAGE_NAME='torch_no_python'

				fi

				TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"

				TORCH_NO_PYTHON_PACKAGE_NAME="$(echo $TORCH_NO_PYTHON_PACKAGE_NAME | tr '-' '_')"

				echo "Expecting the built wheels to all be called '$TORCH_PACKAGE_NAME' or '$TORCH_NO_PYTHON_PACKAGE_NAME'"

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				# PYTORCH_BUILD_NUMBER > 1

				build_version="$PYTORCH_BUILD_VERSION"

				build_number="$PYTORCH_BUILD_NUMBER"

				if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then

				    # This will be the *exact* version, since build_number<1

				    build_version="$OVERRIDE_PACKAGE_VERSION"

				    build_number=0

				fi

				if [[ -z "$build_version" ]]; then

				    build_version=1.0.0

				fi

				if [[ -z "$build_number" ]]; then

				    build_number=1

				fi

				export PYTORCH_BUILD_VERSION=$build_version

				export PYTORCH_BUILD_NUMBER=$build_number

				export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"

				export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"

				if [[ -e /opt/openssl ]]; then

				    export OPENSSL_ROOT_DIR=/opt/openssl

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				# If given a python version like 3.6m or 2.7mu, convert this to the format we

				# expect. The binary CI jobs pass in python versions like this; they also only

				# ever pass one python version, so we assume that DESIRED_PYTHON is not a list

				# in this case

				if [[ -n "$DESIRED_PYTHON" && $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then

				    python_digits="$(echo $DESIRED_PYTHON | tr -cd [:digit:])"

				    py_majmin="${DESIRED_PYTHON}"

				    DESIRED_PYTHON="cp${python_digits}-cp${python_digits}t"

				elif [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then

				    python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"

				    DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"

				    if [[ ${python_nodot} -ge 310 ]]; then

				        py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:2}"

				    else

				        py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:1}"

				    fi

				fi

				pydir="/opt/python/$DESIRED_PYTHON"

				export PATH="$pydir/bin:$PATH"

				echo "Will build for Python version: ${DESIRED_PYTHON} with ${python_installation}"

				mkdir -p /tmp/$WHEELHOUSE_DIR

				export PATCHELF_BIN=/usr/local/bin/patchelf

				patchelf_version=$($PATCHELF_BIN --version)

				echo "patchelf version: " $patchelf_version

				if [[ "$patchelf_version" == "patchelf 0.9" ]]; then

				    echo "Your patchelf version is too old. Please use version >= 0.10."

				    exit 1

				fi

				########################################################

				# Compile wheels as well as libtorch

				#######################################################

				if [[ -z "$PYTORCH_ROOT" ]]; then

				    echo "Need to set PYTORCH_ROOT env variable"

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				python setup.py clean

				retry pip install -qr requirements.txt

				case ${DESIRED_PYTHON} in

				  cp31*)

				    retry pip install -q --pre numpy==2.1.0

				    ;;

				  # Should catch 3.9+

				  *)

				    retry pip install -q --pre numpy==2.0.2

				    ;;

				esac

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				fi

				# This value comes from binary_linux_build.sh (and should only be set to true

				# for master / release branches)

				BUILD_DEBUG_INFO=${BUILD_DEBUG_INFO:=0}

				if [[ $BUILD_DEBUG_INFO == "1" ]]; then

				    echo "Building wheel and debug info"

				else

				    echo "BUILD_DEBUG_INFO was not set, skipping debug info"

				fi

				if [[ "$DISABLE_RCCL" = 1 ]]; then

				    echo "Disabling NCCL/RCCL in pyTorch"

				    USE_RCCL=0

				    USE_NCCL=0

				    USE_KINETO=0

				else

				    USE_RCCL=1

				    USE_NCCL=1

				    USE_KINETO=1

				fi

				echo "Calling setup.py bdist at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				        USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				        python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				fi

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    # Now build pythonless libtorch

				    # Note - just use whichever python we happen to be on

				    python setup.py clean

				    if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				        STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				    fi

				    mkdir -p build

				    pushd build

				    echo "Calling tools/build_libtorch.py at $(date)"

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				         EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \

				         python ../tools/build_libtorch.py

				    echo "Finished tools/build_libtorch.py at $(date)"

				    popd

				    mkdir -p libtorch/{lib,bin,include,share}

				    cp -r build/build/lib libtorch/

				    # for now, the headers for the libtorch package will just be copied in

				    # from one of the wheels (this is from when this script built multiple

				    # wheels at once)

				    ANY_WHEEL=$(ls /tmp/$WHEELHOUSE_DIR/torch*.whl | head -n1)

				    unzip -d any_wheel $ANY_WHEEL

				    if [[ -d any_wheel/torch/include ]]; then

				        cp -r any_wheel/torch/include libtorch/

				    else

				        cp -r any_wheel/torch/lib/include libtorch/

				    fi

				    cp -r any_wheel/torch/share/cmake libtorch/share/

				    rm -rf any_wheel

				    echo $PYTORCH_BUILD_VERSION > libtorch/build-version

				    echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				        LIBTORCH_ABI="cxx11-abi-"

				    else

				        LIBTORCH_ABI=

				    fi

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				fi

				popd

				#######################################################################

				# ADD DEPENDENCIES INTO THE WHEEL

				#

				# auditwheel repair doesn't work correctly and is buggy

				# so manually do the work of copying dependency libs and patchelfing

				# and fixing RECORDS entries correctly

				######################################################################

				fname_with_sha256() {

				    HASH=$(sha256sum $1 | cut -c1-8)

				    DIRNAME=$(dirname $1)

				    BASENAME=$(basename $1)

				    # Do not rename nvrtc-builtins.so as they are dynamically loaded

				    # by libnvrtc.so

				    # Similarly don't mangle libcudnn and libcublas library names

				    if [[ $BASENAME == "libnvrtc-builtins.s"* || $BASENAME == "libcudnn"* || $BASENAME == "libcublas"*  ]]; then

				        echo $1

				    else

				        INITNAME=$(echo $BASENAME | cut -f1 -d".")

				        ENDNAME=$(echo $BASENAME | cut -f 2- -d".")

				        echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"

				    fi

				}

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				make_wheel_record() {

				    FPATH=$1

				    if echo $FPATH | grep RECORD >/dev/null 2>&1; then

				        # if the RECORD file, then

				        echo "$FPATH,,"

				    else

				        HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				        FSIZE=$(ls -nl $FPATH | awk '{print $5}')

				        echo "$FPATH,sha256=$HASH,$FSIZE"

				    fi

				}

				replace_needed_sofiles() {

				    find $1 -name '*.so*' | while read sofile; do

				        origname=$2

				        patchedname=$3

				        if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				            set +e

				            origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				            ERRCODE=$?

				            set -e

				            if [ "$ERRCODE" -eq "0" ]; then

				                echo "patching $sofile entry $origname to $patchedname"

				                $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				            fi

				        fi

				    done

				}

				echo 'Built this wheel:'

				ls /tmp/$WHEELHOUSE_DIR

				mkdir -p "/$WHEELHOUSE_DIR"

				mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true

				fi

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				    rm -rf /tmp/$LIBTORCH_HOUSE_DIR

				fi

				rm -rf /tmp/$WHEELHOUSE_DIR

				rm -rf /tmp_dir

				mkdir /tmp_dir

				pushd /tmp_dir

				for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.whl /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do

				    # if the glob didn't match anything

				    if [[ ! -e $pkg ]]; then

				        continue

				    fi

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    if [[ -d torch ]]; then

				        PREFIX=torch

				    else

				        PREFIX=libtorch

				    fi

				    if [[ $pkg != *"without-deps"* ]]; then

				        # copy over needed dependent .so files over and tag them with their hash

				        patched=()

				        for filepath in "${DEPS_LIST[@]}"; do

				            filename=$(basename $filepath)

				            destpath=$PREFIX/lib/$filename

				            if [[ "$filepath" != "$destpath" ]]; then

				                cp $filepath $destpath

				            fi

				            # ROCm workaround for roctracer dlopens

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            # Keep the so number for XPU dependencies

				            elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then

				                patchedpath=$destpath

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

				            fi

				            patchedname=$(basename $patchedpath)

				            if [[ "$destpath" != "$patchedpath" ]]; then

				                mv $destpath $patchedpath

				            fi

				            patched+=("$patchedname")

				            echo "Copied $filepath to $patchedpath"

				        done

				        echo "patching to fix the so names to the hashed names"

				        for ((i=0;i<${#DEPS_LIST[@]};++i)); do

				            replace_needed_sofiles $PREFIX ${DEPS_SONAME[i]} ${patched[i]}

				            # do the same for caffe2, if it exists

				            if [[ -d caffe2 ]]; then

				                replace_needed_sofiles caffe2 ${DEPS_SONAME[i]} ${patched[i]}

				            fi

				        done

				        # copy over needed auxiliary files

				        for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do

				            srcpath=${DEPS_AUX_SRCLIST[i]}

				            dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}

				            mkdir -p $(dirname $dstpath)

				            cp $srcpath $dstpath

				        done

				    fi

				    # set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib

				    find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'}"

				        $PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # set RPATH of lib/ files to $ORIGIN

				    find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"

				        $PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # regenerate the RECORD file with new hashes

				    record_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g')

				    if [[ -e $record_file ]]; then

				        echo "Generating new record file $record_file"

				        : > "$record_file"

				        # generate records for folders in wheel

				        find * -type f | while read fname; do

				            make_wheel_record "$fname" >>"$record_file"

				        done

				    fi

				    if [[ $BUILD_DEBUG_INFO == "1" ]]; then

				        pushd "$PREFIX/lib"

				        # Duplicate library into debug lib

				        cp libtorch_cpu.so libtorch_cpu.so.dbg

				        # Keep debug symbols on debug lib

				        strip --only-keep-debug libtorch_cpu.so.dbg

				        # Remove debug info from release lib

				        strip --strip-debug libtorch_cpu.so

				        objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg

				        # Zip up debug info

				        mkdir -p /tmp/debug

				        mv libtorch_cpu.so.dbg /tmp/debug/libtorch_cpu.so.dbg

				        CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch_cpu.so)

				        pushd /tmp

				        PKG_NAME=$(basename "$pkg" | sed 's/\.whl$//g')

				        zip /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip /tmp/debug/libtorch_cpu.so.dbg

				        cp /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip "$PYTORCH_FINAL_PACKAGE_DIR"

				        popd

				        popd

				    fi

				    # zip up the wheel back

				    zip -rq $(basename $pkg) $PREIX*

				    # replace original wheel

				    rm -f $pkg

				    mv $(basename $pkg) $pkg

				    cd ..

				    rm -rf tmp

				done

				# Copy wheels to host machine for persistence before testing

				if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				    if [[ -n "$BUILD_PYTHONLESS" ]]; then

				        cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				    else

				        cp /$WHEELHOUSE_DIR/torch*.whl "$PYTORCH_FINAL_PACKAGE_DIR"

				    fi

				fi

				# remove stuff before testing

				rm -rf /opt/rh

				if ls /usr/local/cuda* >/dev/null 2>&1; then

				    rm -rf /usr/local/cuda*

				fi

				# Test that all the wheels work

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				  export OMP_NUM_THREADS=4 # on NUMA machines this takes too long

				  pushd $PYTORCH_ROOT/test

				  # Install the wheel for this Python version

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true

				  fi

				  pip uninstall -y "$TORCH_PACKAGE_NAME"

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  fi

				  pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  # Print info on the libraries installed in this wheel

				  # Rather than adjust find command to skip non-library files with an embedded *.so* in their name,

				  # since this is only for reporting purposes, we add the || true to the ldd command.

				  installed_libraries=($(find "$pydir/lib/python${py_majmin}/site-packages/torch/" -name '*.so*'))

				  echo "The wheel installed all of the libraries: ${installed_libraries[@]}"

				  for installed_lib in "${installed_libraries[@]}"; do

				      ldd "$installed_lib" || true

				  done

				  # Run the tests

				  echo "$(date) :: Running tests"

				  pushd "$PYTORCH_ROOT"

				  #TODO: run_tests.sh and check_binary.sh should be moved to pytorch/pytorch project

				  LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \

				          "/builder/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"

				  popd

				  echo "$(date) :: Finished tests"

				fi

									
										99

.ci/manywheel/build_cpu.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,99 @@

				#!/usr/bin/env bash

				set -ex

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				export TH_BINARY_BUILD=1

				export USE_CUDA=0

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				DIR_SUFFIX=cpu

				if [[ "$GPU_ARCH_TYPE" == "xpu" ]]; then

				    DIR_SUFFIX=xpu

				    # Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				    source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh

				    source /opt/intel/oneapi/pti/latest/env/vars.sh

				    export USE_STATIC_MKL=1

				fi

				WHEELHOUSE_DIR="wheelhouse$DIR_SUFFIX"

				LIBTORCH_HOUSE_DIR="libtorch_house$DIR_SUFFIX"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$DIR_SUFFIX"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$DIR_SUFFIX"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    if [[ "$(uname -m)" == "s390x" ]]; then

				        LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"

				    else

				        LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    fi

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				)

				if [[ "$GPU_ARCH_TYPE" == "xpu" ]]; then

				    echo "Bundling with xpu support package libs."

				    DEPS_LIST+=(

				        "/opt/intel/oneapi/compiler/latest/lib/libsycl-preview.so.7"

				        "/opt/intel/oneapi/compiler/latest/lib/libOpenCL.so.1"

				        "/opt/intel/oneapi/compiler/latest/lib/libxptifw.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libsvml.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libirng.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libimf.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libintlc.so.5"

				        "/opt/intel/oneapi/compiler/latest/lib/libpi_level_zero.so"

				        "/opt/intel/oneapi/pti/latest/lib/libpti_view.so.0.9"

				        "/opt/intel/oneapi/pti/latest/lib/libpti.so.0.9"

				    )

				    DEPS_SONAME+=(

				        "libsycl-preview.so.7"

				        "libOpenCL.so.1"

				        "libxptifw.so"

				        "libsvml.so"

				        "libirng.so"

				        "libimf.so"

				        "libintlc.so.5"

				        "libpi_level_zero.so"

				        "libpti_view.so.0.9"

				        "libpti.so.0.9"

				    )

				fi

				rm -rf /usr/local/cuda*

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source ${SOURCE_DIR}/${BUILD_SCRIPT}

									
										290

.ci/manywheel/build_cuda.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,290 @@

				#!/usr/bin/env bash

				set -ex

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P ))"

				export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"

				export NCCL_ROOT_DIR=/usr/local/cuda

				export TH_BINARY_BUILD=1

				export USE_STATIC_CUDNN=1

				export USE_STATIC_NCCL=1

				export ATEN_STATIC_CUDA=1

				export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export USE_CUPTI_SO=0

				export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Determine CUDA version and architectures to build for

				#

				# NOTE: We should first check `DESIRED_CUDA` when determining `CUDA_VERSION`,

				# because in some cases a single Docker image can have multiple CUDA versions

				# on it, and `nvcc --version` might not show the CUDA version we want.

				if [[ -n "$DESIRED_CUDA" ]]; then

				    # If the DESIRED_CUDA already matches the format that we expect

				    if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then

				        CUDA_VERSION=${DESIRED_CUDA}

				    else

				        # cu90, cu92, cu100, cu101

				        if [[ ${#DESIRED_CUDA} -eq 4 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"

				        elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"

				        fi

				    fi

				    echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"

				    # There really has to be a better way to do this - eli

				    # Possibly limiting builds to specific cuda versions be delimiting images would be a choice

				    if [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				        echo "Switching to CUDA version ${DESIRED_CUDA}"

				        /builder/conda/switch_cuda_version.sh "${DESIRED_CUDA}"

				    fi

				else

				    CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")

				    echo "CUDA $CUDA_VERSION Detected"

				fi

				cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.4)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        fi

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.1)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.[67])

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    *)

				        echo "unknown cuda version $CUDA_VERSION"

				        exit 1

				        ;;

				esac

				export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}

				echo "${TORCH_CUDA_ARCH_LIST}"

				# Package directories

				WHEELHOUSE_DIR="wheelhouse$cuda_version_nodot"

				LIBTORCH_HOUSE_DIR="libtorch_house$cuda_version_nodot"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$cuda_version_nodot"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$cuda_version_nodot"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				)

				if [[ $USE_CUSPARSELT == "1" ]]; then

				        DEPS_SONAME+=(

				            "libcusparseLt.so.0"

				        )

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				        )

				fi

				if [[ $CUDA_VERSION == "12.1" || $CUDA_VERSION == "12.4" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.12"

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.12"

				            "libcublasLt.so.12"

				            "libcudart.so.12"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				elif [[ $CUDA_VERSION == "11.8" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750

				    export BUILD_BUNDLE_PTXAS=1

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.11"

				            "/usr/local/cuda/lib64/libcublasLt.so.11"

				            "/usr/local/cuda/lib64/libcudart.so.11.0"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.11.2"    # this is not a mistake, it links to more specific cuda version

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.11"

				            "libcublasLt.so.11"

				            "libcudart.so.11.0"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.11.2"

				            "libnvrtc-builtins.so.11.8"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				else

				    echo "Unknown cuda version $CUDA_VERSION"

				    exit 1

				fi

				# builder/test.sh requires DESIRED_CUDA to know what tests to exclude

				export DESIRED_CUDA="$cuda_version_nodot"

				# Switch `/usr/local/cuda` to the desired CUDA version

				rm -rf /usr/local/cuda || true

				ln -s "/usr/local/cuda-${CUDA_VERSION}" /usr/local/cuda

				# Switch `/usr/local/magma` to the desired CUDA version

				rm -rf /usr/local/magma || true

				ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma

				export CUDA_VERSION=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev) # 10.0.130

				export CUDA_VERSION_SHORT=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev | cut -f1,2 -d".") # 10.0

				export CUDNN_VERSION=$(ls /usr/local/cuda/lib64/libcudnn.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev)

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source $SCRIPTPATH/${BUILD_SCRIPT}

									
										353

.ci/manywheel/build_libtorch.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,353 @@

				#!/usr/bin/env bash

				# meant to be called only from the neighboring build.sh and build_cpu.sh scripts

				set -e pipefail

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				# Require only one python installation

				if [[ -z "$DESIRED_PYTHON" ]]; then

				    echo "Need to set DESIRED_PYTHON env variable"

				    exit 1

				fi

				if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then

				    echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"

				    echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"

				    exit 1

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# TODO move this into the Docker images

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # TODO: Remove this once nvidia package repos are back online

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				fi

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				# PYTORCH_BUILD_NUMBER > 1

				build_version="$PYTORCH_BUILD_VERSION"

				build_number="$PYTORCH_BUILD_NUMBER"

				if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then

				    # This will be the *exact* version, since build_number<1

				    build_version="$OVERRIDE_PACKAGE_VERSION"

				    build_number=0

				fi

				if [[ -z "$build_version" ]]; then

				    build_version=1.0.0

				fi

				if [[ -z "$build_number" ]]; then

				    build_number=1

				fi

				export PYTORCH_BUILD_VERSION=$build_version

				export PYTORCH_BUILD_NUMBER=$build_number

				export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"

				export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"

				# set OPENSSL_ROOT_DIR=/opt/openssl if it exists

				if [[ -e /opt/openssl ]]; then

				    export OPENSSL_ROOT_DIR=/opt/openssl

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				# If given a python version like 3.6m or 2.7mu, convert this to the format we

				# expect. The binary CI jobs pass in python versions like this; they also only

				# ever pass one python version, so we assume that DESIRED_PYTHON is not a list

				# in this case

				if [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then

				    python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"

				    DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"

				fi

				pydir="/opt/python/$DESIRED_PYTHON"

				export PATH="$pydir/bin:$PATH"

				export PATCHELF_BIN=/usr/local/bin/patchelf

				patchelf_version=`$PATCHELF_BIN --version`

				echo "patchelf version: " $patchelf_version

				if [[ "$patchelf_version" == "patchelf 0.9" ]]; then

				    echo "Your patchelf version is too old. Please use version >= 0.10."

				    exit 1

				fi

				########################################################

				# Compile wheels as well as libtorch

				#######################################################

				if [[ -z "$PYTORCH_ROOT" ]]; then

				    echo "Need to set PYTORCH_ROOT env variable"

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				    # TODO remove this work-around once pytorch sources are updated

				    export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr

				fi

				echo "Calling setup.py install at $(date)"

				if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				    STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				fi

				(

				    set -x

				    mkdir -p build

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \

				        # TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed

				        CFLAGS='-Wno-deprecated-declarations' \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \

				        python setup.py install

				    mkdir -p libtorch/{lib,bin,include,share}

				    # Make debug folder separate so it doesn't get zipped up with the rest of

				    # libtorch

				    mkdir debug

				    # Copy over all lib files

				    cp -rv build/lib/*                libtorch/lib/

				    cp -rv build/lib*/torch/lib/*     libtorch/lib/

				    # Copy over all include files

				    cp -rv build/include/*            libtorch/include/

				    cp -rv build/lib*/torch/include/* libtorch/include/

				    # Copy over all of the cmake files

				    cp -rv build/lib*/torch/share/*   libtorch/share/

				    # Split libtorch into debug / release version

				    cp libtorch/lib/libtorch_cpu.so libtorch/lib/libtorch_cpu.so.dbg

				    # Keep debug symbols on debug lib

				    strip --only-keep-debug libtorch/lib/libtorch_cpu.so.dbg

				    # Remove debug info from release lib

				    strip --strip-debug libtorch/lib/libtorch_cpu.so

				    # Add a debug link to the release lib to the debug lib (debuggers will then

				    # search for symbols in a file called libtorch_cpu.so.dbg in some

				    # predetermined locations) and embed a CRC32 of the debug library into the .so

				    cd libtorch/lib

				    objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg

				    cd ../..

				    # Move the debug symbols to its own directory so it doesn't get processed /

				    # zipped with all the other libraries

				    mv libtorch/lib/libtorch_cpu.so.dbg debug/libtorch_cpu.so.dbg

				    echo "${PYTORCH_BUILD_VERSION}" > libtorch/build-version

				    echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash

				)

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    LIBTORCH_ABI="cxx11-abi-"

				else

				    LIBTORCH_ABI=

				fi

				(

				    set -x

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    # objcopy installs a CRC32 into libtorch_cpu above so, so add that to the name here

				    CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch/lib/libtorch_cpu.so)

				    # Zip debug symbols

				    zip /tmp/$LIBTORCH_HOUSE_DIR/debug-libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION-$CRC32.zip debug/libtorch_cpu.so.dbg

				    # Zip and copy libtorch

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				)

				popd

				#######################################################################

				# ADD DEPENDENCIES INTO THE WHEEL

				#

				# auditwheel repair doesn't work correctly and is buggy

				# so manually do the work of copying dependency libs and patchelfing

				# and fixing RECORDS entries correctly

				######################################################################

				fname_with_sha256() {

				    HASH=$(sha256sum $1 | cut -c1-8)

				    DIRNAME=$(dirname $1)

				    BASENAME=$(basename $1)

				    if [[ $BASENAME == "libnvrtc-builtins.so" || $BASENAME == "libcudnn"* ]]; then

				        echo $1

				    else

				        INITNAME=$(echo $BASENAME | cut -f1 -d".")

				        ENDNAME=$(echo $BASENAME | cut -f 2- -d".")

				        echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"

				    fi

				}

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				make_wheel_record() {

				    FPATH=$1

				    if echo $FPATH | grep RECORD >/dev/null 2>&1; then

				        # if the RECORD file, then

				        echo "$FPATH,,"

				    else

				        HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				        FSIZE=$(ls -nl $FPATH | awk '{print $5}')

				        echo "$FPATH,sha256=$HASH,$FSIZE"

				    fi

				}

				echo 'Built this package:'

				(

				    set -x

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				    rm -rf /tmp/$LIBTORCH_HOUSE_DIR

				)

				TMP_DIR=$(mktemp -d)

				trap "rm -rf ${TMP_DIR}" EXIT

				pushd "${TMP_DIR}"

				for pkg in /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do

				    # if the glob didn't match anything

				    if [[ ! -e $pkg ]]; then

				        continue

				    fi

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    PREFIX=libtorch

				    if [[ $pkg != *"without-deps"* ]]; then

				        # copy over needed dependent .so files over and tag them with their hash

				        patched=()

				        for filepath in "${DEPS_LIST[@]}"; do

				            filename=$(basename $filepath)

				            destpath=$PREFIX/lib/$filename

				            if [[ "$filepath" != "$destpath" ]]; then

				                cp $filepath $destpath

				            fi

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

				            fi

				            patchedname=$(basename $patchedpath)

				            if [[ "$destpath" != "$patchedpath" ]]; then

				                mv $destpath $patchedpath

				            fi

				            patched+=("$patchedname")

				            echo "Copied $filepath to $patchedpath"

				        done

				        echo "patching to fix the so names to the hashed names"

				        for ((i=0;i<${#DEPS_LIST[@]};++i)); do

				            find $PREFIX -name '*.so*' | while read sofile; do

				                origname=${DEPS_SONAME[i]}

				                patchedname=${patched[i]}

				                if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                    set +e

				                    origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				                    ERRCODE=$?

				                    set -e

				                    if [ "$ERRCODE" -eq "0" ]; then

				                        echo "patching $sofile entry $origname to $patchedname"

				                        $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				                    fi

				                fi

				            done

				        done

				        # copy over needed auxiliary files

				        for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do

				            srcpath=${DEPS_AUX_SRCLIST[i]}

				            dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}

				            mkdir -p $(dirname $dstpath)

				            cp $srcpath $dstpath

				        done

				    fi

				    # set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib

				    find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to " '$ORIGIN:$ORIGIN/lib'

				        $PATCHELF_BIN --set-rpath '$ORIGIN:$ORIGIN/lib' $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # set RPATH of lib/ files to $ORIGIN

				    find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to " '$ORIGIN'

				        $PATCHELF_BIN --set-rpath '$ORIGIN' $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # regenerate the RECORD file with new hashes

				    record_file=`echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g'`

				    if [[ -e $record_file ]]; then

				        echo "Generating new record file $record_file"

				        rm -f $record_file

				        # generate records for folders in wheel

				        find * -type f | while read fname; do

				            echo $(make_wheel_record $fname) >>$record_file

				        done

				    fi

				    # zip up the wheel back

				    zip -rq $(basename $pkg) $PREFIX*

				    # replace original wheel

				    rm -f $pkg

				    mv $(basename $pkg) $pkg

				    cd ..

				    rm -rf tmp

				done

				# Copy wheels to host machine for persistence before testing

				if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				    cp /$LIBTORCH_HOUSE_DIR/debug-libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				fi

									
										263

.ci/manywheel/build_rocm.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,263 @@

				#!/usr/bin/env bash

				set -ex

				export ROCM_HOME=/opt/rocm

				export MAGMA_HOME=$ROCM_HOME/magma

				# TODO: libtorch_cpu.so is broken when building with Debug info

				export BUILD_DEBUG_INFO=0

				# TODO Are these all used/needed?

				export TH_BINARY_BUILD=1

				export USE_STATIC_CUDNN=1

				export USE_STATIC_NCCL=1

				export ATEN_STATIC_CUDA=1

				export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				# Set RPATH instead of RUNPATH when using patchelf to avoid LD_LIBRARY_PATH override

				export FORCE_RPATH="--force-rpath"

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Determine ROCm version and architectures to build for

				#

				# NOTE: We should first check `DESIRED_CUDA` when determining `ROCM_VERSION`

				if [[ -n "$DESIRED_CUDA" ]]; then

				    if ! echo "${DESIRED_CUDA}"| grep "^rocm" >/dev/null 2>/dev/null; then

				        export DESIRED_CUDA="rocm${DESIRED_CUDA}"

				    fi

				    # rocm3.7, rocm3.5.1

				    ROCM_VERSION="$DESIRED_CUDA"

				    echo "Using $ROCM_VERSION as determined by DESIRED_CUDA"

				else

				    echo "Must set DESIRED_CUDA"

				    exit 1

				fi

				# Package directories

				WHEELHOUSE_DIR="wheelhouse$ROCM_VERSION"

				LIBTORCH_HOUSE_DIR="libtorch_house$ROCM_VERSION"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$ROCM_VERSION"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$ROCM_VERSION"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				# To make version comparison easier, create an integer representation.

				ROCM_VERSION_CLEAN=$(echo ${ROCM_VERSION} | sed s/rocm//)

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION_CLEAN})

				IFS="$save_IFS"

				if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=0

				elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))

				# Required ROCm libraries

				ROCM_SO_FILES=(

				    "libMIOpen.so"

				    "libamdhip64.so"

				    "libhipblas.so"

				    "libhipfft.so"

				    "libhiprand.so"

				    "libhipsolver.so"

				    "libhipsparse.so"

				    "libhsa-runtime64.so"

				    "libamd_comgr.so"

				    "libmagma.so"

				    "librccl.so"

				    "librocblas.so"

				    "librocfft.so"

				    "librocm_smi64.so"

				    "librocrand.so"

				    "librocsolver.so"

				    "librocsparse.so"

				    "libroctracer64.so"

				    "libroctx64.so"

				    "libhipblaslt.so"

				    "libhiprtc.so"

				)

				if [[ $ROCM_INT -ge 60100 ]]; then

				    ROCM_SO_FILES+=("librocprofiler-register.so")

				fi

				if [[ $ROCM_INT -ge 60200 ]]; then

				    ROCM_SO_FILES+=("librocm-core.so")

				fi

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				    LIBNUMA_PATH="/usr/lib64/libnuma.so.1"

				    LIBELF_PATH="/usr/lib64/libelf.so.1"

				    LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"

				    LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"

				        LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"

				        # Below libs are direct dependencies of libcholmod

				        LIBAMD_PATH="/lib64/libamd.so.2"

				        LIBCAMD_PATH="/lib64/libcamd.so.2"

				        LIBCCOLAMD_PATH="/lib64/libccolamd.so.2"

				        LIBCOLAMD_PATH="/lib64/libcolamd.so.2"

				        LIBSATLAS_PATH="/lib64/atlas/libsatlas.so.3"

				        # Below libs are direct dependencies of libsatlas

				        LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"

				        LIBQUADMATH_PATH="/lib64/libquadmath.so.0"

				    fi

				    MAYBE_LIB64=lib64

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    LIBNUMA_PATH="/usr/lib/x86_64-linux-gnu/libnuma.so.1"

				    LIBELF_PATH="/usr/lib/x86_64-linux-gnu/libelf.so.1"

				    if [[ $ROCM_INT -ge 50300 ]]; then

				        LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.6"

				    else

				        LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.5"

				    fi

				    LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"

				        # Below libs are direct dependencies of libcholmod

				        LIBSUITESPARSE_CONFIG_PATH="/lib/x86_64-linux-gnu/libsuitesparseconfig.so.5"

				        LIBAMD_PATH="/lib/x86_64-linux-gnu/libamd.so.2"

				        LIBCAMD_PATH="/lib/x86_64-linux-gnu/libcamd.so.2"

				        LIBCCOLAMD_PATH="/lib/x86_64-linux-gnu/libccolamd.so.2"

				        LIBCOLAMD_PATH="/lib/x86_64-linux-gnu/libcolamd.so.2"

				        LIBMETIS_PATH="/lib/x86_64-linux-gnu/libmetis.so.5"

				        LIBLAPACK_PATH="/lib/x86_64-linux-gnu/liblapack.so.3"

				        LIBBLAS_PATH="/lib/x86_64-linux-gnu/libblas.so.3"

				        # Below libs are direct dependencies of libblas

				        LIBGFORTRAN_PATH="/lib/x86_64-linux-gnu/libgfortran.so.5"

				        LIBQUADMATH_PATH="/lib/x86_64-linux-gnu/libquadmath.so.0"

				    fi

				    MAYBE_LIB64=lib

				fi

				OS_SO_PATHS=($LIBGOMP_PATH $LIBNUMA_PATH\

				             $LIBELF_PATH $LIBTINFO_PATH\

				             $LIBDRM_PATH $LIBDRM_AMDGPU_PATH\

				             $LIBSUITESPARSE_CONFIG_PATH\

				             $LIBCHOLMOD_PATH $LIBAMD_PATH\

				             $LIBCAMD_PATH $LIBCCOLAMD_PATH\

				             $LIBCOLAMD_PATH $LIBSATLAS_PATH\

				             $LIBGFORTRAN_PATH $LIBQUADMATH_PATH\

				             $LIBMETIS_PATH $LIBLAPACK_PATH\

				             $LIBBLAS_PATH)

				OS_SO_FILES=()

				for lib in "${OS_SO_PATHS[@]}"

				do

				    file_name="${lib##*/}" # Substring removal of path to get filename

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				# PyTorch-version specific

				# AOTriton dependency only for PyTorch >= 2.4

				if (( $(echo "${PYTORCH_VERSION} 2.4" | awk '{print ($1 >= $2)}') )); then

				    ROCM_SO_FILES+=("libaotriton_v2.so")

				fi

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep

				ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

				HIPBLASLT_LIB_DST=lib/hipblaslt/library

				ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)

				HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				# ROCm library files

				ROCM_SO_PATHS=()

				for lib in "${ROCM_SO_FILES[@]}"

				do

				    file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib

				    if [[ -z $file_path ]]; then

				        if [ -d "$ROCM_HOME/lib64/" ]; then

				            file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64

				        fi

				    fi

				    if [[ -z $file_path ]]; then

				        file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME

				    fi

				    if [[ -z $file_path ]]; then

				        echo "Error: Library file $lib is not found." >&2

				        exit 1

				    fi

				    ROCM_SO_PATHS[${#ROCM_SO_PATHS[@]}]="$file_path" # Append lib to array

				done

				DEPS_LIST=(

				    ${ROCM_SO_PATHS[*]}

				    ${OS_SO_PATHS[*]}

				)

				DEPS_SONAME=(

				    ${ROCM_SO_FILES[*]}

				    ${OS_SO_FILES[*]}

				)

				DEPS_AUX_SRCLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"

				    "/opt/amdgpu/share/libdrm/amdgpu.ids"

				)

				DEPS_AUX_DSTLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"

				    "share/libdrm/amdgpu.ids"

				)

				# MIOpen library files

				MIOPEN_SHARE_SRC=$ROCM_HOME/share/miopen/db

				MIOPEN_SHARE_DST=share/miopen/db

				MIOPEN_SHARE_FILES=($(ls $MIOPEN_SHARE_SRC | grep -E $ARCH))

				DEPS_AUX_SRCLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_DST/})

				# RCCL library files

				RCCL_SHARE_SRC=$ROCM_HOME/share/rccl/msccl-algorithms

				RCCL_SHARE_DST=share/rccl/msccl-algorithms

				RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))

				DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})

				echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source $SCRIPTPATH/${BUILD_SCRIPT}

									
										26

.ci/manywheel/test_wheel.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/usr/bin/env bash

				set -e

				yum install -y wget git

				rm -rf /usr/local/cuda*

				# Install Anaconda

				if ! ls /py

				then

				    echo "Miniconda needs to be installed"

				    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh

				    bash ~/miniconda.sh -b -p /py

				else

				    echo "Miniconda is already installed"

				fi

				export PATH="/py/bin:$PATH"

				# Anaconda token

				if ls /remote/token

				then

				   source /remote/token

				fi

				conda install -y conda-build anaconda-client

									
										39

.ci/pytorch/build.sh
									
												View File
												
				@ -49,13 +49,8 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				fi

				# Enable LLVM dependency for TensorExpr testing

				if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export USE_LLVM=/opt/rocm/llvm

				  export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm

				else

				  export USE_LLVM=/opt/llvm

				  export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				fi

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then

				  # To build test_edge_op_registration

				@ -183,7 +178,7 @@ fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used

				if [ -z "$MAX_JOBS" ]; then

				  if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then

				  if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then

				    export MAX_JOBS=$(($(nproc) - 1))

				  fi

				fi

				@ -208,10 +203,12 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export LDSHARED="clang --shared"

				  export USE_CUDA=0

				  if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				    export USE_CUDA=1

				  fi

				  export USE_ASAN=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"

				  export REL_WITH_DEB_INFO=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all"

				  unset USE_LLVM

				fi

				@ -223,10 +220,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				    export USE_PRECOMPILED_HEADERS=1

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build*  ]]; then

				  export USE_GLOO_WITH_OPENSSL=ON

				fi

				if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				@ -237,7 +230,7 @@ fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -345,11 +338,11 @@ else

				    CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"

				    CUSTOM_OP_TEST="$PWD/test/custom_operator"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -359,10 +352,10 @@ else

				    JIT_HOOK_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/jit-hook-build"

				    JIT_HOOK_TEST="$PWD/test/jit_hooks"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -374,7 +367,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -405,8 +398,6 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];

				  python tools/stats/export_test_times.py

				fi

				# snadampal: skipping it till sccache support added for aarch64

				# https://github.com/pytorch/pytorch/issues/121559

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *s390x* ]]; then

				  print_sccache_stats

				fi

									
										6

.ci/pytorch/common-build.sh
									
												View File
												
				@ -6,6 +6,12 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then

				    # Save the absolute path in case later we chdir (as occurs in the gpu perf test)

				    script_dir="$( cd "$(dirname "${BASH_SOURCE[0]}")" || exit ; pwd -P )"

				    if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				        # This is really weird, but newer sccache somehow produces broken binary

				        # see https://github.com/pytorch/pytorch/issues/139188

				        sudo mv /opt/cache/bin/sccache-0.2.14a /opt/cache/bin/sccache

				    fi

				    if which sccache > /dev/null; then

				        # Save sccache logs to file

				        sccache --stop-server > /dev/null  2>&1 || true

									
										13

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -191,9 +191,22 @@ function install_torchrec_and_fbgemm() {

				  pip_uninstall torchrec-nightly

				  pip_uninstall fbgemm-gpu-nightly

				  pip_install setuptools-git-versioning scikit-build pyre-extensions

				  # TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it

				  # seems to be an sccache-related issue

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    unset CMAKE_CUDA_COMPILER_LAUNCHER

				    sudo mv /opt/cache/bin /opt/cache/bin-backup

				  fi

				  # See https://github.com/pytorch/pytorch/issues/106971

				  CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache

				    sudo mv /opt/cache/bin-backup /opt/cache/bin

				  fi

				}

				function clone_pytorch_xla() {

									
										12

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -1,4 +1,4 @@

				from datetime import datetime, timedelta

				from datetime import datetime, timedelta, timezone

				from tempfile import mkdtemp

				from cryptography import x509

				@ -42,11 +42,10 @@ def create_cert(path, C, ST, L, O, key):

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            datetime.now(timezone.utc) + timedelta(days=10)

				        )

				        .add_extension(

				            x509.BasicConstraints(ca=True, path_length=None),

				@ -88,11 +87,10 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            datetime.now(timezone.utc) + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

				        .sign(private_ca_key, hashes.SHA256())

									
										19

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -9,15 +9,13 @@ if [[ -n "$CONDA_ENV" ]]; then

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled for non-arm64 build

				if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then

				  pushd test

				  if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				    echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				    exit 1

				  fi

				  popd

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				  echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				  exit 1

				fi

				popd

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				@ -27,8 +25,9 @@ setup_test_python() {

				  echo "Ninja version: $(ninja --version)"

				  echo "Python version: $(which python) ($(python --version))"

				  # Increase default limit on open file handles from 256 to 1024

				  ulimit -n 1024

				  # Set the limit on open file handles to 16384

				  # might help with intermittent compiler test failures

				  ulimit -n 16384

				}

				test_python_all() {

									
										153

.ci/pytorch/test.sh
									
												View File
												
				@ -49,16 +49,16 @@ NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"

				export VALGRIND=ON

				# export TORCH_INDUCTOR_INSTALL_GXX=ON

				if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,

				  # clang9 appears to miscompile code involving std::optional<c10::SymInt>,

				  # such that valgrind complains along these lines:

				  #

				  # Conditional jump or move depends on uninitialised value(s)

				  #    at 0x40303A: ~optional_base (Optional.h:281)

				  #    by 0x40303A: call (Dispatcher.h:448)

				  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:10)

				  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:10)

				  #    by 0x403700: main (basic.cpp:16)

				  #  Uninitialised value was created by a stack allocation

				  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:6)

				  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:6)

				  #

				  # The problem does not appear with gcc or newer versions of clang (we tested

				  # clang14).  So we suppress valgrind testing for clang9 specifically.

				@ -72,7 +72,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  #

				  # using namespace at;

				  #

				  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {

				  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, std::optional<c10::SymInt> storage_offset) {

				  #   auto op = c10::Dispatcher::singleton()

				  #       .findSchemaOrThrow(at::_ops::as_strided::name, at::_ops::as_strided::overload_name)

				  #       .typed<at::_ops::as_strided::schema>();

				@ -81,7 +81,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  #

				  # int main(int argv) {

				  #   Tensor b = empty({3, 4});

				  #   auto z = call(b, b.sym_sizes(), b.sym_strides(), c10::nullopt);

				  #   auto z = call(b, b.sym_sizes(), b.sym_strides(), std::nullopt);

				  # }

				  export VALGRIND=OFF

				fi

				@ -196,6 +196,9 @@ install_tlparse

				# ASAN test is not working

				if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=true:strict_init_order=true:detect_odr_violation=1:detect_container_overflow=0:check_initialization_order=true:debug=true

				    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				        export ASAN_OPTIONS="${ASAN_OPTIONS}:protect_shadow_gap=0"

				    fi

				    export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp

				    export PYTORCH_TEST_WITH_ASAN=1

				    export PYTORCH_TEST_WITH_UBSAN=1

				@ -233,8 +236,8 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    # it depends on a ton of dynamic libraries that most programs aren't gonna

				    # have, and it applies to child processes.

				    # TODO: get rid of the hardcoded path

				    export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so

				    LD_PRELOAD=$(clang --print-file-name=libclang_rt.asan-x86_64.so)

				    export LD_PRELOAD

				    # Disable valgrind for asan

				    export VALGRIND=OFF

				@ -281,7 +284,7 @@ test_python_shard() {

				  # modify LD_LIBRARY_PATH to ensure it has the conda env.

				  # This set of tests has been shown to be buggy without it for the split-build

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -293,7 +296,7 @@ test_python() {

				}

				test_dynamo_shard() {

				test_dynamo_wrapped_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				@ -307,7 +310,8 @@ test_dynamo_shard() {

				    --exclude-distributed-tests \

				    --exclude-torch-export-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				    --verbose \

				    --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -320,6 +324,7 @@ test_inductor_distributed() {

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/test_replicate_with_compiler.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose

				@ -331,11 +336,12 @@ test_inductor_distributed() {

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_compile.py --verbose

				  python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives --verbose

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose

				  assert_git_not_dirty

				}

				@ -369,22 +375,39 @@ test_inductor_aoti() {

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				test_inductor_cpp_wrapper() {

				  export TORCHINDUCTOR_CPP_WRAPPER=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we should be able to run all the inductor

				  # unit tests with cpp wrapper.

				  python test/run_test.py --include inductor/test_torchinductor.py --verbose

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				  # Run inductor benchmark tests with cpp wrapper.

				  # Skip benchmark tests if it's in rerun-disabled-mode.

				  if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]]; then

				    echo "skip dynamo benchmark tests for rerun-disabled-test"

				  else

				    echo "run dynamo benchmark tests with cpp wrapper"

				    python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  fi

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -401,10 +424,10 @@ pr_time_benchmarks() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  echo "benchmark results on current PR: "

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt"

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "$TEST_REPORTS_DIR/new_expected_results.csv"

				}

				if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then

				@ -512,7 +535,7 @@ test_perf_for_dashboard() {

				              "${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \

				              --output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				        fi

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				@ -567,13 +590,6 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      # For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx2* ]]; then

				      TEST_CONFIG=${TEST_CONFIG//_avx2/}

				    fi

				@ -607,6 +623,11 @@ test_inductor_halide() {

				  assert_git_not_dirty

				}

				test_inductor_triton_cpu() {

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose

				  assert_git_not_dirty

				}

				test_dynamo_benchmark() {

				  # Usage: test_dynamo_benchmark huggingface 0

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -644,32 +665,12 @@ test_inductor_torchbench_smoketest_perf() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # Test some models in the cpp wrapper mode

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				    --output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				    --export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  # The perf number of nanogpt seems not very stable, e.g.

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				    python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \

				@ -713,6 +714,10 @@ test_inductor_set_cpu_affinity(){

				    export KMP_BLOCKTIME=1

				  fi

				  cores=$(test_inductor_get_core_number)

				  # Set number of cores to 16 on Aarch64 for performance runs.

				  if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then

				    cores=16

				  fi

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				  export TASKSET="taskset -c 0-$end_core"

				@ -749,19 +754,9 @@ test_inductor_torchbench_cpu_smoketest_perf(){

				    fi

				    cat "$output_name"

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				    # Allow 1% variance for CPU perf to accommodate perf fluctuation

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target" -s 0.99

				  done

				  # Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"

				}

				test_torchbench_gcp_smoketest(){

				@ -819,7 +814,7 @@ test_without_numpy() {

				  # Regression test for https://github.com/pytorch/pytorch/issues/66353

				  python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;print(torch.tensor([torch.tensor(0.), torch.tensor(1.)]))"

				  # Regression test for https://github.com/pytorch/pytorch/issues/109387

				  if [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				  if [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				    python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"

				  fi

				  popd

				@ -1372,7 +1367,7 @@ test_executorch() {

				  echo "Run ExecuTorch regression tests for some models"

				  # TODO(huydhn): Add more coverage here using ExecuTorch's gather models script

				  # shellcheck disable=SC1091

				  source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''

				  source .ci/scripts/test_model.sh mv3 cmake xnnpack-quantization-delegation ''

				  popd

				@ -1383,14 +1378,16 @@ test_executorch() {

				  assert_git_not_dirty

				}

				test_linux_aarch64(){

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				       test_transformers test_multiprocessing test_numpy_interop --verbose

				        test_transformers test_multiprocessing test_numpy_interop \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				  python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \

				       dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Inductor tests

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \

				@ -1400,7 +1397,8 @@ test_linux_aarch64(){

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				@ -1433,6 +1431,8 @@ elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				  test_inductor_halide

				elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then

				  test_inductor_triton_cpu

				elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then

				  test_inductor_micro_benchmark

				elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

				@ -1449,14 +1449,13 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  else

				    install_torchaudio cuda

				  fi

				  install_torchtext

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				@ -1475,9 +1474,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				@ -1486,9 +1487,9 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				  install_torchvision

				  test_dynamo_shard "${SHARD_NUMBER}"

				  test_dynamo_wrapped_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_aten

				  fi

									
										2

.ci/pytorch/win-build.sh
									
												View File
												
				@ -26,7 +26,7 @@ fi

				export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

				set +ex

				grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/

				grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h  --exclude=pythoncapi_compat.h --exclude=eval_frame.c torch/

				PYLONG_API_CHECK=$?

				if [[ $PYLONG_API_CHECK == 0 ]]; then

				  echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

									
										3

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -52,7 +52,8 @@ if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Activate xpu environment - VS env is required for xpu

				  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

				  call "C:\Program Files (x86)\Intel\oneAPI\compiler\latest\env\vars.bat"

				  call "C:\Program Files (x86)\Intel\oneAPI\ocloc\latest\env\vars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

									
										6

.ci/pytorch/win-test.sh
									
												View File
												
				@ -43,6 +43,12 @@ python -m pip install z3-solver==4.12.2.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.25

				# Install parameterized

				python -m pip install parameterized==0.8.1

				# Install pulp for testing ilps under torch\distributed\_tools

				python -m pip install pulp==2.9.0

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										9

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -27,12 +27,11 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				  source activate testenv >/dev/null

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"

				  # Prior to Python 3.8 paths were suffixed with an 'm'

				  if [[ -d  "\${python_path}/bin" ]]; then

				    export PATH="\${python_path}/bin:\$PATH"

				  elif [[ -d "\${python_path}m/bin" ]]; then

				    export PATH="\${python_path}m/bin:\$PATH"

				  if [[ "\$python_nodot" = *t ]]; then

				    python_digits="\$(echo $DESIRED_PYTHON | tr -cd [:digit:])"

				    python_path="/opt/python/cp\$python_digits-cp\${python_digits}t"

				  fi

				  export PATH="\${python_path}/bin:\$PATH"

				fi

				EXTRA_CONDA_FLAGS=""

									
										8

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -114,6 +114,12 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B

				    fi

				fi

				USE_GLOO_WITH_OPENSSL="ON"

				if [[ "$GPU_ARCH_TYPE" =~ .*aarch64.* ]]; then

				  USE_GLOO_WITH_OPENSSL="OFF"

				  USE_GOLD_LINKER="OFF"

				fi

				cat >"$envfile" <<EOL

				# =================== The following code will be executed inside Docker container ===================

				export TZ=UTC

				@ -153,7 +159,7 @@ export DOCKER_IMAGE="$DOCKER_IMAGE"

				export USE_GOLD_LINKER="${USE_GOLD_LINKER}"

				export USE_GLOO_WITH_OPENSSL="ON"

				export USE_GLOO_WITH_OPENSSL="${USE_GLOO_WITH_OPENSSL}"

				# =================== The above code will be executed inside Docker container ===================

				EOL

28

.clang-format

View File

 @ -44,7 +44,9 @@ ContinuationIndentWidth: 4
 Cpp11BracedListStyle: true
 DerivePointerAlignment: false
 DisableFormat:   false
 ForEachMacros:   [ FOR_EACH_RANGE, FOR_EACH, ]
 ForEachMacros:
   - FOR_EACH_RANGE
   - FOR_EACH
 IncludeCategories:
   - Regex:           '^<.*\.h(pp)?>'
     Priority:        1
 @ -58,6 +60,24 @@ IndentWrappedFunctionNames: false
 KeepEmptyLinesAtTheStartOfBlocks: false
 MacroBlockBegin: ''
 MacroBlockEnd:   ''
 Macros:
   - >-
     PyObject_HEAD_INIT(type)={
         /* this is not exactly match with PyObject_HEAD_INIT in Python source code
          * but it is enough for clang-format */
         { 0xFFFFFFFF },
         (type)
     },
   - >-
     PyVarObject_HEAD_INIT(type, size)={
         {
             /* manually expand PyObject_HEAD_INIT(type) above
              * because clang-format do not support recursive expansion */
             { 0xFFFFFFFF },
             (type)
         },
         (size)
     },
 MaxEmptyLinesToKeep: 1
 NamespaceIndentation: None
 PenaltyBreakBeforeFirstCallParameter: 1
 @ -79,7 +99,11 @@ SpacesInContainerLiterals: true
 SpacesInCStyleCastParentheses: false
 SpacesInParentheses: false
 SpacesInSquareBrackets: false
 Standard:        Cpp11
 Standard:        c++17
 StatementMacros:
   - PyObject_HEAD
   - PyObject_VAR_HEAD
   - PyException_HEAD
 TabWidth:        8
 UseTab:          Never
 ---

									
										38

.github/ISSUE_TEMPLATE.md
									
										vendored
									
												View File
											
				@ -1,38 +0,0 @@

				If you have a question or would like help and support, please ask at our

				[forums](https://discuss.pytorch.org/).

				If you are submitting a feature request, please preface the title with [feature request].

				If you are submitting a bug report, please fill in the following details.

				## Issue description

				Provide a short description.

				## Code example

				Please try to provide a minimal example to repro the bug.

				Error messages and stack traces are also helpful.

				## System Info

				Please copy and paste the output from our

				[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py)

				(or fill out the checklist below manually).

				You can get the script and run it with:

				```

				wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py

				# For security purposes, please check the contents of collect_env.py before running it.

				python collect_env.py

				```

				- PyTorch or Caffe2:

				- How you installed PyTorch (conda, pip, source):

				- Build command you used (if compiling from source):

				- OS:

				- PyTorch version:

				- Python version:

				- CUDA/cuDNN version:

				- GPU models and configuration:

				- GCC version (if compiling from source):

				- CMake version:

				- Versions of any other relevant libraries:

									
										3

.github/ISSUE_TEMPLATE/ci-sev.md
									
										vendored
									
												View File
												
				@ -5,7 +5,8 @@ about: Tracking incidents for PyTorch's CI infra.

				> NOTE: Remember to label this issue with "`ci: sev`"

				**MERGE BLOCKING** <!-- remove this line if you don't want this SEV to block merges -->

				 <!-- uncomment the below line if you don't want this SEV to block merges -->

				 <!--  **MERGE BLOCKING** -->

				## Current Status

				*Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.

									
										24

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -32,30 +32,6 @@ self-hosted-runner:

				    - lf.linux.8xlarge.nvidia.gpu

				    - lf.linux.16xlarge.nvidia.gpu

				    - lf.linux.g5.4xlarge.nvidia.gpu

				    # Organization-wide AWS Linux Runners with new Amazon 2023 AMI

				    - amz2023.linux.large

				    - amz2023.linux.2xlarge

				    - amz2023.linux.4xlarge

				    - amz2023.linux.12xlarge

				    - amz2023.linux.24xlarge

				    - amz2023.linux.arm64.2xlarge

				    - amz2023.linux.arm64.m7g.4xlarge

				    - amz2023.linux.arm64.m7g.4xlarge.ephemeral

				    - amz2023.linux.4xlarge.nvidia.gpu

				    - amz2023.linux.8xlarge.nvidia.gpu

				    - amz2023.linux.16xlarge.nvidia.gpu

				    - amz2023.linux.g5.4xlarge.nvidia.gpu

				    # Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account

				    - amz2023.lf.linux.large

				    - amz2023.lf.linux.2xlarge

				    - amz2023.lf.linux.4xlarge

				    - amz2023.lf.linux.12xlarge

				    - amz2023.lf.linux.24xlarge

				    - amz2023.lf.linux.arm64.2xlarge

				    - amz2023.lf.linux.4xlarge.nvidia.gpu

				    - amz2023.lf.linux.8xlarge.nvidia.gpu

				    - amz2023.lf.linux.16xlarge.nvidia.gpu

				    - amz2023.lf.linux.g5.4xlarge.nvidia.gpu

				    # Repo-specific IBM hosted S390x runner

				    - linux.s390x

				    # Organization wide AWS Windows runners

									
										4

.github/actions/build-android/action.yml
									
										vendored
									
												View File
												
				@ -42,11 +42,14 @@ runs:

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_REGION: us-east-1

				        DOCKER_IMAGE: ${{ inputs.docker-image  }}

				        MATRIX_ARCH: ${{ inputs.arch }}

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				        set -exo pipefail

				        # Fetch aws credential from IMDs

				        eval "$(python3 .github/scripts/get_aws_session_tokens.py)"

				        export container_name

				        container_name=$(docker run \

				          -e BUILD_ENVIRONMENT \

				@ -56,6 +59,7 @@ runs:

				          -e SHA1 \

				          -e BRANCH \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_REGION \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

									
										6

.github/actions/checkout-pytorch/action.yml
									
										vendored
									
												View File
												
				@ -18,8 +18,14 @@ inputs:

				runs:

				  using: composite

				  steps:

				    - name: Check if in a container runner

				      shell: bash

				      id: check_container_runner

				      run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				    - name: Clean workspace

				      shell: bash

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      env:

				        NO_SUDO: ${{ inputs.no-sudo }}

				      run: |

									
										30

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -85,15 +85,25 @@ runs:

				      with:

				        docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    - name: Check if in a ARC runner

				    - name: Check if in a container runner

				      shell: bash

				      id: check_arc_runner

				      run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"

				      id: check_container_runner

				      run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				    - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				      id: install-nvidia-driver

				      uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				      if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				    - name: Setup GPU_FLAG for docker run

				      id: setup-gpu-flag

				      run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

				      if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}

				    - name: Setup SCCACHE_SERVER_PORT environment for docker run when on container

				      id: setup-sscache-port-flag

				      run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}

				    - name: Lock NVIDIA A100 40GB Frequency

				      shell: bash

				@ -101,7 +111,7 @@ runs:

				        sudo nvidia-smi -pm 1

				        sudo nvidia-smi -ac 1215,1410

				        nvidia-smi

				      if: contains(matrix.runner, 'a100')

				      if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				    - name: Start monitoring script

				      id: monitor-script

				@ -172,6 +182,7 @@ runs:

				        NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				        TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_REGION: us-east-1

				        SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				        SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}

				        DOCKER_IMAGE: ${{ inputs.docker-image }}

				@ -181,6 +192,9 @@ runs:

				        PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}

				        DASHBOARD_TAG: ${{ inputs.dashboard-tag }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				        SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}

				        IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}

				      shell: bash

				      run: |

				        set -x

				@ -199,6 +213,7 @@ runs:

				        # shellcheck disable=SC2086,SC2090

				        container_name=$(docker run \

				          ${GPU_FLAG:-} \

				          ${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \

				          -e BUILD_ENVIRONMENT \

				          -e PR_NUMBER \

				          -e GITHUB_ACTIONS \

				@ -227,6 +242,7 @@ runs:

				          -e PR_LABELS \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_REGION \

				          -e SCCACHE_S3_KEY_PREFIX \

				          -e XLA_CUDA \

				          -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				@ -234,7 +250,9 @@ runs:

				          -e PYTORCH_TEST_RERUN_DISABLED_TESTS \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          -e HUGGING_FACE_HUB_TOKEN \

				          -e SCRIBE_GRAPHQL_ACCESS_TOKEN \

				          -e DASHBOARD_TAG \

				          -e IS_A100_RUNNER \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				@ -305,7 +323,7 @@ runs:

				    - name: Teardown Linux

				      uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      if: always()

				      if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'

				    # NB: We are currently having an intermittent GPU-related issue on G5 runners with

				    # A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

									
										2

.github/actions/pytest-cache-download/action.yml
									
										vendored
									
												View File
												
				@ -26,7 +26,7 @@ runs:

				        retry_wait_seconds: 30

				        command: |

				          set -eu

				          python3 -m pip install boto3==1.19.12

				          python3 -m pip install boto3==1.35.42

				    - name: Download the cache

				      shell: bash

									
										2

.github/actions/pytest-cache-upload/action.yml
									
										vendored
									
												View File
												
				@ -33,7 +33,7 @@ runs:

				        retry_wait_seconds: 30

				        command: |

				          set -eu

				          python3 -m pip install boto3==1.19.12

				          python3 -m pip install boto3==1.35.42

				    - name: Upload the cache

				      shell: bash

									
										14

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -20,7 +20,7 @@ runs:

				          elif [[ $runner_name_str == *"gcp"* ]]; then

				            echo "Runner is from Google Cloud Platform, No info on ec2 metadata"

				          else

				            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"

				            curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"

				          fi

				        }

				        echo "ami-id: $(get_ec2_metadata ami-id)"

				@ -28,14 +28,14 @@ runs:

				        echo "instance-type: $(get_ec2_metadata instance-type)"

				        echo "system info $(uname -a)"

				    - name: Check if in a ARC runner

				    - name: Check if in a container runner

				      shell: bash

				      id: check_arc_runner

				      run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)"  >> $GITHUB_OUTPUT

				      id: check_container_runner

				      run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				    - name: Start docker if docker deamon is not running

				      shell: bash

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      run: |

				        if systemctl is-active --quiet docker; then

				            echo "Docker daemon is running...";

				@ -73,7 +73,7 @@ runs:

				        env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				    - name: Kill any existing containers, clean up images

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      shell: bash

				      run: |

				        # ignore expansion of "docker ps -q" since it could be empty

				@ -116,7 +116,7 @@ runs:

				    - name: Check that the docker daemon is running

				      shell: bash

				      continue-on-error: true

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'true' }}

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}

				      run: |

				        set +x

									
										2

.github/actions/setup-win/action.yml
									
										vendored
									
												View File
												
				@ -18,7 +18,7 @@ runs:

				          # Pulled from instance metadata endpoint for EC2

				          # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html

				          category=$1

				          curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"

				          curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"

				        }

				        echo "ami-id: $(get_ec2_metadata ami-id)"

				        echo "instance-id: $(get_ec2_metadata instance-id)"

									
										14

.github/actions/upload-test-artifacts/action.yml
									
										vendored
									
												View File
												
				@ -28,7 +28,7 @@ runs:

				      run: |

				        # Remove any previous test jsons if they exist

				        rm -f test-jsons-*.zip

				        zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'

				        zip -r "test-jsons-${FILE_SUFFIX}.zip" test/test-reports -i '*.json'

				    - name: Zip test reports for upload

				      if: runner.os != 'Windows' && !inputs.use-gha

				@ -38,7 +38,7 @@ runs:

				      run: |

				        # Remove any previous test reports if they exist

				        rm -f test-reports-*.zip

				        zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml' -i '*.csv'

				        zip -r "test-reports-${FILE_SUFFIX}.zip" test/test-reports -i '*.xml' -i '*.csv'

				    - name: Zip usage log for upload

				      if: runner.os != 'Windows' && !inputs.use-gha

				@ -53,8 +53,8 @@ runs:

				        if [ -f 'usage_log.txt' ]; then

				            zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt'

				        fi

				        if ls test/**/*.log 1> /dev/null 2>&1; then

				            zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log'

				        if find "test/test-reports" -name "*.log" 2>/dev/null | grep -q .; then

				            zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log'

				        fi

				    - name: Zip debugging artifacts for upload

				@ -77,7 +77,7 @@ runs:

				        FILE_SUFFIX: ${{ inputs.file-suffix }}

				      run: |

				        # -ir => recursive include all files in pattern

				        7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'

				        7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\test-reports\*.json'

				    - name: Zip test reports for upload

				      if: runner.os == 'Windows' && !inputs.use-gha

				@ -86,7 +86,7 @@ runs:

				        FILE_SUFFIX: ${{ inputs.file-suffix }}

				      run: |

				        # -ir => recursive include all files in pattern

				        7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml' -ir'!test\*.csv'

				        7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\test-reports\*.xml' -ir'!test\test-reports\*.csv'

				    - name: Zip usage log for upload

				      if: runner.os == 'Windows' && !inputs.use-gha

				@ -96,7 +96,7 @@ runs:

				        FILE_SUFFIX: ${{ inputs.file-suffix }}

				      run: |

				        # -ir => recursive include all files in pattern

				        7z a "logs-$Env:FILE_SUFFIX.zip" 'usage_log.txt' -ir'!test\*.log'

				        7z a "logs-$Env:FILE_SUFFIX.zip" 'usage_log.txt' -ir'!test\test-reports\*.log'

				    # S3 upload

				    - name: Store Test Downloaded JSONs on S3

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 ed7b36b7a741253d4e41e4da3c901d83294503
 fa44bdab1fe49bab58389e7b6a33061ffced9bc7

2

.github/ci_commit_pins/torchbench.txt vendored

View File

 @ -1 +1 @@
 dbebd44a11eb84afbf53c3c071dd105297e
 e522b45cd4535b9dfe067aa68d7315755df38f48

									
										6

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -98,3 +98,9 @@

				"module: distributed_checkpoint":

				- torch/distributed/checkpoint/**

				- test/distributed/checkpoint/**

				"module: compiled autograd":

				- torch/csrc/dynamo/python_compiled_autograd.cpp

				- torch/csrc/dynamo/compiled_autograd.h

				- torch/_dynamo/compiled_autograd.py

				- torch/inductor/test_compiled_autograd.py

									
										369

.github/lf-canary-scale-config.yml
									
										vendored
									
												View File
											
				@ -1,369 +0,0 @@

				# This file is generated by .github/scripts/validate_scale_config.py in test-infra

				# It defines runner types that will be provisioned by by LF Self-hosted runners

				# scale-config.yml:

				#   Powers what instance types are available for GHA auto-scaled

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2

				#

				# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls

				#                     to avoid RequestLimitExceeded issues

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				# NOTE: Default values,

				#

				# runner_types:

				#   runner_label:

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				#     disk_size: 50

				#     is_ephemeral: true

				runner_types:

				  lf.c.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 500

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.24xlarge.ephemeral:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 2400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.c.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.c.linux.arm64.2xlarge.ephemeral:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.c.linux.arm64.m7g.4xlarge.ephemeral:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.c.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.c.windows.g4dn.xlarge:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: true

				    max_available: 100

				    os: windows

				  lf.c.windows.g4dn.xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: false

				    max_available: 100

				    os: windows

				  lf.c.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: true

				    max_available: 420

				    os: windows

				  lf.c.windows.4xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: false

				    max_available: 420

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: windows

				  lf.c.windows.g5.4xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: windows

									
										369

.github/lf-scale-config.yml
									
										vendored
									
												View File
											
				@ -1,369 +0,0 @@

				# This file is generated by .github/scripts/validate_scale_config.py in test-infra

				# It defines runner types that will be provisioned by by LF Self-hosted runners

				# scale-config.yml:

				#   Powers what instance types are available for GHA auto-scaled

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2

				#

				# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls

				#                     to avoid RequestLimitExceeded issues

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				# NOTE: Default values,

				#

				# runner_types:

				#   runner_label:

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				#     disk_size: 50

				#     is_ephemeral: true

				runner_types:

				  lf.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 500

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.24xlarge.ephemeral:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 2400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.linux.arm64.2xlarge.ephemeral:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.linux.arm64.m7g.4xlarge.ephemeral:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				  lf.windows.g4dn.xlarge:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: true

				    max_available: 100

				    os: windows

				  lf.windows.g4dn.xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: false

				    max_available: 100

				    os: windows

				  lf.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: true

				    max_available: 420

				    os: windows

				  lf.windows.4xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: false

				    max_available: 420

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: windows

				  lf.windows.g5.4xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: windows

									
										13

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -86,6 +86,18 @@

				  - pull

				  - inductor

				- name: OSS CI / pytorchbot / slow tests

				  patterns:

				  - test/slow_tests.json

				  approved_by:

				  - pytorchbot

				  ignore_flaky_failures: false

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				  - slow

				- name: OSS CI /pytorchbot / Executorch

				  patterns:

				  - .ci/docker/ci_commit_pins/executorch.txt

				@ -532,6 +544,7 @@

				  - anijain2305

				  - bdhirsh

				  - zou3519

				  - isuruf

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

									
										3

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -6,6 +6,7 @@ ciflow_push_tags:

				- ciflow/binaries_libtorch

				- ciflow/binaries_wheel

				- ciflow/inductor

				- ciflow/inductor-periodic

				- ciflow/inductor-rocm

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				@ -16,11 +17,13 @@ ciflow_push_tags:

				- ciflow/nightly

				- ciflow/periodic

				- ciflow/rocm

				- ciflow/s390

				- ciflow/slow

				- ciflow/trunk

				- ciflow/unstable

				- ciflow/xpu

				- ciflow/torchbench

				- ciflow/autoformat

				retryable_workflows:

				- pull

				- trunk

2

.github/requirements-gha-cache.txt vendored

View File

 @ -4,7 +4,7 @@
 #   docs/cpp/requirements.txt
 #   functorch/docs/requirements.txt
 #   .ci/docker/requirements-ci.txt
 boto3==1.19.12
 boto3==1.35.42
 jinja2==3.1.4
 lintrunner==0.10.7
 ninja==1.10.0.post1

									
										2

.github/requirements/README.md
									
										vendored
									
												View File
												
				@ -17,8 +17,6 @@ The list of support files are as follows:

				    conda environment

				  * conda-env-macOS-ARM64. This is used by MacOS (m1, arm64) build and

				    test jobs to setup the conda environment

				  * conda-env-macOS-X64. This is use by MacOS (x86-64) build and test

				    jobs to setup the conda environment

				  * conda-env-Linux-X64. This is used by Linux buck build and test jobs

				    to setup the conda environment

				* Pip:

4

.github/requirements/conda-env-Linux-X64.txt vendored

View File

 @ -4,5 +4,5 @@ mkl-include=2022.1.0
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 setuptools=68.2.2
 typing-extensions=4.9.0
 setuptools=72.1.0
 typing-extensions=4.11.0

2

.github/requirements/conda-env-iOS.txt vendored

View File

 @ -3,5 +3,5 @@ cmake=3.22.1
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 setuptools=68.2.2
 setuptools=72.1.0
 typing-extensions=4.11.0

4

.github/requirements/conda-env-macOS-ARM64 vendored

View File

 @ -1,8 +1,8 @@
 numpy=1.22.3
 pyyaml=6.0
 setuptools=61.2.0
 setuptools=72.1.0
 cmake=3.22.*
 typing-extensions=4.9.0
 typing-extensions=4.11.0
 dataclasses=0.8
 pip=22.2.2
 pillow=10.0.1

16

.github/requirements/conda-env-macOS-X64 vendored

View File

 @ -1,16 +0,0 @@
 mkl=2021.2.0
 mkl-include=2021.2.0
 numpy=1.21.2
 pyyaml=5.3
 setuptools=46.0.0
 cmake=3.22.*
 typing-extensions=4.9.0
 dataclasses=0.8
 pip=22.2.2
 pillow=10.0.1
 libuv=1.40.0
 pkg-config=0.29.2
 wheel=0.37.1
 # Not pinning certifi so that we can always get the latest certificates
 certifi

2

.github/requirements/pip-requirements-iOS.txt vendored

View File

 @ -1,4 +1,4 @@
 # iOS simulator requirements
 coremltools==5.0b5
 protobuf==3.20.2
 optree==0.12.1
 optree==0.13.0

5

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,4 +1,4 @@
 boto3==1.19.12
 boto3==1.35.42
 hypothesis==6.56.4
 expecttest==0.2.1
 fbscribelogger==0.1.6
 @ -27,7 +27,8 @@ pytest-cpp==2.3.0
 rockset==1.0.3
 z3-solver==4.12.2.0
 tensorboard==2.13.0
 optree==0.12.1
 optree==0.13.0
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2
 parameterized==0.8.1

									
										67

.github/scripts/close_nonexistent_disable_issues.py
									
										vendored
									
												View File
												
				@ -3,26 +3,37 @@ import json

				import multiprocessing as mp

				import os

				import re

				import sys

				import tempfile

				from typing import Any, Dict, List, Optional, Tuple

				from pathlib import Path

				from typing import Any, Dict, List, Tuple

				import requests

				import rockset  # type: ignore[import]

				from gitutils import retries_decorator

				REPO_ROOT = Path(__file__).resolve().parent.parent.parent

				sys.path.insert(0, str(REPO_ROOT))

				from tools.testing.clickhouse import query_clickhouse

				sys.path.pop(0)

				LOGS_QUERY = """

				with

				    shas as (

				        SELECT

				            push.head_commit.id as sha,

				            distinct

				            push.head_commit.id as sha

				        FROM

				            commons.push

				            -- Not bothering with final here

				            default.push

				        WHERE

				            push.ref = 'refs/heads/viable/strict'

				            AND push.repository.full_name = 'pytorch/pytorch'

				            AND push.repository.'full_name' = 'pytorch/pytorch'

				        ORDER BY

				            push._event_time DESC

				            push.head_commit.'timestamp' desc

				        LIMIT

				            5

				    )

				@ -30,27 +41,29 @@ select

				    id,

				    name

				from

				    workflow_job j

				    default.workflow_job j final

				    join shas on shas.sha = j.head_sha

				where

				    j.name like '% / test%'

				    j.id in (select id from materialized_views.workflow_job_by_head_sha where head_sha in (select sha from shas))

				    and j.name like '% / test%'

				    and j.name not like '%rerun_disabled_tests%'

				    and j.name not like '%mem_leak_check%'

				"""

				TEST_EXISTS_QUERY = """

				select

				    count(*) as c

				    name

				from

				    test_run_s3

				    default.test_run_s3

				where

				    cast(name as string) like :name

				    and classname like :classname

				    and _event_time > CURRENT_TIMESTAMP() - DAYS(7)

				    name::String like {name: String}

				    and classname like {classname: String}

				    and time_inserted > CURRENT_TIMESTAMP() - INTERVAL 7 DAY

				limit 1

				"""

				CLOSING_COMMENT = (

				    "I cannot find any mention of this test in rockset for the past 7 days "

				    "I cannot find any mention of this test in the database for the past 7 days "

				    "or in the logs for the past 5 commits on viable/strict.  Closing this "

				    "issue as it is highly likely that this test has either been renamed or "

				    "removed.  If you think this is a false positive, please feel free to "

				@ -62,6 +75,11 @@ DISABLED_TESTS_JSON = (

				)

				@retries_decorator()

				def query_db(query: str, params: Dict[str, Any]) -> List[Dict[str, Any]]:

				    return query_clickhouse(query, params)

				def parse_args() -> Any:

				    parser = argparse.ArgumentParser()

				    parser.add_argument(

				@ -72,17 +90,6 @@ def parse_args() -> Any:

				    return parser.parse_args()

				@retries_decorator()

				def query_rockset(

				    query: str, params: Optional[Dict[str, Any]] = None

				) -> List[Dict[str, Any]]:

				    res = rockset.RocksetClient(

				        host="api.rs2.usw2.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]

				    ).sql(query, params)

				    results: List[Dict[str, Any]] = res.results

				    return results

				def download_log_worker(temp_dir: str, id: int, name: str) -> None:

				    url = f"https://ossci-raw-job-status.s3.amazonaws.com/log/{id}"

				    data = requests.get(url).text

				@ -137,13 +144,13 @@ def check_if_exists(

				    if present:

				        return True, "found in logs"

				    # Query rockset to see if the test is there

				    count = query_rockset(

				    # Query DB to see if the test is there

				    count = query_db(

				        TEST_EXISTS_QUERY, {"name": f"{name}%", "classname": f"{classname}%"}

				    )

				    if count[0]["c"] == 0:

				    if len(count) == 0:

				        return False, "not found"

				    return True, "found in rockset"

				    return True, "found in DB"

				if __name__ == "__main__":

				@ -151,7 +158,7 @@ if __name__ == "__main__":

				    disabled_tests_json = json.loads(requests.get(DISABLED_TESTS_JSON).text)

				    all_logs = []

				    jobs = query_rockset(LOGS_QUERY)

				    jobs = query_db(LOGS_QUERY, {})

				    with tempfile.TemporaryDirectory() as temp_dir:

				        pool = mp.Pool(20)

				        for job in jobs:

									
										22

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -77,6 +77,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'"

				@ -333,7 +334,7 @@ def generate_wheels_matrix(

				        package_type = "manywheel"

				    if python_versions is None:

				        python_versions = FULL_PYTHON_VERSIONS + ["3.13"]

				        python_versions = FULL_PYTHON_VERSIONS + ["3.13", "3.13t"]

				    if arches is None:

				        # Define default compute archivectures

				@ -368,8 +369,15 @@ def generate_wheels_matrix(

				            # TODO: Enable python 3.13 on rocm, aarch64, windows

				            if (

				                gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")

				            ) and python_version == "3.13":

				                gpu_arch_type == "rocm"

				                or os not in ["linux", "linux-s390x", "macos-arm64"]

				            ) and python_version in ["3.13", "3.13t"]:

				                continue

				            # TODO: Enable python 3.13t on xpu and cpu-s390x or MacOS

				            if (

				                gpu_arch_type in ["xpu", "cpu-s390x"] or os == "macos-arm64"

				            ) and python_version == "3.13t":

				                continue

				            if use_split_build and (

				@ -403,7 +411,7 @@ def generate_wheels_matrix(

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]

				                            if os != "linux-aarch64"

				                            else ""

				                        ),

				@ -412,8 +420,8 @@ def generate_wheels_matrix(

				                        ),

				                    }

				                )

				                # Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA

				                if python_version == "3.10" and arch_version == "12.1":

				                # Special build building to use on Colab. Python 3.11 for 12.1 CUDA

				                if python_version == "3.11" and arch_version == "12.1":

				                    ret.append(

				                        {

				                            "python_version": python_version,

				@ -451,7 +459,7 @@ def generate_wheels_matrix(

				                            ".", "_"

				                        ),

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.4"]

				                            if os != "linux" and gpu_arch_type != "xpu"

				                            else ""

				                        ),

									
										68

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -70,17 +70,15 @@ class BinaryBuildWorkflow:

				            )

				        else:

				            self.build_environment = f"{self.os}-binary-{self.package_type}"

				        if self.use_split_build:

				            # added to distinguish concurrency groups

				            self.build_environment += "-split"

				    def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:

				        output_file_path = (

				            GITHUB_DIR

				            / f"workflows/generated-{self.build_environment}-{self.branches}.yml"

				        )

				        if self.use_split_build:

				            output_file_path = (

				                GITHUB_DIR

				                / f"workflows/generated-{self.build_environment}-{self.branches}-split.yml"

				            )

				        with open(output_file_path, "w") as output_file:

				            GENERATED = "generated"  # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file

				            output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])

				@ -116,20 +114,21 @@ LINUX_BINARY_BUILD_WORFKLOWS = [

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            use_split_build=True,

				            arches=["11.8", "12.1", "12.4", "cpu"],

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				        ),

				        use_split_build=True,

				    ),

				    # See https://github.com/pytorch/pytorch/issues/138750

				    #   BinaryBuildWorkflow(

				    #     os=OperatingSystem.LINUX,

				    #     package_type="manywheel",

				    #     build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				    #         OperatingSystem.LINUX,

				    #         use_split_build=True,

				    #         arches=["11.8", "12.1", "12.4", "cpu"],

				    #     ),

				    #     ciflow_config=CIFlowConfig(

				    #         labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				    #         isolated_workflow=True,

				    #     ),

				    #     use_split_build=True,

				    # ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="conda",

				@ -182,21 +181,22 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        ),

				        branches="main",

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.9"],

				            use_split_build=True,

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_PERIODIC},

				        ),

				        branches="main",

				        use_split_build=True,

				    ),

				    # See https://github.com/pytorch/pytorch/issues/138750

				    # BinaryBuildWorkflow(

				    #     os=OperatingSystem.LINUX,

				    #     package_type="manywheel",

				    #     build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				    #         OperatingSystem.LINUX,

				    #         arches=["11.8", "12.1", "12.4"],

				    #         python_versions=["3.9"],

				    #         use_split_build=True,

				    #     ),

				    #     ciflow_config=CIFlowConfig(

				    #         labels={LABEL_CIFLOW_PERIODIC},

				    #     ),

				    #     branches="main",

				    #     use_split_build=True,

				    # ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="libtorch",

									
										8

.github/scripts/github_utils.py
									
										vendored
									
												View File
												
				@ -168,6 +168,14 @@ def gh_post_commit_comment(

				    )

				def gh_close_pr(org: str, repo: str, pr_num: int, dry_run: bool = False) -> None:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/pulls/{pr_num}"

				    if dry_run:

				        print(f"Dry run closing PR {pr_num}")

				    else:

				        gh_fetch_url(url, method="PATCH", data={"state": "closed"})

				def gh_delete_comment(org: str, repo: str, comment_id: int) -> None:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/comments/{comment_id}"

				    gh_fetch_url(url, method="DELETE")

									
										10

.github/scripts/lintrunner.sh
									
										vendored
									
												View File
												
				@ -17,6 +17,11 @@ if [[ -d "${CACHE_DIRECTORY}" ]]; then

				    cp -r "${CACHE_DIRECTORY}" . || true

				fi

				# if lintrunner is not installed, install it

				if ! command -v lintrunner &> /dev/null; then

				    python3 -m pip install lintrunner==0.12.5

				fi

				# This has already been cached in the docker image

				lintrunner init 2> /dev/null

				@ -33,10 +38,11 @@ python3 torch/utils/data/datapipes/gen_pyi.py

				RC=0

				# Run lintrunner on all files

				if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then

				if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then

				    echo ""

				    echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"

				    echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"

				    echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions. To apply suggested patches automatically, use the -a flag. Before pushing another commit,\e[0m"

				    echo -e "\e[1m\e[36mplease verify locally and ensure everything passes.\e[0m"

				    RC=1

				fi

									
										433

.github/scripts/runner_determinator.py
									
										vendored
									
												View File
												
				@ -1,51 +1,107 @@

				# flake8: noqa: G004

				# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml

				#       must be kept in sync. You can do it easily by running the following command:

				#           python .github/scripts/update_runner_determinator.py

				"""

				This runner determinator is used to determine which set of runners to run a

				GitHub job on. It uses the first comment of a GitHub issue (by default

				https://github.com/pytorch/test-infra/issues/5132) as a user list to determine

				which users will get their jobs to run on experimental runners. This user list

				is also a comma separated list of additional features or experiments which the

				user could be opted in to.

				https://github.com/pytorch/test-infra/issues/5132) to define the configuration

				of which runners should be used to run which job.

				The configuration has two parts, the settings and a list of opted-in users,

				separated by a line containing "---".  If the line is not present, the

				settings are considered to be empty with only the second part, the user

				list, defined.

				The first part is a YAML block that defines the rollout settings. This can be

				used to define any settings that are needed to determine which runners to use.

				It's fields are defined by the RolloutSettings class below.

				The second part is a list of users who are explicitly opted in to the LF fleet.

				The user list is also a comma separated list of additional features or

				experiments which the user could be opted in to.

				The user list has the following rules:

				- Users are GitHub usernames with the @ prefix

				- If the first line is a "*" then all users will use the new runners

				- If the first line is a "!" then all users will use the old runners

				- Users are GitHub usernames, which must start with the @ prefix

				- Each user is also a comma-separated list of features/experiments to enable

				- A "#" prefix indicates the user is opted out of the new runners but is opting

				  into features/experiments.

				- A "#" prefix opts the user out of all experiments

				Example user list:

				Example config:

				    # A list of experiments that can be opted into.

				    # This defines the behavior they'll induce when opted into.

				    # Expected syntax is:

				    #   [experiment_name]: # Name of the experiment. Also used for the label prefix.

				    #      rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.

				    @User1

				    @User2,amz2023

				    #@UserOptOutOfNewRunner,amz2023

				    experiments:

				      lf:

				        rollout_percent: 25

				        all_branches: false

				        default: true

				    ---

				    # Opt-ins:

				    # Users can opt into the LF fleet by adding their GitHub username to this list

				    # and specifying experiments to enable in a comma-separated list.

				    # Experiments should be from the above list.

				    @User1,lf,split_build

				    @User2,lf

				    @User3,split_build

				"""

				import logging

				import os

				import random

				from argparse import ArgumentParser

				from logging import LogRecord

				from typing import Any, Iterable

				from typing import Any, Dict, FrozenSet, Iterable, List, NamedTuple, Tuple

				import yaml

				from github import Auth, Github

				from github.Issue import Issue

				WORKFLOW_LABEL_META = ""  # use meta runners

				DEFAULT_LABEL_PREFIX = ""  # use meta runners

				WORKFLOW_LABEL_LF = "lf."  # use runners from the linux foundation

				WORKFLOW_LABEL_LF_CANARY = "lf.c."  # use canary runners from the linux foundation

				RUNNER_AMI_LEGACY = ""

				RUNNER_AMI_AMZ2023 = "amz2023"

				GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")

				GH_OUTPUT_KEY_AMI = "runner-ami"

				GH_OUTPUT_KEY_LABEL_TYPE = "label-type"

				SETTING_EXPERIMENTS = "experiments"

				LF_FLEET_EXPERIMENT = "lf"

				CANARY_FLEET_SUFFIX = ".c"

				class Experiment(NamedTuple):

				    rollout_perc: float = (

				        0  # Percentage of workflows to experiment on when user is not opted-in.

				    )

				    all_branches: bool = (

				        False  # If True, the experiment is also enabled on the exception branches

				    )

				    default: bool = (

				        True  # If True, the experiment is enabled by default for all queries

				    )

				    # Add more fields as needed

				class Settings(NamedTuple):

				    """

				    Settings for the experiments that can be opted into.

				    """

				    experiments: Dict[str, Experiment] = {}

				class ColorFormatter(logging.Formatter):

				    """Color codes the log messages based on the log level"""

				@ -88,6 +144,12 @@ def set_github_output(key: str, value: str) -> None:

				        f.write(f"{key}={value}\n")

				def _str_comma_separated_to_set(value: str) -> FrozenSet[str]:

				    return frozenset(

				        filter(lambda itm: itm != "", map(str.strip, value.strip(" \n\t").split(",")))

				    )

				def parse_args() -> Any:

				    parser = ArgumentParser("Get dynamic rollout settings")

				    parser.add_argument("--github-token", type=str, required=True, help="GitHub token")

				@ -122,6 +184,13 @@ def parse_args() -> Any:

				        required=True,

				        help="Current GitHub ref type, branch or tag",

				    )

				    parser.add_argument(

				        "--eligible-experiments",

				        type=_str_comma_separated_to_set,

				        required=False,

				        default="",

				        help="comma separated list of experiments to check, if omitted all experiments marked with default=True are checked",

				    )

				    return parser.parse_args()

				@ -167,90 +236,208 @@ def get_potential_pr_author(

				def is_exception_branch(branch: str) -> bool:

				    """

				    Branches that get opted out of all experiments and should always use Meta runners

				    Branches that get opted out of experiments by default, until they're explicitly enabled.

				    """

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				def get_fleet(rollout_state: str, workflow_requestors: Iterable[str]) -> str:

				    """

				    Determines if the job should run on the LF fleet or the Meta fleet

				    Returns:

				        The appropriate label prefix for the runner, corresponding to the fleet to use.

				        This gets prefixed to the very start of the runner label.

				    """

				def load_yaml(yaml_text: str) -> Any:

				    try:

				        if rollout_state[0] == "!":

				            log.info("LF Workflows are disabled for everyone. Using meta runners.")

				            return WORKFLOW_LABEL_META

				        elif rollout_state[0] == "*":

				            log.info("LF Workflows are enabled for everyone. Using LF runners.")

				            return WORKFLOW_LABEL_LF

				        else:

				            all_opted_in_users = {

				                usr_raw.strip("\n\t@ ").split(",")[0]

				                for usr_raw in rollout_state.split()

				            }

				            opted_in_requestors = {

				                usr for usr in workflow_requestors if usr in all_opted_in_users

				            }

				            if opted_in_requestors:

				                log.info(

				                    f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."

				                )

				                return WORKFLOW_LABEL_LF

				            else:

				                log.info(

				                    f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."

				                )

				                return WORKFLOW_LABEL_META

				    except Exception as e:

				        log.error(

				            f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"

				        )

				        return WORKFLOW_LABEL_META

				        data = yaml.safe_load(yaml_text)

				        return data

				    except yaml.YAMLError as exc:

				        log.exception("Error loading YAML")

				        raise

				def get_optin_feature(

				    rollout_state: str, workflow_requestors: Iterable[str], feature: str, fallback: str

				def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:

				    """

				    Extracts the text with settings, if any, and the opted in users from the rollout state.

				    If the issue body contains "---" then the text above that is the settings

				    and the text below is the list of opted in users.

				    If it doesn't contain "---" then the settings are empty and the rest is the users.

				    """

				    rollout_state_parts = rollout_state.split("---")

				    if len(rollout_state_parts) >= 2:

				        return rollout_state_parts[0], rollout_state_parts[1]

				    else:

				        return "", rollout_state

				class UserOptins(Dict[str, List[str]]):

				    """

				    Dictionary of users with a list of features they have opted into

				    """

				def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:

				    """

				    Parse the user opt-in text into a key value pair of username and the list of features they have opted into

				    Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.

				        - Example line: "@User1,lf,split_build"

				        - A "#" prefix indicates the user is opted out of all experiments

				    """

				    optins = UserOptins()

				    for user in user_optin_text.split("\n"):

				        user = user.strip("\r\n\t -")

				        if not user or not user.startswith("@"):

				            # Not a valid user. Skip

				            continue

				        if user:

				            usr_name = user.split(",")[0].strip("@")

				            optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]

				    return optins

				def parse_settings_from_text(settings_text: str) -> Settings:

				    """

				    Parse the experiments from the issue body into a list of ExperimentSettings

				    """

				    try:

				        if settings_text:

				            # Escape the backtick as well so that we can have the settings in a code block on the GH issue

				            # for easy reading

				            # Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on

				            #       the backtick character in shell commands.

				            backtick = chr(96)  # backtick character

				            settings_text = settings_text.strip(f"\r\n\t{backtick} ")

				            settings = load_yaml(settings_text)

				            # For now we just load experiments. We can expand this if/when we add more settings

				            experiments = {}

				            for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():

				                valid_settings = {}

				                for setting in exp_settings:

				                    if setting not in Experiment._fields:

				                        log.warning(

				                            f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"

				                        )

				                    else:

				                        valid_settings[setting] = exp_settings[setting]

				                experiments[exp_name] = Experiment(**valid_settings)

				            return Settings(experiments)

				    except Exception:

				        log.exception("Failed to parse settings")

				    return Settings()

				def parse_settings(rollout_state: str) -> Settings:

				    """

				    Parse settings, if any, from the rollout state.

				    If the issue body contains "---" then the text above that is the settings

				    and the text below is the list of opted in users.

				    If it doesn't contain "---" then the settings are empty and the default values are used.

				    """

				    settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)

				    return parse_settings_from_text(settings_text)

				def parse_users(rollout_state: str) -> UserOptins:

				    """

				    Parse users from the rollout state.

				    """

				    _, users_text = extract_settings_user_opt_in_from_text(rollout_state)

				    return parse_user_opt_in_from_text(users_text)

				def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:

				    """

				    Check if a user is opted into an experiment

				    """

				    return experiment_name in user_optins.get(user, [])

				def get_runner_prefix(

				    rollout_state: str,

				    workflow_requestors: Iterable[str],

				    branch: str,

				    eligible_experiments: FrozenSet[str] = frozenset(),

				    is_canary: bool = False,

				) -> str:

				    """

				    Used to dynamically opt in jobs to specific runner-type variants.

				    settings = parse_settings(rollout_state)

				    user_optins = parse_users(rollout_state)

				    Returns:

				        The runner-type's variant name if the user has opted in to the feature, otherwise returns an empty string.

				        This variant name is prefixed to the runner-type in the label.

				    """

				    try:

				        userlist = {u.lstrip("#").strip("\n\t@ ") for u in rollout_state.split()}

				        all_opted_in_users = set()

				        for user in userlist:

				            for i in user.split(","):

				                if i == feature:

				                    all_opted_in_users.add(user.split(",")[0])

				        opted_in_requestors = {

				            usr for usr in workflow_requestors if usr in all_opted_in_users

				        }

				        if opted_in_requestors:

				    fleet_prefix = ""

				    prefixes = []

				    for experiment_name, experiment_settings in settings.experiments.items():

				        if not experiment_settings.all_branches and is_exception_branch(branch):

				            log.info(

				                f"Feature {feature} is enabled for {', '.join(opted_in_requestors)}. Using feature {feature}."

				                f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."

				            )

				            return feature

				        else:

				            log.info(

				                f"Feature {feature} is disabled for {', '.join(workflow_requestors)}. Using fallback \"{fallback}\"."

				            )

				            return fallback

				            continue

				    except Exception as e:

				        if eligible_experiments:

				            if experiment_name not in eligible_experiments:

				                exp_list = ", ".join(eligible_experiments)

				                log.info(

				                    f"Skipping experiment '{experiment_name}', as it is not in the eligible_experiments list: {exp_list}"

				                )

				                continue

				        elif not experiment_settings.default:

				            log.info(

				                f"Skipping experiment '{experiment_name}', as it is not a default experiment"

				            )

				            continue

				        # Is any workflow_requestor opted in to this experiment?

				        opted_in_users = [

				            requestor

				            for requestor in workflow_requestors

				            if is_user_opted_in(requestor, user_optins, experiment_name)

				        ]

				        enabled = False

				        if opted_in_users:

				            log.info(

				                f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."

				            )

				            enabled = True

				        elif experiment_settings.rollout_perc:

				            # If no user is opted in, then we randomly enable the experiment based on the rollout percentage

				            if random.uniform(0, 100) <= experiment_settings.rollout_perc:

				                log.info(

				                    f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."

				                )

				                enabled = True

				        if enabled:

				            label = experiment_name

				            if experiment_name == LF_FLEET_EXPERIMENT:

				                # We give some special treatment to the "lf" experiment since determines the fleet we use

				                #  - If it's enabled, then we always list it's prefix first

				                #  - If we're in the canary branch, then we append ".c" to the lf prefix

				                if is_canary:

				                    label += CANARY_FLEET_SUFFIX

				                fleet_prefix = label

				            else:

				                prefixes.append(label)

				    if len(prefixes) > 1:

				        log.error(

				            f'Failed to determine if user has opted-in to feature {feature}. Using fallback "{fallback}". Exception: {e}'

				            f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"

				        )

				        return fallback

				        prefixes = prefixes[:1]

				    # Fleet always comes first

				    if fleet_prefix:

				        prefixes.insert(0, fleet_prefix)

				    return ".".join(prefixes) + "." if prefixes else ""

				def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:

				@ -267,53 +454,37 @@ def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -

				def main() -> None:

				    args = parse_args()

				    if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):

				        log.info(f"Exception branch: '{args.github_branch}', using meta runners")

				        label_type = WORKFLOW_LABEL_META

				        runner_ami = RUNNER_AMI_LEGACY

				    else:

				        try:

				            rollout_state = get_rollout_state_from_issue(

				                args.github_token, args.github_issue_repo, args.github_issue

				            )

				    runner_label_prefix = DEFAULT_LABEL_PREFIX

				            username = get_potential_pr_author(

				                args.github_token,

				                args.github_repo,

				                args.github_actor,

				                args.github_ref_type,

				                args.github_branch,

				            )

				    try:

				        rollout_state = get_rollout_state_from_issue(

				            args.github_token, args.github_issue_repo, args.github_issue

				        )

				            label_type = get_fleet(

				                rollout_state,

				                (

				                    args.github_issue_owner,

				                    username,

				                ),

				            )

				            runner_ami = get_optin_feature(

				                rollout_state=rollout_state,

				                workflow_requestors=(

				                    args.github_issue_owner,

				                    username,

				                ),

				                feature=RUNNER_AMI_AMZ2023,

				                fallback=RUNNER_AMI_LEGACY,

				            )

				        except Exception as e:

				            log.error(

				                f"Failed to get issue. Falling back to meta runners. Exception: {e}"

				            )

				            label_type = WORKFLOW_LABEL_META

				            runner_ami = RUNNER_AMI_LEGACY

				        username = get_potential_pr_author(

				            args.github_token,

				            args.github_repo,

				            args.github_actor,

				            args.github_ref_type,

				            args.github_branch,

				        )

				    # For Canary builds use canary runners

				    if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:

				        label_type = WORKFLOW_LABEL_LF_CANARY

				        is_canary = args.github_repo == "pytorch/pytorch-canary"

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)

				    set_github_output(GH_OUTPUT_KEY_AMI, runner_ami)

				        runner_label_prefix = get_runner_prefix(

				            rollout_state,

				            (args.github_issue_owner, username),

				            args.github_branch,

				            args.eligible_experiments,

				            is_canary,

				        )

				    except Exception as e:

				        log.error(

				            f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"

				        )

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

				if __name__ == "__main__":

									
										35

.github/scripts/sync_distributed_folder_prototype.sh
									
										vendored
									
												View File
											
				@ -1,35 +0,0 @@

				#!/bin/bash

				set -eoux pipefail

				SYNC_BRANCH=pytorch-stable-prototype

				git config user.email "fake@example.com"

				git config user.name  "PyTorch Stable Bot"

				git fetch origin main

				git fetch origin "$SYNC_BRANCH"

				git checkout "$SYNC_BRANCH"

				# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.

				# This specific SHA was chosen as it was before the "branch point" of the stable branch

				for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)

				do

				    # `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise

				    if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]

				    then

				        echo "Skipping $SHA"

				        continue

				    fi

				    echo "Copying $SHA"

				    git cherry-pick -x "$SHA" -X theirs

				    git reset --soft HEAD~1

				    git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed

				    git checkout .

				    git commit --reuse-message=HEAD@{1}

				    git clean -f

				done

				if [[ "${WITH_PUSH}" == true ]]; then

				  git push

				fi

									
										2

.github/scripts/tag_docker_images_for_release.py
									
										vendored
									
												View File
												
				@ -51,6 +51,8 @@ def main() -> None:

				    for platform_image in platform_images:  # type: ignore[attr-defined]

				        for arch in platform_image.keys():  # type: ignore[attr-defined]

				            if arch == "cpu-s390x":

				                continue

				            tag_image(

				                platform_image[arch],  # type: ignore[index]

				                default_tag,

									
										440

.github/scripts/test_runner_determinator.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,440 @@

				from unittest import main, TestCase

				from unittest.mock import Mock, patch

				import runner_determinator as rd

				USER_BRANCH = "somebranch"

				EXCEPTION_BRANCH = "main"

				class TestRunnerDeterminatorIssueParser(TestCase):

				    def test_parse_settings(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 0

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0, default=False),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_settings_in_code_block(self) -> None:

				        settings_text = """

				        ```

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 0

				                default: false

				        ```

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0, default=False),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_all_branches_setting(self) -> None:

				        settings_text = """

				        ```

				        experiments:

				            lf:

				                rollout_perc: 25

				                all_branches: true

				            otherExp:

				                all_branches: True

				                rollout_perc: 0

				        ```

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25, all_branches=True),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTrue(settings.experiments["otherExp"].all_branches)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0, all_branches=True),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_users(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        users = rd.parse_users(settings_text)

				        self.assertDictEqual(

				            {"User1": ["lf"], "User2": ["lf", "otherExp"]},

				            users,

				            "Users not parsed correctly",

				        )

				    def test_parse_users_without_settings(self) -> None:

				        settings_text = """

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        users = rd.parse_users(settings_text)

				        self.assertDictEqual(

				            {"User1": ["lf"], "User2": ["lf", "otherExp"]},

				            users,

				            "Users not parsed correctly",

				        )

				class TestRunnerDeterminatorGetRunnerPrefix(TestCase):

				    def test_opted_in_user(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)

				        self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")

				    def test_opted_in_user_two_experiments(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")

				    def test_opted_in_user_two_experiments_default(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)

				        self.assertEqual("lf.", prefix, "Runner prefix not correct for User2")

				    def test_opted_in_user_two_experiments_default_exp(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(

				            settings_text, ["User2"], USER_BRANCH, frozenset(["lf", "otherExp"])

				        )

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")

				    def test_opted_in_user_two_experiments_default_exp_2(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(

				            settings_text, ["User2"], USER_BRANCH, frozenset(["otherExp"])

				        )

				        self.assertEqual("otherExp.", prefix, "Runner prefix not correct for User2")

				    @patch("random.uniform", return_value=50)

				    def test_opted_out_user(self, mock_uniform: Mock) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    @patch("random.uniform", return_value=10)

				    def test_opted_out_user_was_pulled_in_by_rollout(self, mock_uniform: Mock) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        # User3 is opted out, but is pulled into both experiments by the 10% rollout

				        prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    @patch("random.uniform", return_value=10)

				    def test_opted_out_user_was_pulled_in_by_rollout_excl_nondefault(

				        self, mock_uniform: Mock

				    ) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        # User3 is opted out, but is pulled into default experiments by the 10% rollout

				        prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)

				        self.assertEqual("lf.", prefix, "Runner prefix not correct for user")

				    @patch("random.uniform", return_value=10)

				    def test_opted_out_user_was_pulled_in_by_rollout_filter_exp(

				        self, mock_uniform: Mock

				    ) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        # User3 is opted out, but is pulled into default experiments by the 10% rollout

				        prefix = rd.get_runner_prefix(

				            settings_text, ["User3"], USER_BRANCH, frozenset(["otherExp"])

				        )

				        self.assertEqual("otherExp.", prefix, "Runner prefix not correct for user")

				    @patch("random.uniform", return_value=25)

				    def test_opted_out_user_was_pulled_out_by_rollout_filter_exp(

				        self, mock_uniform: Mock

				    ) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 10

				            otherExp:

				                rollout_perc: 50

				                default: false

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        # User3 is opted out, but is pulled into default experiments by the 10% rollout

				        prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    def test_lf_prefix_always_comes_first(self) -> None:

				        settings_text = """

				        experiments:

				            otherExp:

				                rollout_perc: 0

				            lf:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,otherExp,lf

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    def test_ignores_commented_users(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        #@User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    def test_ignores_extra_experiments(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				            foo:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf,otherExp,foo

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    def test_disables_experiment_on_exception_branches_when_not_explicitly_opted_in(

				        self,

				    ) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 100

				        ---

				        Users:

				        @User,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    def test_allows_experiment_on_exception_branches_when_explicitly_opted_in(

				        self,

				    ) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 100

				                all_branches: true

				        ---

				        Users:

				        @User,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)

				        self.assertEqual("lf.", prefix, "Runner prefix not correct for user")

				if __name__ == "__main__":

				    main()

									
										35

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -12,7 +12,7 @@ import json

				import os

				import warnings

				from hashlib import sha256

				from typing import Any, Dict, List, Optional

				from typing import Any, List, Optional

				from unittest import main, mock, skip, TestCase

				from urllib.error import HTTPError

				@ -24,7 +24,6 @@ from trymerge import (

				    find_matching_merge_rule,

				    get_classifications,

				    get_drci_classifications,

				    get_rockset_results,

				    gh_get_team_members,

				    GitHubPR,

				    JobCheckState,

				@ -42,7 +41,6 @@ if "GIT_REMOTE_URL" not in os.environ:

				    os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"

				GQL_MOCKS = "gql_mocks.json.gz"

				ROCKSET_MOCKS = "rockset_mocks.json.gz"

				DRCI_MOCKS = "drci_mocks.json.gz"

				@ -77,16 +75,11 @@ def mock_query(

				        if err.code == 401 or err.code == 403:

				            err_msg = f"If you are seeing this message during workflow run, please make sure to update {file_name}"

				            err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with"

				            err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN,"

				            err_msg += " the rockset api key passed via ROCKSET_API_KEY,"

				            err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN"

				            err_msg += " and drci api key passed via DRCI_BOT_KEY environment variables"

				            if (

				                os.getenv("GITHUB_TOKEN") is None

				                or os.getenv("ROCKSET_API_KEY") is None

				                or os.getenv("DRCI_BOT_KEY") is None

				            ):

				            if os.getenv("GITHUB_TOKEN") is None or os.getenv("DRCI_BOT_KEY") is None:

				                err_msg = (

				                    "Failed to update cached queries as GITHUB_TOKEN or ROCKSET_API_KEY or DRCI_BOT_KEY "

				                    "Failed to update cached queries as GITHUB_TOKEN or DRCI_BOT_KEY "

				                    + "is not defined. "

				                    + err_msg

				                )

				@ -110,16 +103,6 @@ def mocked_gh_graphql(query: str, **kwargs: Any) -> Any:

				    return mock_query(gh_graphql_wrapper, GQL_MOCKS, key_function, query, kwargs)

				def mocked_rockset_results(head_sha: str, merge_base: str, num_retries: int = 3) -> Any:

				    return mock_query(

				        get_rockset_results,

				        ROCKSET_MOCKS,

				        lambda x, y: f"{x} {y}",

				        head_sha,

				        merge_base,

				    )

				def mocked_drci_classifications(pr_num: int, project: str, num_retries: int = 3) -> Any:

				    return mock_query(

				        get_drci_classifications,

				@ -273,10 +256,6 @@ def xla_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule]:

				    ]

				def empty_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:

				    return []

				class DummyGitRepo(GitRepo):

				    def __init__(self) -> None:

				        super().__init__(get_git_repo_dir(), get_git_remote_name())

				@ -288,7 +267,6 @@ class DummyGitRepo(GitRepo):

				        return "super awsome commit message"

				@mock.patch("trymerge.get_rockset_results", side_effect=empty_rockset_results)

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch(

				    "trymerge.get_drci_classifications", side_effect=mocked_drci_classifications

				@ -604,7 +582,6 @@ class TestTryMerge(TestCase):

				            mocked_gh_fetch_merge_base.assert_called_once()

				@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

				@mock.patch(

				@ -843,7 +820,7 @@ class TestBypassFailures(TestCase):

				        checks = pr.get_checkrun_conclusions()

				        # Known flaky failure takes precedence over ignore current (need to set the

				        # merge base here to get the results from Rockset, and that categorize the

				        # merge base here to get the results from Dr. CI, and that categorize the

				        # broken trunk failure too

				        checks = get_classifications(

				            pr.pr_num,

				@ -929,7 +906,6 @@ class TestBypassFailures(TestCase):

				        )

				@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

				@mock.patch("trymerge.get_drci_classifications", return_value={})

				@ -1008,7 +984,6 @@ class TestBypassFailuresOnSandCastle(TestCase):

				        self.assertTrue(len(failed) == 2)

				@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

				@mock.patch(

									
										107

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -36,6 +36,7 @@ from warnings import warn

				import yaml

				from github_utils import (

				    gh_close_pr,

				    gh_fetch_json_list,

				    gh_fetch_merge_base,

				    gh_fetch_url,

				@ -451,8 +452,6 @@ RE_DIFF_REV = re.compile(r"^Differential Revision:.+?(D[0-9]+)", re.MULTILINE)

				CIFLOW_LABEL = re.compile(r"^ciflow/.+")

				CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")

				MERGE_RULE_PATH = Path(".github") / "merge_rules.yaml"

				ROCKSET_MERGES_COLLECTION = "merges"

				ROCKSET_MERGES_WORKSPACE = "commons"

				REMOTE_MAIN_BRANCH = "origin/main"

				DRCI_CHECKRUN_NAME = "Dr.CI"

				INTERNAL_CHANGES_CHECKRUN_NAME = "Meta Internal-Only Changes Check"

				@ -1140,7 +1139,10 @@ class GitHubPR:

				            if label_base in label:

				                count += 1

				                full_label = f"{label_base}X{count}"

				        gh_add_labels(self.org, self.project, self.pr_num, [full_label], dry_run)

				        self.add_label(full_label, dry_run)

				    def add_label(self, label: str, dry_run: bool) -> None:

				        gh_add_labels(self.org, self.project, self.pr_num, [label], dry_run)

				    def merge_into(

				        self,

				@ -1174,12 +1176,12 @@ class GitHubPR:

				            for pr in additional_merged_prs:

				                pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)

				        if comment_id and self.pr_num:

				            # When the merge process reaches this part, we can assume that the commit

				            # has been successfully pushed to trunk

				            merge_commit_sha = repo.rev_parse(name=REMOTE_MAIN_BRANCH)

				        # When the merge process reaches this part, we can assume that the commit

				        # has been successfully pushed to trunk

				        merge_commit_sha = repo.rev_parse(name=self.default_branch())

				            # Finally, upload the record to Rockset. The list of pending and failed

				        if comment_id and self.pr_num:

				            # Finally, upload the record to s3. The list of pending and failed

				            # checks are at the time of the merge

				            save_merge_record(

				                comment_id=comment_id,

				@ -1201,7 +1203,18 @@ class GitHubPR:

				                ignore_current=bool(ignore_current_checks),

				            )

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

				            print("Missing comment ID or PR number, couldn't upload to s3")

				        # Usually Github will see that the commit has "resolves <pr_num>" in the

				        # commit message and close the PR, but sometimes it doesn't, leading to

				        # confusion.  When it doesn't, we close it manually.

				        time.sleep(60)  # Give Github some time to close the PR

				        manually_close_merged_pr(

				            pr=self,

				            additional_merged_prs=additional_merged_prs,

				            merge_commit_sha=merge_commit_sha,

				            dry_run=dry_run,

				        )

				    def merge_changes(

				        self,

				@ -1469,7 +1482,7 @@ def find_matching_merge_rule(

				        # Categorize all checks when skip_mandatory_checks (force merge) is set. Do it here

				        # where the list of checks is readily available. These records will be saved into

				        # Rockset merge records

				        # s3 merge records

				        (

				            pending_mandatory_checks,

				            failed_mandatory_checks,

				@ -1496,13 +1509,41 @@ def checks_to_str(checks: List[Tuple[str, Optional[str]]]) -> str:

				def checks_to_markdown_bullets(

				    checks: List[Tuple[str, Optional[str], Optional[int]]]

				    checks: List[Tuple[str, Optional[str], Optional[int]]],

				) -> List[str]:

				    return [

				        f"- [{c[0]}]({c[1]})" if c[1] is not None else f"- {c[0]}" for c in checks[:5]

				    ]

				def manually_close_merged_pr(

				    pr: GitHubPR,

				    additional_merged_prs: List[GitHubPR],

				    merge_commit_sha: str,

				    dry_run: bool,

				) -> None:

				    def _comment_and_close(pr: GitHubPR, comment: str) -> None:

				        pr = GitHubPR(pr.org, pr.project, pr.pr_num)  # Refresh the PR

				        if not pr.is_closed():

				            gh_post_pr_comment(pr.org, pr.project, pr.pr_num, comment, dry_run)

				            gh_close_pr(pr.org, pr.project, pr.pr_num, dry_run)

				    message = (

				        f"This PR (#{pr.pr_num}) was merged in {merge_commit_sha} but it is still open, likely due to a Github bug, "

				        "so mergebot is closing it manually.  If you think this is a mistake, please feel free to reopen and contact Dev Infra."

				    )

				    _comment_and_close(pr, message)

				    for additional_pr in additional_merged_prs:

				        message = (

				            f"This PR (#{additional_pr.pr_num}) was merged as part of PR #{pr.pr_num} in the stack under {merge_commit_sha} "

				            "but it is still open, likely due to a Github bug, so mergebot is closing it manually. "

				            "If you think this is a mistake, please feel free to reopen and contact Dev Infra."

				        )

				        _comment_and_close(additional_pr, message)

				    print(f"PR {pr.pr_num} and all additional PRs in the stack have been closed.")

				@retries_decorator()

				def save_merge_record(

				    comment_id: int,

				@ -1528,7 +1569,7 @@ def save_merge_record(

				    This saves the merge records as a json, which can later be uploaded to s3

				    """

				    # Prepare the record to be written into Rockset

				    # Prepare the record to be written into s3

				    data = [

				        {

				            "comment_id": comment_id,

				@ -1550,7 +1591,8 @@ def save_merge_record(

				            "ignore_current": ignore_current,

				            "error": error,

				            # This is a unique identifier for the record for deduping purposes

				            # in rockset.  Any unique string would work

				            # in Rockset.  Any unique string would work.  This will not be used

				            # after we migrate off Rockset

				            "_id": f"{project}-{pr_num}-{comment_id}-{os.environ.get('GITHUB_RUN_ID')}",

				        }

				    ]

				@ -1560,36 +1602,6 @@ def save_merge_record(

				        json.dump(data, f)

				@retries_decorator(rc=[])

				def get_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:

				    query = f"""

				SELECT

				    w.name as workflow_name,

				    j.id,

				    j.name,

				    j.conclusion,

				    j.completed_at,

				    j.html_url,

				    j.head_sha,

				    j.torchci_classification.captures as failure_captures,

				    LENGTH(j.steps) as steps,

				FROM

				    commons.workflow_job j join commons.workflow_run w on w.id = j.run_id

				where

				    j.head_sha in ('{head_sha}','{merge_base}')

				"""

				    try:

				        import rockset  # type: ignore[import]

				        res = rockset.RocksetClient(

				            host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]

				        ).sql(query)

				        return cast(List[Dict[str, Any]], res.results)

				    except ModuleNotFoundError:

				        print("Could not use RockSet as rocket dependency is missing")

				        return []

				@retries_decorator()

				def get_drci_classifications(pr_num: int, project: str = "pytorch") -> Any:

				    """

				@ -1935,6 +1947,7 @@ def do_revert_prs(

				        )

				        pr.add_numbered_label("reverted", dry_run)

				        pr.add_label("ci-no-td", dry_run)

				        if not dry_run:

				            gh_post_commit_comment(pr.org, pr.project, commit_sha, revert_msg)

				            gh_update_pr_state(pr.org, pr.project, pr.pr_num)

				@ -2027,7 +2040,7 @@ def categorize_checks(

				    pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    # failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset

				    # failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on s3

				    failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)

				    # If required_checks is not set or empty, consider all names are relevant

				@ -2086,7 +2099,7 @@ def categorize_checks(

				    ):

				        failed_checks = failed_checks + flaky_or_broken_trunk

				    # The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record

				    # The list of failed_checks_categorization is returned so that it can be saved into the s3 merge record

				    return (pending_checks, failed_checks, failed_checks_categorization)

				@ -2370,7 +2383,7 @@ def main() -> None:

				        handle_exception(e)

				        if args.comment_id and args.pr_num:

				            # Finally, upload the record to Rockset, we don't have access to the

				            # Finally, upload the record to s3, we don't have access to the

				            # list of pending and failed checks here, but they are not really

				            # needed at the moment

				            save_merge_record(

				@ -2393,7 +2406,7 @@ def main() -> None:

				                error=str(e),

				            )

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

				            print("Missing comment ID or PR number, couldn't upload to s3")

				    finally:

				        if not args.check_mergeability:

				            gh_remove_label(

									
										31

.github/scripts/update_runner_determinator.py
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,31 @@

				#!/usr/bin/env python3

				import re

				# Read the contents of runner_determinator.py

				with open(".github/scripts/runner_determinator.py") as script_file:

				    script_content = script_file.read()

				# Indent the script content by 10 spaces to match destination indentation

				indented_script_content = "\n".join(

				    [" " * 10 + line if line else line for line in script_content.splitlines()]

				)

				# Read the contents of _runner-determinator.yml

				with open(".github/workflows/_runner-determinator.yml") as yml_file:

				    yml_content = yml_file.read()

				# Replace the content between the markers

				new_yml_content = re.sub(

				    r"(cat <<EOF > runner_determinator.py\n)(.*?)(\n\s+EOF)",

				    lambda match: match.group(1) + indented_script_content + match.group(3),

				    yml_content,

				    flags=re.DOTALL,

				)

				# Save the modified content back to _runner-determinator.yml

				with open(".github/workflows/_runner-determinator.yml", "w") as yml_file:

				    yml_file.write(new_yml_content)

				print("Updated _runner-determinator.yml with the contents of runner_determinator.py")

12

.github/templates/common.yml.j2 vendored

View File

 @ -25,7 +25,7 @@ concurrency:
             # Pulled from instance metadata endpoint for EC2
             # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
             category=$1
             curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
             curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
           }
           echo "ami-id: $(get_ec2_metadata ami-id)"
           echo "instance-id: $(get_ec2_metadata instance-id)"
 @ -40,6 +40,16 @@ concurrency:
         continue-on-error: true
         with:
           github-secret: ${{ secrets.GITHUB_TOKEN }}
       - name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
         shell: bash
         run: |
           git config --global core.longpaths true
           git config --global core.symlinks true
           # https://git-scm.com/docs/git-fsmonitor--daemon.  The daemon could lock
           # the directory on Windows and prevent GHA from checking out as reported
           # in https://github.com/actions/checkout/issues/1018
           git config --global core.fsmonitor false
       # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
       - name: Enable long paths on Windows
         shell: powershell

5

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -53,8 +53,9 @@ env:
 jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: ./.github/workflows/_runner-determinator.yml
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -68,6 +69,7 @@ jobs:
     needs: get-label-type
     with:!{{ upload.binary_env_as_input(config) }}
       {%- if "aarch64" in build_environment %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.arm64.m7g.4xlarge.ephemeral
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
 @ -102,6 +104,7 @@ jobs:
       build_name: !{{ config["build_name"] }}
       build_environment: !{{ build_environment }}
       {%- if "aarch64" in build_environment %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.arm64.2xlarge
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}

3

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -54,8 +54,9 @@ env:
 jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: ./.github/workflows/_runner-determinator.yml
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

									
										145

.github/workflows/_android-build-test.yml
									
										vendored
									
												View File
											
				@ -1,145 +0,0 @@

				name: android-build-test

				on:

				  workflow_call:

				    inputs:

				      build-environment:

				        required: true

				        type: string

				        description: Top-level label for what's being built/tested.

				      docker-image-name:

				        required: true

				        type: string

				        description: Name of the base docker image to build with.

				      sync-tag:

				        required: false

				        type: string

				        default: ""

				        description: |

				          If this is set, our linter will use this to make sure that every other

				          job with the same `sync-tag` is identical.

				      test-matrix:

				        required: true

				        type: string

				        description: |

				          A JSON description of what configs to run later on.

				env:

				  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}

				jobs:

				  filter:

				    if: github.repository_owner == 'pytorch'

				    runs-on: [self-hosted, linux.large]

				    outputs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          fetch-depth: 1

				          submodules: false

				      - name: Select all requested test configurations

				        id: filter

				        uses: ./.github/actions/filter-test-configs

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				          test-matrix: ${{ inputs.test-matrix }}

				  build-and-test:

				    needs: filter

				    # Don't run on forked repos.

				    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'

				    strategy:

				      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}

				      fail-fast: false

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      - name: Output disk space left

				        run: |

				          sudo df -H

				      - name: Preserve github env variables for use in docker

				        run: |

				          env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				          env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				      - name: Build

				        env:

				          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}

				          TORCH_CUDA_ARCH_LIST: 5.2

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        run: |

				          set -e

				          # Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:

				          # 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;

				          # 2) Not parallelizable by architecture: it only builds libtorch for one architecture;

				          export BUILD_LITE_INTERPRETER

				          BUILD_LITE_INTERPRETER="1"

				          if [[ "${BUILD_ENVIRONMENT}" == *"full-jit" ]]; then

				            BUILD_LITE_INTERPRETER="0"

				          fi

				          git submodule sync && git submodule update -q --init --recursive --depth 1

				          export id

				          id=$(docker run -e BUILD_ENVIRONMENT \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				            -e SCCACHE_BUCKET \

				            -e SKIP_SCCACHE_INITIALIZATION=1 \

				            -e TORCH_CUDA_ARCH_LIST \

				            -e BUILD_LITE_INTERPRETER \

				            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				            --security-opt seccomp=unconfined \

				            --cap-add=SYS_PTRACE \

				            --tty \

				            --detach \

				            --user jenkins \

				            -v "$(pwd):/var/lib/jenkins/workspace" \

				            --cap-add=SYS_PTRACE \

				            --security-opt seccomp=unconfined \

				            --cap-add=SYS_PTRACE \

				            --security-opt seccomp=unconfined \

				            -t -d -w /var/lib/jenkins "${DOCKER_IMAGE}")

				          export COMMAND

				          # shellcheck disable=SC2016

				          COMMAND='(echo "sudo chown -R jenkins workspace && cd workspace && ./scripts/build_android_gradle.sh" | docker exec -u jenkins -e BUILD_LITE_INTERPRETER -e GRADLE_OFFLINE=1 -i "$id" bash) 2>&1'

				          echo "${COMMAND}" > ./command.sh && bash ./command.sh

				          # Skip docker push as this job is purely for size analysis purpose.

				          # Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.

				      - name: Chown workspace

				        uses: ./.github/actions/chown-workspace

				        if: always()

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

									
										190

.github/workflows/_android-full-build-test.yml
									
										vendored
									
												View File
											
				@ -1,190 +0,0 @@

				name: android-full-build-test

				on:

				  workflow_call:

				    inputs:

				      build-environment:

				        required: true

				        type: string

				        description: Top-level label for what's being built/tested.

				      docker-image-name:

				        required: true

				        type: string

				        description: Name of the base docker image to build with.

				      sync-tag:

				        required: false

				        type: string

				        default: ""

				        description: |

				          If this is set, our linter will use this to make sure that every other

				          job with the same `sync-tag` is identical.

				      test-matrix:

				        required: true

				        type: string

				        description: |

				          A JSON description of what configs to run later on.

				env:

				  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}

				jobs:

				  filter:

				    if: github.repository_owner == 'pytorch'

				    runs-on: [self-hosted, linux.large]

				    outputs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          fetch-depth: 1

				          submodules: false

				      - name: Select all requested test configurations

				        id: filter

				        uses: ./.github/actions/filter-test-configs

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				          test-matrix: ${{ inputs.test-matrix }}

				  build:

				    needs: filter

				    # Don't run on forked repos.

				    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'

				    strategy:

				      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}

				      fail-fast: false

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      - name: Output disk space left

				        shell: bash

				        run: |

				          sudo df -H

				      - name: Preserve github env variables for use in docker

				        shell: bash

				        run: |

				          env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				          env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				      - name: Parse ref

				        id: parse-ref

				        run: .github/scripts/parse_ref.py

				      - name: Build arm-v7a

				        uses: ./.github/actions/build-android

				        with:

				          arch: arm_v7a

				          arch-for-build-env: arm-v7a

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          build-environment: ${{ inputs.build-environment }}

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          branch: ${{ steps.parse-ref.outputs.branch }}

				      - name: Build arm-v8a

				        uses: ./.github/actions/build-android

				        with:

				          arch: arm_v8a

				          arch-for-build-env: arm-v8a

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          build-environment: ${{ inputs.build-environment }}

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          branch: ${{ steps.parse-ref.outputs.branch }}

				      - name: Build x86_32

				        id: build-x86_32

				        uses: ./.github/actions/build-android

				        with:

				          arch: x86_32

				          arch-for-build-env: x86_32

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          build-environment: ${{ inputs.build-environment }}

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          branch: ${{ steps.parse-ref.outputs.branch }}

				      - name: Build x86_64

				        uses: ./.github/actions/build-android

				        with:

				          arch: x86_64

				          arch-for-build-env: x86_64

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          build-environment: ${{ inputs.build-environment }}

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          branch: ${{ steps.parse-ref.outputs.branch }}

				      - name: Build final artifact

				        env:

				          BRANCH: ${{ steps.parse-ref.outputs.branch }}

				          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          AWS_DEFAULT_REGION: us-east-1

				          PR_NUMBER: ${{ github.event.pull_request.number }}

				          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          ID_X86_32: ${{ steps.build-x86_32.outputs.container_id }}

				        run: |

				          set -eux

				          # Putting everything together

				          # ID_X86_32 container were created during build-x86_32 step

				          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v7a" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v7a"

				          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_64" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_x86_64"

				          docker cp "${GITHUB_WORKSPACE}/build_android_install_arm_v8a" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_arm_v8a"

				          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_32" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_x86_32"

				          # run gradle buildRelease

				          (echo "./scripts/build_android_gradle.sh" | docker exec \

				            -e BUILD_ENVIRONMENT="pytorch-linux-focal-py3-clang9-android-ndk-r21e-gradle-build" \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				            -e AWS_DEFAULT_REGION \

				            -e PR_NUMBER \

				            -e SHA1 \

				            -e BRANCH \

				            -e SCCACHE_BUCKET \

				            -e SKIP_SCCACHE_INITIALIZATION=1 \

				            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				            --user jenkins \

				            -u jenkins -i "${ID_X86_32}" bash) 2>&1

				          mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts"

				          docker cp "${ID_X86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/"

				      - name: Store PyTorch Android Build Artifacts on S3

				        uses: seemethere/upload-artifact-s3@v5

				        with:

				          name: ${{ inputs.build-environment }}

				          retention-days: 14

				          if-no-files-found: error

				          path: build_android_artifacts/artifacts.tgz

				      - name: Chown workspace

				        uses: ./.github/actions/chown-workspace

				        if: always()

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

									
										13

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -91,14 +91,14 @@ jobs:

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      - name: Check if in a ARC runner

				      - name: Check if in a container runner

				        shell: bash

				        id: check_arc_runner

				        run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"

				        id: check_container_runner

				        run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Output disk space left

				        run: |

				@ -137,11 +137,15 @@ jobs:

				          AWS_DEFAULT_REGION: us-east-1

				          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          SCCACHE_REGION: us-east-1

				          TORCH_CUDA_ARCH_LIST: 5.2

				          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				          CUDA_VERSION: ${{ inputs.cuda-version }}

				        run: |

				          python3 -m pip install boto3==1.19.12

				          # Fetch aws credential from IMDs

				          eval "$(python3 .github/scripts/get_aws_session_tokens.py)"

				          export SHARD_NUMBER=0

				          # detached container should get cleaned up by teardown_ec2_linux

				          # TODO: Stop building test binaries as part of the build phase

				@ -163,6 +167,7 @@ jobs:

				            -e NUM_TEST_SHARDS \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				            -e SCCACHE_BUCKET \

				            -e SCCACHE_REGION \

				            -e SKIP_SCCACHE_INITIALIZATION=1 \

				            -e REENABLED_ISSUES \

				            -e TORCH_CUDA_ARCH_LIST \

									
										4

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -271,7 +271,9 @@ jobs:

				          )

				          docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"

				          if [[ ${BUILD_ENVIRONMENT} == *"aarch64"* ]]; then

				            docker exec -t "${container_name}" bash -c "bash /builder/aarch64_linux/aarch64_ci_build.sh"

				            docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/aarch64_linux/aarch64_ci_build.sh"

				          elif [[ ${{ inputs.PACKAGE_TYPE }} == "manywheel" || ${{ inputs.PACKAGE_TYPE }} == "libtorch" ]]; then

				            docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"

				          else

				            docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/${{ inputs.PACKAGE_TYPE }}/build.sh"

				          fi

Compare commits

2239 Commits v2.5.0 ... PR-FixConf

3 .buckconfig.oss Unescape Escape View File

1 .ci/docker/android/AndroidManifest.xml Unescape Escape View File

66 .ci/docker/android/build.gradle Unescape Escape View File

6 .ci/docker/aotriton_version.txt Unescape Escape View File

58 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-cpu.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

112 .ci/docker/common/install_android.sh Unescape Escape View File

4 .ci/docker/common/install_aotriton.sh Unescape Escape View File

51 .ci/docker/common/install_cache.sh Unescape Escape View File

11 .ci/docker/common/install_clang.sh Unescape Escape View File

19 .ci/docker/common/install_conda.sh Unescape Escape View File

22 .ci/docker/common/install_cpython.sh Unescape Escape View File

73 .ci/docker/common/install_cuda.sh Unescape Escape View File

14 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

2 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

16 .ci/docker/common/install_graphviz.sh Normal file Unescape Escape View File

51 .ci/docker/common/install_miopen.sh Unescape Escape View File

2 .ci/docker/common/install_onnx.sh Unescape Escape View File

8 .ci/docker/common/install_triton.sh Unescape Escape View File

7 .ci/docker/common/install_user.sh Unescape Escape View File

11 .ci/docker/common/install_xpu.sh Unescape Escape View File

5 .ci/docker/conda/Dockerfile Unescape Escape View File

6 .ci/docker/conda/build.sh Unescape Escape View File

5 .ci/docker/libtorch/Dockerfile Unescape Escape View File

1 .ci/docker/manywheel/Dockerfile Unescape Escape View File

9 .ci/docker/manywheel/build.sh Unescape Escape View File

14 .ci/docker/manywheel/build_scripts/ssl-check.py Unescape Escape View File

51 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

5 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

27 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

10 .ci/libtorch/build.sh Normal file Unescape Escape View File

21 .ci/manywheel/LICENSE Normal file Unescape Escape View File

25 .ci/manywheel/build.sh Executable file Unescape Escape View File

505 .ci/manywheel/build_common.sh Normal file Unescape Escape View File

99 .ci/manywheel/build_cpu.sh Executable file Unescape Escape View File

290 .ci/manywheel/build_cuda.sh Normal file Unescape Escape View File

353 .ci/manywheel/build_libtorch.sh Normal file Unescape Escape View File

263 .ci/manywheel/build_rocm.sh Executable file Unescape Escape View File

26 .ci/manywheel/test_wheel.sh Executable file Unescape Escape View File

39 .ci/pytorch/build.sh Unescape Escape View File

6 .ci/pytorch/common-build.sh Unescape Escape View File

13 .ci/pytorch/common_utils.sh Unescape Escape View File

12 .ci/pytorch/create_test_cert.py Unescape Escape View File

19 .ci/pytorch/macos-test.sh Unescape Escape View File

153 .ci/pytorch/test.sh Unescape Escape View File

2 .ci/pytorch/win-build.sh Unescape Escape View File

3 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

6 .ci/pytorch/win-test.sh Unescape Escape View File

9 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

8 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

28 .clang-format Unescape Escape View File

38 .github/ISSUE_TEMPLATE.md vendored Unescape Escape View File

3 .github/ISSUE_TEMPLATE/ci-sev.md vendored Unescape Escape View File

24 .github/actionlint.yaml vendored Unescape Escape View File

4 .github/actions/build-android/action.yml vendored Unescape Escape View File

6 .github/actions/checkout-pytorch/action.yml vendored Unescape Escape View File

30 .github/actions/linux-test/action.yml vendored Unescape Escape View File

2 .github/actions/pytest-cache-download/action.yml vendored Unescape Escape View File

2 .github/actions/pytest-cache-upload/action.yml vendored Unescape Escape View File

14 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

2 .github/actions/setup-win/action.yml vendored Unescape Escape View File

14 .github/actions/upload-test-artifacts/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/torchbench.txt vendored Unescape Escape View File

6 .github/labeler.yml vendored Unescape Escape View File

369 .github/lf-canary-scale-config.yml vendored Unescape Escape View File

369 .github/lf-scale-config.yml vendored Unescape Escape View File

13 .github/merge_rules.yaml vendored Unescape Escape View File

3 .github/pytorch-probot.yml vendored Unescape Escape View File

2 .github/requirements-gha-cache.txt vendored Unescape Escape View File

2 .github/requirements/README.md vendored Unescape Escape View File

4 .github/requirements/conda-env-Linux-X64.txt vendored Unescape Escape View File

2 .github/requirements/conda-env-iOS.txt vendored Unescape Escape View File

4 .github/requirements/conda-env-macOS-ARM64 vendored Unescape Escape View File

2239 Commits

v2.5.0 ... PR-FixConf

3

.buckconfig.oss

View File

1

.ci/docker/android/AndroidManifest.xml

View File

66

.ci/docker/android/build.gradle

View File

6

.ci/docker/aotriton_version.txt

View File

58

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/triton-cpu.txt Normal file

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

112

.ci/docker/common/install_android.sh

View File

4

.ci/docker/common/install_aotriton.sh

View File

51

.ci/docker/common/install_cache.sh

View File

11

.ci/docker/common/install_clang.sh

View File

19

.ci/docker/common/install_conda.sh

View File

22

.ci/docker/common/install_cpython.sh

View File

73

.ci/docker/common/install_cuda.sh

View File

14

.ci/docker/common/install_cuda_aarch64.sh

View File

2

.ci/docker/common/install_cusparselt.sh

View File

16

.ci/docker/common/install_graphviz.sh Normal file

View File

51

.ci/docker/common/install_miopen.sh

View File

2

.ci/docker/common/install_onnx.sh

View File

8

.ci/docker/common/install_triton.sh

View File

7

.ci/docker/common/install_user.sh

View File

11

.ci/docker/common/install_xpu.sh

View File

5

.ci/docker/conda/Dockerfile

View File

6

.ci/docker/conda/build.sh

View File

5

.ci/docker/libtorch/Dockerfile

View File

1

.ci/docker/manywheel/Dockerfile

View File

9

.ci/docker/manywheel/build.sh

View File

14

.ci/docker/manywheel/build_scripts/ssl-check.py

View File

51

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

5

.ci/docker/ubuntu-rocm/Dockerfile

View File

27

.ci/docker/ubuntu/Dockerfile

View File

10

.ci/libtorch/build.sh Normal file

View File

21

.ci/manywheel/LICENSE Normal file

View File

25

.ci/manywheel/build.sh Executable file

View File

505

.ci/manywheel/build_common.sh Normal file

View File

99

.ci/manywheel/build_cpu.sh Executable file

View File

290

.ci/manywheel/build_cuda.sh Normal file

View File

353

.ci/manywheel/build_libtorch.sh Normal file

View File

263

.ci/manywheel/build_rocm.sh Executable file

View File

26

.ci/manywheel/test_wheel.sh Executable file

View File

39

.ci/pytorch/build.sh

View File

6

.ci/pytorch/common-build.sh

View File

13

.ci/pytorch/common_utils.sh

View File

12

.ci/pytorch/create_test_cert.py

View File

19

.ci/pytorch/macos-test.sh

View File

153

.ci/pytorch/test.sh

View File

2

.ci/pytorch/win-build.sh

View File

3

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

6

.ci/pytorch/win-test.sh

View File

9

.circleci/scripts/binary_linux_test.sh

View File

8

.circleci/scripts/binary_populate_env.sh

View File

28

.clang-format

View File

38

.github/ISSUE_TEMPLATE.md vendored

View File

3

.github/ISSUE_TEMPLATE/ci-sev.md vendored

View File

24

.github/actionlint.yaml vendored

View File

4

.github/actions/build-android/action.yml vendored

View File

6

.github/actions/checkout-pytorch/action.yml vendored

View File

30

.github/actions/linux-test/action.yml vendored

View File

2

.github/actions/pytest-cache-download/action.yml vendored

View File

2

.github/actions/pytest-cache-upload/action.yml vendored

View File

14

.github/actions/setup-linux/action.yml vendored

View File

2

.github/actions/setup-win/action.yml vendored

View File

14

.github/actions/upload-test-artifacts/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/torchbench.txt vendored

View File

6

.github/labeler.yml vendored

View File

369

.github/lf-canary-scale-config.yml vendored

View File

369

.github/lf-scale-config.yml vendored

View File

13

.github/merge_rules.yaml vendored

View File

3

.github/pytorch-probot.yml vendored

View File

2

.github/requirements-gha-cache.txt vendored

View File

2

.github/requirements/README.md vendored

View File

4

.github/requirements/conda-env-Linux-X64.txt vendored

View File

2

.github/requirements/conda-env-iOS.txt vendored

View File

4

.github/requirements/conda-env-macOS-ARM64 vendored

View File

16

.github/requirements/conda-env-macOS-X64 vendored

View File