pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-24 15:44:58 +08:00

Author	SHA1	Message	Date
Boyuan Feng	9ad25d4c05	remove check	2024-08-05 15:22:44 -07:00
Yifu Wang	ea42027e0e	[micro_pipeline_tp] support all _scaled_mm args (#131984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131984 Approved by: https://github.com/weifengpy	2024-08-05 21:44:37 +00:00
Edward Yang	2b5e31d099	Move sigmoid run_const_graph HOP to PyTorch core (#132526 ) Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible. Test Plan: sandcastle and oss ci Differential Revision: D60674861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132526 Approved by: https://github.com/SherlockNoMad	2024-08-05 21:40:56 +00:00
Brian Hirsh	af8b8a47cb	fsdp.set_: convey to functionalization that it mutates storage (#132322 ) Fixes https://github.com/pytorch/pytorch/issues/132197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132322 Approved by: https://github.com/albanD, https://github.com/yf225 ghstack dependencies: #132243, #132337	2024-08-05 21:28:59 +00:00
Brian Hirsh	1a0db29932	move torch._functionalize APIs to pybind. add one for marking storage mutations (#132337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132337 Approved by: https://github.com/albanD, https://github.com/justinchuby ghstack dependencies: #132243	2024-08-05 21:28:59 +00:00
Brian Hirsh	4db368a475	make functorch CSE respect mutations as barriers (like fsdp.set_) (#132243 ) Fixes https://github.com/pytorch/pytorch/issues/132200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132243 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/yf225	2024-08-05 21:28:55 +00:00
Fangjun Kuang	ee0ae11b34	Fix a typo in the example code. (#132601 ) Since the backward multiples the gradient by `n`, we must change the forward function to multiply the input tensor by `n`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132601 Approved by: https://github.com/soulitzer	2024-08-05 21:04:20 +00:00
albanD	9a1ad3345f	Fix periodic windows test (#132648 ) This test fails to clean up folders on windows for the past week, see `27f61eba58` for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648 Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet	2024-08-05 20:54:20 +00:00
cyy	6b12dc0224	[Reland] [11/N] Use std::nullopt and std::optional (#132622 ) Reland of #132396, which was reverted due to dependency reversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132622 Approved by: https://github.com/ezyang	2024-08-05 20:36:33 +00:00
Sam Larsen	6f4dc56735	[inductor] Default to 1 compile thread for internal (#132540 ) Summary: The historical default here is "1", i.e., no parallel compilation. In order to prepare for rolling out the subprocess-based parallel compile, I had previously modified this code to allow parallelism when worker_start_method="subprocess". I realize this probably isn't the best rollout strategy. Rather than opting all internal usages into both a) parallel-compile, _and_ b) a new implementation of parallel compile, let's put the default back to "1" and then start rolling out the new parallel compile implementation only to those usages that have already opted in by explicitly setting compile_thread > 1 Differential Revision: [D60686105](https://our.internmc.facebook.com/intern/diff/D60686105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132540 Approved by: https://github.com/c00w	2024-08-05 20:23:16 +00:00
Pearu Peterson	1471473b84	Add tests to bsr_dense_addmm_meta. Tune bsr_dense_addmm kernel for ViT shapes. (#132646 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132646 Approved by: https://github.com/cpuhrsch	2024-08-05 20:22:33 +00:00
Basil Wong	b7bcfdaff2	Change deprecate warning on dispatch_on_subclass to warn once (#132374 ) Summary: # Problem `TORCH_WARN` can cause massive log spam. I output the logs for before and after adding this change. Before: * The log file size was ~61.15 MB(61148028 bytes). After: * The log filesize was ~56.44 MB(56444057) bytes. # Context Looks like we tried to land this change earlier but it was reverted: * D59413413 * Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function # Testing Update `test_warn_on_invalid_torch_function` would fail because the warning would not be called on the handling of the second torch function class since `TORCH_WARN_ONCE` stops repeats globally. Updated so that it runs separate programs. (Was not able to actually run the test, could someone help me with that Test Plan: Need help with this... Differential Revision: D60561181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132374 Approved by: https://github.com/ezyang	2024-08-05 20:02:33 +00:00
PyTorch MergeBot	2764bee942	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit 6919e8baaba391ced7b4acaa553d6ea1f3b30e79. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](`6919e8baab`) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))	2024-08-05 19:59:04 +00:00
PyTorch MergeBot	a3ea96b762	Revert "[export] Convert autocast to HOO (#131914 )" This reverts commit aec948adfc224e49213c4bc49586d4e4ba65fbbb. Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/davidberard98 due to PR shouldn't have been relanded by the bot, phabricator diff did not have any recent changes and is still internally reverted ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2269797388))	2024-08-05 19:52:09 +00:00
Jack Taylor	1d34f33d00	Scale XBLOCK in triton reduction configs to avoid hitting max grid (#128826 ) Scale XBLOCK size in triton_config_reduction to avoid hitting maxGridSize limits. This issue was observed in gpt-fast examples with large sequence length: Reproducer: https://gist.github.com/jataylo/8a0ba922fbf68e345d360a418b48b9f1 `RuntimeError: Triton Error [HIP]: Code: 9, Messsage: invalid configuration argument` Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128826 Approved by: https://github.com/jansel, https://github.com/nmacchioni	2024-08-05 19:34:38 +00:00
David Berard	e1c2bdac2f	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007	2024-08-05 18:58:33 +00:00
Shangdi Yu	aec948adfc	[export] Convert autocast to HOO (#131914 ) Summary: Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export run_tests("test_predispatch_autocast") ``` Reviewed By: angelayi Differential Revision: D60206382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914 Approved by: https://github.com/angelayi	2024-08-05 18:52:12 +00:00
zdevito	8d9c3a71f6	Support IPC for Expandable Segments (#130890 ) This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed. Differential Revision: [D60547506](https://our.internmc.facebook.com/intern/diff/D60547506) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2	2024-08-05 18:48:13 +00:00
Yidi Wu	618e2c9de4	fix torch rec test failure (#132437 ) Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export" Reviewed By: zhxchen17 Differential Revision: D60528900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437 Approved by: https://github.com/Skylion007	2024-08-05 18:06:07 +00:00
Max Podkorytov	1c7dc335f7	[ROCm][CK][Inductor] Enable addmm for CK backend to gemm max autotune (#130576 ) Add functional support for torch.addmm with CK backend. See also #125453 # Implementation details 1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias 2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul # Testing `pytest test/inductor/test_ck_backend.py -k addmm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576 Approved by: https://github.com/chenyang78	2024-08-05 17:49:09 +00:00
Nikita Shulga	7b2664ece6	Temp disable MKL in DistributionKernels.cpp (#132532 ) Until https://github.com/pytorch/pytorch/issues/132395 is addressed Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential ) ```python import torch high_bits_for_seed = 16000000000000000000 # to use "good quality" seed _ = torch.manual_seed (high_bits_for_seed + 2024) prob = torch.ones (26) dups_mult = 0 perm_counts_mult = {} for _ in range (1_000_000): p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist()) if p in perm_counts_mult: dups_mult += 1 perm_counts_mult[p] += 1 else: perm_counts_mult[p] = 1 print ('duplicate multinomial perms: ', dups_mult) print ('multiple multinomial perms: ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item()) print ('max of perm_counts_mult: ', torch.tensor (list (perm_counts_mult.values())).max().item()) print ('len (perm_counts_mult): ', len (perm_counts_mult)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132532 Approved by: https://github.com/albanD	2024-08-05 17:40:57 +00:00
PyTorch MergeBot	baa2483cea	Revert "Refactor thunkify to return proper thunk abstraction (#132407 )" This reverts commit c65cb37657ef4f7fcd070a7e8e5121eb299919fd. Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to td strikes again ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2269577711))	2024-08-05 17:39:54 +00:00
cyy	d5045cceff	[16/N] Fix clang-tidy warnings in jit (#132604 ) Follows #132564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132604 Approved by: https://github.com/Skylion007	2024-08-05 17:36:22 +00:00
Wouter Devriendt	e8645fa2b9	[Doc] fix some typos (found by codespell and typos) (#132544 ) Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544 Approved by: https://github.com/kit1980	2024-08-05 17:21:56 +00:00
albanD	3d87dfc088	Add basic OpenReg module scaffolding with autograd (#131708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131708 Approved by: https://github.com/ezyang	2024-08-05 17:07:11 +00:00
Will Constable	df59084012	Drop GIL around cudart APIs (#132520 ) Noticed a hang where the stuck thread blocked on cudaHostUnregister call, probably due to an internal cuda deadlock caused by something else, but was holding the GIL at the time and blocked other python threads. As far as I can tell cudart APIs all do not require the GIL held nor are they marked as thread unsafe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132520 Approved by: https://github.com/LucasLLC, https://github.com/kirtiteja	2024-08-05 17:04:01 +00:00
Kulin Seth	6919e8baab	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-08-05 17:02:30 +00:00
Kiuk Chung	d532c00c81	[test/torch_np] Fix usages of deprecated NumPy 2.0 APIs in numpy_tests (#131909 ) Migrates usages of deprecated APIs in NumPy-2.0 per [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-2-0-migration-guide). I did a grep on the old API usages (see list below) and these were used only referenced in test files under `test/torch_np/numpy_tests/*/.py`. Specifically, migrates the usages of the following APIs: 1. `np.sctypes` → Access dtypes explicitly instead 2. `np.float_` → `np.float64` 3. `np.complex_` → `np.complex128` 4. `np.longcomplex` → `np.clongdouble` 5. `np.unicode_` → `np.str_` 6. `np.product` → `np.prod` 7. `np.cumproduct` → `np.cumprod` 8. `np.alltrue` → `np.all` 9. `np.sometrue` → `np.any` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131909 Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman	2024-08-05 16:21:08 +00:00
Xu Han	a672f6c84e	[inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py (#132615 ) [inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132615 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 16:00:35 +00:00
Xu Han	9945caec65	[inductor] Fix autotune non-close attr crash on Windows (#132630 ) When I enable `autotune` related UT on Windows. <img width="1364" alt="Image" src="https://github.com/user-attachments/assets/b0c9c516-419d-47d0-a4c1-e90c98109d02"> I found the non `close` attr issue on Windows. Acturaly, I checked the DLL type is `CDLL`. It doesn't have `close` attr. I made this PR to check the `close` attr and do the close operation. <img width="1624" alt="Image" src="https://github.com/user-attachments/assets/14093900-4ad8-4673-839e-7ba1410c5656"> After this fix, the UTs passed. Here are some existing issues: 1. `CDLL` didn't have `close` attr, so the DLL are not be closed. Though it did't crash on Linux. 2. This PR just avoid crash on Windows, and didn't real close also. TODO: We need to replace `CDLL` by `DLLWrapper` in `CppBenchmarkRequest`, like `CUDABenchmarkRequest`. I have added a task to tracking: https://github.com/pytorch/pytorch/issues/124245 , and will follow up this change in further PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132630 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 16:00:27 +00:00
Aart Bik	a8490a0762	[traced-graph][sparse] propagate sparsity in fx graph (#131920 ) This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well. Feature request: https://github.com/pytorch/pytorch/issues/117188 Rebranch of older PRs (for history): https://github.com/pytorch/pytorch/pull/131474 https://github.com/pytorch/pytorch/pull/128549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920 Approved by: https://github.com/ezyang	2024-08-05 15:49:53 +00:00
Aleksei Nikiforov	14edd986b3	Fix missing include file (#132647 ) This error only appears with newer gcc releases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132647 Approved by: https://github.com/Skylion007	2024-08-05 15:49:49 +00:00
Andrew Gu	70cb16b316	[DTensor] Added naive replicate strategy for more diagonal ops (#132201 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132201 Approved by: https://github.com/wz337 ghstack dependencies: #132104	2024-08-05 15:18:56 +00:00
Edward Z. Yang	c65cb37657	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD ghstack dependencies: #131649	2024-08-05 14:42:40 +00:00
Brian Hirsh	b465a5843b	DTensor: add more foreach ops to supported sharding prop list (#132066 ) fixes https://github.com/pytorch/pytorch/issues/132016. Right now if you run an op that DTensor has no sharding prop rule, and that op accepts non-trivial pytrees of inputs tensors as arguments, DTensor can end up infinite looping before it has the chance to error due to not having a sharding prop rule. This PR doesn't fix the problem, but adds rules for the culprit ops (missing foreach ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132066 Approved by: https://github.com/wanchaol	2024-08-05 13:51:59 +00:00
Gabriel Ferns	c3ee07c71c	add missing profiler include in cpp code generation (#132419 ) Summary: When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check config.profiler_mark_wrapper_call. Test Plan: This case is already covered in test_profiler_mark_wrapper_call. ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 1 test in 8.080s OK ``` Fixes https://github.com/pytorch/pytorch/issues/131339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 13:40:47 +00:00
Andrew Gu	b30d0916d9	[FSDP2] Added missing event wait (for future) (#132568 ) Nothing is actually wrong currently, but we should add this in case we land https://github.com/pytorch/pytorch/pull/127032 in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132568 Approved by: https://github.com/weifengpy, https://github.com/Skylion007	2024-08-05 12:44:46 +00:00
wz337	fb87796d4f	[DeviceMesh] Add supports for non-continuous slicing (#132310 ) Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310 Approved by: https://github.com/wanchaol	2024-08-05 09:30:07 +00:00
Avik Chaudhuri	27f61eba58	serde sympy functions (#132493 ) Summary: Sympy functions appearing in symbolic expressions inside tensor metadata were not being deserialized properly. Test Plan: updated test Differential Revision: D60573150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132493 Approved by: https://github.com/pianpwk	2024-08-05 08:08:50 +00:00
Feng Shi	55b0c39d82	Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132182 ) Summary: Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)"" The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests. The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause. See D54134695 or #124969 for more details. Test Plan: Originally failed tests f585704630 f585733786 Diff patched: f586664028 f586663820 Differential Revision: D60458597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182 Approved by: https://github.com/Yuzhen11	2024-08-05 06:57:30 +00:00
haozhe.zhu	ae44b8f410	[inductor] support vectorization for torch.argmax/min(float/int64_t)-> int64_t (#131016 ) Support reduction argmin/max by scalar implementation. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_argmax_argmin_with_nan_value python test/inductor/test_cpu_repro.py -k test_argmin python test/inductor/test_cpu_repro.py -k test_reduction_cpu_only ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131016 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-05 04:31:53 +00:00
Wu, Chunyuan	1fb498d6e3	Add try except for _maybe_evaluate_static call in IndexPropagation (#132128 ) Fixes the Inductor max-autotune mode failures of the below models: - GPT2ForSequenceClassification - PegasusForConditionalGeneration - XGLMForCausalLM - hf_GPT2 - tnt_s_patch16_224 ```log File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true evaluated = self.shape_env._maybe_evaluate_static( File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper return fn_cache(self, args, *kwargs) File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static vr = var_ranges[k] torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: KeyError: m_start ``` The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-05 01:02:51 +00:00
Jianyu Huang	c7cfa51721	Always use high precision for SDPA math backend (#128922 ) Summary: feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts. Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16. Differential Revision: D58710805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922 Approved by: https://github.com/xw285cornell, https://github.com/drisspg	2024-08-04 23:58:14 +00:00
William Wen	01cdcbf7c8	[dynamo] revert map/zip iterator related changes (#132528 ) Need to revert due to internal hangs: S437700 This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64. Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)" This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3. Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)" This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9. Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)" This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528 Approved by: https://github.com/ZainRizvi	2024-08-04 18:46:55 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
Oguz Ulgen	6e79932543	Add basic mypy annotations to dynamo (#132415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu	2024-08-04 18:43:36 +00:00
PyTorch MergeBot	3558a8cf4a	Revert "Add basic mypy annotations to dynamo (#132415 )" This reverts commit 71e22e0959eb8d5a66833bf5c6b5903536a5bef1. Reverted https://github.com/pytorch/pytorch/pull/132415 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit 78927d37f6085a0b30269cceb731d8097302c091. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
PyTorch MergeBot	9be33bc584	Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 )" This reverts commit 6c65fd03942415b68040e102c44cf5109d2d851e. Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/ZainRizvi due to Sorry, had to revert this to revert another PR that depends on this change ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2267629534))	2024-08-04 18:30:59 +00:00
PyTorch MergeBot	0a25666f92	Revert "[dynamo] revert map/zip iterator related changes (#132528 )" This reverts commit e81e74ca6cb45e1ab831ddfe9a2ba5c7e17fa03f. Reverted https://github.com/pytorch/pytorch/pull/132528 on behalf of https://github.com/ZainRizvi due to This stack entered a weird state in the diff train. Reverting and relanding to clean the state ([comment](https://github.com/pytorch/pytorch/pull/132528#issuecomment-2267628475))	2024-08-04 18:26:09 +00:00
Aaron Gokaslan	fd4b649e6c	[BE]: Simplify some list comps to generators C419 (#132578 ) Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578 Approved by: https://github.com/ezyang	2024-08-04 17:46:26 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Xuehai Pan	c35061c542	Migrate Python code formatter from `black` to `ruff format` (#132574 ) See also: - #124845 - #123062 Closes #124845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132574 Approved by: https://github.com/ezyang	2024-08-04 17:13:31 +00:00
Jiashen Cao	09fcd792eb	[Fix]: ScriptObject lifting issue (#130952 ) #### Issue ScriptObject was treated as normal attribute by the converter previously. This PR lifts it to be a constant and convert it directly to a GetAttr fx node. ScriptObject would also trigger `CallMethod` and this PR adds that support as well. #### Test Plan Add test case for ScriptObject. `pytest test/export/test_converter.py -s -k test_convert_script_object` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130952 Approved by: https://github.com/angelayi	2024-08-04 16:52:45 +00:00
PyTorch MergeBot	5dac4d2c78	Revert "[easy] fix f-string messages in torch/_ops.py (#132531 )" This reverts commit 908d2a153b14cbb7a39c1f4ef9a77534cf2c71bf. Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to still breaks tests ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2267584289))	2024-08-04 15:41:56 +00:00
cyy	105ba7b58c	[5/N] Fix clang-tidy warnings in aten/src/ATen (#132565 ) Follows #132001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132565 Approved by: https://github.com/Skylion007	2024-08-04 14:39:16 +00:00
David Berard	908d2a153b	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007 ghstack dependencies: #132356, #132466	2024-08-04 14:30:42 +00:00
Xu Han	87d46d70d7	[inductor] export kernel for gemm template. (#132580 ) Changes: 1. Move `get_export_declaration` to `cpp_utils.py` as basic function. 2. Export kernel for gemm template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580 Approved by: https://github.com/ezyang	2024-08-04 11:17:19 +00:00
Xuehai Pan	d2dc173664	Remove lint dependency `ufmt` (#132573 ) `ufmt` is a combination of `black + usort`. This PR removes `ufmt` and run `black` and `usort` separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132573 Approved by: https://github.com/ezyang ghstack dependencies: #129769, #132572	2024-08-04 10:24:09 +00:00
Xuehai Pan	f7aeb394b6	[BE][Easy] Remove empty `ISORT_SKIPLIST` (#132572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132572 Approved by: https://github.com/ezyang, https://github.com/justinchuby ghstack dependencies: #129769	2024-08-04 10:24:09 +00:00
Xuehai Pan	f3fce597e9	[BE][Easy][17/19] enforce style for empty lines in import segments in `torch/[a-c]/` and `torch/[e-n]/` (#129769 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769 Approved by: https://github.com/ezyang	2024-08-04 10:24:09 +00:00
Dan Zimmerman	2714adce20	[caffe2] Fix compiling ATen-hip in non-opt mode (#132581 ) Summary: It looks like https://github.com/pytorch/pytorch/pull/131894 accidentally broke non-opt hip builds. I.e. `is_flash_attention_available` doesn't get inlined in non-opt mode, so all of `can_use_flash_attention` is compiled into the final object file. This includes a reference to `aotriton::v2::flash::check_gpu` which we haven't setup yet for HIP builds. Test Plan: CI Differential Revision: D60720707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132581 Approved by: https://github.com/jianyuh, https://github.com/xw285cornell	2024-08-04 07:51:18 +00:00
cyy	522fa03e91	[Submodule] Bump ONNX to v1.16.2 (#132566 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566 Approved by: https://github.com/justinchuby	2024-08-04 07:01:54 +00:00
Wei Feng	2a8e94347f	[TP] verify numeric parity on Transfromers for multiple iterations (#132543 ) Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations this PR sets a baseline of TP numerics. I can verify fp8 on top of it Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543 Approved by: https://github.com/tianyu-l ghstack dependencies: #132350	2024-08-04 06:43:27 +00:00
Gabriel Ferns	8ff310392e	add __torch_function__ handler to get_device cpp (#132567 ) From the issue: ``` import torch class CustomParameter(torch.nn.Parameter): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): return func.__name__ x = CustomParameter(torch.rand(2)) print(x.square()) # 'square' print(torch.square(x)) # 'square' print(x.get_device()) # 'get_device' print(torch.get_device(x)) # -1 ``` after fix: ``` $ python repro.py square square get_device get_device ``` Fixes: https://github.com/pytorch/pytorch/issues/131944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132567 Approved by: https://github.com/ezyang	2024-08-04 04:26:30 +00:00
Xu Han	7f8a384a8f	[inductor] add msvc_cl compiler check (#132571 ) add `msvc_cl` compiler check. Local test: <img width="880" alt="image" src="https://github.com/user-attachments/assets/fe4da5e0-dd52-4dbc-831e-c32479e27a29"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132571 Approved by: https://github.com/ezyang	2024-08-04 03:48:25 +00:00
Feng Yuan	81b8d3586f	Update torch-xpu-ops pin (ATen XPU implementation) (#132390 ) Regular update. 1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml. 2. Align with PyTorch in-tree to use safe data pointer access APIs. 3. Enable FP64 conversion emulation for some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390 Approved by: https://github.com/EikanWang	2024-08-04 02:22:46 +00:00
CaoE	6ec4af6865	[Inductor][CPP] Add vectorization support for double (#131886 ) Before: ``` extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = decltype(tmp0)(tmp0 * tmp0); out_ptr0[static_cast<long>(x0)] = tmp1; } } } } ``` After: ``` extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<long>(x0), 16); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131886 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-04 02:13:21 +00:00
PyTorch MergeBot	d984105748	Revert "[export] Convert autocast to HOO (#131914 )" This reverts commit b28c01d90d6575522d2240ce485d7dd87a7242aa. Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/ezyang due to Failing lint, but was covered up by master failure on lint ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2267248773))	2024-08-04 02:10:35 +00:00
Adnan Akhundov	6c65fd0394	[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820 Approved by: https://github.com/eellison	2024-08-03 22:11:47 +00:00
cyy	bc46f205c4	[15/N] Fix clang-tidy warnings in jit (#132564 ) Follows #132477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132564 Approved by: https://github.com/Skylion007	2024-08-03 19:33:24 +00:00
PyTorch MergeBot	00097f3458	Revert "C++ network flow implementation in c10 (#132188 )" This reverts commit dccce77935bb023f225b9972929fd9213e754e84. Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be failing internal tests. Please see D60702564 to investigate ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2267098420))	2024-08-03 18:44:28 +00:00
Xu Han	e3387c6712	[inductor] use uint64_t replace long to add Windows support. (#132491 ) `long` type is different between `Windows` and `Linux`. This PR use `int64_t` instead of `long` on Windows. `LL` suffix is used to initial `int64_t` value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132491 Approved by: https://github.com/malfet	2024-08-03 18:38:30 +00:00
Yanbo Liang	bbce517221	[Inductor][FlexAttention] TestFlexAttention -> TestFlexDecoding (#132547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132547 Approved by: https://github.com/Chillee ghstack dependencies: #132015	2024-08-03 17:26:44 +00:00
PyTorch MergeBot	21d02f8b4b	Revert "[easy] fix f-string messages in torch/_ops.py (#132531 )" This reverts commit 25903f3932b3a24d4edf323484d2159f3ac92999. Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to broke lint and tests due to conflict with 132377 ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2266743391))	2024-08-03 14:49:07 +00:00
Pian Pawakapan	a896fb1b36	check unsupported sympy functions for runtime asserts (#132457 ) Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457 Approved by: https://github.com/avikchaudhuri	2024-08-03 10:17:25 +00:00
Xuehai Pan	0e7e61f7ce	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-03 09:43:38 +00:00
Jiashen Cao	159d508f03	[Fix]: prim::If with multiple outputs and input return directly (#131779 ) #### Issue Test is not working for prim::Loop with multiple outputs. Additionally fix issue where input is directly returned, which is not supported by HigherOrderOp. #### Test Plan `pytest test/export/test_converter.py -s -k test_convert_if_multiple_out` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131779 Approved by: https://github.com/angelayi, https://github.com/SherlockNoMad	2024-08-03 08:07:21 +00:00
Xu Han	36ec0fdf10	[inductor] check compiler exist on Windows. (#132533 ) Current Windows env, if we are not activate the MSVC env. It will not raise a clear error to compiler: <img width="904" alt="image" src="https://github.com/user-attachments/assets/725ea608-d181-40b1-8930-42fe2b32643a"> With this PR, we can help users point to the issue is from compiler. <img width="1034" alt="image" src="https://github.com/user-attachments/assets/8515a796-e3e9-4909-a68f-8a14d4864951"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132533 Approved by: https://github.com/jansel	2024-08-03 07:47:11 +00:00
Adnan Akhundov	8ad9f89ccc	[inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562 ) Summary: This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally. We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Differential Revision: D60701839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562 Approved by: https://github.com/chenyang78	2024-08-03 06:31:28 +00:00
Animesh Jain	06581c277a	[dynamo][stable-diffusion] Support dict(obj) on constrained subclasses of dict and OrderedDict (#132558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132558 Approved by: https://github.com/jansel	2024-08-03 06:31:00 +00:00
Shangdi Yu	b28c01d90d	[export] Convert autocast to HOO (#131914 ) Summary: Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export run_tests("test_predispatch_autocast") ``` Reviewed By: angelayi Differential Revision: D60206382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914 Approved by: https://github.com/angelayi	2024-08-03 05:48:57 +00:00
Avik Chaudhuri	ed4493de0e	dim name is identifier (#132557 ) Summary: Dim names appear in suggested fixes so should be valid Python identifiers. Test Plan: none Differential Revision: D60696854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132557 Approved by: https://github.com/pianpwk	2024-08-03 05:28:50 +00:00
Edward Z. Yang	1f5dfe00da	Subtracer should always be real to inherit fake/real tensors from parent config (#132488 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132488 Approved by: https://github.com/zou3519	2024-08-03 04:55:42 +00:00
Justin Chu	6966d44eda	[ONNX] Rename _internal/exporter to _exporter_legacy (#132429 ) The next PR will be creating an `exporter` directory to house logic from `torch-onnx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429 Approved by: https://github.com/titaiwangms	2024-08-03 04:23:05 +00:00
David Berard	5973aec671	[fx] python_code(verbose=True): show size/strides for all tensors (#132192 ) python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors. Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures. test script: ```python import torch def fn(x): return x.cos() nt = torch.nested.nested_tensor_from_jagged( torch.randn(10, 10), torch.tensor([0, 1, 3, 6, 10]), ) torch.compile(fn)(nt) ``` logs excerpt: ``` [0/0] [__graph_code] TRACED GRAPH [0/0] [__graph_code] ===== __compiled_fn_1 ===== [0/0] [__graph_code] /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M [0/0] [__graph_code] def forward(self, L_x_: "f32[4, zf1, 10][10zf1, 10, 1]cpu", zf1: "Sym(zf1)"): [0/0] [__graph_code] l_x_ = L_x_ [0/0] [__graph_code] [0/0] [__graph_code] # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c [0/0] [__graph_code] cos: "f32[4, zf1, 10][10zf1, 10, 1]cpu" = l_x_.cos(); l_x_ = None [0/0] [__graph_code] return (cos,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192 Approved by: https://github.com/Chillee	2024-08-03 02:54:32 +00:00
Ivan Zaitsev	0b571b1058	[codemod][pyre] Add missing Pyre mode headers (#132548 ) Reviewed By: connernilsen Differential Revision: D59849027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132548 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-03 02:32:53 +00:00
Yanbo Liang	373e9be457	[Inductor][FlexAttention] Add kwarg to top level for users to specify kernel params (#132015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132015 Approved by: https://github.com/Chillee	2024-08-03 02:27:02 +00:00
David Berard	25903f3932	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007 ghstack dependencies: #132356, #132466	2024-08-03 02:23:44 +00:00
Animesh Jain	419b76c4ac	[dynamo] Reland 132308, 132314, 132318, 132334 - Make builtin nn modules attributes static (#132539 ) Relanding 4 PRs ending at https://github.com/pytorch/pytorch/pull/132334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132539 Approved by: https://github.com/Skylion007, https://github.com/yanboliang, https://github.com/mlazos	2024-08-03 02:08:22 +00:00
Ivan Zaitsev	841cadd555	Fix discrepancies from 129973 (#132545 ) #129973 ([D59132793](https://www.internalfb.com/diff/D59132793)) was exported missing changes in `test/cpp/jit/CMakeLists.txt` this PR remediates that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132545 Approved by: https://github.com/kit1980	2024-08-03 01:57:49 +00:00
Eli Uriegas	243a763e1b	ci: Remove split-build CUDA testing from pull.yml (#132537 ) This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml. I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2024-08-03 01:24:17 +00:00
Shangdi Yu	a503136583	[export] Detect whether case_name is registered in exportdb (#132420 ) Summary: - moves logging functionalities into `torch/_export/db/logging.py` file. - add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError` - change the case name of `torch_sym_int` to `unsupported_operator` - Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb. - TODO: add test Test Plan: CI Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging: ``` E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633, 1.2414, -0.1071], E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] [-0.1936, -0.9425, -0.0824]]) E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case. https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input ...... File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching raise Unsupported( torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet ``` It also logs a `export.error.classified` event in Scuba. Reviewed By: zhxchen17 Differential Revision: D60427208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420 Approved by: https://github.com/zhxchen17	2024-08-03 01:08:48 +00:00
Joel Schlosser	64720f3b89	Introduce checks to validate public API tests (#131390 ) This PR introduces a new sanity check for the public API tests in `.ci/pytorch/test.sh`. * Validates two public API tests: 1. Ensures `test_correct_module_names` fails when a new file OR an existing file adds an invalid public API function (e.g. one whose `__module__` is unset). 2. Ensures `test_modules_can_be_imported` fails when a module underneath `torch/` cannot be imported. * Runs this in CI as part just before the pre-existing FC / BC checks. I've verified that re-introducing the bug that #131386 fixed causes the new check to fail: ![public_api_failure](https://github.com/user-attachments/assets/376ddef3-d14a-41f6-93e2-f935deb6555a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131390 Approved by: https://github.com/albanD	2024-08-03 00:29:00 +00:00
cyy	fcef6cc6d1	[13/N] Fix clang-tidy warnings in jit (#132477 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132477 Approved by: https://github.com/Skylion007	2024-08-03 00:13:18 +00:00
Shivam Raikundalia	705ac311aa	Fix Distributed EventList usage (#132448 ) Summary: Summarized here: https://github.com/pytorch/pytorch/issues/132227 Test Plan: Use suggestion in issue, should see test passing again Differential Revision: D60614690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132448 Approved by: https://github.com/aaronenyeshi	2024-08-02 23:55:31 +00:00
Sherlock Huang	e3513fb2af	[ts_converter]handle python list append, list add, aten.to.dtype+mutation_op pattern (#132529 ) Summary: #### Description Add support for aten::append with a python function that returns a new list with the appended element. We then update the `fx_node` in the `name_to_node` mapping. aten::append contributed by Jiashen Cao <jiashenc@meta.com> Fix conversion for csr_ranker_test ``` model_name: csr_ranker_test_4.ptl has_ts_model: True has_sample_inputs: True ops_maybe_missing_meta: set() script_objects: set() ts_can_run: True ts_run_exception: None can_convert: True convert_exception: None ep_result_correct: True ep_run_exception: None can_package: True package_exception: None sigmoid_can_run: False sigmoid_run_exception: RuntimeError('not for symbolics') sigmoid_result_correct: None ``` Test Plan: test_aten_add_t test_aten_append_t test_aten_to_dtype_with_mutating_storage buck2 run mode/opt sigmoid/inference/ts_migration:main -- --mode test_one --model_name csr_ranker_test Differential Revision: D60635893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132529 Approved by: https://github.com/jiashenC	2024-08-02 23:32:37 +00:00
David Berard	85f19ce14a	Support meta["val"] that is a dict, for triton kernels and for the partitioner (#132466 ) Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]` to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner. Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466 Approved by: https://github.com/zou3519 ghstack dependencies: #132356	2024-08-02 23:24:29 +00:00
Shivam Raikundalia	bcac71517c	[Profiler] Test Logging for Empty Traces (#132444 ) Summary: Tests D60311331. Please see that diff for explanation Test Plan: This diff is adding a test itself Reviewed By: aaronenyeshi Differential Revision: D60311555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444 Approved by: https://github.com/aaronenyeshi	2024-08-02 22:04:15 +00:00
David Berard	1962f9475f	[NJT][flop counter] attention: if offsets are fake, use max seqlen (#132356 ) The flop counter is used by the partitioner, in which case the tensors passed in can be fake. The flop computations for nested attention use the offsets to determine the actual amount of compute that will be done. But when the offsets are fake, we end up with unbacked symints (from `(offsets[1:] - offsets[:-1]).to_list()`). If we find that the offsets are fake or functional tensors, then use the max sequence length instead. Repro: https://gist.github.com/davidberard98/903fb3e586edb6d1d466786e1a610eba Differential Revision: [D60597463](https://our.internmc.facebook.com/intern/diff/D60597463) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132356 Approved by: https://github.com/soulitzer	2024-08-02 20:42:29 +00:00
Will Constable	37c3d503b7	[pipelining] Make test_schedule quiet (#132369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132369 Approved by: https://github.com/H-Huang ghstack dependencies: #129810, #130378	2024-08-02 20:38:17 +00:00
Will Constable	7c1cca9fda	[pipelining] Add schedule send/recv pass (#130378 ) Inserts send/recv ops where needed in a compute-only pipeline schedule. Any F or B action will require a recv op for its input and a send op for its output, except for at the ends of the pipeline. To avoid hangs caused by mixed-up orderings of sends/recvs across ranks, we pick one compute action at a time and insert both its send op (on that rank's schedule), and the matching recv op for the recipient stage (on the schedule for the rank for that stage). TODO Currently ignores a couple of edge cases - ignores batching (which is an optimization) - ignores cases where a stage sends to anotehr stage on the same rank, and should skip the send/recv and directly access memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378 Approved by: https://github.com/H-Huang ghstack dependencies: #129810	2024-08-02 20:38:17 +00:00
Will Constable	625f494619	[Pipelining] Add schedule unshard/reshard pass (#129810 ) Adds fsdp unshard/reshard ops to a compute-only schedule. Operates on one pp-rank's schedule at a time, since there is no cross-pp-rank coordination needed for FSDP. (Unshard/Reshard is across DP ranks within a PP group). Uses a heuristic based on examining the next N stages to run compute operations on this rank, evicting (resharding) and fetching (unsharding) ahead of time to give unshard operations a chance to overlap with compute and PP comms. - this heuristic has not been validated and may not be optimal Makes the assumption that it's fine to add the UNSHARD/RESHARD actions to the schedule regardless of if FSDP will actually be used. - this way, users do not have to tell us at PP schedule creation time if they plan to use FSDP or DDP - it is trivial to implement UNSHARD/RESHARD as no-ops inside the runtime, if FSDP is not detected on the stage module TODO - also add FSDP's reduce-scatter? or is it sufficient to leave this handled by PipelineStage at 'last backward' time - validate 'next N stages' heuristic and expose an API if needed - add an e2e test Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-08-02 20:38:17 +00:00
William Wen	f379bbd46d	[dynamo] support inspect.signature.bind (#132330 ) Fixes https://github.com/pytorch/pytorch/issues/93760. This was not that small of a task... Pull Request resolved: https://github.com/pytorch/pytorch/pull/132330 Approved by: https://github.com/jansel ghstack dependencies: #132329	2024-08-02 20:37:05 +00:00
Zhengxu Chen	642257db1a	Update the FQN for auto_functionalized HOO. (#132171 ) Summary: as title. torch._higher_order_ops.auto_functionlize.auto_functionalized is a Python FQN which should NOT be used to talk to the backends and we should use the standard FQN name torch.ops.higher_order.auto_functionalized instead. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_custom_op_auto_functionalize_pre_dispatch Differential Revision: D60468759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132171 Approved by: https://github.com/SherlockNoMad	2024-08-02 20:34:50 +00:00
David Berard	dccce77935	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-02 20:30:59 +00:00
Mikayla Gawarecki	f49d5e30eb	Change owners of test/test_transformers.py to module: multi-headed-attention (#132519 ) So flaky tests get tagged with `module: multi-headed-attention` instead of `module: nn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132519 Approved by: https://github.com/Skylion007	2024-08-02 20:12:33 +00:00
William Wen	e81e74ca6c	[dynamo] revert map/zip iterator related changes (#132528 ) Need to revert due to internal hangs: S437700 This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64. Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)" This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3. Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)" This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9. Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)" This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528 Approved by: https://github.com/ZainRizvi	2024-08-02 19:40:57 +00:00
Sam Larsen	b71cd149ce	Fix file lock issue in AotCodeCompiler (#132343 ) Summary: It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this: - The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z` - The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock` - The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp` - The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin` So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths. Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it. Differential Revision: D60552021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343 Approved by: https://github.com/desertfire	2024-08-02 19:01:37 +00:00
PyTorch MergeBot	bcb4f7c172	Revert "Grouped Query Attention (#128898 )" This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))	2024-08-02 18:58:46 +00:00
Menglu Yu	afca6f5b47	[PT2][Optimus] Add missing example value for introduced nodes (#132297 ) Summary: We observed that many introduced nodes during split cat and batch fusion pattern optimization did not have example value meta data, which will cause problems in our follow up pattern optimizations, thus we add all missing values. We also fix bugs in some meta update and corner case bug for the old pattern, which caused problems in the follow up pattern optimization. We delete merge_stack_tahn_unbind_pass pattern, which was designed for cmf model, and it could be replaced by the more advanced pattern we added, thus we remove it for easy maintenance. Test Plan: # unit test ``` buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/15481123762720165 Network: Up: 230KiB Down: 702KiB (reSessionID-756346bf-6da3-4fa0-8d03-1b4fd61e0a7a) Jobs completed: 30. Time elapsed: 7:23.9s. Cache hits: 20%. Commands: 5 (cached: 1, remote: 0, local: 4) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 ``` buck2 test @mode/opt pytorch/diff_train_tests/ads/optimus:local_pt2_runner ``` Network: Up: 1.3GiB Down: 84MiB (reSessionID-ff135cdd-e42c-4ab5-8217-907ada465f01) Jobs completed: 61. Time elapsed: 21:56.5s. Cache hits: 0%. Commands: 39 (cached: 0, remote: 0, local: 39) Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 752, 'pattern_matcher_count': 732, 'normalization_pass': 328, 'normalization_aten_pass': 12, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1, 'fxgraph_cache_miss': 1}) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132297 Approved by: https://github.com/jackiexu1992	2024-08-02 18:57:12 +00:00
PyTorch MergeBot	24d0a32f98	Revert "[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308 )" This reverts commit aa0ed2496f5bf38768c9eda13112fd43359548bb. Reverted https://github.com/pytorch/pytorch/pull/132308 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132308#issuecomment-2265959993))	2024-08-02 18:55:51 +00:00
PyTorch MergeBot	e696f17467	Revert "[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314 )" This reverts commit d6a82ce39bd8e705a4cc2cebb886f4476a7250cf. Reverted https://github.com/pytorch/pytorch/pull/132314 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132314#issuecomment-2265953367))	2024-08-02 18:52:38 +00:00
PyTorch MergeBot	e4e3575fb0	Revert "[11/N] Use std::nullopt and std::optional (#132396 )" This reverts commit d7d61904936617a6a43782868d0b1004cb70dfc0. Reverted https://github.com/pytorch/pytorch/pull/132396 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/132396#issuecomment-2265952528))	2024-08-02 18:49:42 +00:00
PyTorch MergeBot	59b73079a0	Revert "Always use high precision for SDPA math backend (#128922 )" This reverts commit fbf3bc0a602b4ec1eab169202d5b1158fe2c1def. Reverted https://github.com/pytorch/pytorch/pull/128922 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128922#issuecomment-2265949958))	2024-08-02 18:46:50 +00:00
PyTorch MergeBot	193a19ee91	Revert "[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318 )" This reverts commit 7b816d7d6d5d521f913c78f897790f66112c7d84. Reverted https://github.com/pytorch/pytorch/pull/132318 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132318#issuecomment-2265945433))	2024-08-02 18:43:32 +00:00
PyTorch MergeBot	b8f7019df0	Revert "[dynamo] Track params/buffers and mark them as static (#132334 )" This reverts commit babb249a89b51931afe16db8b498ff72cd433afc. Reverted https://github.com/pytorch/pytorch/pull/132334 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132334#issuecomment-2265942261))	2024-08-02 18:41:19 +00:00
Bin Bao	e0514a5b99	[AOTI][refactor] Consolidate how python_kernel_name is set (#132320 ) Summary: Similar to the refactoring of set_cpp_kernel, consolidate the ways of setting python_kernel_name Pull Request resolved: https://github.com/pytorch/pytorch/pull/132320 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #132319	2024-08-02 18:34:25 +00:00
Bin Bao	a9e1133faa	[AOTI][refactor] Move set_cpp_kernel to base class (#132319 ) Summary: Consolidate how cpp_kernel_name is set and make it a method in the base ExternKernel class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132319 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-08-02 18:34:24 +00:00
Aleksei Nikiforov	df781343e2	Link libc10 to pthreads (#132484 ) It gets linked as transitive dependency of `libmkl` on x86_64, but it's must be specified explicitly on s390x Linking issue only appears when using gcc-13 with gold linker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132484 Approved by: https://github.com/malfet	2024-08-02 18:03:44 +00:00
Yidi Wu	19897a1647	[export] change deepcopy to copy in _replace_set_grad_with_hop pass.. (#132181 ) Summary: Fixes T197371132. Previously, we call copy.deepcopy to avoid mutating the original signature. However, this causes errors when the signature reference a FakeScriptObject, which then references a real torch.ScriptObject due to "The tensor has a non-zero number of elements, but its data is not allocated yet." We therefore just change it to a shallow copy. This should be good enough for guarding the signature. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_ebc_non_strict_export" Differential Revision: D60476839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132181 Approved by: https://github.com/BoyuanFeng	2024-08-02 17:57:09 +00:00
cyy	87d58cc81f	[4/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132001 ) Follows #132000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132001 Approved by: https://github.com/Skylion007	2024-08-02 17:42:02 +00:00
cyy	207e24ff83	Enable clang-tidy on aten/src/ATen/cudnn/* (#130133 ) Continued work of applying clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/130133 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-08-02 17:39:37 +00:00
Justin Chu	0c491702c4	[ONNX] Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag (#132299 ) Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag to allow for enabling the new torch.onnx logic and hiding them during migration and testing. The actual logic migration will happen after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132299 Approved by: https://github.com/titaiwangms	2024-08-02 17:06:11 +00:00
David Berard	9167113c16	[easy][MPS] add torch.mps.is_available() (#132426 ) Just return "torch.mps.device_count() > 0", which, based on the implementation of device_count(), seems to be equivalent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132426 Approved by: https://github.com/malfet	2024-08-02 17:05:49 +00:00
Edward Z. Yang	fc32732596	Don't attempt to compute hints for unbacked expressions (#132060 ) This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060 Approved by: https://github.com/Skylion007	2024-08-02 16:39:14 +00:00
PyTorch MergeBot	8fff976355	Revert "Refactor thunkify to return proper thunk abstraction (#132407 )" This reverts commit d903e664c6b70ad17e0b316ef39d71be5edddc87. Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))	2024-08-02 16:32:43 +00:00
PyTorch MergeBot	1197550876	Revert "Don't attempt to compute hints for unbacked expressions (#132060 )" This reverts commit d342dc0179944dd317b509b3432da81701836444. Reverted https://github.com/pytorch/pytorch/pull/132060 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))	2024-08-02 16:32:43 +00:00
Edward Z. Yang	296c339f98	Ensure compiler collective is called even when no graph is compiled (#132163 ) It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163 Approved by: https://github.com/jansel	2024-08-02 16:31:54 +00:00
soulitzer	82b6480b0a	Update SavedTensorHooks TLS stack to use SafePyObject (#131700 ) Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700 Approved by: https://github.com/albanD	2024-08-02 16:27:16 +00:00
PyTorch MergeBot	9eeb5eebab	Revert "Ensure compiler collective is called even when no graph is compiled (#132163 )" This reverts commit 0d9c9716b2db52281f6f10a113e07936deeb6e0a. Reverted https://github.com/pytorch/pytorch/pull/132163 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132163#issuecomment-2265729449))	2024-08-02 16:16:31 +00:00
Andrii Grynenko	fca2dba7ca	[pytorch][counters] Pybind for WaitCounter (#132357 ) Summary: Basic pybind integration for WaitCounter providing a guard API. Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API). Test Plan: unit test Differential Revision: D60557660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132357 Approved by: https://github.com/jamesperng, https://github.com/asiab4	2024-08-02 16:08:10 +00:00
PyTorch MergeBot	d224857b3a	Revert "Change signature of CompilerFn for register_backend decorator (#131880 )" This reverts commit ccf9ce8e8c3c86269003547d976da5ed1fc9511b. Reverted https://github.com/pytorch/pytorch/pull/131880 on behalf of https://github.com/albanD due to Breaking lint ([comment](https://github.com/pytorch/pytorch/pull/131880#issuecomment-2265682757))	2024-08-02 15:49:09 +00:00
Edward Z. Yang	63eb06c051	Disable SymDispatchMode when torch.compile'ing (#132433 ) Partially addresses https://github.com/pytorch/pytorch/issues/132417 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433 Approved by: https://github.com/ydwu4	2024-08-02 15:23:49 +00:00
cyy	5aafdc2f87	[3/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132000 ) Follows #131834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132000 Approved by: https://github.com/ezyang	2024-08-02 15:00:38 +00:00
Yan Zhiwei	78f4a3919f	Remove duplicate XPU switch case in DispatchStub (#132480 ) This PR fixes the issue mentioned in https://github.com/pytorch/pytorch/issues/132481. Duplicated XPU switch cases exist in `DispatchStub.cpp` and this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132480 Approved by: https://github.com/nautsimon, https://github.com/malfet	2024-08-02 14:39:00 +00:00
redradist	ccf9ce8e8c	Change signature of CompilerFn for register_backend decorator (#131880 ) ## Description Add `...` to show that CompilerFn for custom backend could take additional options Re: Recreated closed PR https://github.com/pytorch/pytorch/pull/110006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131880 Approved by: https://github.com/jansel	2024-08-02 14:30:58 +00:00
Nick Westlake	053e5080f6	Enable exception chaining in call_user_compiler (#131186 ) Enable exception chaining of BackendCompilerFailed exception in call_user_compiler. This prevents the original exception and traceback, which is often the most useful for debugging, from being discarded. Example output without the patch > Traceback (most recent call last): > [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(] > [Trace back from call_user_compiler to _inplace_generalized_scatter raise RuntimeError] > torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information Example output with the patch > Traceback (most recent call last): > [Traceback from_inplace_generalized_scatter to raise error_type(message_evaluated)] > RuntimeError: expand: attempting to expand a dimension of length 2! > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > [Traceback from call_user_compiler to _inplace_generalized_scatter raise RuntimeError] > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e) with e] > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information Pull Request resolved: https://github.com/pytorch/pytorch/pull/131186 Approved by: https://github.com/jansel	2024-08-02 14:07:06 +00:00
Alnis Murtovi	48929184e9	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison	2024-08-02 13:54:37 +00:00
cyy	b9cb1abf65	[12/N] Use std::optional (#132361 ) Follows #132396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361 Approved by: https://github.com/eqy	2024-08-02 13:46:46 +00:00
Animesh Jain	56f2917bef	[dynamo] Bugfix for recently added str handler (#132461 ) There is probably more work to improve support. But this is hot fix to not fail on `.__func__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132461 Approved by: https://github.com/williamwen42 ghstack dependencies: #132425	2024-08-02 13:16:39 +00:00
Edward Z. Yang	0d9c9716b2	Ensure compiler collective is called even when no graph is compiled (#132163 ) It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163 Approved by: https://github.com/jansel	2024-08-02 12:18:34 +00:00
Edward Z. Yang	d342dc0179	Don't attempt to compute hints for unbacked expressions (#132060 ) This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060 Approved by: https://github.com/Skylion007 ghstack dependencies: #131649, #132407	2024-08-02 12:09:37 +00:00
Edward Z. Yang	d903e664c6	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD ghstack dependencies: #131649	2024-08-02 12:09:37 +00:00
Edward Z. Yang	290f09f829	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-02 12:00:46 +00:00
Xu Han	8668bc279d	[inductor] contine to fix restrict keyword. (#132463 ) It is a continued work to the PR: https://github.com/pytorch/pytorch/pull/132394 , and all `restrict` key word of `cpp_micro_gemm.py` are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132463 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-02 11:09:17 +00:00
Michael Lazos	d2e9a8bf6d	[Reland] Fix inlining module-scoped store global (#132439 ) Reland https://github.com/pytorch/pytorch/pull/132224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132439 Approved by: https://github.com/anijain2305	2024-08-02 09:13:52 +00:00
Pearu Peterson	a4ea776881	Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645 ) As in the title: To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported - `sparse_xyz_tensor(indices, values, pin_memory=True)` - `sparse_xyz_tensor(indices, values).pin_memory()` - `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())` Fixes https://github.com/pytorch/pytorch/issues/115330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645 Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy	2024-08-02 08:55:55 +00:00
Animesh Jain	babb249a89	[dynamo] Track params/buffers and mark them as static (#132334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132334 Approved by: https://github.com/ezyang, https://github.com/mlazos	2024-08-02 08:55:43 +00:00
xinyu-intel	2ee9895304	Support optimizer capturable on hpu and xpu (#132119 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-08-02 08:19:52 +00:00
zengxian	f936e68506	[CI] Update CPU inductor smoke test model list and target (#132221 ) Fixes #132097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221 Approved by: https://github.com/desertfire	2024-08-02 07:09:54 +00:00
eqy	e5560d10f4	[CUDA][SDPA] Fix expect export on sm90+ (#132194 ) CC @drisspg not sure what is causing the scale=0.125 to be omitted here... Pull Request resolved: https://github.com/pytorch/pytorch/pull/132194 Approved by: https://github.com/drisspg	2024-08-02 05:43:58 +00:00
David Berard	7d8b95e8fb	[easy] more debug in partitioner assert (#132456 ) Print the name of the node that didn't have good meta['val']. An internal model is failing with this assert, we need this info to debug further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132456 Approved by: https://github.com/Chillee	2024-08-02 05:07:01 +00:00
cyy	35d14d22a0	Fix some issues detected by static analysis tools (#131989 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131989 Approved by: https://github.com/ezyang	2024-08-02 04:18:57 +00:00
Yanbo Liang	5ea0f51187	[Dynamo] Support abc.MutableMapping.get (#132363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132363 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2024-08-02 04:17:35 +00:00
drisspg	2b86a7fcc7	fix printing of scores and mods names (#132424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132424 Approved by: https://github.com/Skylion007	2024-08-02 03:30:23 +00:00
cyy	07fe1dd58f	[13/N] Fix clang-tidy warnings in jit (#132411 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411 Approved by: https://github.com/Skylion007	2024-08-02 03:14:09 +00:00
James Wu	1250171866	Use fresh inductor cache on unit tests (#132432 ) Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results ``` Now passes Differential Revision: D60604811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432 Approved by: https://github.com/masnesral	2024-08-02 03:02:36 +00:00
Animesh Jain	6c4ce4331c	[dynamo][exception] Raise Observed KeyError exception for dict __getitem__ (#132425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132425 Approved by: https://github.com/yanboliang, https://github.com/Skylion007	2024-08-02 02:58:31 +00:00
Nikita Shulga	cd5452aace	[CUDA] `is_bf16_supported()` should not crash if there are no GPUs (#132313 ) `False` is the good answer on a system that does not have any CUDA GPUs. - Added regression test to TestTorch. Fixes https://github.com/pytorch/pytorch/issues/132303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313 Approved by: https://github.com/eqy, https://github.com/syed-ahmed	2024-08-02 02:50:43 +00:00
majing	3a355c1891	Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630 ) Fixes #130916 As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram. The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-08-02 01:51:09 +00:00
Chien-Chin Huang	bc510916fa	Only make wait_tensor as a side_effect op (#132341 ) Summary: https://github.com/pytorch/pytorch/pull/131023 add all the collective ops to the side effect list. But we should only make wait_tensor as a side_effect op because all collective ops should have a corresponding wait_tensor. We should switch to use high_order effect token. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132341 Approved by: https://github.com/yf225	2024-08-02 01:24:40 +00:00
Yichen Yan	ef426d5183	[nccl] Wrap nccl code update with version check (#130419 ) Fixes the issue that cannot build pytorch with nccl < 2.13 after https://github.com/pytorch/pytorch/issues/128756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130419 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-02 01:22:07 +00:00
Chen Haifeng	50ed6ce277	Support built-in id function for TensorVariable on parameters (#130100 ) Fixes #130087 This patch tries to provide a built-in id function implementation for TensorVariable when the id function is called on tensors like module parameters. The id function call on intermediate tensors is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130100 Approved by: https://github.com/anijain2305	2024-08-02 01:19:25 +00:00
Siyu Yang	64235c6a71	Skip test_fp8 in test_aot_inductor to temporarily (#132453 ) https://github.com/pytorch/pytorch/pull/130422 caused the test `test.inductor.test_aot_inductor.AOTInductorTestABICompatibleCuda. test_fp8_abi_compatible_cuda` to fail (unclear why it was not run in GitHub) with `torch/csrc/inductor/aoti_torch/c/shim.h:390:34: note: candidate function not viable: requires 9 arguments, but 6 were provided`. We suspect that the kernel produced by the lowering function, which is no longer a fallback choice, has a schema issue at codegen. Fp8 is not used through AOTI currently and it is difficult to revert the PR (BE week), so we'll skip the test temporarily while making the new lowering compatible with AOTI. Testing: the failed test on internal diff is now skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132453 Approved by: https://github.com/henrylhtsang	2024-08-02 01:18:03 +00:00
cyy	56334c854c	[2/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#131834 ) Follows #130798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131834 Approved by: https://github.com/ezyang	2024-08-02 00:49:30 +00:00
Avik Chaudhuri	ee1ef066fd	add src map to data-dependent errors (#132393 ) Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables point to the same symbol, and some may be preferred over others. Thus we dump this info as well. Test Plan: updated test Sample error with new format: ``` Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0). (Size-like symbols: none) <snip> The following call raised this error: File "test/export/test_export.py", line 1950, in forward return r.view(items[0], items[2]) To fix the error, insert one of the following checks before this call: 1. torch._check(items[2] >= 0) 2. torch._check(items[2] < 0) (These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.) ``` Differential Revision: D60574478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393 Approved by: https://github.com/BoyuanFeng	2024-08-02 00:31:12 +00:00
William Wen	625af2d27c	[dynamo] fix add_push_null callsites with CALL_FUNCTION_EX (#132329 ) Also fix a bug in `PyCodegen.add_push_null` where in Python <= 3.12, we may accidentally duplicate a NULL instead of the object on the stack before it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132329 Approved by: https://github.com/anijain2305	2024-08-02 00:29:21 +00:00
atalman	0016be8051	[Docker] Replace epel release rpm by yum install (#132449 ) URL: https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm is not available anymore, hence replacing this with yum epel-release install. As a backup plan this is available still : https://archives.fedoraproject.org/pub/archive/epel/7/x86_64/Packages/e/epel-release-7-14.noarch.rpm Saved on our s3 path, just in case: https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm Please note, We are still using for installs like this: ``` RUN yum install -y \ https://repo.ius.io/ius-release-el7.rpm \ https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm ``` Test in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/132449 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-08-02 00:16:03 +00:00
PyTorch MergeBot	3855ac5a5d	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit ab9791c0e342753013181eeeab300a05774fc456. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/angelayi due to never got landed internally due to weird flow... sorry ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2264224466))	2024-08-01 23:47:29 +00:00
henrylhtsang	0c3ac428a2	[BE][typing] fix types in common pruning (#132309 ) BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309 Approved by: https://github.com/ColinPeppler	2024-08-01 23:34:33 +00:00
Mikayla Gawarecki	87ddf70fc6	Set weights_only=False in export `deserialize_torch_artifact` (#132348 ) Context: We are planning to make a BC breaking change to `torch.load` by flipping the default for `weights_only` from `False` --> `True` in a future release. With `weights_only=True`, a custom unpickler is used that limits what can be loaded to state_dicts containing tensors (there is also a way for the user to allowlist specific things to be loaded). The goal of this is to attempt to prevent remote execution of arbitrary code when using `torch.load`. To my understanding, in export, `torch.load` is used internally to load arbitrary objects, so we should set `weights_only=False` here to prevent the flip from breaking export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132348 Approved by: https://github.com/angelayi	2024-08-01 23:25:07 +00:00
Shangdi Yu	1362d51e7d	[AOTI] Fix number type for AOTI (#132180 ) Fixes #131338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132180 Approved by: https://github.com/desertfire	2024-08-01 22:43:28 +00:00
Yidi Wu	35400f750f	[torchbind] don't warning for certain skippable methods. (#132306 ) Summary: Skip the warning if the fake script object doesn't implement a fake method for: 1. __obj_flatten__: for real script object only. 2. __set_state__ and __get_state__ for serialization. Don't expect it to be used during tracing. Test Plan: Existing tests. Reviewed By: angelayi Differential Revision: D60478460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132306 Approved by: https://github.com/angelayi	2024-08-01 22:40:42 +00:00
Shangdi Yu	2f54c38594	[AOTI] Fix bfloat16 in CPU (#132150 ) Fixes #122986 - add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file - Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare] 436 \| if (tensor.numel() != numel) { Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-08-01 22:26:30 +00:00
Joel Schlosser	a356a03f4a	Fix DEBUG=1 asserts for mvlgamma backward with NJT (#132422 ) mvlgamma backward trips DEBUG=1 asserts when trying to construct an empty tensor with `layout=torch.jagged`. This happens due to passing `self.options()` to `arange()` in `mvlgamma_backward()`. Fix in this PR unconditionally constructs `arange()` with the strided layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132422 Approved by: https://github.com/albanD	2024-08-01 21:53:16 +00:00
Yu, Guangye	92bebb46fa	Support XPU ABI=0 build (#130110 ) # Motivation This PR intends to support ABI=0 build for XPU backend. # Additional Context The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why? Because we use - gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And - use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides, - `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not. This PR depends on https://github.com/pytorch/pytorch/pull/131643. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-08-01 21:42:14 +00:00
Brian Hirsh	997f64af38	fastpath FunctionalTensor sizes() (#132084 ) Another attempt at fast-pathing sizes() in FunctionalTensor, since it appears to improve compile time perf by up to ~10%. See the investigation from https://github.com/pytorch/pytorch/issues/125977#issuecomment-2122915602. After looking at some failing tests locally I realized that we need to manually handle metadata mutations now, since the previous "smarter" size dispatch was handling the updates Pull Request resolved: https://github.com/pytorch/pytorch/pull/132084 Approved by: https://github.com/ezyang	2024-08-01 21:09:22 +00:00
PyTorch MergeBot	c8958f8f84	Revert "Ban decorator usage of dynamo_timed (#132328 )" This reverts commit 9853c048eb53946eb505424b17ac42ce46b66ac1. Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](`9853c048eb`). Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))	2024-08-01 20:20:28 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
Oguz Ulgen	71e22e0959	Add basic mypy annotations to dynamo (#132415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu	2024-08-01 20:14:25 +00:00
Simon	12f61e65eb	[mtia][sdpa] MTIA SDPA dispatch via _fused_sdp_choice_stub (#132008 ) Summary: as title Differential Revision: D59823335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132008 Approved by: https://github.com/mortzur	2024-08-01 20:01:40 +00:00
Anshul Sinha	596f568592	[dtensor][debug] adding js script to pytorch github so that i can host the browser visualizer on pytorch (#132185 ) Summary This is the javascript portion that is used in CommDebugMode's visual browser. I have placed it here so that I can host the browser on PyTorch. I am following the same procedures to host as memory_viz https://github.com/pytorch/pytorch.github.io/blob/site/memory_viz.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/132185 Approved by: https://github.com/XilunWu ghstack dependencies: #132070	2024-08-01 19:50:23 +00:00
Edward Z. Yang	9853c048eb	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-01 19:27:58 +00:00
PyTorch MergeBot	40c8f73099	Revert "Fix inlining module-scoped store global (#132224 )" This reverts commit c3a31d90e7d10a9b89b11396b6f8b20ed52bf394. Reverted https://github.com/pytorch/pytorch/pull/132224 on behalf of https://github.com/ZainRizvi due to Looks like the new import mock_store_global_crossfile_inline fails internally. Please see D60567756 for details ([comment](https://github.com/pytorch/pytorch/pull/132224#issuecomment-2263768729))	2024-08-01 19:06:36 +00:00
Michael Lazos	93979e7063	Skip frame if torch dispatch mode enabled (#131828 ) Fixes https://github.com/pytorch/pytorch/issues/105929 We now skip frames if a dispatch mode is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131828 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2024-08-01 19:06:20 +00:00
Jianyu Huang	fbf3bc0a60	Always use high precision for SDPA math backend (#128922 ) Summary: feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts. Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16. Differential Revision: D58710805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922 Approved by: https://github.com/xw285cornell, https://github.com/drisspg	2024-08-01 18:55:48 +00:00
eellison	0eea2b3947	Cast inputs to low precision kernels in emulate low precision mode (#132345 ) With https://github.com/pytorch/pytorch/pull/132238 is sufficient to make give no divergence https://github.com/pytorch/pytorch/issues/132301: Although we should discuss that issue more at length. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132345 Approved by: https://github.com/zou3519	2024-08-01 18:02:10 +00:00
Ryo	ce61300141	Enable oneDNN for tanh based GELU on aarch64 (#130925 ) Provides speedup for GELU on aarch64 compared to native PyTorch implementation. e.g. 8.5x speedup compared to native implementation for 1x1x16384 on 32 threads on Graviton 3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130925 Approved by: https://github.com/malfet	2024-08-01 17:54:48 +00:00
Bin Bao	97eba8e174	[AOTI] Fix a typo in ExternKernel.codegen_const_args (#132191 ) Differential Revision: [D60513923](https://our.internmc.facebook.com/intern/diff/D60513923) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132191 Approved by: https://github.com/chenyang78	2024-08-01 17:46:25 +00:00
James Wu	f467d55329	Disable remote cache on test_aot_autograd_cache (#132409 ) Summary: AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss. Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled passes Differential Revision: D60588615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409 Approved by: https://github.com/masnesral	2024-08-01 17:26:11 +00:00
angelayi	010fc7858a	[export] Fix serialization of OpOverload w/ SymInt outputs (#132126 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1473575486613991/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/132126 Approved by: https://github.com/ydwu4	2024-08-01 17:22:04 +00:00
Xuehai Pan	ff4ca0d02a	[Easy] Fix argument name collision in `HigherOrderOperator` dispatched functions (#132377 ) Share the same spirit of #129562 - #129562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132377 Approved by: https://github.com/zou3519	2024-08-01 17:13:37 +00:00
Animesh Jain	7b816d7d6d	[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318 ) This fixes the huge increase in compile time with +dynamic with inline_inbuilt_nn_modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132318 Approved by: https://github.com/yanboliang, https://github.com/mlazos, https://github.com/ezyang ghstack dependencies: #132302, #132304, #132312, #132308, #132314	2024-08-01 17:11:18 +00:00
pratiklp00	69cbf05529	Fix recent build error on ppc64le (#129736 ) This PR will fix the recent build issue observed on ppc64le. Fixes #128130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129736 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-01 17:09:42 +00:00
Xuehai Pan	30293319a8	[BE][Easy][19/19] enforce style for empty lines in import segments in `torch/[o-z]*/` (#129771 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2024-08-01 17:07:14 +00:00
Howard Huang	c59f3fff52	[PP] Forward only schedule (#132177 ) `python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177 Approved by: https://github.com/lessw2020	2024-08-01 16:35:56 +00:00
Yiming Zhou	ee09d066d3	[dynamo] Add line number to _warn_capture_scalar_outputs() (#132333 ) Fixes #127667. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132333 Approved by: https://github.com/anijain2305	2024-08-01 16:11:21 +00:00
Xu Han	35fcd59fd8	[inductor] make restrict_keyword cross OSs. (#132394 ) Error Msg: <img width="862" alt="image" src="https://github.com/user-attachments/assets/51fef188-bce8-42a5-8ed4-d11802c6ca89"> <img width="347" alt="image" src="https://github.com/user-attachments/assets/0eafe38e-1c7c-427d-82f5-16a31bccc476"> Handle `restrict` keyword the by OS, ref: https://learn.microsoft.com/en-us/cpp/cpp/extension-restrict?view=msvc-170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132394 Approved by: https://github.com/desertfire	2024-08-01 16:03:10 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Oguz Ulgen	a6985c09cb	Add None return type to init -- functorch and torchgen (#132351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335	2024-08-01 15:26:45 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
atalman	30d7f0b15a	Remove wget call to builder install_cuda.sh (#132410 ) This file ``install_cuda.sh`` now lives in ``.ci/docker/common`` and will be removed from builder repo. Here is PR that removes it from builder: https://github.com/pytorch/builder/pull/1949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132410 Approved by: https://github.com/Skylion007	2024-08-01 15:22:08 +00:00
cyy	c99adce9a1	[12/N] Fix clang-tidy warnings in jit (#132209 ) Follows #132131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209 Approved by: https://github.com/Skylion007	2024-08-01 15:12:12 +00:00
Justin Chu	0d88dd0f77	[TS2E] Remove reference to torch.onnx internals (#132186 ) Instead, this PR moves the code to the converter to avoid dependence. Feel free to refactor it afterward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132186 Approved by: https://github.com/angelayi	2024-08-01 15:08:02 +00:00
cyy	d7d6190493	[11/N] Use std::nullopt and std::optional (#132396 ) Follows #132364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132396 Approved by: https://github.com/ezyang	2024-08-01 14:46:33 +00:00
Xu Han	a4013e8b72	[inductor] cpp codegen alignas for all OSs. (#132387 ) Changes: 1. Make cpp codegen alignas works for all OSs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132387 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-01 14:30:09 +00:00
Xu Han	6c1f1563e1	[inductor] fix UndefinedTensorImpl singleton can't export on Windows. (#132326 ) This PR fix the `UndefinedTensorImpl::_singleton` can't export on Windows issue. Snapshot: <img width="1346" alt="image" src="https://github.com/user-attachments/assets/b34256ac-a0ae-473b-89e6-10d755eaad24"> The reason is MSVC can't export class static data to external linkage, ref: https://learn.microsoft.com/en-us/cpp/cpp/using-dllimport-and-dllexport-in-cpp-classes?view=msvc-170#_pluslang_using_dllimport_and_dllexport_in_c2b2bselectivememberimportexport I use another singleton implenmentation to avoid the issue, for Windows. Since this PR, cpp_wrapper on Windows would start to work. <img width="1916" alt="image" src="https://github.com/user-attachments/assets/c1d7d7e7-64ca-4c6d-9fb7-e3b91e675b58"> Next step, I will enable the cpp_wrapper UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132326 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-01 13:37:12 +00:00
Xuehai Pan	6ff1e43a41	[BE][Easy][13/19] enforce style for empty lines in import segments in `test/j*/` (#129764 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129764 Approved by: https://github.com/ezyang	2024-08-01 12:13:42 +00:00
Xuehai Pan	672ce4610e	Populate submodules of `torch._C` to `sys.modules` recursively (#132216 ) See comment: `e9d1c26275/torch/__init__.py (L938-L950)` This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216 Approved by: https://github.com/ezyang	2024-08-01 12:04:59 +00:00
Max Ren	d95756f6a5	[Quantizer][Add] Fix add annotation with constant (#132092 ) Summary: Occaisonally we run into a partition that looks like this for Add: ``` SourcePartition(nodes=[_constant2, add_2], source=<built-in function add>, input_nodes=[x], output_nodes=[_constant2, add_2], params=[_constant2]) ``` In this case we are adding a constant to an input, and reusing the constant later down the line. This causes our constant to be an output in our SourcePartition. The assumption then that: ``` add_node = add_partition.output_nodes[0] ``` Will not necessarily hold. As a result we must check that the output node is indeed a call function and not a constant. Test Plan: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_ops -- test_qs8_add_constant Differential Revision: D60413221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132092 Approved by: https://github.com/jerryzh168	2024-08-01 09:57:43 +00:00
joydddd	bdd83c4c7f	Add Full block support to flex_decoding (#131404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131404 Approved by: https://github.com/yanboliang	2024-08-01 07:28:52 +00:00
cyy	043e41f4f4	[10/N] Use std::nullopt and std::make_optional (#132364 ) Follows #130674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132364 Approved by: https://github.com/ezyang	2024-08-01 07:02:35 +00:00
Animesh Jain	d6a82ce39b	[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132314 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304, #132312, #132308	2024-08-01 06:21:05 +00:00
Animesh Jain	aa0ed2496f	[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132308 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304, #132312	2024-08-01 06:21:05 +00:00
Animesh Jain	612ea35395	[dynamo] Introduce UnspecializedBuiltinNNModuleSource (#132312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132312 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304	2024-08-01 06:21:05 +00:00
Tugsbayasgalan Manlaibaatar	4c29c1a96a	[EZ] adjust test to accept training IR input (#131999 ) When we do predispatch functional export, sometimes we get harmless additional detach calls. In the new training IR, it actually outputs slightly different (arguable more correct) result. Differential Revision: [D60348764](https://our.internmc.facebook.com/intern/diff/D60348764/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131999 Approved by: https://github.com/bdhirsh ghstack dependencies: #131988, #131995	2024-08-01 06:20:38 +00:00
Matthew Hoffman	7a779b5257	Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 ) Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error: ``` "mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage] ``` Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288 Approved by: https://github.com/ezyang	2024-08-01 05:45:08 +00:00
Tugsbayasgalan Manlaibaatar	928adb7cc2	Fix empty fake mode problem (#131995 ) Title Differential Revision: [D60348541](https://our.internmc.facebook.com/intern/diff/D60348541/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131995 Approved by: https://github.com/angelayi ghstack dependencies: #131988	2024-08-01 04:55:37 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
Animesh Jain	bcd1d2e832	[dynamo] Introduce UnspecializedNNModule guard source (#132304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132304 Approved by: https://github.com/yanboliang ghstack dependencies: #132302	2024-08-01 04:35:43 +00:00
Animesh Jain	e772547d70	[dynamo][rename/refactor] Rename guard_source NN_MODULE to SPECIALIZED_NN_MODULE (#132302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132302 Approved by: https://github.com/yanboliang	2024-08-01 04:35:43 +00:00
Dan Zimmerman	90fa64bd7e	[torch][take2] Implement BFloat16 __hip_bfloat16 overloads (#132234 ) Summary: In D60024830 I attempted to define these overloads, but gated the implementation on the wrong macros. Namely I used `__CUDACC__` instead of `__HIPCC__` (facepalm). It might be worth merging this with the nvidia case via typedefs (e.g. `typedef __hip_bfloat16 __gpu_bfloat16` and `typedef __nv_bfloat16 __gpu_bfloat16`), but that seems like an entirely new paradigm for torch, so I'll punt that change to the future so we can focus on supporting `BFloat16(__hip_bfloat16)` here Test Plan: CI Differential Revision: D60362079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132234 Approved by: https://github.com/houseroad	2024-08-01 04:25:46 +00:00
Jiong Gong	7911b7bfb7	[inductor][cpp] stabilize do_bench_cpu (#131873 ) This PR stabilizes the `do_bench_cpu` by using milliseconds for warmup and benchmark runs, aligning with that of Trtion's do_bench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131873 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/eellison	2024-08-01 04:25:31 +00:00
Xuehai Pan	b25ef91bf1	[BE][Easy][18/19] enforce style for empty lines in import segments in `torch/d*/` (#129770 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770 Approved by: https://github.com/wconstab	2024-08-01 04:22:50 +00:00
Wei Feng	bc7ed1fbdc	[FSDP2] add __repr__ to FSDPParamGroup and FSDPParam (#132350 ) in pdb, it's pretty common to print `FSDPParamGroup` and `FSDPParam`. making sure they are human readable print `FSDPParam` in pdb ``` FSDPParam(fqn=layers.6._checkpoint_wrapped_module.attention.wq.weight, orig_size=torch.Size([128, 256])) ``` print `FSDPParamGroup` in pdb ``` FSDPParamGroup(fqn=layers.6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132350 Approved by: https://github.com/awgu	2024-08-01 04:21:57 +00:00
Tianyu Liu	46ed33b207	add decomposition_table as an arg to get_isolated_graphmodule (#130886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130886 Approved by: https://github.com/wanchaol	2024-08-01 04:21:43 +00:00
Tugsbayasgalan Manlaibaatar	073430ebea	Don't check for autograd state when lowering to inference IR (#131988 ) When lowering to inference IR, we shouldn't error on autograd state changes because we will have preserved the autograd state change at the training level. I think the more correct way of implementing it would be to wrap autograd ops in HOP before decomposing, but that seems low ROI. Differential Revision: [D60346235](https://our.internmc.facebook.com/intern/diff/D60346235/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131988 Approved by: https://github.com/angelayi	2024-08-01 04:15:37 +00:00
Avik Chaudhuri	81db69278d	unsupported sympy functions in export solver (#132325 ) Summary: A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691. However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.). Test Plan: added test (original repro provided) Differential Revision: D60540951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325 Approved by: https://github.com/ezyang	2024-08-01 04:11:52 +00:00
PyTorch MergeBot	10344d76bd	Revert "[AOTI] Fix bfloat16 in CPU (#132150 )" This reverts commit a488113062b7231197ace8522ab3cab535c77d0b. Reverted https://github.com/pytorch/pytorch/pull/132150 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_unspec_inputs_cuda_dynamic_shapes_cuda_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10189155341/job/28189531216) [HUD commit link](`a488113062`). Test was not run on PR due to being skipped for being slow ([comment](https://github.com/pytorch/pytorch/pull/132150#issuecomment-2261895048))	2024-08-01 03:35:39 +00:00
PyTorch MergeBot	a28cda11ef	Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613 )" This reverts commit 344c15a0bb66409ec5e576992090d127cbfa2cff. Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))	2024-08-01 03:22:11 +00:00
YangQun1	589aef4bb0	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-08-01 03:18:37 +00:00
rzou	718c13cd39	[inductor] Reinplacing should not allow an op to mutate the same input multiple times (#132238 ) Fixes #132196 Let's say we have: - op(x, y) that mutates both x and y - new_x, new_y = functional_op(x, y) is the functional variant If we are presented with functional_op(x, x), we must not reinplace this into op(x, x), because then it would be writing to the same Tensor. Instead, it's OK to reinplace one of them and to clone the other: ``` >>> y = x.clone() >>> op(x, y) ``` This also applies if we have views: functional_op(x, x[0]) should not reinplace into op(x, x[0]). The fix is to avoid reinplacing an arg if a view of it already has been reinplaced. Test Plan: - new and existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132238 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-08-01 02:37:03 +00:00
Alnis Murtovi	344c15a0bb	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison ghstack dependencies: #131610, #131611	2024-08-01 02:25:54 +00:00
Valentine233	2276d9045a	[cpu] add more VecConvert for 8bits (#131876 ) Adds more intrinsic specializations for 8bits conversions, in order to speed up bit8 SDPA in the future. - u8 -> i16 - i32 -> f32 - f32 -> i32 - i32 -> i8 (only add vec512 cause lack of avx512vl for vec256) - i16 -> i8 (only add vec512 cause lack of avx512vl for vec256) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131876 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-08-01 01:38:39 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
albanD	4e966e8a1c	Update inference_mode doc (#132321 ) Fix https://github.com/pytorch/pytorch/issues/132288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132321 Approved by: https://github.com/awgu, https://github.com/soulitzer	2024-07-31 23:50:03 +00:00
Shangdi Yu	a488113062	[AOTI] Fix bfloat16 in CPU (#132150 ) Fixes #122986 - add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file - Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare] 436 \| if (tensor.numel() != numel) { Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-07-31 23:28:24 +00:00
jainapurva	6b28af1b79	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-31 22:58:51 +00:00
eellison	f0da167ce5	Add fx graph runnable to tl parse (#130976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976 Approved by: https://github.com/ezyang	2024-07-31 22:19:35 +00:00
Oguz Ulgen	645c1052a6	Refactor local autotune remote cache to make the code less error prone (#132289 ) Fixes #132241 This PR refactors local autotune cache so that disabling it is easier and cleaner. Differential Revision: [D60537196](https://our.internmc.facebook.com/intern/diff/D60537196) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132289 Approved by: https://github.com/aorenste ghstack dependencies: #132285	2024-07-31 22:12:22 +00:00
Oguz Ulgen	b0e06d9d6a	Make config.autotune_remote_cache be a three-way option (#132285 ) Similar to fx_graph_cache config, make autotune config be three-way so we can hard enable/disable via config options. Differential Revision: [D60537105](https://our.internmc.facebook.com/intern/diff/D60537105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132285 Approved by: https://github.com/aorenste	2024-07-31 22:12:22 +00:00
Peter Bell	260c991e20	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-31 21:32:20 +00:00
Xuehai Pan	e74ba1b34a	[BE][Easy][15/19] enforce style for empty lines in import segments in `torch/_d*/` (#129767 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767 Approved by: https://github.com/anijain2305	2024-07-31 21:18:11 +00:00
Sheng Fu	ad9826208c	Remove string length limit in ET (#132169 ) Summary: ET sets the length limit of string input varaibele to 8192 characters. However, the node process_group::init has more than 8192 characters for a Ads 128 rank job. This DIFF is to temporaily remove this limit, so ET can capture the complete information of the process group. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTrace Reviewed By: sanrise Differential Revision: D60341306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132169 Approved by: https://github.com/sraikund16, https://github.com/sanrise	2024-07-31 20:54:39 +00:00
Alnis Murtovi	d3cefc9e3a	AutoHeuristic: Collect data for mixed_mm (#131611 ) This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things: Move pad_mm related AutoHeuristic files into subdirectory Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py). The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611 Approved by: https://github.com/eellison ghstack dependencies: #131610	2024-07-31 20:45:45 +00:00
Siddharth Kotapati	f8b6e91840	Add sequoia runner to mac-mps (#132190 ) Adds MacOS 15 runners to GitHub actions for Mac-mps test suite Co-authored-by: Joona Havukainen <jhavukainen@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132190 Approved by: https://github.com/malfet	2024-07-31 20:26:04 +00:00
Sergii Dymchenko	d72e863b3e	Fix lint after PR #130572 (#132316 ) Fix lint after https://github.com/pytorch/pytorch/pull/130572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132316 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-07-31 20:00:31 +00:00
Catherine Lee	aeb78c9849	[TD] More files for test_public_bindings (#132284 ) It relies on that file Also we care about .cpp files too apparently Pull Request resolved: https://github.com/pytorch/pytorch/pull/132284 Approved by: https://github.com/ZainRizvi	2024-07-31 19:53:40 +00:00
Andrii Grynenko	cb4c107d70	[pytorch][counters] DynamicCounter (#132166 ) Summary: Implement a callback-based dynamic counter with pluggable backends. The backend API and integration is similar to WaitCounter. Note that this counter should only be used with C++ callbacks, since making it safe to be used for GIL-requiring callbacks would be pretty challenging and may defeat the whole purpose of this counter (since the duration of the callback can no longer be guaranteed). Test Plan: unit test Differential Revision: D60464055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132166 Approved by: https://github.com/asiab4	2024-07-31 19:52:51 +00:00
PyTorch MergeBot	dc38646c58	Revert "[pytorch][counters] Pybind for WaitCounter (#132167 )" This reverts commit 2c7bd61afa4b762e00b26bbde43685de080af32a. Reverted https://github.com/pytorch/pytorch/pull/132167 on behalf of https://github.com/clee2000 due to broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183687967/job/28172929836) [HUD commit link](`2c7bd61afa`) not tested on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132167#issuecomment-2261328275))	2024-07-31 19:51:56 +00:00
Edward Z. Yang	6955bc170d	Some updates to merge rules (#132296 ) The added people from metamates don't actually make a material difference right now but I added some for fun. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132296 Approved by: https://github.com/albanD, https://github.com/malfet	2024-07-31 19:49:08 +00:00
Gabriel Ferns	2138a710eb	enable test_max_pool2d6 after resolving empty array (#132219 ) Related to Issue: https://github.com/pytorch/pytorch/issues/131335 Resolving PR: https://github.com/pytorch/pytorch/pull/132023 Test output: ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (enable-test-max-pool2d6)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cpu_cpp_wrapper.py -k test_max_pool2d6 inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] .inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] . ---------------------------------------------------------------------- Ran 2 tests in 8.668s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132219 Approved by: https://github.com/desertfire	2024-07-31 19:13:54 +00:00
drisspg	cfe61e84ac	Add a 'to' method for moving to and from device for BlockMask (#132087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132087 Approved by: https://github.com/yanboliang	2024-07-31 19:05:30 +00:00
Edward Z. Yang	898a431a46	Dump files that look like FX graphs to structured log (#132100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132100 Approved by: https://github.com/oulgen	2024-07-31 18:45:28 +00:00
James Wu	f9e4d05c15	Save and run post compilation steps within FXGraphCache (#130572 ) This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that: - When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification). - When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch. What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it returns exactly what compile_fx_inner returns. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner. ## What's a post compile step? We define a post-compile to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include: - Setting the tracing context's output strides - Running cudagraphs if enabled - Maybe realign inputs if cudagraphs didn't run To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object. ## Splitting cudagraphs work into pre/post compile Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages. Implementation notes: - We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness. - ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache doesn't have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so only inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result. - Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572 Approved by: https://github.com/eellison	2024-07-31 18:32:40 +00:00
JackCaoG	b40249b462	propagate XLA's metadata after functional sync (#131076 ) Fixes https://github.com/pytorch/xla/issues/7174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076 Approved by: https://github.com/bdhirsh	2024-07-31 18:20:00 +00:00
Joel Schlosser	7eb2a99585	Fix to support unary pointwise ops when an NJT is not the first arg (#131937 ) Background: NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg. This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937 Approved by: https://github.com/soulitzer ghstack dependencies: #131898, #131704	2024-07-31 17:51:03 +00:00
Michael Lazos	c3a31d90e7	Fix inlining module-scoped store global (#132224 ) Fixes https://github.com/pytorch/pytorch/issues/132165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132224 Approved by: https://github.com/anijain2305	2024-07-31 17:37:43 +00:00
Aaron Orenstein	6214b5388b	typing ir.py - part 1 (#131845 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131845 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-31 17:37:14 +00:00
Michael Lazos	144639797a	Improve side effects error message (#132223 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132223 Approved by: https://github.com/anijain2305	2024-07-31 17:29:26 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
Sam Larsen	9826c542f0	[inductor] skip remote fx caching in failing pattern matcher tests (#132206 ) Summary: These tests are failing internally with remote caching enabled because the installed pattern increments a nonlocal counter, which we skip with a cache hit. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_with_mutation (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations1 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations2 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations3 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 ``` Differential Revision: D60491503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132206 Approved by: https://github.com/oulgen	2024-07-31 16:41:04 +00:00
datagero	bdd7a0322d	[Dynamo] Fix - `str` handler for UserDefinedObjectVariable (#130506 ) Fixes #130301 Adjusted the call_str method to handle str conversion for UserDefinedObjectVariable. Attempt in a clean branch for unrelated test errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130506 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2024-07-31 16:39:59 +00:00
Yan Zhiwei	fe4f8e97cd	[Intel GPU] xpu-ops codegen via backend whitelist (#130082 ) # Motivation This PR intends to enhance the codegen to allow generate codes for XPU backend. XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts. We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes. # Usage XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops` We use the following commands to generate XPU operators ` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU` The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen. The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten` # Result All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder. # Verification * In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #130019	2024-07-31 16:31:38 +00:00
David Berard	aec8bc5e4c	[easy] fix type annotation on constraint_violations variable (#127064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127064 Approved by: https://github.com/jananisriram	2024-07-31 16:27:10 +00:00
hongxiayang	c85088b1f9	[ROCm] performance optimization for index select (#131713 ) As observed during working on this fix (https://github.com/pytorch/pytorch/pull/130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance. By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script: ``` input = torch.randint(low=0, high=16032, size=[131072], device="cuda") w = torch.randn([16032, 16384], device="cuda") with profiler.profile(record_shapes=True) as prof: x = torch.nn.functional.embedding(input, w) ``` I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio: 128 \| 1x 256 \| 1.33x 512 \| 1.44x 1024 \| 1.49x Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script. Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster. Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131713 Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet	2024-07-31 16:24:01 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Andrii Grynenko	2c7bd61afa	[pytorch][counters] Pybind for WaitCounter (#132167 ) Summary: Basic pybind integration for WaitCounter providing a guard API. Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API). Test Plan: unit test Reviewed By: asiab4 Differential Revision: D60463979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132167 Approved by: https://github.com/asiab4	2024-07-31 16:04:40 +00:00
Xu Han	39a3c98aa6	[inductor] fix scalar miss constuctor for long type. (#132117 ) Fix `long` to `c10::scalar` convert issue. ![image](https://github.com/user-attachments/assets/fc44a170-e293-4688-a185-d189484f6638) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132117 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-31 15:40:48 +00:00
Ke Wen	b2118573d6	[BE] Unify PG assignments (#132230 ) python's `or` operator returns `bar` in cases of `foo = None or bar` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132230 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2024-07-31 15:28:25 +00:00
IvanKobzarev	9c52013559	[subclasses] Fix nested subclasses flattened tensors ordering (#132096 ) get_plain_tensors() should result in DFS of leaves. The error was that plain tensors (leaves) on the same level were returned before subclasses plained tensors even if subclasses are before in "flatten" list. Original issue from AO: https://github.com/pytorch/ao/issues/515 Test:TBD, need to make asymetric subclass with dense tensors and subclasses Pull Request resolved: https://github.com/pytorch/pytorch/pull/132096 Approved by: https://github.com/bdhirsh	2024-07-31 14:12:51 +00:00
PyTorch MergeBot	5406e46b00	Revert "Add fx graph runnable to tl parse (#130976 )" This reverts commit 52c3af62d6fa4a0a4e22764a89f1877f3b1b28f9. Reverted https://github.com/pytorch/pytorch/pull/130976 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/130976#issuecomment-2260579485))	2024-07-31 13:53:57 +00:00
Ke Wen	3d7f541597	[BE][TP] Check module has bias before access (#132137 ) Some linear modules, such as the ones reconstructed by `torch.export.unflatten()`, may not have the `bias` attribute, if the original linear module has `bias=None`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132137 Approved by: https://github.com/wanchaol	2024-07-31 13:45:28 +00:00
Dan Zimmerman	dad125a64b	Address clang-tidy nits in BFloat16 (#132203 ) Summary: In https://github.com/pytorch/pytorch/pull/131359 I forgot to amend with clang-tidy fixes before merging. This addresses that. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/132203 Approved by: https://github.com/houseroad	2024-07-31 13:41:56 +00:00
Yu, Guangye	45e6a364ee	Avoid autocast deprecation warning (#132207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207 Approved by: https://github.com/awgu	2024-07-31 13:13:39 +00:00
Luca Wehrstedt	f4f7aba75d	Expose function to probe whether PyTorch was built with FlashAttention (#131894 ) This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-07-31 11:33:09 +00:00
Xuehai Pan	548c460bf1	[BE][Easy][7/19] enforce style for empty lines in import segments in `test/[a-c]/` and `test/[q-z]/` (#129758 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758 Approved by: https://github.com/ezyang	2024-07-31 10:54:03 +00:00
Janani Sriram	46994e753b	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#132172 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172 Approved by: https://github.com/davidberard98 ghstack dependencies: #132170	2024-07-31 10:51:46 +00:00
Janani Sriram	89053e382a	[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#132170 ) Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170 Approved by: https://github.com/davidberard98	2024-07-31 10:51:46 +00:00
Xuehai Pan	e7eeee473c	[BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]/` and `torch/_[e-h]/` and `torch/_[j-z]*/` (#129765 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765 Approved by: https://github.com/ezyang	2024-07-31 10:42:50 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
IvanKobzarev	a94e507c39	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-31 07:25:19 +00:00
Yiming Zhou	e9d1c26275	fix uniform op in dynamo (#132160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132160 Approved by: https://github.com/anijain2305	2024-07-31 06:48:43 +00:00
Justin Chu	ae708e9791	[ONNX] Remove the deprecated SymbolicContext (#132184 ) Remove the deprecated SymbolicContext class from torch.onnx Pull Request resolved: https://github.com/pytorch/pytorch/pull/132184 Approved by: https://github.com/titaiwangms	2024-07-31 04:24:32 +00:00
cyy	89da94594e	[11/N] Fix clang-tidy warnings in jit (#132131 ) Follows #132122 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132131 Approved by: https://github.com/Skylion007	2024-07-31 03:45:52 +00:00
PyTorch MergeBot	91299c95ec	Revert "Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 )" This reverts commit 78020ea55d1bc06898577887b80c15d6d2b967dc. Reverted https://github.com/pytorch/pytorch/pull/131288 on behalf of https://github.com/kit1980 due to Broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10172945925/job/28136657243) [HUD commit link](`78020ea55d`) ([comment](https://github.com/pytorch/pytorch/pull/131288#issuecomment-2259581854))	2024-07-31 03:45:09 +00:00
Cheng Ni	27c9262d29	Fix stdout / stderr typing in SubprocessHandler (#132071 ) Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`. Test Plan: CI Differential Revision: D60319648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071 Approved by: https://github.com/Skylion007	2024-07-31 02:51:11 +00:00
eellison	52c3af62d6	Add fx graph runnable to tl parse (#130976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976 Approved by: https://github.com/ezyang	2024-07-31 02:27:22 +00:00
Matthew Hoffman	deb788f6cc	Merge `torch.nn.utils.rnn` type stubs (#131872 ) I want to re-attempt: * #61467 See: * https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730 and this is one of the files I would touch. quoting @ezyang: * https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129 > The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-07-31 02:24:59 +00:00
Matthew Hoffman	78020ea55d	Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 ) Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error: ``` "mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage] ``` Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288 Approved by: https://github.com/ezyang	2024-07-31 02:16:38 +00:00
Cui, Yifeng	df0494bbba	Clean redundant link libraries for XPU (#131322 ) `torch_xpu` should link to `libtorch_cpu.so` instead of `torch_cpu_library`, otherwise redundant link libraries will contaminate `torch_xpu`, especially when there are MKL in both CPU and XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131322 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-07-31 02:15:15 +00:00
Xuehai Pan	c07aa1c9c9	[Easy] reorder functions in `torch._jit_internal` (#130531 ) Split from #128633. - #128633 Move commonly used functions (e.g. `is_scripting`) to the top of the module to avoid circular dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130531 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-07-31 02:12:29 +00:00
Xuehai Pan	fbe6f42dcf	[BE][Easy][8/19] enforce style for empty lines in import segments in `test/[k-p]*/` (#129759 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129759 Approved by: https://github.com/justinchuby, https://github.com/ezyang	2024-07-31 02:09:20 +00:00
atalman	914577569d	Remove python 3.8 nightly builds (#132138 ) Removing python 3.8 support in nightly builds. As per PR: https://github.com/pytorch/pytorch/issues/120718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132138 Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/huydhn	2024-07-31 01:50:03 +00:00
Anshul Sinha	05317cd8f7	[dtensor][be] improving readability and reducing repeating code (#132070 ) Summary I created functions that reduced repeating code in the console and json APIs which also improved their readability for future developers. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/132070 Approved by: https://github.com/XilunWu	2024-07-31 00:53:36 +00:00
Tianyu Liu	f85feef127	[DTensor] add support for custom op registration (#131108 ) `register_sharding` is an experimental API that allows users to register sharding strategies for an operator when the tensor inputs and outputs are :class:`DTensor`s. It can be useful when: (1) there doesn't exist a default sharding strategy for ``op``, e.g. when `op` is a custom operator that is not supported by `DTensor`; (2) when users would like to overwrite default sharding strategies of existing operators. Here's an example: @register_sharding(aten._softmax.default) def custom_softmax_sharding(x, dim, half_to_float): softmax_dim = dim if dim >= 0 else dim + x.ndim acceptable_shardings = [] all_replicate = ([Replicate()], [Replicate(), None, None]) acceptable_shardings.append(all_replicate) for sharding_dim in range(x.ndim): if sharding_dim != softmax_dim: all_sharded = ( [Shard(sharding_dim)], [Shard(sharding_dim), None, None], ) acceptable_shardings.append(all_sharded) return acceptable_shardings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131108 Approved by: https://github.com/wanchaol	2024-07-31 00:51:16 +00:00
leslie-fang-intel	31205d5198	[Inductor][CPP] Fix Local Buffer issue with inplace result line (#132018 ) Summary If a `global buffer` has been replaced by `local buffer`, we will add this `global buffer` into `removed_buffers` to avoid unnecessary allocation. However, a special case is when this `global buffer` can reuse previous buffer. We didn't handle this case previously which cause functional failure in `f151f25c0b/torch/_inductor/codegen/wrapper.py (L440)` In this PR, we resolve this issue by avoid adding this global buffer into `V.kernel.inplace_update_buffers` when this buffer has been marked as `removed`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_local_buffer_with_line_reuse ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132018 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-31 00:38:17 +00:00
Siyu Yang	882d80fd92	Add lowering for updated _scaled_mm (fixing submodules) (#130422 ) Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations. The Triton kernel template is based on `3ad9031d02` (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422 Approved by: https://github.com/ipiszy	2024-07-30 23:48:48 +00:00
Menglu Yu	fdcd2f0dd1	[PT2][Optimus] Add unbind cat to view pass (#132152 ) Summary: We observed new graph transformation opportunity in IG_CTR, which can further remove the cat node. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/5061a3fe-b788-4031-b3af-66d48564a2df Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199298289131 Network: Up: 2.5GiB Down: 5.7GiB (reSessionID-a49b1234-c02c-4a2d-a9ad-9f5b23557522) Jobs completed: 294061. Time elapsed: 13:47.8s. Cache hits: 68%. Commands: 106996 (cached: 72904, remote: 33875, local: 217) Tests finished: Pass 10. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 1649, 'pattern_matcher_count': 1538, 'normalization_pass': 343, 'extern_calls': 160, 'normalization_aten_pass': 39, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 9, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1}) before vs after graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1497865201 Differential Revision: D60325668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132152 Approved by: https://github.com/jackiexu1992	2024-07-30 23:27:18 +00:00
Edward Z. Yang	afb04d78c8	Don't try hard to compute alignment of unbacked expressions (#131649 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131649 Approved by: https://github.com/bdhirsh	2024-07-30 23:19:42 +00:00
Yifu Wang	5a33657b31	[micro_pipeline_tp] implement the pass for fused_scaled_matmul_reduce_scatter (#131951 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131951 Approved by: https://github.com/weifengpy	2024-07-30 23:02:49 +00:00
Joel Schlosser	524aac413c	Initial OpInfo-based testing for NJTs (#131704 ) This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing. * New tests in `TestNestedTensorOpInfo` * `test_forward()` - compares forward output to an unbind-based reference * `test_backward()` - compares forward output and grads to an unbind-based reference * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager * To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`. * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim * Several xfails were added to get things passing TODO (future PRs): * Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today) * Mixed (NT, T), (T, NT) inputs for binary ops * Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs * Address all xfails via fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704 Approved by: https://github.com/soulitzer ghstack dependencies: #131898	2024-07-30 23:02:24 +00:00
Roy Berger	93facac02c	[NeuralNetInference] Bring up iOS builds (#131917 ) Summary: Mirror Android setup to static link & use lite interpreter on iOS Test Plan: CI Reviewed By: EscapeZero Differential Revision: D60156611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131917 Approved by: https://github.com/cccclai	2024-07-30 23:01:09 +00:00
Wanchao Liang	53a5e0f1a8	[BE] delete spmd module (#132072 ) Summary: as titled, fully delete spmd module as we stopped working on this and the code is already broken with no unit tests enabled. We should not keep it in the codebase as it provide no value anymore, and it burdens DTensor to maintain the compatiblity with it (i.e. code paths/imports) constantly. Test Plan: sandcastle Differential Revision: D60402105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132072 Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/fegin, https://github.com/seemethere, https://github.com/albanD, https://github.com/yifuwang	2024-07-30 22:20:21 +00:00
Songhao Jia	a141334c88	migitate wrong tensor.dim_order() (#131366 ) Summary: there're some issues for dim order creation. T194410923 has detail illustration. One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut. To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it. Test Plan: CI Differential Revision: D60056793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366 Approved by: https://github.com/ezyang	2024-07-30 21:58:15 +00:00
Andrew Gu	2b43fab555	[DTensor] Added naive support for `nn.init.orthogonal_` (#132104 ) Try to unblock https://github.com/pytorch/pytorch/issues/131991 - `nn.init.orthogonal_` uses `tensor.new`, which is the legacy factory function. We change this to `tensor.new_empty` (empty is okay since it will be immediately followed by `.normal_()` to fill the tensor) so that it preserves `DTensor`-ness. - `nn.init.orthogonal_` uses QR decomposition (`aten.linalg_qr.default`) and `torch.diag` (calling into `aten.diagonal_copy.default`). For simplicity, we use naive replicate strategies for now. `aten.diagonal_copy.default` could do something more sophisticated for sharded inputs, but I would rather defer that to later due to the complexity. For `orthogonal_` support specifically, since the result of the QR decomp will be replicated, the input to `aten.diagonal_copy.default` will be replicated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132104 Approved by: https://github.com/albanD, https://github.com/wanchaol	2024-07-30 21:55:09 +00:00
Zain Rizvi	3e142d766a	[EZ] Make consistent with scale-config.yml (#132164 ) Fix inconsistencies from test-infra's scale-config.yml file To be followed up by https://github.com/pytorch/test-infra/pull/5513 which will catch such inconsistencies going forward Pull Request resolved: https://github.com/pytorch/pytorch/pull/132164 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/zxiiro	2024-07-30 21:42:23 +00:00
Lucas Pasqualin	69c34f6e4c	Corrects Error Codes from cudaHostRegister (#132089 ) Causing some terrible error messages e.g. : ``` # printing directly: cudaError.??? # casting to int first: 712 Traceback (most recent call last): File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module> main() File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main _create_cpu_state_dict(sd, share_memory=True, pin_memory=True) File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict ret = _iterate_state_dict( ^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict ret = { ^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp> key: _iterate_state_dict( ^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict ret = tensor_func(iter_object, pg, device, companion_obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func succ == 0 AssertionError: Pinning shared memory failed with error-code: cudaError.??? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089 Approved by: https://github.com/Skylion007	2024-07-30 21:42:00 +00:00
Jiashen Cao	ff377e16ab	Improve logging in the TSConverter (#132082 ) Summary: Currently, running explain with TORCH_LOGS enabled will cause duplicate loggings because explain uses the exact same code path for covnersion. This PR just disables logging when it is running explain. And move all logging to convert() to prevent from logging from __init__ when we are just using explain. Test Plan: Manual testing with attached outputs. Reviewed By: SherlockNoMad, angelayi Differential Revision: D60199007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132082 Approved by: https://github.com/ydwu4	2024-07-30 21:37:44 +00:00
Edward Z. Yang	495d413519	Include code object of frame being compiled in stack (#132161 ) This is pretty useful to have! Test plan: https://internalfb.com/intern/fblearner/details/586653862/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132161 Approved by: https://github.com/oulgen	2024-07-30 21:33:27 +00:00
rzou	19db4f6014	[capture_triton] fix special kwargs path (#132143 ) I didn't test this path when creating the orchestrator. This PR fixes that path to work in the capture_triton path. The problem is that we are handling a value that is an int (in the capture_triton path) and a ConstantVariable (in the Dynamo triton path) so we abstract that out in the orchestrator. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143 Approved by: https://github.com/oulgen	2024-07-30 20:30:40 +00:00
Xintong Hu	1118c74b5f	[PT2] Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes (#131902 ) (#132078 ) Summary: Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes Test Plan: run new UTs Reviewed By: frank-wei Differential Revision: D60258724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132078 Approved by: https://github.com/frank-wei	2024-07-30 20:17:06 +00:00
Joel Schlosser	d53b11bb6e	Strict shape checking for NJTs with TestCase.assertEqual() (#131898 ) Background: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal. This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal. Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898 Approved by: https://github.com/soulitzer	2024-07-30 20:05:48 +00:00
Shuai Yang	58f76bc301	Revise skip torchrec logic (#130783 ) Summary: The previous logic adds skipped files when the file was imported which happens at very early stage. However, we could set skip_torchrec at later stage (e.g, in APS, we set it during the trainer execution). In that case, the skip logic will still take effect since skipped files have been added. So in this diff, we revise the logic so that it can adapt to changes of skip_torchrec at later stages. Test Plan: Tested on APS models: buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher_live -- mode=local_ig_fm_uhm_mini model_name=ig_fm_one_sparse_benchmark features=ig_fm_one_sparse_benchmark model=ig_fm_one_sparse_benchmark training.pipeline_type=pt2 commit: 2fb485d9e torchrec related paths were not skipped. Differential Revision: D59779153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130783 Approved by: https://github.com/yanboliang	2024-07-30 19:55:20 +00:00
Li-Huai (Allan) Lin	964f97539f	[MPS] Correct nonzero warning and fix the test (#132127 ) #125355 lifted the natively supported macOS version to 14. Fixes #132110 Probably fixes this flaky test disabling issue: #126492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127 Approved by: https://github.com/malfet	2024-07-30 19:46:25 +00:00
Edward Z. Yang	f2dedc910e	Improve SpeculationLog error message (#131982 ) There are some substantive changes. Instead of recording the next instruction in the speculation log, I record the current instruction. I think this is more intuitive, we always call speculation at the beginning of executing an instruction, so logically, the entry is associated with the current instruction. (Note that self.instruction_pointer is next instruction, as conventionally we increment IP before calling speculate). The cosmetic change is to also pass in the Instruction corresponding to the IP and print it, and beef up the error message, including notes about the previous instruction that was run before it failed (this is typically the critical instruction). At time of submission, this test case triggered the error: ``` diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py index 5ade17856e1..60ef89be346 100644 --- a/test/distributed/test_dynamo_distributed.py +++ b/test/distributed/test_dynamo_distributed.py @@ -844,6 +844,39 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase): for r in res[1:]: self.assertEqual(res[0], r) + @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch") + @config.patch(enable_compiler_collectives=True) + def test_compiler_collectives_automatic_dynamic_speculation_divergence(self): + with _dynamo_dist_per_rank_init(self.rank, self.world_size): + torch._dynamo.utils.clear_compilation_metrics() + + # TODO: This should be possible to do inside the function, but + device = f"cuda:{self.rank}" + + @torch.compile() + def f(x, y): + zx = x.shape + zy = y.shape + return x.sum() + y.sum() + + if self.rank == 0: + dataloader = [4, 4] + else: + dataloader = [3, 4] + + for data in dataloader: + f( + torch.randn(data, device=self.rank), + torch.randn(data, device=self.rank), + ) + + metrics = torch._dynamo.utils.get_compilation_metrics() + # Number of compiles same on all nodes + res = [None] * self.world_size + torch.distributed.all_gather_object(res, len(metrics)) + for r in res[1:]: + self.assertEqual(res[0], r) + @requires_nccl() ``` although I plan to fix this soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131982 Approved by: https://github.com/anijain2305, https://github.com/mlazos, https://github.com/jansel	2024-07-30 19:21:31 +00:00
Joel Schlosser	e6cddc9271	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-30 18:42:54 +00:00
Aidyn-A	f217b470cc	[CMAKE] Avoid double setting of LDFLAGS (#130370 ) It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set: `1bf4a44b33/CMakeLists.txt (L541-L542)` This flag collision causes the build failures at the `cmake` stage. This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370 Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet	2024-07-30 18:16:04 +00:00
Jane Xu	3816f6420a	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-30 18:08:17 +00:00
Aos Dabbagh	9f6d7df3d9	docs(multinomial): Add reference to `Multinomial` class (#131904 ) This PR just adds the reference to the class `torch.distributions.multinomial.Multinomial` in `torch.multinomial`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131904 Approved by: https://github.com/jbschlosser	2024-07-30 18:05:07 +00:00
PyTorch MergeBot	239d4d2489	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 9606d61e0c921b886d20cb61454043c6c270ae89. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/ZainRizvi due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2258871791))	2024-07-30 17:39:41 +00:00
Tristan Rice	9027db1ab8	TCPStore: fix remote address (#131773 ) (#131913 ) Summary: This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. This relands it since it got reverted due to a fmt::format issue internally. Original Pull Request: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman Test Plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v buck2 test @//mode/dev-nosan //caffe2/test/distributed:store ``` Differential Revision: D60296583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131913 Approved by: https://github.com/kurman, https://github.com/rsdcastro, https://github.com/Skylion007	2024-07-30 17:27:33 +00:00
Florian	3864a2d834	[profiler ut] Update event name in test_profiler.py (#131757 ) Fixes #ISSUE_NUMBER To support kernel name with some uppercase letters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131757 Approved by: https://github.com/aaronenyeshi	2024-07-30 17:15:31 +00:00
Yidi Wu	32c57e78ed	Specialize sym node when used as device kwarg (#131811 ) Fixes https://github.com/pytorch/pytorch/issues/131189. We specialize the symint in python_arg_parser when used as kwarg device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131811 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/albanD	2024-07-30 17:11:57 +00:00
Andrew Gu	33ce9cf7f9	[FSDP2] Relaxed overlap timing check to avoid flakiness (#132116 ) Trying to fix https://github.com/pytorch/pytorch/issues/131081 See https://github.com/pytorch/pytorch/issues/131081#issuecomment-2239443504 for detailed context. This PR is relaxing one assertion against the _baseline_ to try to fix the flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132116 Approved by: https://github.com/Skylion007	2024-07-30 14:28:12 +00:00
Jeeja	16e0868a3d	[FSDP] Add hpu device to _get_remote_device_str (#132120 ) In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120 Approved by: https://github.com/awgu	2024-07-30 14:24:24 +00:00
Guilherme Leobas	a843178529	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519	2024-07-30 14:22:23 +00:00
Shreyans Pathak	12b67bd998	Fix pyi annotation for `ProcessGroupGloo.Options` (#132080 ) This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file. Fixes #132054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080 Approved by: https://github.com/Skylion007	2024-07-30 13:52:31 +00:00
PyTorch MergeBot	499ead96ff	Revert "Grouped Query Attention (#128898 )" This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))	2024-07-30 13:11:24 +00:00
cyy	bdf57da6a6	[3/N] Enable clang-tidy on torch/csrc/inductor (#132101 ) Follows #132040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132101 Approved by: https://github.com/Skylion007	2024-07-30 13:04:57 +00:00
cyy	eccbd408e5	[10/N] Fix clang-tidy warnings in jit (#132122 ) Follows #132010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122 Approved by: https://github.com/Skylion007	2024-07-30 12:56:31 +00:00
Sijia Chen	83db609ee5	[inductor] fix the cudagraph tree test (#132043 ) Summary: There are two kinds of exceptions: Case #1: ``` static input data pointer changed. input name: primals_2. data pointer changed from 140315748992000 to 140315748993536. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1826, in forward return self.static_tensor + x + self.goo(x) File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1816, in forward return self.linear(x) input name: primals_3. data pointer changed from 140315748990976 to 140315748993024. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward self.static_tensor.add_(torch.ones((2, 2), device="cuda")) ``` Case #2: ``` static input data pointer changed. input name: primals_2. data pointer changed from 139852509086720 to 139852509088256. input stack trace: None input name: primals_3. data pointer changed from 139852509085696 to 139852509087744. input stack trace: File "/dev/shm/uid-30083/f61ee184-seed-nspid4026560782_cgpid769179-ns-4026560865/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward self.static_tensor.add_(torch.ones((2, 2), device="cuda")) ``` The current impl only covered the case #2 Test Plan: https://www.internalfb.com/intern/testinfra/testrun/15481123762274476 Differential Revision: D60340212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132043 Approved by: https://github.com/BoyuanFeng	2024-07-30 08:35:56 +00:00
Menglu Yu	36e8289129	[PT2][Optimus] Optimize cat node inputs pattern (#131866 ) Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes ``` # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 1589, 'pattern_matcher_count': 1497, 'extern_calls': 393, 'normalization_pass': 342, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 12, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1}) P1496150856 Differential Revision: D60274533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131866 Approved by: https://github.com/jackiexu1992	2024-07-30 07:49:26 +00:00
Yanbo Liang	54d4f6bbca	[Inductor][FlexAttention] Correct partial/full blocks naming (#131993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131993 Approved by: https://github.com/drisspg	2024-07-30 06:40:40 +00:00
Animesh Jain	03e058189e	[dynamo] Support dict unpack of MutableMapping objects (#131961 ) Fixes https://github.com/pytorch/pytorch/issues/128067 The basic functionality was alredy introduced earlier. This just ensures that we support UserDefinedObjectVariable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131961 Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/yanboliang ghstack dependencies: #131827, #131956	2024-07-30 05:49:58 +00:00
Animesh Jain	f806128619	[dynamo] Skip <frozen abc> to skip __isisintance__ check on abc objects (#131956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131956 Approved by: https://github.com/williamwen42, https://github.com/mlazos ghstack dependencies: #131827	2024-07-30 05:49:58 +00:00
Animesh Jain	13457d1da0	[dynamo][log] Suggest to use pytree when graph-break on optree (#131827 ) Discovered while working on https://github.com/pytorch/pytorch/issues/121369 On the model above, the log looks like this ~~~ /home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree._C.PyCapsule.flatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py. torch._dynamo.utils.warn_once(msg) /home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree.PyCapsule.unflatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py. torch._dynamo.utils.warn_once(msg) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131827 Approved by: https://github.com/zou3519, https://github.com/mlazos	2024-07-30 05:49:58 +00:00
Jiang, Yanbing	fc6066b80f	improve mkldnn_linear_pointwise_binary performance for contiguous tensor with non default contiguous strides (#132019 ) Fixes https://github.com/pytorch/pytorch/issues/131734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132019 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-07-30 05:02:38 +00:00
PyTorch UpdateBot	40f8db5741	[audio hash update] update the pinned audio hash (#132105 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132105 Approved by: https://github.com/pytorchbot	2024-07-30 03:39:27 +00:00
Xu Han	aa1488fe02	[inductor] turn on enable_kernel_profile on Windows. (#132025 ) Enable `TORCHINDUCTOR_CPP_ENABLE_KERNEL_PROFILE` on Windows inductor. Local tested pass: ![image](https://github.com/user-attachments/assets/a82351af-cc56-4ba1-a8f4-08f1c38713d1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132025 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 03:02:09 +00:00
Xu Han	475da800c7	[inductor] optimize cflags for Windows. (#131980 ) changes: 1. optimize cflags for Windows. Ref: https://github.com/pytorch/pytorch/blob/v2.4.0/torch/utils/cpp_extension.py#L215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131980 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:59:51 +00:00
Xu Han	bdc42e3fb8	[inductor] validate_can_generate_cpp_wrapper add win32 support. (#131978 ) Changes: 1. `validate_can_generate_cpp_wrapper` add win32 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131978 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:59:48 +00:00
eellison	baa4c9ca46	Optimize aten.cat calls of a repeated element (#132081 ) This was a particular problem for a model I saw which would have a large number of repeats, making compilation slow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132081 Approved by: https://github.com/shunting314	2024-07-30 02:56:00 +00:00
leslie-fang-intel	f8e4060484	[Inductor][CPP] Enhance cppcsevar data type deduce (#130827 ) Summary Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](`096dc444ce/torch/_inductor/codegen/common.py (L1844)`)), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction: - We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`. - To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:51:31 +00:00
William Wen	b6c1490cc0	[dynamo] make more unpack_var_sequence calls forced (#132069 ) Fixes [T197204962](https://www.internalfb.com/intern/tasks/?t=197204962) (example failure: https://www.internalfb.com/intern/testinfra/diagnostics/11540474088277914.281475138576374.1722221031/) Added tests contain a simple repro for the observed failure (`test_map_unpack_vars`). Also fixes https://github.com/pytorch/pytorch/issues/132044 Differential Revision: [D60420335](https://our.internmc.facebook.com/intern/diff/D60420335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132069 Approved by: https://github.com/anijain2305	2024-07-30 02:30:08 +00:00
Aaron Orenstein	8721b21b38	Fix fake_tensor w/ non-view tensor (#132050 ) Summary: This code was overly complex and is confusing some guards - basically if a result cached tensor isn't a view there's no reason to be messing with its storage. Test Plan: unit tests pass Differential Revision: D60387821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132050 Approved by: https://github.com/oulgen	2024-07-30 02:17:18 +00:00
eellison	9598c58618	Add config option to skip autotuning conv (#131839 ) requested internally bc for some models the conv templates are not very helpful Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839 Approved by: https://github.com/oulgen ghstack dependencies: #131400	2024-07-30 01:57:53 +00:00
zhouyusong	5a2620302b	[inductor] Replace self_cuda_time_total function calls with self_dev… (#131029 ) …ice_time_total for wrapper_bench Pull Request resolved: https://github.com/pytorch/pytorch/pull/131029 Approved by: https://github.com/shunting314	2024-07-30 01:57:39 +00:00
Li-Huai (Allan) Lin	a147fa577b	[MPS] Fix masked_fill_ in non_contiguous cases (#131957 ) fixes #131285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957 Approved by: https://github.com/DenisVieriu97	2024-07-30 01:34:48 +00:00
blaine-rister	3716934b1a	[Inductor] Refactor autotuning utils to compute max block sizes (#131730 ) These OSS changes are part of a larger MTIA diff. The OSS part is a simple refactor that makes it easier to query max block sizes by the prefix of the grid dimension, e.g. `"X"`, as opposed to having to use separate functions for `get_xmax()`, `get_ymax()`, etc. Differential Revision: D60195669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131730 Approved by: https://github.com/eellison	2024-07-30 01:04:53 +00:00
PyTorch MergeBot	7a7dd8c29e	Revert "[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518 )" This reverts commit bcf5c68c18c6a109e1fa00829eea0428d44cfb6b. Reverted https://github.com/pytorch/pytorch/pull/131518 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit (the final PR and diff must always be identical). Conflicts arise when that happens which block the diff train. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131518#issuecomment-2257259839))	2024-07-30 00:55:10 +00:00
angelayi	ab9791c0e3	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-07-30 00:41:44 +00:00
eellison	2a4d9aa548	Disable expandable segments checkpointing internally (#132048 ) Differential Revision: [D60388286](https://our.internmc.facebook.com/intern/diff/D60388286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132048 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-07-30 00:26:39 +00:00
PyTorch MergeBot	be5e44192d	Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 )" This reverts commit 8fe2bf212dc5e01b15cbe728958f940873230d64. Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit. Weird conflicts arise when that happens. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2257230717))	2024-07-30 00:18:22 +00:00
Bin Bao	b1ccd0c407	[CI] Update environment varible setting for aarch64 (#132046 ) Summary: JEMALLOC_LIB and core_number need to be set differently on aarch64. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132046 Approved by: https://github.com/huydhn	2024-07-30 00:09:59 +00:00
yuqingj	e3dc20c94b	[NJT] support cat backward (#132076 ) cat_tensors_backward use narrow_symint, so we need to support aten::narrow for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132076 Approved by: https://github.com/davidberard98	2024-07-29 23:49:26 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
eellison	8b507a922a	Mode to emulate amp numerics (#131595 ) ``` # Mode to emulate pytorch eager numerics for lower precision (fp16, bf16) # Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after # For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts # Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging # to emulate the eager numerics. ``` We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching. in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now. This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595 Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel	2024-07-29 22:42:23 +00:00
soulitzer	884eadcd19	Fix multi grad hooks thread safety (#132055 ) Thanks @awgu for spotting this Pull Request resolved: https://github.com/pytorch/pytorch/pull/132055 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/albanD	2024-07-29 22:32:59 +00:00
Edward Z. Yang	e55e9d8126	Clear speculation log when restarting due to compiler collective (#131983 ) The compiler collective can trigger an input to become dynamic, which can trigger operations to be recorded to the graph, which would change the speculation log entries (since they only start being recorded once we have a non-empty output graph). Test case triggers this situation. Production instance: https://www.internalfb.com/mlhub/pipelines/runs/mast/f584750649-TrainingApplication?job_attempt=2&version=0&env=PRODUCTION Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131983 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2024-07-29 22:32:10 +00:00
PyTorch MergeBot	62b2e7a553	Revert "Add config option to skip autotuning conv (#131839 )" This reverts commit 3d4de8e96d0bb1fe19b25734a97a19dd85313692. Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))	2024-07-29 22:31:51 +00:00
Janani Sriram	8fe2bf212d	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519 Approved by: https://github.com/davidberard98 ghstack dependencies: #131518	2024-07-29 22:16:32 +00:00
jainapurva	d039b14207	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-29 21:49:06 +00:00
Yang Chen	05a8540041	[cpp-wrapper] create null pointer for zero-size array (#132023 ) zero-size array is not supported in the C or C++ standard, so we create a null pointer for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132023 Approved by: https://github.com/desertfire	2024-07-29 21:40:33 +00:00
Andrew Gu	d8358a2d86	Made `register_multi_grad_hook` return type `RemovableHandle` (#132074 ) `_MultiHandle` is private. Let us return `RemovableHandle`, which is public. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132074 Approved by: https://github.com/soulitzer	2024-07-29 21:29:34 +00:00
PyTorch MergeBot	d5e9fbb012	Revert "BE: reset dynamo before each test in test_module.py (#131372 )" This reverts commit 527901f054a947976dc587bb9cf72c86992b7c87. Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
PyTorch MergeBot	a4723b566f	Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397 )" This reverts commit ca8153ae6758fbf33cc767cfd0cb384b87b8d3ca. Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
Tom Ritchford	bdf5a6dca9	Add decomposition for unsqueeze_copy (#130942 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942 Approved by: https://github.com/peterbell10	2024-07-29 21:13:37 +00:00
Yanbo Liang	3c1562158e	[BE] Fix torch.compile docstring formatting issues (#131837 ) Fixes #131815 <img width="1098" alt="Screenshot 2024-07-25 at 6 58 39 PM" src="https://github.com/user-attachments/assets/d0f6edc3-419e-4096-803b-cecd45d8644b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131837 Approved by: https://github.com/williamwen42	2024-07-29 20:52:28 +00:00
Simon Mahns	dcb03106b7	[Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007 ) Summary: as title Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962 Differential Revision: D60335413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007 Approved by: https://github.com/hanzlfs, https://github.com/egienvalue	2024-07-29 20:47:18 +00:00
Joona Havukainen	082d0b80ca	Min and max NaN propagation fix in MPS backend (#130445 ) Partial fix to issue #130295 Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445 Approved by: https://github.com/malfet	2024-07-29 20:09:15 +00:00
Animesh Jain	f44446e851	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #132053	2024-07-29 20:01:51 +00:00
Sam Larsen	4c2bcf92cb	[inductor] Enable FX graph caching in OSS by default (#125863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-07-29 19:19:54 +00:00
Xu Han	484852c02b	[Doc] update guide install mkl-static from conda to pip (#130026 ) <img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4ac3ca68-57dc-42c7-ac7a-876dc377ebcf"> Conda intel channel is not avaliable now. Use `pip` install instead of `conda`. `Windows` and `Linux` are avaliable: Binary list: https://pypi.org/project/mkl-static/#files `MacOS` is avaliable for old version: https://pypi.org/project/mkl-static/2021.3.0/#files TODO: 1. cherry-pick to `release/2.4` branch, @atalman . 2. fix it also in `release/2.3` branch: https://github.com/pytorch/pytorch/pull/131853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130026 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-07-29 19:19:15 +00:00
Aidyn-A	301ec32ae8	[EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059 ) Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059 Approved by: https://github.com/Skylion007	2024-07-29 19:15:37 +00:00
Xuehai Pan	5cc34f61d1	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet ghstack dependencies: #131151	2024-07-29 18:53:14 +00:00
Xuehai Pan	4694ee1ad2	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-29 18:53:14 +00:00
cyy	ab912b7fef	[2/N] Fix clang-tidy warnings in inductor (#132040 ) Follows #131979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132040 Approved by: https://github.com/Skylion007	2024-07-29 18:41:24 +00:00
cyy	c764ef6d53	[9/N] Fix clang-tidy warnings in jit (#132010 ) Follows #131997 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132010 Approved by: https://github.com/Skylion007	2024-07-29 18:38:35 +00:00
Animesh Jain	f389bca2e9	[dynamo][inline_inbuilt_nn_modules] Skip test_dpp_graphs for now (#132053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132053 Approved by: https://github.com/laithsakka	2024-07-29 17:59:47 +00:00
Edward Z. Yang	6c6fbb4691	Fix pyi annotation for ProcessGroupNCCL.Options (#130957 ) Probably all the other options need updating too, but this is the one I needed. The accurate annotation was determined by reading torch/csrc/distributed/c10d/init.cpp Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-07-29 17:46:01 +00:00
Yang Chen	025242d065	[cpu-test] enable test_cpu_repro in fbcode (#132022 ) Summary: This diff enables test_cpu_repro in fbcode Test Plan: ci Differential Revision: D60364517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132022 Approved by: https://github.com/desertfire	2024-07-29 17:45:26 +00:00
Shunting Zhang	ca8153ae67	BE: reset dynamo before each test in test_ops_gradients.py (#131397 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388, #131372	2024-07-29 17:39:23 +00:00
Shunting Zhang	527901f054	BE: reset dynamo before each test in test_module.py (#131372 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388	2024-07-29 17:39:23 +00:00
Aaron Gokaslan	bd1a29b158	[BE][Ez]: Update ruff to 0.5.5. Bugfixes and better LSP support (#132037 ) Updates ruff to the latest and greatest, mainly better LSP support and bugfixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/132037 Approved by: https://github.com/malfet	2024-07-29 16:57:13 +00:00
PyTorch MergeBot	6cf493158e	Revert "Enable FlashAttention on Windows (#131906 )" This reverts commit b90bc66766c3503c1f229660710a803488d53c16. Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))	2024-07-29 16:49:23 +00:00
eellison	3d4de8e96d	Add config option to skip autotuning conv (#131839 ) requested internally bc for some models the conv templates are not very helpful Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839 Approved by: https://github.com/oulgen ghstack dependencies: #131400	2024-07-29 16:43:58 +00:00
PyTorch MergeBot	e73a4cb21f	Revert "[pt2e][quant] Ensure BN node is erased after convert (#131651 )" This reverts commit eba2ffd278a004df8fd335328ab8ba00c978e471. Reverted https://github.com/pytorch/pytorch/pull/131651 on behalf of https://github.com/ZainRizvi due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131651#issuecomment-2256407968))	2024-07-29 16:42:24 +00:00
PyTorch MergeBot	f72266ecea	Revert "Let dynamo inline functional_call (#128646 )" This reverts commit 5aab1acc84ff4a4374c9ddd179be48b07c6c8a74. Reverted https://github.com/pytorch/pytorch/pull/128646 on behalf of https://github.com/clee2000 due to the newly added test dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers [GH job link](https://github.com/pytorch/pytorch/actions/runs/10147452270/job/28058682000) [HUD commit link](`5aab1acc84`) is broken, probably a landrace since it passed on PR ([comment](https://github.com/pytorch/pytorch/pull/128646#issuecomment-2256375501))	2024-07-29 16:26:50 +00:00
Tom Ritchford	962f248437	Add decomposition for expand_copy (#130940 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940 Approved by: https://github.com/peterbell10	2024-07-29 16:23:56 +00:00
rzou	e393c7fa05	Tighten torch.library.infer_schema input types (#130705 ) Made the following changes: - mutates_args is now keyword-only and mandatory. This is to align with torch.library.custom_op (which makes it mandatory because it's easy to miss) - op_name is now keyword-only. This helps the readability of the API - updated all usages of infer_schema This change is not BC-breaking because we introduced torch.library.infer_schema a couple of days ago. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705 Approved by: https://github.com/yushangdi ghstack dependencies: #131777	2024-07-29 16:01:19 +00:00
PyTorch MergeBot	957a89f56c	Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 )" This reverts commit 03760be2714c6ed3b4f44c4dc3ea016f557d8597. Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](`03760be271`) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))	2024-07-29 15:52:08 +00:00
Aaron Gokaslan	ca254d145f	[BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036 ) Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036 Approved by: https://github.com/malfet	2024-07-29 15:50:00 +00:00
Guilherme Leobas	5aab1acc84	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519 ghstack dependencies: #129091, #130490	2024-07-29 15:41:03 +00:00
Guilherme Leobas	e0e4e84ef9	wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #129091	2024-07-29 15:41:03 +00:00
Guilherme Leobas	1e9cdf7d91	Relax constraints for creating a `GenericContextWrappingVariable` (#129091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-07-29 15:40:59 +00:00
Brian Hirsh	6cbad37bee	make `_inductor.config.rocm.supported_arch` set order deterministic for caching (#131921 ) This fixes some AOTAutograd caching tests that were failing flakily internally because they would occasionally cache miss. [T195598220](https://www.internalfb.com/intern/tasks/?t=195598220) I found it by running some stress tests and diffing the AOT cache information on each run, and ended up with this diff (`rocm.supported_arch` order was changing from run to run, although apparently not in OSS): ``` --- tmpa.txt 2024-07-26 11:03:46.220924798 -0700 +++ tmpb.txt 2024-07-26 11:03:44.053586437 -0700 @@ -1,4 +1,4 @@ -Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74: +Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh: [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False) [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False @@ -184,7 +184,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False -[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'} +[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'} [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False @@ -231,7 +231,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[verbose_progress]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[warn_mix_layout]: False [a44txxznx23htuc7zxw7larc7yxpxzxmiqzloxznw7z2k2azqj3] inductor_config[worker_start_method]: fork -Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74: +Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh: [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False) [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False @@ -417,7 +417,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False -[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'} +[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'} [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131921 Approved by: https://github.com/jamesjwu, https://github.com/oulgen	2024-07-29 15:29:04 +00:00
Ruichen Sun	14108c1677	Fix error handling in _triton.py (#132006 ) On Windows, _triton.py creates a confusing error ("RuntimeError: Should never be _installed")_ as triton is not supported in Windows. This is not caught in the current Pytorch exception handling. This pull request adds a new exception handling for the runtime error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132006 Approved by: https://github.com/oulgen	2024-07-29 15:02:25 +00:00
Bin Bao	be3eba382f	[CI] Run perf test for perf_cpu_aarch64 (#132038 ) Summary: Run perf test for perf_cpu_aarch64 instead of regular CI test (test_linux_aarch64). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132038 Approved by: https://github.com/malfet	2024-07-29 13:48:40 +00:00
PyTorch MergeBot	c35f21e5fc	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit 14158d892a2bd9b34edb5637f9a05217ea0330bd. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](`14158d892a`) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))	2024-07-29 13:19:38 +00:00
PyTorch MergeBot	06fe99a097	Revert "[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 )" This reverts commit dfa18bf3f39c5a90b48baf956e50fa7da4462d3d. Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))	2024-07-29 13:09:41 +00:00
PyTorch MergeBot	7ef927da15	Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275 )" This reverts commit 6de65d5dd4226b6bae15352b575c81a6750c819b. Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](`6de65d5dd4`) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))	2024-07-29 12:48:27 +00:00
cyy	efca51e171	[8/N] Fix clang-tidy warnings in jit (#131997 ) Follows #131996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131997 Approved by: https://github.com/Skylion007	2024-07-29 12:40:42 +00:00
PyTorch MergeBot	eb9409511e	Revert "support zb1p and zb2p algorithms (#130752 )" This reverts commit 8fe5b93667b60e37c12d288659a25cbd5ae53c79. Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](`8fe5b93667`) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))	2024-07-29 12:40:00 +00:00
pruthvistony	9d497887b8	Changes to support clang-19 (#131905 ) Co-authored-by: pruthvistony <pruthvigithub@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131905 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2024-07-29 12:38:23 +00:00
cyy	b67811abda	[1/N] Fix clang-tidy warnings in inductor (#131979 ) Fixes clang-tidy warnings in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131979 Approved by: https://github.com/Skylion007	2024-07-29 12:37:56 +00:00
Chengji Yao	d47c470f47	[dynamo] implement `var_getattr` in UserFunctionVariable (#130413 ) This PR addresses the `getattr` of UserFunctionVariable. Although this usage is uncommon, it does appear in [Megatron's code](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py#L635). ``` def linear_with_grad_accumulation_and_async_allreduce(...): .... if not linear_with_grad_accumulation_and_async_allreduce.warned: .... .... linear_with_grad_accumulation_and_async_allreduce.warned = False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130413 Approved by: https://github.com/yanboliang	2024-07-29 08:29:59 +00:00
Xuehai Pan	dfa18bf3f3	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet	2024-07-29 07:40:42 +00:00
Shunting Zhang	f151f25c0b	BE: reset dynamo before each test in test_torch.py (#131388 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388 Approved by: https://github.com/zou3519 ghstack dependencies: #131551	2024-07-29 04:57:34 +00:00
Wu, Chunyuan	30e7fc0fe1	Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557 ) Fix the compilation error: ```cpp /tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16’ {aka ‘const c10::BFloat16’} 401 \| cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1); \| ^~~~~~ \| \| \| at::Tensor ``` The generated code after the fix will be: ```cpp cpp_fused_div_mm_0((bfloat16)(arg2_1.data_ptr()), (bfloat16)(constant2.data_ptr()), (bfloat16)(_frozen_param1.data_ptr()), (bfloat16)(buf1.data_ptr())); ``` Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557 Approved by: https://github.com/leslie-fang-intel	2024-07-29 04:01:17 +00:00
Peter Bell	03760be271	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-29 03:14:13 +00:00
Yan Zhiwei	2a02b5cd22	[Intel GPU] Dispatch Stub support (#130019 ) # Motivation Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way. We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs. This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-29 02:18:52 +00:00
cyy	5b3b2b9cc7	[7/N] Fix clang-tidy warnings in jit (#131996 ) Follows #131986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131996 Approved by: https://github.com/ezyang	2024-07-29 01:21:18 +00:00
cyy	ddd539ba6c	[6/N] Fix clang-tidy warnings in jit (#131986 ) Follows #131969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131986 Approved by: https://github.com/ezyang	2024-07-29 00:49:08 +00:00
Tianyu Liu	7b0e10f0e5	fix _MaskPartial when multiple embeddings coexist (#131264 ) Previously, using _MaskPartial when multiple embeddings have the following issues: 1. Suppose an `nn.Embedding` has shape `[vocab_size, emb_size]`. When there are more than one embeddings, sharing the same `vocab_size` but with different `emb_size`s. Then they would not share `OpStrategy` since each, when involved in computation, would have different `OpSchema`; however, there would be cache hit for redistribute (specifically `_gen_transform_infos` in `torch/distributed/_tensor/_redistribute.py` when doing `Replicate` -> `_MaskPartial`) as the `_MaskPartial` only has `vocab_size` as `logical_dim_size` but not `emb_size` as attribute. This cache hit is undesirable and would cause trouble when doing all-reduce/reduce-scatter on the new `_MaskPartial` in a separate `OpStrategy`. The error was reported in #130725. In this PR, we introduce `offset_shape` to represent the embedding's full shape to avoid cache hit from embeddings of different shapes. 2. The second issue is when we have two `nn.Embedding`s `emb1` and `emb2` with the same shape. There will be cache hit not only in `_gen_transform_infos`, but also in `OpStrategy` generation. Previously, if we sequentially do `Replicate` -> `_MaskPartial` for both `emb1` `emb2` and then sequentially do reduction on the `_MaskPartial` of `emb1`, it would destroy the `MaskBuffer` and `emb2` would hit error. This PR adds a `refcount` for the `MaskBuffer` so that it can be properly shared by multiple `nn.Embedding`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131264 Approved by: https://github.com/wanchaol	2024-07-29 00:40:58 +00:00
Peter Bell	0ab6551bcb	[inductor] Handle NoneLayout in count_numel (#131645 ) We're currently under-counting mutations from ExternKernel since they use `NoneLayout` which doesn't have an associated shape and dtype. Instead, we can get that information from the buffer being mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131645 Approved by: https://github.com/jansel	2024-07-28 23:02:22 +00:00
cyy	7c1fbc7fe9	[5/N] Remove unused parameter (#131998 ) Follows #131291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131998 Approved by: https://github.com/ezyang	2024-07-28 21:29:06 +00:00
Nikita Shulga	f901b02066	[Distributed] Do not expose `nlohmann/json.hpp` in public headers (#131925 ) Move `<hlohmann/json.hpp>` dependency as well as `NCCLTraceBuffer::getCollectiveTraceJson` and `NCCLTraceBuffer::dump_json` implementation introduced by https://github.com/pytorch/pytorch/pull/129505 from the header into .cpp file. This relaxes the requirement on all downstream client to depend on the library Fixes https://github.com/pytorch/pytorch/issues/130678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131925 Approved by: https://github.com/albanD, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #131922	2024-07-28 18:45:24 +00:00
Oguz Ulgen	75c8d59ea1	Remove mypy ignore from torch/_dynamo/variables/lazy.py (#131785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131785 Approved by: https://github.com/aorenste, https://github.com/zou3519 ghstack dependencies: #131786, #131870	2024-07-28 17:13:53 +00:00
Oguz Ulgen	7c29665f77	Remove mypy ignore from torch/testing/_internal/distributed/ (#131870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870 Approved by: https://github.com/aakhundov ghstack dependencies: #131786	2024-07-28 17:13:53 +00:00
Oguz Ulgen	2e4807575c	Remove mypy ignore from torch/_dynamo/polyfill.py (#131786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131786 Approved by: https://github.com/aorenste, https://github.com/zou3519	2024-07-28 17:13:49 +00:00
Adnan Akhundov	cc512ea0f6	[inductor] Fix flaky tests in test_aot_inductor.py (#131994 ) Summary: The `test_model_modified_weights` in `test_aot_inductor.py` has been failing internally for a while. The behavior leading to the test failure was that, after updating the eager model's weights and recompiling the (CPU) model with AOTI, the output of the model was identical to the one before the weights were updated. The root cause is here in Python: `8927fc209f/test/inductor/test_aot_inductor_utils.py (L69-L71)` which, in turn, instantiates the `Runner` object in C++ relying on `dlopen` for loading the .so. The problem is that repeated `dlopen` call does not reload the library from the same path, unless `dlclose` is called in-between the two `dlopen` calls. There is `dlclose` in the `Runner`'s destructor, but it's not called, likely due to the way the loaded `runner` gets closed over in Python: `8927fc209f/test/inductor/test_aot_inductor_utils.py (L83-L94)` Here we add copying the .so file to a unique temporary path right before loading it into a `runner` to avoid the `dlopen` staleness described above. This fixes the `test_model_modified_weights` and, hopefully, will help avoiding similar errors in the future tests. Test Plan: Tested internally. Differential Revision: D60348165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131994 Approved by: https://github.com/chenyang78	2024-07-28 16:55:22 +00:00
Animesh Jain	6de65d5dd4	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #131744, #131928, #131948	2024-07-28 13:23:00 +00:00
Adnan Akhundov	8927fc209f	[inductor] Add type hints to functions in debug.py (#131836 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131836 Approved by: https://github.com/eellison	2024-07-28 04:54:22 +00:00
Huy Do	500aea8d50	Build PT aarch64 on arm runner (#131964 ) Another fix is needed to address https://github.com/pytorch/pytorch/actions/runs/10118374576/job/27985575620. The build needs to be done on arm runner to stay compatible with the Docker image. ### Testing https://github.com/pytorch/pytorch/actions/runs/10118589329/job/27985670691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131964 Approved by: https://github.com/malfet	2024-07-28 04:50:38 +00:00
PyTorch MergeBot	945bf78894	Revert "[BE] typing for decorators - fx/_compatibility (#131568 )" This reverts commit 193f62fde91ee20deb5ddcd9ff4593cd78d74c64. Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	b002ec61b6	Revert "[BE] typing for decorators - masked/_ops (#131569 )" This reverts commit aa58af8b43ad0e615415b4d754255f5be481d41a. Reverted https://github.com/pytorch/pytorch/pull/131569 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	a3ba405871	Revert "[BE] typing for decorators - library (#131570 )" This reverts commit 5731b486c87bedff69aa0264d6c934bf723eb513. Reverted https://github.com/pytorch/pytorch/pull/131570 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	a0abb77007	Revert "[BE] typing for decorators - distributed/_tensor/ops/utils (#131571 )" This reverts commit 4b985e6f803023ec301238d2b4bab4fbea4dd03c. Reverted https://github.com/pytorch/pytorch/pull/131571 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
Yifu Wang	a8a9882899	Implement fused_scaled_matmul_reduce_scatter for async-TP (#131950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131950 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831, #131832, #131833	2024-07-28 03:39:12 +00:00
Yifu Wang	0538a69a8d	[micro_pipeline_tp] support all-gather -> _scaled_mm (#131833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131833 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831, #131832	2024-07-28 03:39:11 +00:00
Yifu Wang	492e9a4886	[micro_pipeline_tp] add support for type-erased all-gather pattern observed in DTensor + float8_experimental (#131832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131832 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831	2024-07-28 03:39:11 +00:00
PyTorch MergeBot	fd5b7d4bf9	Revert "[BE] typing for decorators - _meta_registrations (#131572 )" This reverts commit bfe0079b72aa3ed315ae8f140c97a5826c401a65. Reverted https://github.com/pytorch/pytorch/pull/131572 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	609447a626	Revert "[BE] typing for decorators - _jit_internal (#131573 )" This reverts commit f0f20f7e97716b4b077dca2a1a42930ccf990c1c. Reverted https://github.com/pytorch/pytorch/pull/131573 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	4684b8e9d7	Revert "[BE] typing for decorators - _inductor/lowering (#131574 )" This reverts commit b2cbcf710b26c4cb92d810fff46b6ddcb8d10cbf. Reverted https://github.com/pytorch/pytorch/pull/131574 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	07b7f51877	Revert "[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575 )" This reverts commit 42dc5a47a157f9a441ceba53cf569cc42a640732. Reverted https://github.com/pytorch/pytorch/pull/131575 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	6a0c3bae21	Revert "[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576 )" This reverts commit 37d76c7d48353cff5ed0d868b7ca486ad092ceaf. Reverted https://github.com/pytorch/pytorch/pull/131576 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	b1d640a2b7	Revert "[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577 )" This reverts commit 5ee6a6dacc926da37ebe06e4206dcc307bf891f5. Reverted https://github.com/pytorch/pytorch/pull/131577 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	d3c17fea90	Revert "[BE] typing for decorators - _library/custom_ops (#131578 )" This reverts commit c65b197b85aeee61ed4c09527a8f6eecf8c20e27. Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	065d0fe570	Revert "[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579 )" This reverts commit 79f0c4dc04c7976b734767d64c4833932219dcfb. Reverted https://github.com/pytorch/pytorch/pull/131579 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	5ced63a005	Revert "[BE] typing for decorators - utils/flop_counter (#131580 )" This reverts commit 81c26ba5ae1edf95da8f6956ae4b5ad23c9833c6. Reverted https://github.com/pytorch/pytorch/pull/131580 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	2c4023d65f	Revert "[BE] typing for decorators - _refs/nn/functional (#131581 )" This reverts commit dbf7c318b2dd4652467f11f4aaebaa3ed372e728. Reverted https://github.com/pytorch/pytorch/pull/131581 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	e448f32944	Revert "[BE] typing for decorators - signal/windows/windows (#131582 )" This reverts commit 8689d377f9b60b70efa6608e654a3889f947f4d8. Reverted https://github.com/pytorch/pytorch/pull/131582 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	d90f6b45c0	Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 )" This reverts commit fb3ddafbcfe6de1c4b208c020bc5ff4c4c4faf79. Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2254327833))	2024-07-28 03:26:14 +00:00
PyTorch MergeBot	8f5cf46405	Revert "Fix public API tests (#131386 )" This reverts commit 91fcfd87600545c19b975bd6ea134f2f931bf84a. Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))	2024-07-28 03:23:04 +00:00
cyy	7be0ce51b6	Fix handle serialization error (#131871 ) This is a bug to try serialise std::string in C API Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131871 Approved by: https://github.com/Skylion007	2024-07-28 00:33:20 +00:00
Aaron Orenstein	3e0ccb3a9f	Fixing fake tensor SymInt caching (#131966 ) Summary: Some tests are failing because of a weird interaction between the symbolic sizes and the `set()` - back it out for now. Differential Revision: D60320595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131966 Approved by: https://github.com/oulgen	2024-07-27 22:43:57 +00:00
Shuo Ding	d07a125af2	[Inductor] supporting pointwise intermediate nodes in B2B-GEMM (#131685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131685 Approved by: https://github.com/eellison	2024-07-27 20:11:20 +00:00
Xuehai Pan	14158d892a	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-27 19:39:40 +00:00
albanD	466ea8ce54	Add fallback() to torch.library (#131707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131707 Approved by: https://github.com/zou3519	2024-07-27 18:02:35 +00:00
cyy	8e5a367311	[5/N] Fix clang-tidy warnings in jit (#131969 ) Follows #131903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131969 Approved by: https://github.com/ezyang	2024-07-27 17:54:20 +00:00
Xuehai Pan	918ece4f4d	[BE][Easy][11/19] enforce style for empty lines in import segments in `test/dy*/` (#129762 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129762 Approved by: https://github.com/anijain2305	2024-07-27 17:43:53 +00:00
Angela Yi	ae9f17a821	[aoti] Rename OSS DynamicArg and OpKernel (#131862 ) Summary: Fixing P1495466240 which I think is due to the fact that internal also has an "OpKernel" in the same namespace, using thrift instead of json. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4785074844896831 Differential Revision: D60273354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131862 Approved by: https://github.com/desertfire	2024-07-27 17:34:50 +00:00
PyTorch MergeBot	8cdfdb41bc	Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 )" This reverts commit f862f457304f1952e75336f9f74e4ea3d2a5eb72. Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/atalman due to broke CI: test_nestedtensor.py::TestNestedTensorSubclassCPU::test_layer_norm_with_lengths_requires_grad_False_components_require_grad_False_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10121747545/job/27996722731) [HUD commit link](`f862f45730`) ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2254167994))	2024-07-27 14:45:47 +00:00
Nikita Shulga	07389163f0	[C10][BE] Use range loop (#131922 ) Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922 Approved by: https://github.com/XilunWu	2024-07-27 11:26:27 +00:00
cyy	f83ef69b84	Fix typo in assignment operators (#131890 ) Most typos were introduced in #131077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131890 Approved by: https://github.com/Skylion007	2024-07-27 11:13:42 +00:00
cyy	c82441e07a	Fix std::optional checking bug (#131874 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131874 Approved by: https://github.com/Skylion007	2024-07-27 11:08:10 +00:00
Yifu Wang	93a4671746	Add out_dtypes to fused_all_gather_scaled_matmul's args (#131831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131831 Approved by: https://github.com/weifengpy ghstack dependencies: #131410	2024-07-27 11:07:43 +00:00
Yifu Wang	12cd040edd	[micro_pipeline_tp] exclude simple overlappable collectives as micro-pipeline TP candidates when reorder_for_compute_comm_overlap is enabled (#131410 ) When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410 Approved by: https://github.com/weifengpy	2024-07-27 11:07:43 +00:00
Animesh Jain	36d24925c6	[inline_inbuilt_nn_modules][inductor-cpu] More skips for dynamic shapes when inlining enabled (#131948 ) The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131948 Approved by: https://github.com/eellison, https://github.com/leslie-fang-intel ghstack dependencies: #131744, #131928	2024-07-27 10:03:49 +00:00
Will Feng	aee6bcdba4	[Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614 ) This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614 Approved by: https://github.com/yifuwang ghstack dependencies: #131510	2024-07-27 08:39:58 +00:00
Will Feng	9e06572704	[Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510 ) This PR creates these `GroupedSchedulerNode`s: - One for each all-gather code block (cast + copy-in + all-gather) - One for each all-gather-wait code block (all-gather-wait + copy-out) - One for each reduce-scatter code block (copy-in + reduce-scatter) - One for each reduce-scatter-wait code block (reduce-scatter-wait) This serves two goals: - Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage. - Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms"). The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510 Approved by: https://github.com/yifuwang	2024-07-27 08:39:58 +00:00
cyy	99e13e68e9	[4/N] Fix clang-tidy warnings in jit (#131903 ) Follows #131830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131903 Approved by: https://github.com/Skylion007	2024-07-27 08:08:14 +00:00
Janani Sriram	f862f45730	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519 Approved by: https://github.com/davidberard98 ghstack dependencies: #131518	2024-07-27 07:09:10 +00:00
Janani Sriram	bcf5c68c18	[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518 ) Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518 Approved by: https://github.com/davidberard98	2024-07-27 07:09:10 +00:00
Avik Chaudhuri	c49e857d32	[pt] immutable accessors in graph signature (#131940 ) Summary: splitting PT part of D60253955 Test Plan: existing tests Differential Revision: D60296909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131940 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-07-27 05:32:53 +00:00
Oguz Ulgen	96c1862e0b	Remove mypy ignore from torch/_dynamo/variables/__init__.py (#131784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131784 Approved by: https://github.com/aorenste, https://github.com/zou3519, https://github.com/Skylion007	2024-07-27 05:07:33 +00:00
drisspg	1bfe7eb7e6	Update how we do sdpa testing (#131743 ) ## Motivation This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues: 1. Standardized comparison: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`. 2. Reduced redundancy: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication. 3. Improved maintainability: The new approach simplifies tolerance adjustments across all affected tests. 4. Consistency: Standardizing tensor comparisons ensures a more uniform and reliable testing suite. These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743 Approved by: https://github.com/jainapurva, https://github.com/jbschlosser	2024-07-27 03:58:49 +00:00
Vishwa Raj Singh	bcdba9f91d	Added hpu backend support in fsdp utils (#127757 ) In fsdp init_utils, adding support for hpu backend device on _get_device API. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757 Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu	2024-07-27 03:30:59 +00:00
Xu Han	28fd2e905d	[inductor] enhance cpp_builder lint check. (#131752 ) enhance cpp_builder `mypy` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131752 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:27 +00:00
Xu Han	a90b8b967a	[inductor] enable windows inductor UTs (#131767 ) Changes: 1. Add `skipIfWindows` function. 2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules. 3. Disable some UTs, which are not passed on Windows. 4. Enable test_torchinductor in Windows CI. I have tested passed on my dev machine: <img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb"> TODO: review and fix the skipped cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:03 +00:00
Avik Chaudhuri	3768faec2f	carry cond in data-dependent error (#131932 ) Test Plan: existing Differential Revision: D60302877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131932 Approved by: https://github.com/zhxchen17	2024-07-27 02:13:04 +00:00
Xu Han	9606d61e0c	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 01:46:13 +00:00
Matthew Hoffman	fdf1451bfa	Add `__all__` to torch.optim to define public interface (#131959 ) There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15) ``` error: "SGD" is not exported from module "torch.optim" ``` Adding these classes/modules to `__all__` fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959 Approved by: https://github.com/ezyang	2024-07-27 01:03:25 +00:00
Sergii Dymchenko	8458980bbf	Move benchmarks/dynamo/huggingface configuration to YAML (#131724 ) Similar to https://github.com/pytorch/pytorch/pull/120299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724 Approved by: https://github.com/shunting314	2024-07-27 00:55:04 +00:00
Zain Rizvi	ef8d118c67	Sync with changes to test-infra's scale-config.yml (#131955 ) This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra. This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955 Approved by: https://github.com/malfet	2024-07-27 00:25:40 +00:00
Nikita Shulga	8b04edcac1	Delete unused yml files (#131298 ) To be landed at least 3 days later after previous commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/131298 Approved by: https://github.com/ZainRizvi ghstack dependencies: #130762	2024-07-27 00:21:22 +00:00
Zain Rizvi	1e00f055a4	Move distributed experimental jobs back to the amazon2 for now (#131963 ) Something about the new Amazon2023 AMI is making some distributed tests fail. Moving them back to the old AMI until the issue is fixed This particular jobs are causing this test to fail: https://github.com/pytorch/pytorch/issues/129539 More details in https://github.com/pytorch/pytorch/issues/131962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131963 Approved by: https://github.com/clee2000	2024-07-26 23:44:56 +00:00
Joel Schlosser	91fcfd8760	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-26 23:38:43 +00:00
Shangdi Yu	02b922900b	[aoti] Fix float16 and bfloat16 for generated GPU code (#131437 ) Fixes #131333 Summary: - Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`. - change `float16` and `bfloat16` to `float` before passing to kernel. code generated before: ```cpp ..... half var_1; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1)); .... ``` code generated now: ```cpp typedef at::Half half; typedef at::BFloat16 bfloat16; ..... half var_1_tmp; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp)); float var_1 = float(var_1_tmp); .... ``` Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda` Work in progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437 Approved by: https://github.com/desertfire	2024-07-26 23:36:11 +00:00
Bin Bao	0272934238	[Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812 ) Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory. ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module> from torch.torch_version import __version__ as __version__ File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module> from torch.version import __version__ as internal_version ModuleNotFoundError: No module named 'torch.version' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812 Approved by: https://github.com/eellison, https://github.com/malfet	2024-07-26 22:31:44 +00:00
Sergii Dymchenko	5489ff8e94	Use Mermaid for the diagram in torch/ao/quantization/fx/README.md (#131412 ) preview `3a0efcdfa3/torch/ao/quantization/fx/README.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131412 Approved by: https://github.com/jerryzh168	2024-07-26 22:01:21 +00:00
Peter Bell	16cd1aaa1d	[inductor] Improve sort kernel perf (#131719 ) Closes #129507 This makes two changes to the sort kernel: 1. Use int16 for the indices since we only operate on small dims anyway 2. Instead of passing an explicit mask, we pass the rnumel and imply the mask from that which saves an additional reduction in the sort kernel's inner loop. In my benchmarks, this gives enough of a perf improvement to bump up the max rblock to 512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719 Approved by: https://github.com/eellison	2024-07-26 21:56:47 +00:00
Luca Wehrstedt	b90bc66766	Enable FlashAttention on Windows (#131906 ) Let's just give this a try. Reland of https://github.com/pytorch/pytorch/pull/131875. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906 Approved by: https://github.com/drisspg	2024-07-26 21:41:56 +00:00
rzou	d73b55d64b	Support meta tensors as inputs to the triton_kernel_wrapper HOPs (#131896 ) We automatically generate FakeTensor support for them (the FakeTensor kernel for a triton kernel is "return None"). The same thing should apply to the meta kernel. Tests: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896 Approved by: https://github.com/oulgen	2024-07-26 21:41:03 +00:00
Animesh Jain	fb98cd33f1	[inline_inbuilt_nn_modules][inductor-cpu] Skip test_quantized_linear_amx (#131928 ) The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131928 Approved by: https://github.com/eellison ghstack dependencies: #131744	2024-07-26 21:28:17 +00:00
Shunting Zhang	c8626a4e1f	[BE] add a list of inductor test files to skip resetting dynamo (#131551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551 Approved by: https://github.com/zou3519	2024-07-26 21:08:15 +00:00
Catherine Lee	fde577702d	[TD] More synonyms for filepath (#131838 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131838 Approved by: https://github.com/PaliC, https://github.com/ZainRizvi	2024-07-26 21:02:42 +00:00
Zain Rizvi	1bda3a3135	Migrate nightly.yml workflow & docs to Amazon 2023 (#131821 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Migrates nightly jobs and the linux-docs job in pull.yml To preserve reusability, I'm switching to a new format here that allows one to only specify the runner prefix instead of the full runner name, allowing multiple jobs to continue using the same base runner type like how they did before Validation: - Nightly builds passed in the prev commit: https://github.com/pytorch/pytorch/actions/runs/10102118461/job/27937632823?pr=131821 - Latest commit only updated the docs job in pull.yml, and that has already passed: https://github.com/pytorch/pytorch/actions/runs/10114635537/job/27974392472?pr=131821 The other in-progress jobs are irrelevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/131821 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-07-26 20:54:43 +00:00
James Wu	0e6df1e0fb	Disable remote cache on test (#131908 ) Summary: Fixes test internally Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees -- --exact 'caffe2/test/inductor:cudagraph_trees - test_cache_hit_forward_miss_backward (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' Passes Differential Revision: D60293177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131908 Approved by: https://github.com/clee2000	2024-07-26 20:19:02 +00:00
Brian Hirsh	071ac38141	fast-path FakeTensor detach (#131899 ) Fixes https://github.com/pytorch/pytorch/issues/128281, see investigation at https://github.com/pytorch/pytorch/issues/128281#issuecomment-2252976926. benchmark: ``` python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM ``` time before: ``` TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435 ``` time after: ``` TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131899 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD	2024-07-26 20:16:08 +00:00
Catherine Lee	2ec8312a28	Add rerun_disabled_tests for inductor (#131681 ) Test in prod? THis also turns on mem leak check Briefly checked that ``` python3 ".github/scripts/filter_test_configs.py" \ --workflow "inductor" \ --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \ --test-matrix "{ include: [ { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" }, { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, ]} " \ --selected-test-configs "" \ --pr-number "${PR_NUMBER}" \ --tag "${TAG}" \ --event-name "schedule" \ --schedule "29 8 * * *" \ --branch "${HEAD_BRANCH}" ``` has rerun disabled tests option in the test matrix I don't think all these things need to run but I'm not sure which ones (probably just inductor?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681 Approved by: https://github.com/zou3519	2024-07-26 20:05:24 +00:00
Sergii Dymchenko	da1a1fa55f	Move load_yaml_file to common (#131924 ) This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924 Approved by: https://github.com/shunting314, https://github.com/huydhn	2024-07-26 19:47:52 +00:00
Bin Bao	6c95f79645	[CI] Increase the timeout for aarch64 docker build (#131926 ) Summary: Increase the timeout limit for pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks. If slow build is a problem later, we can upgrade the arm64 CI instance capability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131926 Approved by: https://github.com/avikchaudhuri	2024-07-26 19:27:45 +00:00
PyTorch MergeBot	782efd8e5b	Revert "Add rerun_disabled_tests for inductor (#131681 )" This reverts commit 85fa66be04b6f78139da4f0ec8f8b1956291e1c5. Reverted https://github.com/pytorch/pytorch/pull/131681 on behalf of https://github.com/clee2000 due to this is the wrong file ([comment](https://github.com/pytorch/pytorch/pull/131681#issuecomment-2253318038))	2024-07-26 19:08:59 +00:00
PyTorch MergeBot	0f9bf208ec	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit 054d214c504b415b155ef2da1a70764a115e1276. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))	2024-07-26 19:03:10 +00:00
rzou	a3cdbd8189	[FlopCounterMode] Fix register_flop_formula (#131777 ) Previously, FlopCounterMode would ignore any custom ops registered through `register_flop_formula`. The problem was: - register_flop_formula(target) requires target to be an OpOverloadPacket. - register_flop_formula used register_decomposition to populate its registry - register_decomposition decomposes the OpOverloadPacket into OpOverload before putting it into the registry - FlopCounterMode ignores OpOverloads in its registry (it assumes the registry is a dictionary mapping OpOverloadPacket to flop formula). register_decomposition is too heavy of a hammer, plus this isn't a decomposition, so I changed the registration mechanism. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777 Approved by: https://github.com/Chillee	2024-07-26 18:44:50 +00:00
Vishwa Raj Singh	cd53698df0	Add hpu backend support for dynamo torchVariable _in_graph_classes() function (#129948 ) Fixes #ISSUE_NUMBER Recent change from PR# `f657b2b1f8 (diff-4a52059570bb96333d8383ce6a9d01bbb114c5e34aff6028f820899ca39b5a26R80)` , has hard coded flow to cuda stream in ingraph function. For non cuda backend (hpu in our case), it breaks the graph. As part of this PR change adding hpu backend support to dynamo variables function _in_graph_classes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129948 Approved by: https://github.com/yanboliang	2024-07-26 18:38:03 +00:00
eellison	5f2c80d16d	Add inductor OrderedSet (#130003 ) Implemented by extending `collections.abc.MutableSet` and backing it with a dictionary, which is ordered. From collections.abc.MutableSet: ``` A mutable set is a finite, iterable container. This class provides concrete generic implementations of all methods except for __contains__, __iter__, __len__, add(), and discard(). ``` In addition to implementing those methods I also had to define some methods of python's set which were not implemented in MutableSet. I reused the test from my python's lib. There were a few instances of tests that didnt pass because edge case behavior that is not necessary to reimplement - support self-referencing repr - erroring when an member's `__eq__` function would modify the set itself - MutableSet supports Iterables as inputs, but not sequences (pretty rare..) - Some specifics of exact equivalent type errors being thrown - [The protocol for automatic conversion to immutable](https://docs.python.org/2/library/sets.html#protocol-for-automatic-conversion-to-immutable) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130003 Approved by: https://github.com/aorenste	2024-07-26 18:16:57 +00:00
Mikayla Gawarecki	1dd10ac802	[BE] [Reland] Make nn.Module state_dict load_state_dict pre-hook and state_dict post-hook public (#131690 ) Reland https://github.com/pytorch/pytorch/pull/126704 #### Fixes the issue with type of `nn.Module._state_dict_hooks` being changed in that PR which was problematic: Instead of using `Tuple(Callable, bool)` to keep track of whether the private `_register_state_dict_hook` or the public `register_state_dict_post_hook` API was used to register the hook and toggle the behavior accordingly, I set an attribute on the Callable in the private API, which is never cleaned up. If a callable previously registered using the private API is registered via the public API, a RuntimeError will be raised #### Copied from previous PR description Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437 - `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook` - Add a test as this API was previously untested - `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True` ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~ - For issuet by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook - Document this for private `_register_state_dict_hook` - Remove this for the public `register_state_dict_post_hook` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131690 Approved by: https://github.com/albanD	2024-07-26 18:14:07 +00:00
Shuqiang Zhang	8158cf2f59	[c10d] Fix split_group usage when there is a single rank (#131824 ) Summary: This is a request from xlformer team to allow single rank PG/comms Test Plan: UT Pull Request resolved: https://github.com/pytorch/pytorch/pull/131824 Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj	2024-07-26 18:11:17 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
PyTorch MergeBot	e4db5dc1c4	Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358 )" This reverts commit 4c7f22dee25649cd895bc382192d29f39e482215. Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))	2024-07-26 17:35:27 +00:00
William Wen	2576dbbc35	[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725 ) Fixes https://github.com/pytorch/pytorch/issues/112794. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131725 Approved by: https://github.com/anijain2305 ghstack dependencies: #131413, #131716	2024-07-26 17:17:09 +00:00
William Wen	35b4de32fa	[dynamo] add itertools repeat/count bytecode reconstruction (#131716 ) Also fix bugs in the count iterator variable implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131716 Approved by: https://github.com/anijain2305 ghstack dependencies: #131413	2024-07-26 17:17:09 +00:00
Boyuan Feng	40cc5c0697	[AOT Autograd] Donated Buffer (#130580 ) Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor. Fixes #129496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580 Approved by: https://github.com/bdhirsh	2024-07-26 17:14:34 +00:00
Siyu Yang	9589d986fa	[UT] Relax atol for test_non_contiguous_input_* (3 tests) (#131822 ) BE task T195600898 (internal). The 3 tests ``` test_non_contiguous_input_mm test_non_contiguous_input_bmm test_non_contiguous_input_addmm ``` had the following error in TestX: ``` self.assertTrue(torch.allclose(ref, act, atol=1e-2, rtol=1e-2)) AssertionError: False is not true ``` The tolerance comparing eager and compiled results is too small, perhaps because of a Triton update that changed numerics: ``` Mismatched elements: 25 / 38597376 (0.0%) Greatest absolute difference: 0.015625 at index (3771, 509) (up to 0.01 allowed) Greatest relative difference: 9.375 at index (13687, 48) (up to 0.01 allowed) ``` Change the absolute tolerance from 0.01 to 0.02. Also switch to use `torch.testing.assert_close` which prints out the greatest absolute/relative difference like above when the assert fails. `test_non_contiguous_input_mm_plus_mm` has a different problem, just switching to `torch.testing.assert_close` to be uniform with the other tests. Test commands: ``` python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_mm python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_addmm python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_bmm ``` Internal stress tests pass now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131822 Approved by: https://github.com/shunting314	2024-07-26 17:11:35 +00:00
PyTorch MergeBot	161bb67116	Revert "Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 )" This reverts commit ace6decc9948e434dfe2e253bc28341bb22aa983. Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/clee2000 due to unfortunately the internal pybind update got reverted cc @malfet ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2253147079))	2024-07-26 17:02:56 +00:00
Nikita Shulga	c382fc3fea	[Reland] Fix vulkan builds with missing overrides errors (#131760 ) Followup after https://github.com/pytorch/pytorch/pull/131524 Add note explaining why C10 macros should not be used in that header Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760 Approved by: https://github.com/atalman	2024-07-26 17:01:51 +00:00
Bin Bao	1a2edf6dca	[AOTI] Fix _mm_plus_mm codegen (#131689 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/128474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131689 Approved by: https://github.com/chenyang78	2024-07-26 16:50:12 +00:00
PyTorch MergeBot	696e83a1da	Revert "TCPStore: fix remote address (#131773 )" This reverts commit 9039131a89a5fdb8746bd86b0a4dd91559821e36. Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))	2024-07-26 16:47:57 +00:00
Yidi Wu	404a8ae8f6	[export] fix set_grad x tensor constant. (#131787 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/130379. The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop. In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"]. Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor" Differential Revision: D60244935 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787 Approved by: https://github.com/yushangdi	2024-07-26 16:41:59 +00:00
PyTorch MergeBot	bb64702eb3	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 520182dbffe09943be74a8a9cd58618fc171738f. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/clee2000 due to broke internal tests D60265910 ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2253113689))	2024-07-26 16:40:03 +00:00
Alnis Murtovi	d57de73fe0	AutoHeuristic: Add support for kernel choice selection (#131610 ) This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following: When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py. After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results. I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610 Approved by: https://github.com/eellison	2024-07-26 16:35:55 +00:00
PyTorch MergeBot	a38890a53f	Revert "[2/3] 3D Composability - move pp tests (#129801 )" This reverts commit 29571c5c06f6e5fd143d85c18d8a6b87d2e4e1d3. Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](`544f950d14`) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2253099894))	2024-07-26 16:30:29 +00:00
Animesh Jain	13ab92b72d	[dynamo][recompile-logs] Suggest force_parameter_static_shapes on the recompile log for parameter-related recomps (#131825 ) Discovered in https://github.com/pytorch/pytorch/issues/121369 On the user-empathy-day model, the logs look like these ~~~ W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] function: 'auto_repeat_tensors_for_time' (/home/anijain/local/lumiere-pytorch/lumiere_pytorch/lumiere.py:545) W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] last reason: 0/0: len(L['args']) == 1 W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] function: 'forward' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:150) W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] last reason: 11/0: tensor 'L['x']' size mismatch at index 0. expected 16, actual 8 W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] function: 'normalize_weight' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:127) W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] last reason: 40/1: tensor 'L['weight']' size mismatch at index 0. expected 64, actual 16. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] function: 'pack_one' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:38) W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] last reason: 58/1: tensor 'L['t']' stride mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] function: 'torch_dynamo_resume_in_pack_at_70' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/packing.py:70) W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] last reason: 62/0: tensor 'L['tensors'][0]' size mismatch at index 0. expected 16, actual 32. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] function: 'reshape' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/_backends.py:91) W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] last reason: 65/0: tensor 'L['x']' size mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. ~~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131825 Approved by: https://github.com/ezyang ghstack dependencies: #131795, #131801, #131804	2024-07-26 16:25:21 +00:00
Zhengxu Chen	7feaa73057	[export] Remove deprecated fields from ExportedProgram ctor. (#131697 ) Summary: as title. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D60078426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131697 Approved by: https://github.com/ydwu4	2024-07-26 16:19:46 +00:00
PyTorch MergeBot	546df5daf8	Revert "[3/3] 3D Composability - move tp dp tests (#129802 )" This reverts commit ec3829795dfb58a58ebc9ca241f7949efd60bfda. Reverted https://github.com/pytorch/pytorch/pull/129802 on behalf of https://github.com/atalman due to Need to revert https://github.com/pytorch/pytorch/pull/129801 that got remerged ([comment](https://github.com/pytorch/pytorch/pull/129802#issuecomment-2253082995))	2024-07-26 16:19:25 +00:00
cyy	2988d33c80	[3/N] Fix clang-tidy warnings in jit (#131830 ) Follows #131735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131830 Approved by: https://github.com/ezyang	2024-07-26 15:46:28 +00:00
Brian Hirsh	5612408735	_get_operation_overload: dont raise exception when overload does not exist (#131554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131554 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482, #131665	2024-07-26 15:38:11 +00:00
andrewor14	eba2ffd278	[pt2e][quant] Ensure BN node is erased after convert (#131651 ) Summary: Previously, when folding BN into conv, we rely on DCE to clean up the unused BN node from the graph. This works if the model is already in eval mode, but fails if the model is still in train mode because DCE doesn't remove nodes with potential side effects (in this case `_native_batch_norm_legit`). This required users to move the model to eval mode before calling convert in order to get a properly DCE'd graph. To solve this, we manually erase the BN node after folding instead of relying on DCE. This relaxes the ordering constraints between `move_exported_model_to_eval` and `convert_pt2e`. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node Reviewers: jerryzh168, yushangdi Subscribers: jerryzh168, yushangdi, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651 Approved by: https://github.com/yushangdi	2024-07-26 15:30:45 +00:00
Bin Bao	9440a4824d	[CI][dashboard] Add a workflow to collect A10g perf (#131816 ) Summary: This is an experimental work. Depending on the performance stableness and benchmark coverage on A10g, we may consider to use A10g for manually-triggered per-PR performance comparison instead of exausting expensive A100 instances. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131816 Approved by: https://github.com/huydhn	2024-07-26 14:36:14 +00:00
Dan Zimmerman	535c17efb3	[torch] Implement c10::BFloat16 ctor from __hip_bfloat16 (#131359 ) Summary: Pretty straightfoward. ROCm 6.2.0 changed the `__hip_bfloat16` API (see [this PR](`481912a1fd`)), so we gate impl on `__BF16_HOST_DEVICE__` macro to support older and newer versions of ROCm. Test Plan: CI Differential Revision: D60024830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131359 Approved by: https://github.com/houseroad	2024-07-26 14:30:49 +00:00
Brian Hirsh	e4ace1a396	AOTDispatcher: properly bump version counter on input mutations in inference graphs (#131665 ) This ensures that in an inference setting, we properly bump the VC of mutated graph inputs. Previously, we would only properly bump the VC for training graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131665 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482	2024-07-26 14:22:20 +00:00
Brian Hirsh	5570a0da0a	dont dispatch aten.conj(scalar_tensor) back to python (#131482 ) https://github.com/pytorch/pytorch/issues/105290 The problem in the original flow is that: (1) the user calls `torch.mul(complex_tensor, complex_scalar) (2) python arg parser wraps the complex scalar in a `scalar_tensor`, and dispatches to `aten.mul.Tensor(self, scalar_other)` (3) autograd sees `aten.mul.Tensor`, calls `scalar_other.conj()` [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/FunctionsManual.cpp#L597) (4) during proxy tensor tracing, this gets dispatched to `aten._conj(scalar_tensor)` (5) when we hit __torch_dispatch__, the scalar_tensor is converted back into a plain python scalar (6) we error during tracing, because in `FunctionalTensorMode.__torch_dispatch__` we try to redispatch on `aten._conj.default(plain_python_scalar)`, and this overload does not accept python scalars. My attempted fix in this PR is to update `TensorBase::conj()` to check if the current tensor is a scalar tensor (wrapped number), and if so, manually: (1) convert the scalar tensor back into a scalar (2) call scalar.conj() directly (3) convert the result back into a wrapped tensor This avoids having to go through python entirely in the tracing case (which is fine, because these scalar tensors are constants that we can const-prop during tracing anyway). Notable, I did not add e.g. a new `aten._conj.Scalar` overload. This would not actually fix the problem, since the bug is that we call `aten._conj.default(python_scalar)` directly. we would also need to muck with all `__torch_dispatch__` call sites to know to convert python scalars back into tensors directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131482 Approved by: https://github.com/zou3519, https://github.com/ezyang ghstack dependencies: #131403	2024-07-26 14:22:20 +00:00
Brian Hirsh	8bb9aa93a7	dynamo: mutations on .data should be invisible to autograd (#131403 ) Fixes https://github.com/pytorch/pytorch/issues/121353 our handle for `.data` in dynamo today basically just converts `y = x.data` into `y = x.detach()`. The semantics of these two ops are not quite the same, because: (1) any future mutations on `x.data` will be fully ignored by autograd (2) any mutations on `x.detach()` will bump x's version counter the linked model does a .data mutation that is hidden from autograd in eager, but ends up erroring during AOTDispatcher tracing. I updated dynamo's handling so that: (1) when dynamo sees a call to `getattr(tensor, "data")` and calls `.detach()` we set a flag on the returned `TensorVariable` indicating it came from `.data` (2) on any tensor method that we call with an input `TensorVariable` with this flag turned on, we proxy autograd's `preserve_version_counter` logic into the graph, to properly reset the VC after the op is run. One thing to note is that I don't actually do this on every op that we pass the tensor to: I only do it for tensor methods that appear to be mutations (by checking for a trailing underscore). My thought was that: (1) I didn't want to do this for every op that you pass `y` into, since that will e.g. triple the number of nodes in the graph, and could cause compile time regressions if you use .data (2) this situation is pretty rare in general, and I'm hoping that "tensor method mutations" cover most reasonable mutation cases. If we manage to miss a case, you will get a loud error during tracing anyway, so there is not a safety issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131403 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2024-07-26 14:22:20 +00:00
PyTorch MergeBot	7339c8ab28	Revert "immutable accessors in graph signature (#131807 )" This reverts commit 6fd28fc228f900863d63b1c83912dcc000b084e3. Reverted https://github.com/pytorch/pytorch/pull/131807 on behalf of https://github.com/atalman due to Broke CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10111847569/job/27965364355) [HUD commit link](`608057afe2`) ([comment](https://github.com/pytorch/pytorch/pull/131807#issuecomment-2252875417))	2024-07-26 14:21:12 +00:00
Yanbo Liang	e76e566cfb	[Dynamo] Support zip_longest (#131497 ) Fixes #121348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131497 Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/zou3519	2024-07-26 14:06:10 +00:00
PyTorch MergeBot	c9888c2739	Revert "[BE] typing for decorators - optim/optimizer (#131583 )" This reverts commit a1dad77dfa4e244a867ca7c73e9f6b6fe36a1340. Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](`a1dad77dfa`) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))	2024-07-26 13:41:22 +00:00
PyTorch MergeBot	7ee6831ae8	Revert "Fix vulkan builds with missing overrides errors (#131760 )" This reverts commit 7260eaeca056ffa013de769c10a2bfce9505d937. Reverted https://github.com/pytorch/pytorch/pull/131760 on behalf of https://github.com/malfet due to Does not work with internal builds ([comment](https://github.com/pytorch/pytorch/pull/131760#issuecomment-2252783645))	2024-07-26 13:38:28 +00:00
zengxian	d3e932dc10	[CI] Add inductor cpu accuracy test running on AVX2 runners (#128682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-26 13:24:41 +00:00
Huy Do	e73fa28ec8	[CI] Fix arm64 docker build arch (#131869 ) Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869 Approved by: https://github.com/desertfire	2024-07-26 13:19:36 +00:00
Peter Bell	608057afe2	[inductor] Fix duplicated range tree codegen in split scan (#131669 ) Looks like in the halide codegen refactor, the range tree codegen was split out from initialize_range_tree into its own function, but triton_split_scan.py wasn't updated to reflect this change. The result was the codegen gets invoked twice which is benign but makes the kernel harder to read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669 Approved by: https://github.com/Chillee	2024-07-26 13:11:26 +00:00
Bin Bao	945946e817	[AOTI] Fix another ABI-compatible CPU issue (#131798 ) Summary: This problem is seen on AOTI CPU dashboard runs, a cpp compilation error because ConstantHandle::get doesn't exist. This PR adds ConstantHandle::get so that the interface is consistent with RAIIAtenTensorHandle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131798 Approved by: https://github.com/zou3519, https://github.com/chenyang78 ghstack dependencies: #131791	2024-07-26 11:27:58 +00:00
William Wen	7d282d8755	[dynamo] add lazy IteratorVariable implementations for map and zip (#131413 ) Fixes https://github.com/pytorch/pytorch/issues/130750. Repro of lazy/eager `map` discrepancy without `islice`: ```python def fn(a, b): y = 1 def f(x): nonlocal y y += 1 return x l = list(zip([a, b], map(f, [1, 2, 3, 4]))) return a + y ``` The major change is that we implement `MapVariable` and `ZipVariable` based on `IteratorVariable`. Before, `map` and `zip` were being traced by immediately unpacking the result as a `TupleVariable`, which is wrong in cases such as the example above. `MapVariable`s are not allowed to be unpacked while `ZipVariable`s can only be unpacked if all of its iterables can also be unpacked. We also add new `[has_]force_unpack_var_sequence` methods to `VariableTracker` for the case where it is safe to unpack the entire sequence lazily, e.g., when building a list from a map (i.e. `list(map(f, ...))`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131413 Approved by: https://github.com/anijain2305	2024-07-26 10:47:38 +00:00
IvanKobzarev	115994fea2	[aotd] Align partitioner graph output type to tuple (#131759 ) Brian debugged the difference of the output type for inference and train graph. Partitioner sometimes return list output type. After this PR it will always return tuple. Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing. This could be easily fixed with fast-forward fix on: ``` EXPECTTEST_ACCEPT=1 python test/test.py ``` Adding ciflows/periodic to minimize this probability Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-07-26 09:46:29 +00:00
Bin Bao	1e24f7875e	[AOTI] Fix ABI-compatible mode link issue for CPU (#131791 ) Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path. This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791 Approved by: https://github.com/zou3519, https://github.com/chenyang78	2024-07-26 09:02:13 +00:00
Avik Chaudhuri	6fd28fc228	immutable accessors in graph signature (#131807 ) Test Plan: existing tests Differential Revision: D60253955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131807 Approved by: https://github.com/ydwu4	2024-07-26 08:56:19 +00:00
Jiang, Yanbing	bceb91222c	Fix meta error in _convert_weight_to_int4pack (#130915 ) This PR is to fix meta error in _convert_weight_to_int4pack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915 Approved by: https://github.com/jerryzh168	2024-07-26 08:36:30 +00:00
Avik Chaudhuri	2bf649f5ae	suggested fix for data-dependent error (#125378 ) Suggests fixes for data-dependent errors in non-strict export. Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues. For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR: ``` Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)). (Size-like symbols: u1) <snip>...</snip> User code: File "test/export/test_export.py", line 1944, in forward return r.view(items[0], items[2]) Suggested fixes (please choose one of the following): 1. torch._check(items[2] == r.shape[1]) 2. torch._check(items[2] != r.shape[1])" ``` Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang). In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code. Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next. It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead. Differential Revision: D56867432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378 Approved by: https://github.com/ezyang	2024-07-26 08:34:50 +00:00
Adnan Akhundov	fb3ddafbcf	[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820 Approved by: https://github.com/eellison	2024-07-26 08:11:34 +00:00
Janani Sriram	13e806a591	[NestedTensor] Add support for transposed NestedTensors where ragged_idx > 1 for sum and mean operators (#131517 ) Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, , N)` along `` or `(B, N, M, )` along ``. Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension. Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517 Approved by: https://github.com/davidberard98	2024-07-26 07:21:32 +00:00
Xuehai Pan	63374dda69	[BE][Easy] explicitly define global constants in `torch.testing._internal.common_utils` (#129826 ) This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM". Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826 Approved by: https://github.com/Skylion007	2024-07-26 06:32:08 +00:00
Boyuan Feng	aebfd3d4de	[CUDAGraph] skip cudagraph if too many distinct sizes (#131387 ) Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes. While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424. This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50). Fixes #119640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387 Approved by: https://github.com/eellison	2024-07-26 06:17:35 +00:00
Boyuan Feng	16d7cb5049	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 06:14:06 +00:00
Yu, Guangye	dfba85c26b	Update torch-xpu-ops pin (ATen XPU implementation) (#131643 ) # Motivation Regular update. 1. Some new ATen ops support 2. ABI=0 build support 3. Remove dispatched implementation of pin_memory&is_pinned 4. Enhance deterministic usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/131643 Approved by: https://github.com/EikanWang	2024-07-26 05:51:58 +00:00
Nikita Shulga	baa93e160f	[MPS] Add native implementation for shift ops (#131813 ) Similar to how AND/OR/XOR ops are implemented TODO: Consider using MPS method calls rather than metal kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/131813 Approved by: https://github.com/manuelcandales	2024-07-26 05:01:20 +00:00
Aaron Orenstein	a1dad77dfa	[BE] typing for decorators - optim/optimizer (#131583 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583 Approved by: https://github.com/janeyx99 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582	2024-07-26 05:00:07 +00:00
Aaron Orenstein	8689d377f9	[BE] typing for decorators - signal/windows/windows (#131582 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131582 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581	2024-07-26 05:00:07 +00:00
Aaron Orenstein	dbf7c318b2	[BE] typing for decorators - _refs/nn/functional (#131581 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131581 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580	2024-07-26 05:00:03 +00:00
Aaron Orenstein	81c26ba5ae	[BE] typing for decorators - utils/flop_counter (#131580 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131580 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579	2024-07-26 04:59:58 +00:00
Adnan Akhundov	33069630ce	[inductor] Add type hints to functions in decompositions.py (#131780 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131780 Approved by: https://github.com/eellison	2024-07-26 04:50:23 +00:00
Avik Chaudhuri	5b05ad9697	fix non-persistent buffers (#131756 ) Summary: Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature. This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict. Test Plan: existing tests + new case with non-persistent buffer in nested module Differential Revision: D60224656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2024-07-26 04:45:30 +00:00
Animesh Jain	a617919541	[dynamo] Do not guard on keys for _forward_hooks and _forward_pre_hooks (#131682 ) Fixes https://github.com/pytorch/pytorch/issues/125836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131682 Approved by: https://github.com/bdhirsh	2024-07-26 04:39:54 +00:00
Xuan Zhang	3d7c424a75	[inductor] update users to buffers instead of scheduler nodes (#131796 ) After a recent refactoring of inductor, `.users` are now associated with buffers instead of scheduler nodes. In `debug.py`, one such usage of `.users` is not updated accordingly, and the change here fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131796 Approved by: https://github.com/yf225	2024-07-26 03:34:26 +00:00
Isuru Fernando	6dbf343936	Fix aten implementation for low memory max_pool2d (#131717 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131717 Approved by: https://github.com/peterbell10	2024-07-26 03:23:16 +00:00
YangQun1	c2f3266c8e	Not remove collective ops in dce since they have side-effect (#131023 ) Fixes #130918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131023 Approved by: https://github.com/yf225	2024-07-26 03:03:32 +00:00
Yu, Guangye	e0d3e4a498	remove unused code for XPU (#131856 ) # Motivation This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179 Otherwise, CI will block without this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856 Approved by: https://github.com/EikanWang	2024-07-26 02:57:12 +00:00
Will Feng	236d055330	[Traceable FSDP2] Add partial-graph (graph-break) unit tests (#131747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131747 Approved by: https://github.com/bdhirsh	2024-07-26 02:51:57 +00:00
PyTorch MergeBot	03f49c9523	Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 )" This reverts commit 16699c7d848fca669865d83ffff205bcbb8665be. Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))	2024-07-26 02:08:45 +00:00
Boyuan Feng	16699c7d84	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 01:40:23 +00:00
Colin Peppler	2ff98bc57f	[inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726 Approved by: https://github.com/desertfire ghstack dependencies: #131253	2024-07-26 00:58:04 +00:00
PyTorch MergeBot	b343644f3a	Revert "MTIA equivalent of torch.cuda.memory_stats (#131673 )" This reverts commit 513ce5f69a7f53742b7aa5798082dd158beec2ed. Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))	2024-07-26 00:54:37 +00:00
Yanbo Liang	b893a57f96	[Dynamo] Fix guard_on_nn_modules unit tests discrepancy between OSS and fbcode (#131810 ) Fixes Meta internal task: [T195592220](https://www.internalfb.com/intern/tasks/?t=195592220) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131810 Approved by: https://github.com/zou3519	2024-07-26 00:24:46 +00:00
Animesh Jain	246e32055a	[benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804 ) Fixes https://github.com/pytorch/pytorch/issues/121989 We are turning on the flag by default in another PR. But that PR can go through reverts. So, forcibly adding the benchmark to prevent dashboard fluctuation in case of reverts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #131795, #131801	2024-07-26 00:20:42 +00:00
Peter Bell	c92f2a19a4	[BE] Use assertEqual in MultiKernel tests (#127725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127725 Approved by: https://github.com/lezcano ghstack dependencies: #131044, #127724	2024-07-26 00:12:43 +00:00
Peter Bell	9ae288f4be	[inductor] Simplify multi-kernel codegen by unifying kernel args (#127724 ) Persistent kernels are sometimes able to remove intermediate buffers that would otherwise be needed for the non-persistent reduction kernel. This makes multi kernel's codegen more complicated as it needs to drop these extra arguments at runtime after selecting the correct kernel to run. Instead, this PR updates the persistent kernel's `must_keep_buffers` so these aren't dropped during codegen so both kernels have the same signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724 Approved by: https://github.com/shunting314 ghstack dependencies: #131044	2024-07-26 00:12:43 +00:00
PyTorch MergeBot	14920c149b	Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275 )" This reverts commit 0455344777f354dcbbd8e661a46ca2ca20e8a913. Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](`0455344777`) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))	2024-07-26 00:12:40 +00:00
Tristan Rice	adbe4f5ecf	TCPStore: add better logging on wait timeout (#131808 ) This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout. Bonus: * fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s) * Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s) Ex: ``` DistStoreError: wait timeout after 100ms, keys: /the_key ``` Test plan: ``` python test/distributed/test_store.py python test/distributed/test_c10d_gloo.py -v -k timeout ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808 Approved by: https://github.com/kurman	2024-07-25 23:54:41 +00:00
Brian Hirsh	e9443860e7	add python binding for _get_current_graph_task_keep_graph (#131038 ) Inductor would like a way to have activations that do not escape the backward graph marked as "donated", so we can re-use their memory during memory planning here: https://github.com/pytorch/pytorch/pull/130580 For this to be safe though, we need to know at runtime that autograd does not plan to retain the current autograd graph (either for another call to .backward() later, or if double backward is being used). In the linked PR, the current plan is to error when we detect this situation, and ask the user to turn off the donated buffer config (although if/once we get to the point of always delaying backward compilation to runtime, we can just wait until we know the runtime value to compile). There isn't a way to know if the currently running backward is run with `retain_graph=True` from python - @soulitzer helped me figure out where to grab it so I added a python binding for it under `ctx.is_retain_graph()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131038 Approved by: https://github.com/soulitzer	2024-07-25 23:50:40 +00:00
cyy	eac83479cc	Enable Wunused-function and Wunused-result globally (#131596 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596 Approved by: https://github.com/zou3519	2024-07-25 23:50:12 +00:00
Animesh Jain	2a4ca5ccc4	[dynamo] Pop the exception stack on handling the StopIteration natively (#131801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131801 Approved by: https://github.com/yanboliang ghstack dependencies: #131795	2024-07-25 23:33:19 +00:00
Animesh Jain	11673851d9	[dynamo][exception][bugfix] Add a pop for < 3.11 version (#131795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131795 Approved by: https://github.com/yanboliang	2024-07-25 23:33:19 +00:00
Colin Peppler	f885a70fab	[inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253 ) ## What is sympy fn str arg? It's a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`) ## Crash ``` torch/_inductor/sizevars.py", line 468, in symbolic_hint expr = self.simplify(expr) # where expr is 'sqrt' torch/_inductor/sizevars.py", line 66, in simplify return sympy.expand(expr).xreplace(self.replacements) sympy/core/function.py", line 2816, in expand return sympify(e).expand(deep=deep, modulus=modulus, **hints) AttributeError: 'function' object has no attribute 'expand' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253 Approved by: https://github.com/desertfire	2024-07-25 23:31:20 +00:00
drisspg	b4b62d3945	update to 2.5.8 (#131684 ) # Summary This stack brings the current fork of FAv2 near the top of main which is 2.6.2 Notably we need to update cutlass to 3.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131684 Approved by: https://github.com/jainapurva	2024-07-25 23:15:03 +00:00
Michael Lazos	51f4f87718	[Reland] Ensure staticmethods can be allowed in graph (#131789 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131789 Approved by: https://github.com/anijain2305	2024-07-25 22:54:18 +00:00
wz337	4de85e3c30	[DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636 ) We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes. As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636 Approved by: https://github.com/wanchaol	2024-07-25 22:47:22 +00:00
Aaron Orenstein	79f0c4dc04	[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131579 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578	2024-07-25 22:24:19 +00:00
Aaron Orenstein	c65b197b85	[BE] typing for decorators - _library/custom_ops (#131578 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577	2024-07-25 22:24:19 +00:00
Aaron Orenstein	5ee6a6dacc	[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131577 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576	2024-07-25 22:24:19 +00:00
Aaron Orenstein	37d76c7d48	[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131576 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575	2024-07-25 22:24:19 +00:00
Aaron Orenstein	42dc5a47a1	[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131575 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574	2024-07-25 22:24:19 +00:00
Aaron Orenstein	b2cbcf710b	[BE] typing for decorators - _inductor/lowering (#131574 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131574 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573	2024-07-25 22:24:19 +00:00
Aaron Orenstein	f0f20f7e97	[BE] typing for decorators - _jit_internal (#131573 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131573 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572	2024-07-25 22:24:19 +00:00
Aaron Orenstein	bfe0079b72	[BE] typing for decorators - _meta_registrations (#131572 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131572 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571	2024-07-25 22:24:19 +00:00
Aaron Orenstein	4b985e6f80	[BE] typing for decorators - distributed/_tensor/ops/utils (#131571 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131571 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570	2024-07-25 22:24:19 +00:00
Aaron Orenstein	5731b486c8	[BE] typing for decorators - library (#131570 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131570 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569	2024-07-25 22:24:19 +00:00
Aaron Orenstein	aa58af8b43	[BE] typing for decorators - masked/_ops (#131569 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131569 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568	2024-07-25 22:24:19 +00:00
Aaron Orenstein	193f62fde9	[BE] typing for decorators - fx/_compatibility (#131568 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568 Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519	2024-07-25 22:24:19 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
Animesh Jain	0455344777	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang ghstack dependencies: #131744	2024-07-25 22:14:17 +00:00
Simon Mahns	513ce5f69a	MTIA equivalent of torch.cuda.memory_stats (#131673 ) Summary: Adding MTIA equivalent of `torch.cuda.memory_stats` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673 Approved by: https://github.com/egienvalue	2024-07-25 21:59:59 +00:00
Tristan Rice	9039131a89	TCPStore: fix remote address (#131773 ) This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. Test plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman	2024-07-25 21:55:25 +00:00
Xu Han	520182dbff	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-25 21:45:40 +00:00
Yanbo Liang	a34692c0a3	[Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552 ) Combine #131073 and #131012 and fix doc building failures. Co-authored-by: chilli <chilli@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552 Approved by: https://github.com/Chillee	2024-07-25 21:29:46 +00:00
Shengbao Zheng	89bdd9c18f	[kineto] populate src/dst rank for p2p (#130812 ) Summary: as title populate src/dst rank (global rank) for p2p kernel Differential Revision: D59794535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130812 Approved by: https://github.com/aaronenyeshi	2024-07-25 21:10:57 +00:00
Wanchao Liang	1c58aacbc8	[dtensor] move ops to private (#131211 ) as titled Differential Revision: [D60132519](https://our.internmc.facebook.com/intern/diff/D60132519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131211 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #131212	2024-07-25 20:59:55 +00:00
Jon Janzen	605dfd8fb4	Switch sync_distributed_folder to use non-reverse order (#131683 ) `git` on GHA seems to use the reverse commit ordering that I see locally O_o Pull Request resolved: https://github.com/pytorch/pytorch/pull/131683 Approved by: https://github.com/seemethere	2024-07-25 20:44:23 +00:00
PyTorch MergeBot	fe2e6f0c51	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit dfc9bfc8839ea3a0ffe933a64cd129fab5e4da75. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/atalman due to Breask CI test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10099725941/job/27930133346) [HUD commit link](`2c1851f04e`) ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2251360224))	2024-07-25 20:44:04 +00:00
James Wu	1ad4e6f228	Refactor cudagraphs to use serializable placeholder info (#130252 ) This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252 Approved by: https://github.com/eellison, https://github.com/masnesral ghstack dependencies: #129384	2024-07-25 20:39:37 +00:00
eqy	69d63b2318	[CUDA][Pooling] Clean up unused `accscalar_t` in `maxpool2d` forward (#131728 ) maxpool forward doesn't actually do any accumulation and the second template param was just a dupe of the first Pull Request resolved: https://github.com/pytorch/pytorch/pull/131728 Approved by: https://github.com/mikaylagawarecki	2024-07-25 20:32:42 +00:00
Peter Bell	fdc4d6fe96	[inductor] Refactor fusion of inplace operations (#130835 ) Resubmit of #128979 `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835 Approved by: https://github.com/lezcano	2024-07-25 20:29:01 +00:00
Zain Rizvi	61d7bb3e79	Migrate trunk workflows to Amazon2023 ami (#131677 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 All migrated trunk jobs passed successfully Pull Request resolved: https://github.com/pytorch/pytorch/pull/131677 Approved by: https://github.com/malfet	2024-07-25 20:19:16 +00:00
James Wu	a6ebd56f7b	Factor out cudagraph post compile into its own function (#129384 ) Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start. No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384 Approved by: https://github.com/eellison	2024-07-25 20:15:44 +00:00
IvanKobzarev	58b8704f28	[aot] Keep backward mutations in backward (#129130 ) https://github.com/pytorch/pytorch/issues/127561 Mutations of inputs in backward are emitted manually, after joint_fn tracing. With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs. To keep those mutations in backward: - Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward" In partitioner: Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130 Approved by: https://github.com/Chillee	2024-07-25 20:02:25 +00:00
ashwani	6c31e02971	Fixes the example for `convert_conv3d_weight_memory_format` (#131742 ) Fixes #129158 Please let me know if changes are needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/131742 Approved by: https://github.com/albanD	2024-07-25 20:01:44 +00:00
Animesh Jain	fba24252bd	[dynamo][frame summary] Skip frame summary for frames from inside torch/nn/modules (#131744 ) This ensures that the stack trace points to the user code. At main (no inlining) ![image](https://github.com/user-attachments/assets/bf6f1f46-2dfe-45a2-95e1-fb733cda7e50) With inlining but without this PR ![image](https://github.com/user-attachments/assets/fcb16c4d-dd81-4e5d-a63a-391a73683deb) With inlining and this PR ![image](https://github.com/user-attachments/assets/69f10f65-c2ed-4179-acd5-a2824615129c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131744 Approved by: https://github.com/ezyang	2024-07-25 19:30:03 +00:00
Prachi Gupta	a1fad03fa8	[ROCm] Enable cudagraph expandable segments UTs in inductory/dynamo (#131111 ) Test runtimes extracted from CI logs are as follows. "linux-focal-rocm6.1-py3.8": "dynamo/test_cudagraphs_expandable_segments": 3.3185000000000002, "inductor/test_cudagraph_trees_expandable_segments": 153.233, Pull Request resolved: https://github.com/pytorch/pytorch/pull/131111 Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/peterbell10	2024-07-25 19:26:04 +00:00
Yueming Hao	8c4683c978	Add device argument to the large_grid unit test (#131702 ) Missing device argument lets this unit test only run on CPUs. Two unit tests added in the previous PR https://github.com/pytorch/pytorch/pull/127448. But only one use `device=self.device` to make sure the tests run on correct devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131702 Approved by: https://github.com/desertfire	2024-07-25 19:19:56 +00:00
Matthew Hoffman	bf6aae1468	Improve `torch.masked.mean` and `torch.masked._std_var` scaling (#131293 ) Fixes #131292 Using `new_ones` is expensive and unnecessary. Before: ![21232fda-366a-47ea-a017-15a35cd51d0c](https://github.com/user-attachments/assets/779830f0-0027-4fab-a9e6-b99954c80bc5) After: ![aad2dfcc-52c9-4046-86ab-122b044fa19c](https://github.com/user-attachments/assets/810711c5-c4f0-4b6b-91dc-9a9e714f6ee0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131293 Approved by: https://github.com/ezyang	2024-07-25 18:52:59 +00:00
Yidi Wu	2c1851f04e	[export] fix output node's meta (#131706 ) Summary: This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`: The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR. Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle" Differential Revision: D60198221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706 Approved by: https://github.com/yushangdi	2024-07-25 18:44:21 +00:00
Xu Han	dfc9bfc883	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-25 18:34:08 +00:00
PyTorch MergeBot	f3df7deab8	Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431 )" This reverts commit e9db1b059733a02e1fb726d22a0489471044ad98. Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))	2024-07-25 18:00:46 +00:00
William Wen	2423d89d0c	[dynamo] mirror training flag in OptimizedModule (#131546 ) Fixes https://github.com/pytorch/pytorch/issues/122414. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131546 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-07-25 17:43:09 +00:00
PyTorch MergeBot	c3679bed35	Revert "Fix py codegen to delete values that don't have any users (#131028 )" This reverts commit 91aba7baac3d2a079c0b13db25588842260c98cc. Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_functionalize [GH job link](https://github.com/pytorch/pytorch/actions/runs/10094659640/job/27915271250) [HUD commit link](`91aba7baac`) ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2251058374))	2024-07-25 17:42:18 +00:00
mori360	ec3829795d	[3/3] 3D Composability - move tp dp tests (#129802 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802 Approved by: https://github.com/fduwjj ghstack dependencies: #129800, #129801	2024-07-25 16:36:55 +00:00
mori360	29571c5c06	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab ghstack dependencies: #129800	2024-07-25 16:36:55 +00:00
Yidi Wu	75c4176b05	[export][BE] consolidate export and export_for_training (#131496 ) Summary: This PR consolidates the implementation of export and export_for_training to maximize code re-use. Also add some type annotations and comments in the code for better readability. Test Plan: Existing tests. Differential Revision: D60130515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131496 Approved by: https://github.com/avikchaudhuri, https://github.com/pianpwk	2024-07-25 16:35:16 +00:00
Shangdi Yu	6bc8db1d32	Rename is_training flag to have more information (#131618 ) Summary: rename is_training flag into dispatch_tracing_mode = “make_fx” or “aot_export” Test Plan: OSS CI Differential Revision: D60154327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131618 Approved by: https://github.com/ydwu4	2024-07-25 16:29:55 +00:00
angelayi	f063027d54	[aoti] Fix constant inputs passed to aoti (#131594 ) In cases where the program takes in a constant, export will specialize on the constant and embed the constant into the graph, with the graph containing a placeholder node with no users. However, inductor errors further down as typically in torch.compile, these constants don't show up as inputs. Since these constants are already embedded in the graph, we will just ignore these inputs while compiling with AOTI, and filter out the non-tensor inputs during the runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131594 Approved by: https://github.com/desertfire	2024-07-25 16:22:15 +00:00
Yidi Wu	ffc6bf8149	[dynamo] lazily guard and specialize on the symint when used in f-string. (#131529 ) Fixes https://github.com/pytorch/pytorch/issues/103602. This PR implements the idea of "if someone creates a string and then ends up not using it, we would prefer to NOT have specialized." mentioned in above issue. Specifically, we create a lazy variable tracker instead of ConstantVariable when we're in FORMAT_VALUE, and when the lazy variable tracker is realized (i.e. it's going to be used), we create a ConstantVariable and the specialization/guarding happens at the time of realization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131529 Approved by: https://github.com/ezyang	2024-07-25 16:16:34 +00:00
Sherlock Huang	96e8df6a3a	[ts_converter] Support prim::max and prim::if with multiple outputs (#131593 ) Summary: As title. Test Plan: test_converter.py Reviewed By: angelayi Differential Revision: D60147455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131593 Approved by: https://github.com/ydwu4	2024-07-25 16:13:31 +00:00
cyy	b07ea91c4c	[2/N] Fix clang-tidy warnings in jit (#131735 ) Follows #131034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131735 Approved by: https://github.com/ezyang	2024-07-25 15:56:53 +00:00
PyTorch MergeBot	49a8e061b6	Revert "Support IPC for Expandable Segments (#130890 )" This reverts commit 0e71a88f9b2ca6b950c76a061791559cdd8a8870. Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to some internal tests show shutdown issues with the change to the table that holds ipc handles ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2250767280))	2024-07-25 15:54:57 +00:00
cyy	a4be5cb50e	Simplify some c++ code (#131612 ) The simplifications were discovered by static analysis tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131612 Approved by: https://github.com/ezyang	2024-07-25 15:07:37 +00:00
Mikayla Gawarecki	c3d099ddd1	[BE][Easy] Add hooks to doc for Optimizer base class (#131628 ) Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628 Approved by: https://github.com/janeyx99	2024-07-25 15:07:08 +00:00
Bin Bao	745b55d14a	[CI][dashboard] Add a workflow to collect aarch64 perf (#131729 ) Summary: as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/131729 Approved by: https://github.com/huydhn	2024-07-25 14:58:47 +00:00
Howard Huang	1eedb0a962	fix torchrun log message (#131652 ) fixes https://github.com/pytorch/pytorch/issues/131461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131652 Approved by: https://github.com/awgu	2024-07-25 14:50:10 +00:00
Julia Guo	d0e2ab617d	Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022 ) Migration of Docker conda builds to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml Related to: https://github.com/pytorch/builder/issues/1849 Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR. Test when executing on PR, upload to ecr: https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327 ``` 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b ``` Test With-Push, upload to dockerhub: https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427 ``` docker.io/pytorch/conda-builder:cpu done ``` Will upload here: https://hub.docker.com/r/pytorch/conda-builder/ Test using ecr image in the nightly workflow: https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87 Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch. Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923 Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022 Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129	2024-07-25 14:36:15 +00:00
Aaron Orenstein	4a5a87168e	[BE] typing for decorators - _prims_common/wrappers (#131567 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131567 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-25 14:35:13 +00:00
Nikita Shulga	7260eaeca0	Fix vulkan builds with missing overrides errors (#131760 ) Followup after https://github.com/pytorch/pytorch/pull/131524 Also, use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` macro to suppress existing warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760 Approved by: https://github.com/atalman	2024-07-25 14:29:44 +00:00
Aaron Enye Shi	fddb1bcdea	[CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964 ) Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator. Test Plan: CI, resnet run, and FBR model. Differential Revision: D59703213 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964 Approved by: https://github.com/zdevito	2024-07-25 14:06:52 +00:00
Justin Chu	c88c90a897	[TS2E] Improve logging (#131711 ) Serializing the text without having to do so can be costly for large outputs like ExportedProgram Pull Request resolved: https://github.com/pytorch/pytorch/pull/131711 Approved by: https://github.com/ydwu4	2024-07-25 13:40:10 +00:00
Jiong Gong	316c0d3e6b	[inductor][cpp][gemm] support k slicing for static shapes (#130821 ) This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible. The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default. Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough. Without k-slicing AUTOTUNE linear_unary(64x4096, 64x4096, 64) cpp_packed_gemm_0 0.0108 ms 100.0% _linear_pointwise 0.0431 ms 25.1% AUTOTUNE linear_unary(64x10000, 64x10000, 64) cpp_packed_gemm_0 0.0272 ms 100.0% _linear_pointwise 0.0892 ms 30.5% AUTOTUNE linear_unary(64x20000, 64x20000, 64) cpp_packed_gemm_0 0.0781 ms 100.0% _linear_pointwise 0.1693 ms 46.1% With k-slicing: AUTOTUNE linear_unary(64x4096, 64x4096, 64) cpp_packed_gemm_0 0.0260 ms 100.0% _linear_pointwise 0.0444 ms 58.5% AUTOTUNE linear_unary(64x10000, 64x10000, 64) cpp_packed_gemm_0 0.0275 ms 100.0% _linear_pointwise 0.0893 ms 30.8% AUTOTUNE linear_unary(64x20000, 64x20000, 64) cpp_packed_gemm_0 0.0284 ms 100.0% _linear_pointwise 0.1686 ms 16.8% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #131024	2024-07-25 13:36:38 +00:00
PyTorch MergeBot	d962dba0c4	Revert "[2/3] 3D Composability - move pp tests (#129801 )" This reverts commit 84cd062fb25c6da7d33b559c28afa38420e64415. Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](`544f950d14`) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2250326191))	2024-07-25 13:30:56 +00:00
Jane Xu	9c4cf866c2	Adafactor forloop basic impl (#129905 ) #109581 At this point, the vanilla implementation (the default) is good. Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor Specifically, the impl in this PR, which attempts to replicate the paper, ``` optim = torch.optim.Adafactor([weight]) ``` is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor ``` optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False) ``` is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor ``` optim = keras.optimizers.Adafactor(learning_rate=0.01) ``` The three results respectively for the same randomly generated weights: ``` # ours tensor([[ 0.3807594, -0.3912092], [ 0.0762539, 0.5377805], [ 0.2459473, 0.4662207]]) # pytorch-optimizer tensor([[ 0.3807592, -0.3912172], [ 0.0762507, 0.5377818], [ 0.2459457, 0.4662213]]) # keras array([[ 0.38076326, -0.39121315], [ 0.0762547 , 0.5377859 ], [ 0.24594972, 0.46622536]], dtype=float32) ``` This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences: * keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01. * We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling. <details> Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing My script repro: ``` import torch from pytorch_optimizer import AdaFactor torch.set_printoptions(precision=7) weight = torch.tensor([[ 0.37697506, -0.39500135], [ 0.07246649, 0.53399765], [ 0.24216151, 0.46243715]], dtype=torch.float32) # bias = torch.tensor([0, 0], dtype=torch.float32) weight.grad = torch.tensor([[-0.5940447, -0.7743838], [-0.5940447, -0.7743838], [-0.5940447, -0.7743838]], dtype=torch.float32) # bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32) weight_c = weight.clone() weight_c.grad = weight.grad.clone() optim = torch.optim.Adafactor([weight]) optim.step() print(weight) optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False) optim_c.step() print(weight_c) ``` <details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905 Approved by: https://github.com/albanD	2024-07-25 13:17:19 +00:00
Patryk Merchelski	e8956c9fe6	Allow cpu scalar to be moved to HPU in masked_fill_decomposition (#127871 ) Extension of the condition allowing the cpu scalar to be moved to specific devices. This fixes an HPU specific error: `torch._dynamo.exc.BackendCompilerFailed: backend='aot_hpu_training_backend' raised: RuntimeError: Expected `value` to be on same device as `a`While executing %masked_fill : [num_users=1] = call_method[target=masked_fill](args = (%matmul, %expand_as, %tensor), kwargs = {})` On the HPU in eager mode the problem doesn't occur because the pytorch's implementation is not used then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127871 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-07-25 13:04:55 +00:00
YangQun1	91aba7baac	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-07-25 13:04:23 +00:00
Peter Bell	2784b3f1b7	[inductor] Fix split-scan interaction with multi-kernel (#131044 ) This fixes a couple errors that come up when multi-kernel is used with split-scan. 1. The split-scan was being marked as a persistent kernel, which allowed a multi-kernel to be created but this isn't supported. Fix is to never mark split-scan as persistent. 2. Benchmark codegen was not handling WorkspaceArg, and would raise a KeyError during codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044 Approved by: https://github.com/shunting314	2024-07-25 11:36:36 +00:00
Xuehai Pan	c04f70bb30	[BE] enable UFMT for `torch/ao/` (#128864 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128864 Approved by: https://github.com/ezyang	2024-07-25 11:30:14 +00:00
Xuehai Pan	434f60ce33	Refactor nightly checkout tool (#131134 ) Changes: - Add `-C REPO` in `git` commands to allow the tool can be run everywhere not only the repo dir - Use `pathlib.Path` as many as possible - Replace `subprocess.run(..., check=True)` with `subprocess.check_{call,output}(...)` - Add `encoding='utf-8'` for files Pull Request resolved: https://github.com/pytorch/pytorch/pull/131134 Approved by: https://github.com/ezyang	2024-07-25 11:20:43 +00:00
Xuehai Pan	054d214c50	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-25 10:10:58 +00:00
Anshul Sinha	c4bf4005d1	[dtensor][debug] adding new noise level which allows users to only print operations with dtensors (#131592 ) Summary I have added a new noise level between the existing levels of 1 and 2, such that the noise level controls are now: 0. prints module-level collective counts 1. prints dTensor operations not included in trivial operations (new noise level) 2. prints operations not included in trivial operations 3. prints all operations This gives the user more flexibility in controlling what information they want to use. The noise levels are used both for creating the console/file log and the json dump. In the example file, I have changed the module_tracing examples to noise level 0 and have changed my transformer examples to show off the new noise level. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/131592 Approved by: https://github.com/XilunWu ghstack dependencies: #131419, #130996	2024-07-25 06:54:57 +00:00
Adnan Akhundov	41e9f9cb7c	[inductor] Fix flaky tests in test_select_algorithm.py (#131709 ) Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_select_algorithm.py`. Test Plan: Tested internally. Differential Revision: D60202778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131709 Approved by: https://github.com/eellison	2024-07-25 06:42:57 +00:00
Adnan Akhundov	3afdbecb23	[inductor] Fix flaky tests in test_debug_trace.py (#131722 ) Summary: When run internally in multiple parallel processes, the `test_debug_trace` hits the cache and skips writing all the expected outputs. Here we force-disable inductor cache to circumvent the problem. Ideally, we should switch to using a cleaner `fresh_inductor_cache` decorator approach, but it doesn't work at the moment. Additionally, the debug trace dir is now generated by `tempfile.mkdtemp` to avoid a (rather unlikely) race condition. Test Plan: Tested internally. Differential Revision: D60207586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131722 Approved by: https://github.com/eellison	2024-07-25 05:56:01 +00:00
Yunqiu Guo	059f9fb30b	[BE][inductor] Type annotate `codecache.py` and `config.py` (#131427 ) As title. Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-07-25 05:54:38 +00:00
Xuehai Pan	ace6decc99	Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 ) Fix static `py::object`s with `py::gil_safe_call_once_and_store`. The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault. ```c++ void func() { static py::object obj = py::module_::import("foo").attr("bar"); ... } ``` The correct code is to use raw pointers rather than the instance. ```c++ void func() { static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")}; py::object obj = *obj_ptr; ... } ``` This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely. ```c++ void func() { PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage; py::object obj = storage .call_once_and_store_result( []() -> py::object { return py::module_::import("foo").attr("bar"); } ) .get_stored(); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-25 05:53:09 +00:00
Adnan Akhundov	59ef88ea5b	[inductor] Fix flaky tests in test_pad_mm (#131699 ) Summary: When run internally, some tests in `test_pad_mm.py` requiring big enough GPU to run `max_autotune=True` fail, as they're getting a smaller GPU than they need. Here we add `skipTest`s to skip the tests in these (rare) circumstances. Differential Revision: D60192586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131699 Approved by: https://github.com/chenyang78, https://github.com/shunting314, https://github.com/eellison	2024-07-25 05:46:45 +00:00
Adnan Akhundov	ee996cd63c	[inductor] Fix flaky tests in test_benchmark_fusion.py (#131733 ) Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_benchmark_fusion.py`. Test Plan: Tested internally. Differential Revision: D60211793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131733 Approved by: https://github.com/oulgen	2024-07-25 05:39:14 +00:00
Xuehai Pan	42a4df9447	Support CUDA nightly package in `tools/nightly.py` (#131133 ) Add a new option `--cuda` to `tools/nightly.py` to pull the nightly packages with CUDA support. ```bash # installs pytorch-nightly with cpuonly tools/nightly.py pull # The following only available on Linux and Windows # installs pytorch-nightly with latest CUDA we support tools/nightly.py pull --cuda # installs pytorch-nightly with CUDA 12.1 tools/nightly.py pull --cuda 12.1 ``` Also add targets in `Makefile` and instructions in constribution guidelines. ```bash # setup conda environment with pytorch-nightly make setup-env # setup conda environment with pytorch-nightly with CUDA support make setup-env-cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131133 Approved by: https://github.com/ezyang	2024-07-25 05:33:52 +00:00
Adnan Akhundov	ceab3121de	[inductor] Fix flaky tests in test_memory_planning.py (#131703 ) Summary: Internally, the ABI-compatible mode is [enabled by default](`eb54ca7abe/torch/_inductor/config.py (L53)`). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`. Test Plan: Tested internally. Differential Revision: D60197327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703 Approved by: https://github.com/eellison	2024-07-25 05:09:08 +00:00
cyy	35bb0d3638	Fix unsigned type bug in CUDACachingAllocator.cpp (#131464 ) curr_block->size and block_state.size are both size_t, so once they are not equal, split will happen. According to the comment, it's better to use '>' Pull Request resolved: https://github.com/pytorch/pytorch/pull/131464 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-25 04:48:05 +00:00
Yanbo Liang	5f3f14e5e4	[BE] Annotate subgraph_lowering (#131545 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131545 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2024-07-25 04:35:26 +00:00
Jun Luo	00e19ae97a	[MTIA] Support module.mtia() (#131499 ) Summary: Following other device backends' implementation to support module.mtia() API. Test Plan: OSS and Internal CIs. Differential Revision: D60076584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131499 Approved by: https://github.com/mikaylagawarecki	2024-07-25 04:23:48 +00:00
Xuehai Pan	2ce734cee9	[BE] enable UFMT for `torch/ao/quantization/` (#128863 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128863 Approved by: https://github.com/ezyang ghstack dependencies: #128861, #128862	2024-07-25 04:17:54 +00:00
Michael Lazos	a2f6eb33d0	Register buffer in static input test (#131686 ) Previously, without nn module inlining, dynamo would lift all tensor attributes on an nn module to be constant on the graph. With nn module inlining these need to be buffers explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131686 Approved by: https://github.com/anijain2305	2024-07-25 03:47:56 +00:00
cyy	62704db5c3	[Distributed] [10/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d/control_plane (#131671 ) Follows #130109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131671 Approved by: https://github.com/zou3519	2024-07-25 03:46:55 +00:00
dependabot[bot]	2d7c135757	Bump setuptools from 69.5.1 to 70.0.0 in /tools/build/bazel (#130893 ) Bumps [setuptools](https://github.com/pypa/setuptools) from 69.5.1 to 70.0.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p> <blockquote> <h1>v70.0.0</h1> <h2>Features</h2> <ul> <li>Emit a warning when <code>[tools.setuptools]</code> is present in <code>pyproject.toml</code> and will be ignored. -- by :user:<code>SnoopJ</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4150">#4150</a>)</li> <li>Improved <code>AttributeError</code> error message if <code>pkg_resources.EntryPoint.require</code> is called without extras or distribution Gracefully "do nothing" when trying to activate a <code>pkg_resources.Distribution</code> with a <code>None</code> location, rather than raising a <code>TypeError</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4262">#4262</a>)</li> <li>Typed the dynamically defined variables from <code>pkg_resources</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4267">#4267</a>)</li> <li>Modernized and refactored VCS handling in package_index. (<a href="https://redirect.github.com/pypa/setuptools/issues/4332">#4332</a>)</li> </ul> <h2>Bugfixes</h2> <ul> <li>In install command, use super to call the superclass methods. Avoids race conditions when monkeypatching from _distutils_system_mod occurs late. (<a href="https://redirect.github.com/pypa/setuptools/issues/4136">#4136</a>)</li> <li>Fix finder template for lenient editable installs of implicit nested namespaces constructed by using <code>package_dir</code> to reorganise directory structure. (<a href="https://redirect.github.com/pypa/setuptools/issues/4278">#4278</a>)</li> <li>Fix an error with <code>UnicodeDecodeError</code> handling in <code>pkg_resources</code> when trying to read files in UTF-8 with a fallback -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4348">#4348</a>)</li> </ul> <h2>Improved Documentation</h2> <ul> <li>Uses RST substitution to put badges in 1 line. (<a href="https://redirect.github.com/pypa/setuptools/issues/4312">#4312</a>)</li> </ul> <h2>Deprecations and Removals</h2> <ul> <li> <p>Further adoption of UTF-8 in <code>setuptools</code>. This change regards mostly files produced and consumed during the build process (e.g. metadata files, script wrappers, automatically updated config files, etc..) Although precautions were taken to minimize disruptions, some edge cases might be subject to backwards incompatibility.</p> <p>Support for <code>"locale"</code> encoding is now <strong>deprecated</strong>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4309">#4309</a>)</p> </li> <li> <p>Remove <code>setuptools.convert_path</code> after long deprecation period. This function was never defined by <code>setuptools</code> itself, but rather a side-effect of an import for internal usage. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p> </li> <li> <p>Remove fallback for customisations of <code>distutils</code>' <code>build.sub_command</code> after long deprecated period. Users are advised to import <code>build</code> directly from <code>setuptools.command.build</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p> </li> <li> <p>Removed <code>typing_extensions</code> from vendored dependencies -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4324">#4324</a>)</p> </li> <li> <p>Remove deprecated <code>setuptools.dep_util</code>. The provided alternative is <code>setuptools.modified</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`5cbf12a9b6`"><code>5cbf12a</code></a> Workaround for release error in v70</li> <li><a href="`9c1bcc3417`"><code>9c1bcc3</code></a> Bump version: 69.5.1 → 70.0.0</li> <li><a href="`4dc0c31644`"><code>4dc0c31</code></a> Remove deprecated <code>setuptools.dep_util</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</li> <li><a href="`6c1ef5748d`"><code>6c1ef57</code></a> Remove xfail now that test passes. Ref <a href="https://redirect.github.com/pypa/setuptools/issues/4371">#4371</a>.</li> <li><a href="`d14fa0162c`"><code>d14fa01</code></a> Add all site-packages dirs when creating simulated environment for test_edita...</li> <li><a href="`6b7f7a18af`"><code>6b7f7a1</code></a> Prevent <code>bin</code> folders to be taken as extern packages when vendoring (<a href="https://redirect.github.com/pypa/setuptools/issues/4370">#4370</a>)</li> <li><a href="`69141f69f8`"><code>69141f6</code></a> Add doctest for vendorised bin folder</li> <li><a href="`2a53cc1200`"><code>2a53cc1</code></a> Prevent 'bin' folders to be taken as extern packages</li> <li><a href="`720862807d`"><code>7208628</code></a> Replace call to deprecated <code>validate_pyproject</code> command (<a href="https://redirect.github.com/pypa/setuptools/issues/4363">#4363</a>)</li> <li><a href="`96d681aa40`"><code>96d681a</code></a> Remove call to deprecated validate_pyproject command</li> <li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v69.5.1...v70.0.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=69.5.1&new-version=70.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130893 Approved by: https://github.com/kit1980	2024-07-25 03:32:08 +00:00
Manuel Candales	d6115439be	[MPS] Add SDPA implentation (#131362 ) This work is based off @malfet's #119200 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362 Approved by: https://github.com/kimishpatel	2024-07-25 03:24:37 +00:00
cyy	d98d00487d	[2/N] Remove unused variables (#131468 ) Follows #122496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131468 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-25 03:08:07 +00:00
cyy	538258bc13	[1/N] Fix clang-tidy warnings in jit (#131034 ) Some some tidy warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131034 Approved by: https://github.com/ezyang	2024-07-25 03:03:46 +00:00
cyy	46e42ae85d	[4/N] Fix Wunused-parameter warnings (#131291 ) Follows #131271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131291 Approved by: https://github.com/ezyang	2024-07-25 02:59:22 +00:00
Xuehai Pan	03979a599e	[BE] enable UFMT for `torch/ao/pruning/` (#128862 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128862 Approved by: https://github.com/ezyang ghstack dependencies: #128861	2024-07-25 02:49:35 +00:00
Xuehai Pan	973a1362b9	[BE] enable UFMT for `torch/ao/nn/` (#128861 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128861 Approved by: https://github.com/ezyang	2024-07-25 02:49:19 +00:00
Animesh Jain	c047bddbca	[easy][dynamo] Update test for inline_inbuilt_n_modules (#131718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131718 Approved by: https://github.com/williamwen42, https://github.com/mlazos ghstack dependencies: #131694	2024-07-25 02:49:16 +00:00
Animesh Jain	01bc2a8165	[inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694 ) Related issue https://github.com/pytorch/pytorch/issues/131693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694 Approved by: https://github.com/eellison	2024-07-25 02:49:16 +00:00
Xuehai Pan	b5c006acac	[BE][Easy] enable UFMT for `torch/nn/` (#128865 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128865 Approved by: https://github.com/ezyang	2024-07-25 02:48:42 +00:00
cyy	8ea4c72eb2	[1/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#130798 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130798 Approved by: https://github.com/ezyang	2024-07-25 02:36:43 +00:00
angelayi	ab609d6aa6	[ts_convert] Update conversion for aten.tensor (#131549 ) Fixes aten::tensor issues in edgeml models P1492137675 \| suite \| #models \| #has_ts_model \| #has_sample_inputs \| #ts_can_run \| #can_convert \| #ep_result_correct \| #can_package \| #sigmoid_can_run \| #sigmoid_result_correct \| \|---------\|-----------\|-----------------\|----------------------\|---------------\|----------------\|----------------------\|----------------\|--------------------\|---------------------------\| \| EDGEML \| 34 \| 25 \| 23 \| 21 \| 2 \| 2 \| 2 \| 2 \| 2 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/131549 Approved by: https://github.com/jiashenC, https://github.com/SherlockNoMad	2024-07-25 01:11:03 +00:00
fduwjj	e20fb5e975	[PTD][c10d] Include PG status into flight recorder (#131268 ) We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268 Approved by: https://github.com/shuqiangzhang	2024-07-25 01:01:00 +00:00
Zixi Qi	c3fe9075a9	[ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078 ) Summary: When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide. Test Plan: Verified generated tunableops kernel selection has the correct hipblaslt version from runtime: ``` Validator,PT_VERSION,2.5.0 Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty Validator,HIPBLASLT_VERSION,800-bf2c3184 Validator,ROCM_VERSION,6.0.0.0-12969-1544e39 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676 GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019 GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723 GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047 ``` Differential Revision: D59889043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell	2024-07-25 00:54:07 +00:00
cyy	803c5b8640	[CMake] Fix private compile options for CUDA code (#130546 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130546 Approved by: https://github.com/ezyang	2024-07-25 00:22:18 +00:00
Oguz Ulgen	7a42470bcb	Annotate all InstructionTranslator (#131509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509 Approved by: https://github.com/zou3519	2024-07-24 23:45:53 +00:00
Angela Yi	7535b23a25	[export] Fix set_grad hoo if output is empty (#131511 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1467707973867409/ Test Plan: CI Differential Revision: D60135531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131511 Approved by: https://github.com/ydwu4	2024-07-24 23:17:20 +00:00
Shangdi Yu	29c9f8c782	[export] Fix `graph_break` log registration error when importing export/_trace.py (#131523 ) Summary: When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error: ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations. The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations. Explicitly call `import torch._logging` doesn't fix the issue. (see T196719676) We move ``_ALLOW_LIST`` to be a local variable. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)' buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)' Differential Revision: D60136706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523 Approved by: https://github.com/zhxchen17	2024-07-24 22:40:24 +00:00
PyTorch MergeBot	236e06f9f9	Revert "Ensure staticmethods can be allowed in graph (#130882 )" This reverts commit 93fdd0237dcfe8cb4c65f3596aef123417b760a1. Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/clee2000 due to torchrec test still broken internally D59945836 ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2249003059))	2024-07-24 22:32:41 +00:00
PyTorch MergeBot	5db5865614	Revert "Annotate all InstructionTranslator (#131509 )" This reverts commit eafbd20f23746aa6b9090d989a4ccb059f45297e. Reverted https://github.com/pytorch/pytorch/pull/131509 on behalf of https://github.com/clee2000 due to sorry need to revert this to revert something else, I think you only need to rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/131509#issuecomment-2249000843))	2024-07-24 22:29:49 +00:00
Nikita Shulga	a7e20ef7e4	[BE] Get rid of missing destructor override warning (#131204 ) Regression introduced by https://github.com/pytorch/pytorch/pull/126376 Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~HIPHooksInterface() = default; ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-07-24 22:29:31 +00:00
Oguz Ulgen	b56939dae1	Annotate more InstructionTranslator (#131680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131680 Approved by: https://github.com/zou3519 ghstack dependencies: #131676	2024-07-24 22:14:29 +00:00
Shangdi Yu	f9322c26b2	Remove _export/exported_program.py (#131597 ) Summary: We removed references to _export/exported_program.py in executorch in D60052318. Now we can remove this file. Update the pin to executorch. Test Plan: contbuild & OSS CI: Differential Revision: D60072980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131597 Approved by: https://github.com/avikchaudhuri	2024-07-24 22:04:17 +00:00
PyTorch MergeBot	eb54ca7abe	Revert "[BE] Get rid of missing destructor override warning (#131204 )" This reverts commit 8a890b72dc3e4dcd501060c2a2fee139c235a8b8. Reverted https://github.com/pytorch/pytorch/pull/131204 on behalf of https://github.com/atalman due to sorry @malfet need to revert to make CI green, lets reland with ciflow/periodic label on ([comment](https://github.com/pytorch/pytorch/pull/131204#issuecomment-2248898033))	2024-07-24 21:08:49 +00:00
PaliC	544f950d14	[BE] Improve error message when there are internal changes (#131547 ) Fixes https://github.com/pytorch/test-infra/issues/4988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131547 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/atalman	2024-07-24 20:38:08 +00:00
joydddd	7f61324268	Add sparse block to flex_decoding kernel (#130884 ) fix typo Finish flex_decoding block sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884 Approved by: https://github.com/drisspg	2024-07-24 20:30:25 +00:00
angelayi	b90aa18569	[aoti] Add initial custom op support (#127034 ) Re-land of https://github.com/pytorch/pytorch/pull/125242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034 Approved by: https://github.com/malfet	2024-07-24 20:29:55 +00:00
Aaron Orenstein	44fdf24967	[BE] typing for decorators - jit/_decompositions (#131566 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131566 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-24 20:28:28 +00:00
Jerry Mannil	2b83e4f8d7	[ROCm] Enable flex decoding unit tests (#131048 ) Flex decoding tests are passing with upstream pytorch on MI300X/MI2XX. Only flex attention unit tests have issues. [result_mi250.log](https://github.com/user-attachments/files/16286954/result_mi250.log) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131048 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/malfet	2024-07-24 20:25:34 +00:00
mori360	84cd062fb2	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab ghstack dependencies: #129800	2024-07-24 20:17:54 +00:00
Justin Chu	a9e6356271	[ONNX] Update torch.onnx.export API (#131501 ) - Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`. - Make the options keyword-only arguments (bc-breaking) - Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options. - Deprecate two functions `exportable_ops` and pretty print export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501 Approved by: https://github.com/titaiwangms	2024-07-24 20:03:17 +00:00
Justin Chu	9db567f17d	[ONNX] Set dump_exported_program to True in bench (#131670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670 Approved by: https://github.com/titaiwangms	2024-07-24 20:02:03 +00:00
Catherine Lee	85fa66be04	Add rerun_disabled_tests for inductor (#131681 ) Test in prod? THis also turns on mem leak check Briefly checked that ``` python3 ".github/scripts/filter_test_configs.py" \ --workflow "inductor" \ --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \ --test-matrix "{ include: [ { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" }, { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, ]} " \ --selected-test-configs "" \ --pr-number "${PR_NUMBER}" \ --tag "${TAG}" \ --event-name "schedule" \ --schedule "29 8 * * *" \ --branch "${HEAD_BRANCH}" ``` has rerun disabled tests option in the test matrix I don't think all these things need to run but I'm not sure which ones (probably just inductor?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681 Approved by: https://github.com/zou3519	2024-07-24 19:56:00 +00:00
Andrew Gallagher	65ce2bf465	Allow setting `PYTHON_LIB_REL_PATH` via environment variable (#128419 ) This allows builds to customize the location where caffe2's Python modules are installed to. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128419 Approved by: https://github.com/PaliC, https://github.com/d4l3k, https://github.com/malfet	2024-07-24 19:49:06 +00:00
mori360	074b46b7d9	[1/3] 3D Composability - move fsdp tests (#129800 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129800 Approved by: https://github.com/awgu	2024-07-24 19:47:34 +00:00
Oguz Ulgen	e0f1bf14a4	Fully type torch/utils/_config_module.py (#131676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131676 Approved by: https://github.com/zou3519	2024-07-24 19:36:09 +00:00
Zain Rizvi	05681b6838	Migrate missed experimental jobs to Amazon2023 AMI (#131485 ) Adding in a few jobs that got missed in https://github.com/pytorch/pytorch/pull/131250 Those jobs have passed with the new AMI: https://github.com/pytorch/pytorch/actions/runs/10063808680/job/27820050195?pr=131485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131485 Approved by: https://github.com/atalman, https://github.com/malfet	2024-07-24 19:33:02 +00:00
Jithun Nair	05064f2827	[CI] Move all ROCm jobs to periodic frequency (#131637 ) `inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464 * Move rocm.yml to cron frequency * Move ROCm CI jobs from inductor.yml to inductor-rocm.yml * Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load) * After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn	2024-07-24 19:26:58 +00:00
Bin Bao	8aff6caf67	[CI][dashboard] Rename cpu-x86 to cpu_x86 (#131658 ) Summary: '-' is used as a special separator by upload_dynamo_perf_stats.py, so switch to '_' instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131658 Approved by: https://github.com/huydhn	2024-07-24 19:16:52 +00:00
Bin Bao	3ce6f61416	[AOTI] Support fallback ops not in inductor_fallback_ops (#131247 ) Summary: For aten ops that are not listed in inductor_fallback_ops, AOTI will use proxy executor to execute them instead of erroring out as missing C shim implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131247 Approved by: https://github.com/angelayi	2024-07-24 19:16:43 +00:00
Zain Rizvi	aeca9845a6	Migrate Lint jobs to Amazon 2023 AMI (#131514 ) Continuing in the same vein as https://github.com/pytorch/pytorch/pull/131250, migrate all self-hosted lint.yml jobs to use the new Amazon 2023 AMI Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131514 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn	2024-07-24 19:11:02 +00:00
Andrii Grynenko	b98b3127f7	[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 ) Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately. Test Plan: unit test Reviewed By: jamesperng, asiab4, c-p-i-o Differential Revision: D59842868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021 Approved by: https://github.com/asiab4	2024-07-24 18:38:33 +00:00
William Wen	7718024d2b	[3.13] support 3.13 multiline traces in munge_exc (#131207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #131206	2024-07-24 18:22:30 +00:00
William Wen	f0378912a0	[3.13, dynamo] fix test/dynamo/test_bytecode_utils.py (#131206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131206 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-07-24 18:22:30 +00:00
Zhengxu Chen	a86909d251	[inductor] Type annotate constant_folding.py (#131364 ) Summary: Type annotate constant_folding.py Test Plan: mypy Reviewed By: angelayi Differential Revision: D60063872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131364 Approved by: https://github.com/angelayi	2024-07-24 18:20:06 +00:00
Haoci Zhang	8fe5b93667	support zb1p and zb2p algorithms (#130752 ) Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported. We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached. [zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752 Approved by: https://github.com/H-Huang	2024-07-24 17:58:46 +00:00
Prachi Gupta	5e6cfb7db5	Add an extra shard for distributed periodic jobs (#131498 ) Fixes issue of timeouts being observed in ROCm periodic workflow for distributed runs Pull Request resolved: https://github.com/pytorch/pytorch/pull/131498 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/clee2000	2024-07-24 16:44:53 +00:00
William Wen	106c6a49f5	[dynamo] limit number of compiles per frame (#130891 ) Fixes https://github.com/pytorch/pytorch/issues/130776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891 Approved by: https://github.com/anijain2305	2024-07-24 16:43:40 +00:00
Aaron Orenstein	abcd329359	[BE] typing for decorators - onnx/symbolic_helper (#131565 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131565 Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519, https://github.com/titaiwangms	2024-07-24 16:39:47 +00:00
zdevito	0e71a88f9b	Support IPC for Expandable Segments (#130890 ) This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2	2024-07-24 15:45:40 +00:00
Zain Rizvi	eb5883f8aa	Add new runner labels to actionlint (#131525 ) Adding the labels corresponding to the Amazon2023 ami Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525 Approved by: https://github.com/atalman	2024-07-24 15:28:59 +00:00
Xu Han	72d17d95d7	[inductor] Enable dynamo for Windows. RC1 (#131286 ) Changes: 1. Enable Windows in `check_if_inductor_supported`. 2. Disable Windows in `AotCodeCompiler`. 3. Force Windows inductor to `c++20` to support `std::enable_if_t`. 4. Disable `test_x86inductor_quantizer` UT on `Windows` temporary, It still some issue need to be fix: https://github.com/pytorch/pytorch/pull/131308 . Based on this PR, I have run first model `resnet18` on Windows inductor successful. <img width="1036" alt="image" src="https://github.com/user-attachments/assets/2642bda1-1845-417a-aaba-39bdf22e65d6"> TODO: 1. Upgrade pytorch Windows build to `c++20`. 2. Fix and re-enable `test_x86inductor_quantizer` UT on `Windows`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131286 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-24 15:26:55 +00:00
Jane Xu	4c7f22dee2	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-24 14:58:57 +00:00
rzou	98984422eb	[triton_op] fix autotuning (#131363 ) The problem was we were shoving SymInts into the constant_args side table. The root problem is that torch.fx.node.base_types, which we use to determine what can be put in the graph, doesn't actually have SymInt in it. This PR fixes base_types to include SymInt. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363 Approved by: https://github.com/oulgen, https://github.com/justinchuby	2024-07-24 14:03:37 +00:00
Andrew Gu	bc938184de	[FSDP2] Added `set_reduce_scatter_divide_factor` (#129286 ) This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286 Approved by: https://github.com/weifengpy	2024-07-24 12:42:35 +00:00
PyTorch MergeBot	8ffd109a00	Revert "Fix py codegen to delete values that don't have any users (#131028 )" This reverts commit 466c167b71e6021f8eadcfbae1d9156a375663ce. Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/atalman due to breaks CI ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2247771530))	2024-07-24 12:21:43 +00:00
cyyever	451462dbff	[1/N] Add missing constructors or assignment operators (#131077 ) Just mark them as deleted in most cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131077 Approved by: https://github.com/ezyang	2024-07-24 12:09:39 +00:00
Edward Z. Yang	0c6f1ca064	Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935 ) This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR: 1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective) 2. Using this infrastructure to synchronize automatic dynamic choices across all ranks The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState. Here is how automatic dynamic synchronization works: 1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program. 2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo 3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency. For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935 Approved by: https://github.com/jansel	2024-07-24 11:24:11 +00:00
Yifu Wang	85d3ee1d67	[micro_pipeline_tp] refactor all-gather and reduce-scatter pattern matchers to be more flexible and testable (#131409 ) High level goals: - Cover the all-gather and reduce-scatter pattern matchers with unit tests - Make it easier to exclude certain collectives as async-tp candidates - Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409 Approved by: https://github.com/weifengpy	2024-07-24 11:16:27 +00:00
Peter Bell	89d5391bbf	[inductor] Kill mark_node_as_mutating (#130834 ) Resubmit of #129346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834 Approved by: https://github.com/lezcano ghstack dependencies: #130832, #130833	2024-07-24 11:11:19 +00:00
Peter Bell	6415c45da5	[inductor] Use multiple outputs for flex-attention (#130833 ) Resubmit of #129344 This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833 Approved by: https://github.com/lezcano ghstack dependencies: #130832	2024-07-24 11:11:19 +00:00
Peter Bell	95c248751b	[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 ) Resubmit of #129325 Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832 Approved by: https://github.com/lezcano	2024-07-24 11:11:14 +00:00
Justin Chu	a4c3f29047	[ONNX][BE] Remove ruff skips in torch/onnx (#131368 ) Remove all ruff skips for torch/onnx since we do not do runtime type checking anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131368 Approved by: https://github.com/titaiwangms, https://github.com/Skylion007	2024-07-24 10:56:43 +00:00
Nikita Shulga	62e566b345	[BE] Remove suppression of inconsistent missing overrides (#131524 ) This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204 - Remove global `-Wno-error=inconsistent-missing-override` - Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED` - Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524 Approved by: https://github.com/atalman	2024-07-24 10:07:36 +00:00
Avik Chaudhuri	83d19620f6	kill tmp _is_executorch flag (#131488 ) Test Plan: existing tests Differential Revision: D60126186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131488 Approved by: https://github.com/ydwu4	2024-07-24 08:51:37 +00:00
Bin Bao	1e34870796	[CI][dashboard][reland] Collect PT2 cpu perf nightly (#131560 ) Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131560 Approved by: https://github.com/huydhn	2024-07-24 08:50:33 +00:00
hxwang	276b5238ef	[bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909 ) Hey folks, I was using the `stateless_func` [here](`7c45476d38/torch/distributed/_spmd/api.py (L435)`), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch. ![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47) ![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909 Approved by: https://github.com/mlazos	2024-07-24 08:29:27 +00:00
cyy	41189b0da4	Simplify THPEvent_get_device (#131466 ) Because self->event.device() always returns Device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131466 Approved by: https://github.com/albanD	2024-07-24 08:24:01 +00:00
Janani Sriram	e782918b8e	[NestedTensor] Add example NestedTensor objects with inner dimension of size 1 to tests reducing along jagged dimension for NestedTensor (#131516 ) Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, , 1)` and `(B, , 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, )` and `(B, , 5)`. Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516 Approved by: https://github.com/davidberard98	2024-07-24 07:01:39 +00:00
Adnan Akhundov	e9db1b0597	Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431 ) Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-24 05:37:09 +00:00
Oguz Ulgen	eafbd20f23	Annotate all InstructionTranslator (#131509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509 Approved by: https://github.com/zou3519	2024-07-24 05:31:01 +00:00
eellison	5772c13f56	Dont wrap negative indexing in scatter reduce (#131503 ) Fix for https://github.com/pytorch/pytorch/issues/131321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503 Approved by: https://github.com/shunting314	2024-07-24 04:01:32 +00:00
Michael Lazos	9f96d4b61b	Disable inlining on cudagraph fallback tests (#131557 ) The cudagraph fallback tests should only run without nn module inlining. The [rerecord limit](`fc3d2b26cd/torch/_inductor/cudagraph_trees.py (L1922)`) is ignored if nn module inlining is disabled. Arguably it should just be higher, but this PR addresses the failures and allows inlining to be on by default on main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131557 Approved by: https://github.com/anijain2305 ghstack dependencies: #131556	2024-07-24 04:00:02 +00:00
Michael Lazos	9575b1afad	Ensure tensor dict is populated with compiled autograd (#131556 ) The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them. The context is that `tensor_dict` is used to convey if a tensor is an input with a static address. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556 Approved by: https://github.com/anijain2305	2024-07-24 04:00:02 +00:00
Oguz Ulgen	dffbd3a1e2	Add mypy typing to pattern_matcher (#131506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131506 Approved by: https://github.com/zou3519	2024-07-24 02:55:43 +00:00
Nick Westlake	7124efa81b	Include _native.h for structured_native_functions (#131208 ) In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups". Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208 Approved by: https://github.com/bdhirsh	2024-07-24 02:55:36 +00:00
Jiashen Cao	31da9ee711	Use explain function to provide more meaningful information when conversion failed. (#131214 ) Summary: In the script of testing different families of models, when the conversion failed, we switch to use output from the explain function to provide more meaningful information. Test Plan: Manual testing with attatched log information. ``` buck2 run mode/dev-nosan sigmoid/inference/ts_migration:main -- --mode test_all --test_suites ads_merge --model_id 440779101 ``` ``` Processing 440779101_5455.predictor.disagg.gpu.merge model_name: 440779101_5455.predictor.disagg.gpu.merge has_ts_model: True has_sample_inputs: True ops_maybe_missing_meta: set() ts_can_run: True ts_run_exception: None can_convert: False convert_exception: Unsupported nodes are found in the following list: 0. prim::Loop [%14259 : int = prim::Loop(%14258, %1129, %1126), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19] 1. prim::Loop [%14326 : int = prim::Loop(%1115, %1129, %14259), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19] ep_result_correct: None ep_run_exception: None can_package: None package_exception: None sigmoid_can_run: None sigmoid_run_exception: None sigmoid_result_correct: None ``` Reviewed By: SherlockNoMad Differential Revision: D59971446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131214 Approved by: https://github.com/angelayi	2024-07-24 02:42:18 +00:00
Animesh Jain	0ceaabaf71	[easy][inline-inbuilt-nn-modules] Update test (#131563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131563 Approved by: https://github.com/mlazos ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480, #131512	2024-07-24 02:32:19 +00:00
Aaron Orenstein	0e780a7d69	[BE] Remove some mypy allow-untyped-decorators that are no longer needed (#131564 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131564 Approved by: https://github.com/oulgen	2024-07-24 02:00:08 +00:00
Jun Luo	abb313b466	[torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873 ) Summary: As title Test Plan: CI tests Reviewed By: joebos Differential Revision: D59036602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873 Approved by: https://github.com/hanzlfs	2024-07-24 01:52:21 +00:00
Junjie Wang (PyTorch)	aa1c78c7e9	[PTD][c10d][EZ] LOG error for nccl error rather than info (#131483 ) Summary: As title, when we get nccl exception we should log it as error not info. Test Plan: CI Reviewed By: csmodlin, rmiao Differential Revision: D60123773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131483 Approved by: https://github.com/fegin	2024-07-24 01:08:00 +00:00
YangQun1	466c167b71	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-07-24 01:03:56 +00:00
Nikita Shulga	14495ce288	[BE][MPS] Use `isOperatingSystemAtLeastVersion:` (#131513 ) Instead of trying to come up with different checks for classes resonding to selectors Pull Request resolved: https://github.com/pytorch/pytorch/pull/131513 Approved by: https://github.com/atalman	2024-07-24 00:54:25 +00:00
Jiong Gong	76f7b3e560	[inductor][cpp][gemm] improve thread blocking heuristics (#131024 ) This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse. Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60: AUTOTUNE linear_unary(200000x128, 64x128, 64) _linear_pointwise 0.1010 ms 100.0% cpp_packed_gemm_0 0.8303 ms 12.2% 0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32) V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4) V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4) V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1) After this PR: AUTOTUNE linear_unary(200000x128, 64x128, 64) _linear_pointwise 0.1143 ms 100.0% cpp_packed_gemm_0 0.1228 ms 93.1% V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32) V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8) V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8) V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w	2024-07-24 00:36:29 +00:00
Shangdi Yu	fdc9a1404e	Remove _BLACK_LISTED_OPS (#131361 ) Summary: remove _BLACK_LISTED_OPS after https://github.com/pytorch/pytorch/pull/100749 Test Plan: contbuild & OSS CI Differential Revision: D60056130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131361 Approved by: https://github.com/angelayi	2024-07-24 00:15:27 +00:00
Nicolas Macchioni	2cf220956a	[inductor] fix CacheBase.get_system on AMD (#131365 ) Summary: CacheBase.get_system on AMD is missing device name and hip version, fix that Test Plan: on AMD: ``` buck run fbcode//mode/opt-amd-gpu scripts/nmacchioni/repros/amd_cache_key:repro {'device': {'name': 'gfx942:sramecc+:xnack-'}, 'version': {'triton': '3.0.006965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-4856d26164925fd955c779d8f67ecf47cc5754052b008714b3a580d708b13dd8-06965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-23d635e690d670bf61798e1259674b78c0ed5ba222ab6a455f329f27a758fc2d-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855-166fbf4e6f8845f354611638861a2a9e1dc2654224c278e10b566f09549dae7e-ccd93feaad4c82c8c1604557340de15fda0a3c84fe83f5a4d1e12a07a77bf3f4-cf28658fa328f7f283ec4e6ccc6c48d7c2a8ddbdf5134d3eb35c9b38ce4ace44-b9d80690b3109c2aaf5ece450d62e93b37eb6ab38552089794b3bb36e36a22b3-36130a37af1b19a0dec569aa08d30b00c74c8f02b6b632999d86dea169146792-4a620da64e0c263067f0dbf6c721f5214a5ac315625a07dd98520502ddf7e22f-6ace95666f6a4ecd2b1a7fc7ae865d1a9239608bd020cb6e4b8d15233c2dd9b3', 'hip': '6.0.32830'}, 'hash': 'c4db04316e15953dda8648f5a43a3f208f2c0ba454666cc7d78e40527aab85ec'} ``` on Nvidia: ``` buck run fbcode//mode/opt scripts/nmacchioni/repros/amd_cache_key:repro {'device': {'name': 'NVIDIA PG509-210'}, 'version': {'triton': '6de41ec76ecad84e618d692e6793a4ebe707ae68a0c033a222105daa72214d7c', 'cuda': '12.0.0'}, 'hash': 'b58d0aa37d80fc2932c1b7576ca876b77aa1258db1c14e27d1f201bd15376faf'} ``` Differential Revision: D60062972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131365 Approved by: https://github.com/eellison	2024-07-24 00:11:59 +00:00
rzou	480ae51f85	[pytree] Only import optree if it's used (#131478 ) torch.utils._pytree imports optree if it's available. Instead, we change it to if it gets used. The motivation for this is better isolation. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131478 Approved by: https://github.com/albanD	2024-07-24 00:10:49 +00:00
Animesh Jain	6850e42266	[dynamo][exception] Remove older specialization for StopIteration (#131512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131512 Approved by: https://github.com/yanboliang ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480	2024-07-24 00:06:53 +00:00
Animesh Jain	e2b941a1b4	[dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480 ) Fixes https://github.com/pytorch/pytorch/issues/129667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480 Approved by: https://github.com/williamwen42 ghstack dependencies: #131347, #131367, #131378, #131389, #131405	2024-07-24 00:06:53 +00:00
Anshul Sinha	e39f136c35	[debug][dtensor] implemented activation checkpointing differentiation (#130996 ) Summary While trying to integrate CommDebugMode with TorchTitan, I realized that the forward_hooks were being registered even though it was in the backward pass. After investigating, I realized that it was activation checkpointing that was causing this. In order to prevent users from being confused, I edited CommDebugMode so that it could differentiate between backward pass operations and activation checkpointing operations. I have also added an example case showing that CommDebugMode is able to successfully differentiate between the backward pass and activation checkpointing. The output for the example can be seen below. Test Case torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e activation_checkpointing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130996 Approved by: https://github.com/XilunWu ghstack dependencies: #131419	2024-07-23 23:44:56 +00:00
Anshul Sinha	7b375c3682	[dtensor][debug] changed which module tracker I inherited from to fix bug with activation checkpointing (#131419 ) Summary I switched the module tracker I had been inheriting from PyTorch’s all purpose one to the one written by Sanket in the distributed tools folder. I did this because the original one messed up activation checkpointing by adding itself to the parent set in the backward_pre_hook and then in the forward_pre_hook for the activation_checkpointing. Test Case pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/131419 Approved by: https://github.com/XilunWu	2024-07-23 23:44:56 +00:00
Yifu Wang	161c18ed0b	SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583 ) ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583 Approved by: https://github.com/weifengpy	2024-07-23 23:37:48 +00:00
Aaron Orenstein	1930698140	Fix fake tensor SymInt caching when there's a SymInt storage_offset (#131500 ) Test Plan: Internal unit tests failed before and succeeded after. Differential Revision: D60131273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131500 Approved by: https://github.com/clee2000	2024-07-23 23:37:04 +00:00
Will Feng	fc3d2b26cd	Use fake PG for test_compute_comm_reordering.py unit tests (#131415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415 Approved by: https://github.com/yifuwang	2024-07-23 22:53:23 +00:00
Shunting Zhang	980bb54361	[BE][Inductor] fix failures in test_padding.py (#131417 ) The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally. Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417 Approved by: https://github.com/eellison	2024-07-23 21:53:59 +00:00
Shunting Zhang	53f1f75061	[BE][Inductor] fix do_bench test (#131402 ) The test fail internally [T195592444](https://www.internalfb.com/intern/tasks/?t=195592444) (This is meta internal link). But we don't see the failure in OSS. It turns out that there are 2 issues: 1. `run_test('cuda')` is improperly handled since it tries to import a module named 'cuda' if cuda is available. Since the import fails, all tests in the file are skipped. This hides the failure in OSS. The failure is exposed in internal tests since the main block which runs `run_test('cuda')` is skipped sometimes. 2. fix the real issue that incompatible inputs are provided to `do_bench`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131402 Approved by: https://github.com/eellison	2024-07-23 21:52:35 +00:00
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
Aaron Orenstein	e3ca4e79e1	Fix mypy errors introduced by #131400 (#131522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131522 Approved by: https://github.com/zou3519, https://github.com/eellison	2024-07-23 21:25:21 +00:00
Zhengxu Chen	c9e74449f3	bump executorch commit pin. (#131486 ) Summary: as title. Target commit: `6153b1bf7b` Test Plan: CI Differential Revision: D60125590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131486 Approved by: https://github.com/huydhn	2024-07-23 21:25:07 +00:00
Nikita Shulga	8a890b72dc	[BE] Get rid of missing destructor override warning (#131204 ) Regression introduced by https://github.com/pytorch/pytorch/pull/126376 Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~HIPHooksInterface() = default; ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-07-23 21:02:14 +00:00
Xu Zhao	4eee2e7a6d	[operator_benchmark] Remove TARGETS from broken benchmarks (#131460 ) Summary: Remove operator_benchmark caffe2 build due to the removal of caffe2: `2fd75667b4` Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain. Test Plan: Sandcastle CI Differential Revision: D60086216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460 Approved by: https://github.com/vmpuri	2024-07-23 20:06:08 +00:00
PyTorch MergeBot	8497930766	Revert "[CI][dashboard] Collect PT2 cpu perf nightly (#131369 )" This reverts commit 9851c7313d118517d21a112960044e0fdbf560b1. Reverted https://github.com/pytorch/pytorch/pull/131369 on behalf of https://github.com/atalman due to Sorry need to revert looks like , please run ciflow/inductor looks like this caused failure in [pytorch/pytorch/actions/runs/10058412015/job/27802257096](https://github.com/pytorch/pytorch/actions/runs/10058412015/job/27802257096) ([comment](https://github.com/pytorch/pytorch/pull/131369#issuecomment-2246142022))	2024-07-23 19:41:49 +00:00
PyTorch MergeBot	d4e3fd613c	Revert "[CI] Relax config name matching for cpu inductor tests (#131467 )" This reverts commit aa54bcb6d25fc7c9ac23b82b74ea45f03033c8b2. Reverted https://github.com/pytorch/pytorch/pull/131467 on behalf of https://github.com/atalman due to Sorry need to revert looks like https://github.com/pytorch/pytorch/pull/131369 broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/131467#issuecomment-2246136839))	2024-07-23 19:38:35 +00:00
Sergii Dymchenko	7b82ed2d59	Delete very old misleading info from .ci README (#131502 ) I think there is no way to salvage that by updating, so deleting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131502 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-07-23 19:27:36 +00:00
Michael Lazos	93fdd0237d	Ensure staticmethods can be allowed in graph (#130882 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-07-23 18:59:19 +00:00
jananisriram	faddb0f30c	[NestedTensor] Integrate the mean operator along the jagged dimension into NestedTensor (#131132 ) Summary: Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`. Test Plan: Verify that the new unit test passes via the following command: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean ``` ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op ``` Differential Revision: D59654668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132 Approved by: https://github.com/davidberard98, https://github.com/jbschlosser	2024-07-23 18:48:34 +00:00
drisspg	120ca23a1f	Fix IMAs in Flash-Attention splitkv kernel (#131277 ) # Summary While debugging CI failures for flash_attention tests I stumbled across 2 IMAs for the split-kv variant of flash attention. 1. Illegal global memory writes during the writing of softmax_lse_accum. This was pinpointed to the temporary liftime of these out_accum and softmax_lse_accum. These were likely getting their refcount dropped before the kernel launch that used, them allowing them to potentially get used for other allocations. 2. After debugging this there was illegal writes of the combine kernel. I was able to pinpoint this to the writing to the reduce LSE. From my understanding it was making assumption that kBlocKM evenly divided the global number of rows and wasn't masking out these writes. ### History My line of thinking for this: We create the temporary split accum + LSE stats tensors to store the data for each split. We then launch a follow up kernel to do the reduction. Under ordinary non roofline memory usage the cuda memory caching allocator will keep these allocations alive even though the tensors were created within a temporary scope and no longer have any live references. On CI we often run near max memory usage. We change/add tests and suddenly we get close to oom threshold. The memory allocator will reap these segments and we get write after free errors. After that fix I did get further past the splitkv_flash kernel and then got the following error: ``` Shell ❯ TORCH_DISABLE_ADDR2LINE=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --show-backtrace=device --tool memcheck --log-file ima.txt python ima.py softmax_lseaccum_ptr =0x7f5ebb208a00 oaccum_ptr =0x7f5ebb208c00 softmax_lse_ptr = 0x7f5ebb208800 ❯ ❯ head ima.txt -n 10 ========= COMPUTE-SANITIZER ========= Invalid __global__ write of size 4 bytes ========= at void pytorch_flash::flash_fwd_splitkv_combine_kernel<pytorch_flash::Flash_fwd_kernel_traits<(int)32, (int)64, (int)256, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, pytorch_flash::Flash_kernel_traits<(int)32, (int)64, (int)256, (int)4, cutlass::bfloat16_t>>, (int)16, (int)1, (bool)1>(pytorch_flash::Flash_fwd_params)+0x630 ========= by thread (2,0,0) in block (0,0,0) ========= Address 0x7f5ebb208804 is out of bounds ========= and is 1 bytes after the nearest allocation at 0x7f5ebb208800 of size 4 bytes ``` Okay I looked at the address and it looks like we are writing consective bytes past the softmax_lse_ptr in from the combine func: I tried padding out the softmax_lse to q_padded and no more illegal memory errors on my repro: ``` ========= COMPUTE-SANITIZER ========= ERROR SUMMARY: 0 errors ``` Fixes https://github.com/pytorch/pytorch/issues/131240 Fixes https://github.com/pytorch/pytorch/issues/131227 Fixes https://github.com/pytorch/pytorch/issues/131221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131277 Approved by: https://github.com/malfet	2024-07-23 18:26:49 +00:00
Adria Orenstein	f75d724482	Updating Types in torch/_dynamo/utils.py (#131001 ) Adds some type annotations to the torch/_dynamo/utils.py file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131001 Approved by: https://github.com/aorenste	2024-07-23 18:25:52 +00:00
Bin Bao	aa54bcb6d2	[CI] Relax config name matching for cpu inductor tests (#131467 ) Summary: Matching cpu instead of cpu_inductor should be sufficient. This fixes torchbench test failures in https://github.com/pytorch/pytorch/pull/131369. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131467 Approved by: https://github.com/zou3519	2024-07-23 18:24:29 +00:00
Avik Chaudhuri	94f22eb6b2	refactor post-trace fakification in strict (#131421 ) Summary: Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict). This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet). The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it. Test Plan: existing tests Differential Revision: D60084442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421 Approved by: https://github.com/zhxchen17	2024-07-23 18:23:03 +00:00
Shangdi Yu	f85c35872b	Remove GraphModuleOpUpgrader in _export.serde.upgrade.py (#131373 ) Summary: Remove GraphModuleOpUpgrader in _export.serde.upgrade.py and the file Test Plan: contbuild & OSS CI Differential Revision: D60067937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131373 Approved by: https://github.com/angelayi	2024-07-23 18:09:44 +00:00
Matthias Braun	22906be8f0	Do not abort on SPARSE_STATUS_INVALID_VALUE (#130382 ) Summary: Newer versions of the MKL library return `SPARSE_STATUS_INVALID_VALUE` when badly formed non-triangular matrices are passed to the `mkl_sparse_?_trsv`/`mkl_sparse_?_mrsv` functions. This would start aborting (badly written) tests that worked with the old version which just filled the result tensor with `-NaN` instead of returning an error status. This changes the code to fill the result tensor with `-NaN` on `SPARSE_STATUS_INVALID_VALUE` so we get the same behavior regardless of the MKL version in use. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:sparse -- --run-disabled` Differential Revision: D59542023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130382 Approved by: https://github.com/malfet	2024-07-23 18:09:36 +00:00
Shangdi Yu	cfb9ccab6c	[export] Filter errors by exception type, add case name (#131327 ) Summary: - Log export errors to Scuba and mark them with "classified" and "unclassified" - Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute - Add `case_name` for some exceptions. Test Plan: Running the code below logs a classified error to `torch_export_usage` table in Scuba. ``` import torch from torch._export.db.case import SupportLevel class TorchSymMin(torch.nn.Module): """ torch.sym_min operator is not supported in export. """ def forward(self, x): return x.sum() + torch.sym_min(x.size(0), 100) example_args = (torch.randn(3, 2),) tags = {"torch.operator"} support_level = SupportLevel.NOT_SUPPORTED_YET model = TorchSymMin() torch.export.export(model, example_args) `` Differential Revision: D59981459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327 Approved by: https://github.com/zhxchen17	2024-07-23 18:01:13 +00:00
PyTorch MergeBot	6b8ec2b371	Revert "[triton_op] fix autotuning (#131363 )" This reverts commit 154f27455a62314dfb689f1fe13c0cfd52490339. Reverted https://github.com/pytorch/pytorch/pull/131363 on behalf of https://github.com/ZainRizvi due to This was a tricky one, but looking at the code it's the change to torch/fx/node.py that triggered the type violation errors. Reverting since this is now breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/131363#issuecomment-2245899858))	2024-07-23 18:01:09 +00:00
Wang, Eikan	3fe72e0c2e	[4/N] Non-Tensor: Support layout, device and dtype for aten operations (#125897 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125897 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-23 17:50:17 +00:00
Shangdi Yu	68c725a094	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 17:48:38 +00:00
Feng Shi	404d640c39	[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. Consolidation with Foreach kernel: 1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode 2) The backend kernel is consolidated into ComboKernel. (Note: this is part 1 which only deals with the 1st case above.) Details: 1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels). 2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it. 2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py 3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. 4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Example: - element wise kernels original Pytorch function: ``` def test_activations(a, b, c): a1 = torch.nn.functional.relu(a) b1 = torch.nn.functional.sigmoid(b) c1 = torch.nn.functional.tanh(c) return a1, b1, c1 ``` combokernel ``` triton_heuristics.pointwise( size_hints=[512], tile_hint=TileHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'fp32', 5: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = triton_helpers.maximum(0, tmp0) tl.store(out_ptr0 + (x0), tmp1, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 400 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = xindex tmp2 = tl.load(in_ptr1 + (x1), xmask) tmp3 = tl.sigmoid(tmp2) tl.store(out_ptr1 + (x1), tmp3, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex tmp4 = tl.load(in_ptr2 + (x2), xmask) tmp5 = libdevice.tanh(tmp4) tl.store(out_ptr2 + (x2), tmp5, xmask) else: pass ``` - reduction kernels Original Pytorch function: ``` def test_reduce(a, b, c): a1 = torch.sum(a, dim=0) b1 = torch.max(b, dim=0) c1 = torch.min(c, dim=0) return a1, b1, c1 ``` Generated combokernal: ``` triton_heuristics.persistent_reduction( size_hints=[32, 32], reduction_hint=ReductionHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'i64', 5: 'fp32', 6: 'i64', 7: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 20 rnumel = 20 RBLOCK_0: tl.constexpr = 32 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_0)[None, :] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (20r1)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0]) tmp3 = tl.where(rmask & xmask, tmp1, float("-inf")) tmp4 = triton_helpers.max2(tmp3, 1)[:, None] tmp6 = tl.broadcast_to(rindex, tmp3.shape) _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1) tmp5 = tmp5_tmp[:, None] tl.store(out_ptr0 + (x0), tmp4, xmask) tl.store(out_ptr1 + (x0), tmp5, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_1: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_1)[None, :] roffset = 0 rmask = rindex < rnumel r3 = rindex x2 = xindex tmp7 = tl.load(in_ptr1 + (x2 + (10r3)), rmask & xmask, other=0.0) tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1]) tmp10 = tl.where(rmask & xmask, tmp8, float("inf")) tmp11 = triton_helpers.min2(tmp10, 1)[:, None] tmp13 = tl.broadcast_to(rindex, tmp10.shape) _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1) tmp12 = tmp12_tmp[:, None] tl.store(out_ptr2 + (x2), tmp11, xmask) tl.store(out_ptr3 + (x2), tmp12, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_2: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_2)[None, :] roffset = 0 rmask = rindex < rnumel r5 = rindex x4 = xindex tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0) tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2]) tmp17 = tl.where(rmask & xmask, tmp15, 0) tmp18 = tl.sum(tmp17, 1)[:, None] tl.store(out_ptr4 + (x4), tmp18, xmask) else: pass ``` Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test/inductor:foreach ``` ``` buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels ``` Differential Revision: D54134695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969 Approved by: https://github.com/mlazos	2024-07-23 17:34:28 +00:00
Yueming Hao	979429ca89	[inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883 ) Fixes #126338 ## Issue Summary When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks. See more great discussions in the #126338 This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations. ## Fix The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation. When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation. ## Example ```python @torch.compile def fn(x, y): x = x.view(torch.float16) y = y.view(torch.float16) + 1 return x @ y x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) fn(x, y) ``` The output code generated before this fix is like the following. ```python triton_poi_fused_add_view_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tl.store(out_ptr0 + (x0), tmp1, xmask) triton_poi_fused_add_view_1... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0) del arg0_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view] triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0) del arg1_1 buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view] extern_kernels.mm(buf0, buf1, out=buf2) ``` As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion. The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth. The following are output code generated after this PR. ```python triton_poi_fused_add_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0) del arg1_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm] extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1) ``` The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too. The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision. `7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)` `7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)` Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire . Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883 Approved by: https://github.com/eellison	2024-07-23 17:31:39 +00:00
Oguz Ulgen	f93a6a4d31	Add mypy typing to torch_version.py (#131447 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131447 Approved by: https://github.com/angelayi ghstack dependencies: #131434	2024-07-23 17:31:07 +00:00
Animesh Jain	eab1595ce2	[dynamo] Delete wrong assertion in bind_args (#131405 ) Fix - https://github.com/pytorch/pytorch/issues/130537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131405 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #131347, #131367, #131378, #131389	2024-07-23 17:28:05 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Zain Rizvi	f7754c6dc5	Run pull jobs with new AMI (#131250 ) Migrate all pull jobs to the new Amazon 2023 AMI runner type. Exceptions: - Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately. - Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI Revert plan, in case something starts breaking when we run these new AMIs at a larger scale: - If specific jobs start failing consistently, we bring those jobs back to the old AMI - If the failure is more widespread, revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250 Approved by: https://github.com/malfet, https://github.com/atalman	2024-07-23 17:17:12 +00:00
PyTorch MergeBot	5f0b65bee7	Revert "Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842 )" This reverts commit d33804f8b6e2ea38f8446826a16be13ce4f9b71e. Reverted https://github.com/pytorch/pytorch/pull/130842 on behalf of https://github.com/clee2000 due to breaking some builds internally D60085710, Im not sure what the logs mean but I think its something about build size ([comment](https://github.com/pytorch/pytorch/pull/130842#issuecomment-2245799309))	2024-07-23 17:15:06 +00:00
Oguz Ulgen	4ca8705035	Add mypy typing to fx node (#131434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434 Approved by: https://github.com/zou3519	2024-07-23 17:00:31 +00:00
Sam Larsen	ded5bdb0de	Use inductor TestCase for test_replicate_with_compiler.py (#131053 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py` Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053 Approved by: https://github.com/eellison	2024-07-23 16:59:55 +00:00
Nikita Shulga	a5ad02d05d	Remove MacOS M2 14 runner from MacMPS job (#131465 ) As it's been dead for 2+ weeks and causing queuing issues <img width="760" alt="image" src="https://github.com/user-attachments/assets/4e806cae-3a67-4acb-b84f-1a9131d2a859"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131465 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-07-23 16:51:42 +00:00
Sherlock Huang	c1ef214046	Print ExportedProgram without color by default (#131399 ) Summary: Without plugin, colored ExportedProgram is not really readable. ![image](https://github.com/user-attachments/assets/319920a9-bb4b-4ad2-bcac-0c4f76973b11) Test Plan: CI Differential Revision: D60074481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131399 Approved by: https://github.com/angelayi	2024-07-23 16:41:55 +00:00
Michael Lazos	db376fb643	Ensure non-contiguous indices are handled (#131430 ) The unaligned inputs checker built in the assumption that static indices are a contiguous range (ie 0, 1, 2) when with the new changes with nn module inlining break this assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131430 Approved by: https://github.com/anijain2305	2024-07-23 16:37:55 +00:00
Oguz Ulgen	4f0497c747	Divorce triton and pt2 remote caching (#131345 ) Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other. Differential Revision: D60047752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345 Approved by: https://github.com/aorenste	2024-07-23 16:28:12 +00:00
rzou	154f27455a	[triton_op] fix autotuning (#131363 ) The problem was we were shoving SymInts into the constant_args side table. The root problem is that torch.fx.node.base_types, which we use to determine what can be put in the graph, doesn't actually have SymInt in it. This PR fixes base_types to include SymInt. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363 Approved by: https://github.com/oulgen	2024-07-23 16:15:00 +00:00
Zhengxu Chen	3aa45cae77	[export] Removed deprecated dialect field from EP schema. [2/2] (#131344 ) Summary: Not landable until we've updated the pin of executorch. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D59759620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131344 Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4	2024-07-23 16:05:10 +00:00
Teja	b61600f6cc	[pytorch] fix the leak for pinned memory when using _create_cpu_state… (#131270 ) When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error. ``` obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None ``` This PR fixes this by unregistering this memory on tensor free by attaching a hook. This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270 Approved by: https://github.com/LucasLLC	2024-07-23 15:47:21 +00:00
PyTorch MergeBot	1e86387871	Revert "Support IPC for Expandable Segments (#130890 )" This reverts commit 32c2f84e349ad6e34b8559d3f1f9c27020ae702f. Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to variable shadowing broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2245456085))	2024-07-23 14:46:28 +00:00
chuanqiw	f064dac588	[CI] change xpu ci build runner type to reduce build time (#130922 ) The current XPU build sometime needs 2+hours, change the build runner to `linux.12xlarge` to reduce build time. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130922 Approved by: https://github.com/atalman	2024-07-23 14:45:30 +00:00
Animesh Jain	6bbef2a06b	[dynamo] Support set on KeysView (#131389 ) Fixes https://github.com/pytorch/pytorch/issues/129664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131389 Approved by: https://github.com/mlazos ghstack dependencies: #131347, #131367, #131378	2024-07-23 14:15:26 +00:00
Animesh Jain	e7c5e06772	[dynamo] Support __contains__ on __dict__ on UserDefinedClassVariable (#131378 ) Fixes https://github.com/pytorch/pytorch/issues/129665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131378 Approved by: https://github.com/mlazos ghstack dependencies: #131347, #131367	2024-07-23 14:15:26 +00:00
Animesh Jain	0bc5e26067	[dynamo] Support dict conversion of objects derived from MutableMapping (#131367 ) Fixes - https://github.com/pytorch/pytorch/issues/129662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131367 Approved by: https://github.com/williamwen42 ghstack dependencies: #131347	2024-07-23 14:15:20 +00:00
Animesh Jain	a944cce5b8	[dynamo] Support if callable on list (#131347 ) Fixes https://github.com/pytorch/pytorch/issues/130720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131347 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-23 14:15:15 +00:00
Nikita Shulga	250cdb2ac7	Fix cuda_half_test.cu (#131416 ) $atanh(1.0)$ is $\inf$ (see https://www.mathworks.com/help/matlab/ref/atanh.html ) and difference between two infinities is nan, which is neither greater, nor less nor equal to any reasonable threshold Fix the test by comparing that atanh of .5 is equal for float and half and that atanh of 1.0 equal to infinity Fixes https://github.com/pytorch/pytorch/issues/131401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131416 Approved by: https://github.com/atalman, https://github.com/albanD	2024-07-23 14:10:20 +00:00
rzou	4ac77fc6bd	[HOP] Don't send HOPs to torch_dispatch (#131370 ) I regretted the decision in https://github.com/pytorch/pytorch/pull/130606. Most user torch_dispatchs don't have enough to actually handle the HOP correctly, so for now I'd prefer that users explicitly define the interaction between the HOP and their torch_dispatch class. An example is FlopCounterMode: if we allow HOPs to get passed to it, it will ignore auto_functionalized(mm) by default but it will record flops for mm, which is weird. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131370 Approved by: https://github.com/ydwu4	2024-07-23 13:41:08 +00:00
Xinran / Allan Rui	027f35d9e6	[Inductor] Allow customize decompositions for fwd_only trace function (#131329 ) Summary: Inductor will aggressively try to decompose and lower ops into a smaller opset. However, sometimes it may not align with kernel coverage (or perf preference) on different backends. (eg. Inductor will decompose Gelu into primitive ops, but certain backends already has a Gelu op) Therefore, we need a mechanism to allow customization of decomp for trace function so that Inductor will simply pass this op through. Test Plan: Reviewers: @eellison Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131329 Approved by: https://github.com/eellison	2024-07-23 13:10:48 +00:00
albanD	eb146b10db	Only depend on sympy 1.12 for conda (no 3.13 there anyways) (#131355 ) Fixing nightly after https://github.com/pytorch/pytorch/pull/130895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131355 Approved by: https://github.com/atalman	2024-07-23 12:19:58 +00:00
Bin Bao	9851c7313d	[CI][dashboard] Collect PT2 cpu perf nightly (#131369 ) Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131369 Approved by: https://github.com/huydhn	2024-07-23 11:55:39 +00:00
Charlie West-Taylor	3f3b226ffc	Fixes for the extension backend tests (#130933 ) There were some miscellaneous issues I found: * The WrapperCodeGen subclass constructors don't accept any arguments, which doesn't mesh with how Inductor can try to construct them. * A DeviceInterface subclass for Triton doesn't implement `triton_supported() == True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130933 Approved by: https://github.com/eellison, https://github.com/jansel	2024-07-23 10:46:32 +00:00
Yang Chen	d8e2e1fe50	[aoti] use reshape instead of view for flattening tensors for the nan checker (#131302 ) For some non-contiguous tensors, tensor.view would trigger the following runtime error: "RuntimeError: view size is not compatible with input tensor’s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(…) instead" So, let's use reshape instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131302 Approved by: https://github.com/muchulee8, https://github.com/desertfire	2024-07-23 10:15:28 +00:00
Tom Ritchford	16247987a1	Add decomposition for t_copy (#130939 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939 Approved by: https://github.com/peterbell10	2024-07-23 08:29:19 +00:00
eellison	16a2a1aad3	Annotate graph.py (#131400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400 Approved by: https://github.com/shunting314	2024-07-23 07:04:12 +00:00
Joona Havukainen	102d8e5a63	MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038 ) The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2. Solves silentness of issue #125803. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038 Approved by: https://github.com/malfet	2024-07-23 06:30:40 +00:00
Shangdi Yu	29e2e2afb6	Revert D59561509: Multisect successfully blamed "D59561509: [FX][export] DCE pass, check schema for node impurity (#130395 )" for one test failure (#131341 ) Summary: This diff reverts D59561509 D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure: Tests affected: - [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/) Here's the Multisect link: https://www.internalfb.com/multisect/6533402 Here are the tasks that are relevant to this breakage: T191383430: 10+ tests unhealthy for ads_mtia_inference The backout may land if someone accepts it. If this diff has been generated in error, you can Commandeer and Abandon it. Test Plan: NA Differential Revision: D60029318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341 Approved by: https://github.com/angelayi	2024-07-23 05:23:47 +00:00
Brian Hirsh	b2ad16f01d	avoid OpOverloadPacket.__getattr__ calls in inductor lowering (#131348 ) we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet). Test Plan: CI Differential Revision: D60048270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348 Approved by: https://github.com/davidberard98	2024-07-23 04:30:04 +00:00
Li-Huai (Allan) Lin	99d9b369f4	[Optim] Support tensor lr for all optimizers and check it is 1-element (#131065 ) Fixes: #130980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065 Approved by: https://github.com/janeyx99	2024-07-23 04:27:05 +00:00
nikonikolov	781189f25d	Add `nvjitlink` to the list of loadable global deps (#131295 ) To fix the cusparse dependency resolution in CUDA-12.x, that has nvJitLink dependency: ``` $ ldd -r /usr/local/cuda-11.8/lib64/libcusparse.so.11.7.5.86 linux-vdso.so.1 (0x00007ffea6f51000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb13306f000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb133065000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb13305f000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb132f10000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb132eeb000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb132cf7000) /lib64/ld-linux-x86-64.so.2 (0x00007fb143db7000) $ ldd -r /usr/local/cuda-12.1/lib64/libcusparse.so.12.1.0.106 linux-vdso.so.1 (0x00007ffc41909000) libnvJitLink.so.12 => /usr/local/cuda-12.1/lib64/libnvJitLink.so.12 (0x00007f3916b38000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3916aea000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3916ae0000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3916ada000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f391698b000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f3916964000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3916772000) /lib64/ld-linux-x86-64.so.2 (0x00007f3929a8c000) ``` Fixes #131284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131295 Approved by: https://github.com/malfet	2024-07-23 04:26:33 +00:00
Nikita Shulga	02cd4dbcf4	[BE][CI] Get rid of duplicated code (#131406 ) Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-07-23 04:01:13 +00:00
Wanchao Liang	35a0e0f018	[tp] improve SequenceParallel and its documentation (#131346 ) SequenceParallel style assumes the input torch.Tensor ALREADY sharded on the sequence dimension if not passing in DTensor. Since it causes some user confusion on the documentation, this PR: 1. for the case where input passed in is already a DTensor, we check the input placements and redistribute if it's not sharded on the sequence dimension 2. update the doc to make it more explicit about the case when user passed in a torch.Tensor and DTensor This would fix https://github.com/pytorch/pytorch/issues/129355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131346 Approved by: https://github.com/awgu	2024-07-23 03:57:01 +00:00
Wanchao Liang	12434504a2	[c10d] remove non-necessary tests (#131212 ) as titled, comm tensor is not being actively used as we approached the functional collectives as our collective tracing approach Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212 Approved by: https://github.com/XilunWu	2024-07-23 03:48:55 +00:00
zengxian	8a591da3e7	[CI] Enable AOT inductor in cpu performance smoke test (#130097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097 Approved by: https://github.com/chuanqi129, https://github.com/desertfire	2024-07-23 03:44:13 +00:00
PyTorch MergeBot	6cbb1437c1	Revert "Add sparse block to flex_decoding kernel (#130884 )" This reverts commit 0bf59db6cc076468f44197f0d7ee41f6204c47c2. Reverted https://github.com/pytorch/pytorch/pull/130884 on behalf of https://github.com/atalman due to Sorry reverting test_causal_full_mask_vs_sdpa constantly failing on trunk ([comment](https://github.com/pytorch/pytorch/pull/130884#issuecomment-2244113663))	2024-07-23 02:10:14 +00:00
Atul Jangra	28b0ad4f46	[PT2] Minor fix in signpost (#131332 ) Summary: compile_id is a named Tuple. We want to log signposts. Test Plan: Run e2e job. Confirm this shows up correctly. {F1767320364} Differential Revision: D60045020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131332 Approved by: https://github.com/oulgen	2024-07-23 01:56:00 +00:00
PyTorch MergeBot	b435d84261	Revert "[custom ops] Add register_vmap for custom ops (#130589 )" This reverts commit 074b42064195c45471912f851e94c753992a9a1f. Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))	2024-07-23 01:44:44 +00:00
wizzniu	8963623494	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-23 01:44:15 +00:00
Shangdi Yu	074b420641	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 00:54:52 +00:00
Avik Chaudhuri	1e5ecc4277	move save/load from _export to export (#131353 ) Test Plan: existing tests Differential Revision: D60053905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131353 Approved by: https://github.com/angelayi	2024-07-23 00:48:28 +00:00
angelayi	26f7dd286b	[export] Allow non-CIA ops to be preserved (#131075 ) I feel like the semantics of `run_decompositions(preserve_ops,...)` should be that we should always preserve whatever operator is put into `preserve_ops`, even if it's not CIA? Pull Request resolved: https://github.com/pytorch/pytorch/pull/131075 Approved by: https://github.com/bdhirsh	2024-07-23 00:41:48 +00:00
Jeff Daily	69b1999586	TunableOp size hotfix (#130800 ) Fixes #130727. GetSize calculation was incorrect for strided batched gemm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800 Approved by: https://github.com/xw285cornell	2024-07-22 23:42:26 +00:00
Thomas Ortner	8ae1963a61	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-22 23:18:19 +00:00
PyTorch MergeBot	c74396e890	Revert "[c10d] remove non-necessary tests (#131212 )" This reverts commit 0c074352ab62acba22265d8f19ea95851ae61d0f. Reverted https://github.com/pytorch/pytorch/pull/131212 on behalf of https://github.com/atalman due to sorry need to revert breaks OSS CI, module 'test_c10d_common' has no attribute 'CompilerTest' ([comment](https://github.com/pytorch/pytorch/pull/131212#issuecomment-2243961785))	2024-07-22 23:11:44 +00:00
PyTorch MergeBot	f8f41dcb24	Revert "[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 )" This reverts commit deacc543f13067ab22e8fb2ab714a20dd60bb056. Reverted https://github.com/pytorch/pytorch/pull/130832 on behalf of https://github.com/atalman due to broke periodic test ([comment](https://github.com/pytorch/pytorch/pull/130832#issuecomment-2243894772))	2024-07-22 22:10:02 +00:00
PyTorch MergeBot	15eb10df02	Revert "[inductor] Use multiple outputs for flex-attention (#130833 )" This reverts commit 9df8ea1cf2d62bfe21b46188faea6ef2e29e5210. Reverted https://github.com/pytorch/pytorch/pull/130833 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130833#issuecomment-2243890944))	2024-07-22 22:07:06 +00:00
PyTorch MergeBot	f8875e8277	Revert "[inductor] Kill mark_node_as_mutating (#130834 )" This reverts commit 33f036a6f71b386d4ccb9a756ed892c144ec6a5f. Reverted https://github.com/pytorch/pytorch/pull/130834 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130834#issuecomment-2243886215))	2024-07-22 22:02:43 +00:00
Yifu Wang	d33804f8b6	Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130842 Approved by: https://github.com/fegin	2024-07-22 21:49:33 +00:00
Yifu Wang	a136a7d623	[Functional Collective] enable custom work registration from python (#130354 ) This PR does two things: - Allow tensor -> work registration in Python via `torch._C._distributed_c10d.register_work`. Calling `torch.ops._c10d_functional.wait_tensor` on a tensor would trigger `.wait()` on the registered work object. - Allow user-defined work object in Python to work with functional collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130354 Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wconstab	2024-07-22 21:45:19 +00:00
Catherine Lee	a3922acc06	[TD] More synonyms, new heuristic for test_public_bindings (#130397 ) test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/ flex_attention should be run on anything involving autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397 Approved by: https://github.com/malfet	2024-07-22 21:42:54 +00:00
joydddd	0bf59db6cc	Add sparse block to flex_decoding kernel (#130884 ) fix typo Finish flex_decoding block sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884 Approved by: https://github.com/drisspg	2024-07-22 21:29:43 +00:00
Catherine Lee	83b355bad5	[aoti] forward fix of D60006838, add back test_multiple_output_alias (#131331 ) (#131356 ) Summary: Forward fix of D60006838. The unit test test_multiple_output_alias passed in OSS CI, but failing internally. So adding it back to skip list. Test Plan: ci Differential Revision: D60044926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131356 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-22 20:17:21 +00:00
Xu Zhao	e3eaa22126	[torchbench][multisect] Run accuracy check at Diff time (#131266 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2388 We can enable accuracy checks at Diff time since it is not a performance metric. * Refactor the existing diff time test to use the new PT2 Benchmark Runner. * Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection. Test Plan: Sandcastle CI Or buck test command: ``` buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429 Reviewed By: oulgen Differential Revision: D59825601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266 Approved by: https://github.com/oulgen	2024-07-22 20:14:28 +00:00
Wanchao Liang	0c074352ab	[c10d] remove non-necessary tests (#131212 ) as titled, comm tensor is not being actively used as we approached the functional collectives as our collective tracing approach Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212 Approved by: https://github.com/XilunWu	2024-07-22 19:52:44 +00:00
Thanh Ha	781a33f5d8	Enable dynamic rollout for Linux trunk workflows (#131325 ) Enables dynamic migration of jobs to the LF AWS account for the Linux trunk workflow. The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132 This closes pytorch/ci-infra#250. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131325 Approved by: https://github.com/ZainRizvi	2024-07-22 19:43:24 +00:00
Shuqiang Zhang	406f510f89	[c10d] add bfloat16 support for NAN check (#131131 ) Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131131 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-07-22 19:41:19 +00:00
Xiaodong Wang	9e753d1f20	[AMD] catch exception when other processes belong to other users (#131018 ) Summary: It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out: ``` torch.cuda.memory.list_gpu_processes() File "torch/cuda/memory.py", line 647, in list_gpu_processes procs = amdsmi.amdsmi_get_gpu_process_list(handle) # type: ignore[attr-defined] File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list _check_res( File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res raise AmdSmiLibraryException(ret_code) amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code: 10 \| AMDSMI_STATUS_NO_PERM - Permission Denied ``` So just catch this error Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails Differential Revision: D59901053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018 Approved by: https://github.com/eqy, https://github.com/clee2000	2024-07-22 19:38:51 +00:00
Andrew Gu	23ae6e2eb3	[FSDP2] Removed state dict error for HSDP (#131320 ) Fixes https://github.com/pytorch/torchtitan/issues/441#issuecomment-2241288906. This PR avoids raising the 2D state dict error for HSDP, which does not depend on strided sharding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131320 Approved by: https://github.com/wanchaol, https://github.com/weifengpy	2024-07-22 19:23:17 +00:00
Mikayla Gawarecki	d3556786b8	Blocklist certain modules for weights_only load (#131259 ) Also bold certain text in the error message as suggested <img width="3000" alt="Screenshot 2024-07-19 at 5 56 48 PM" src="https://github.com/user-attachments/assets/378f20c5-c6b2-4e53-8eaf-0bd26c3a6b60"> With a GLOBAL like `os.execv` the error message is now as such ```python File "/data/users/mg1998/pytorch/torch/serialization.py", line 1256, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Trying to load unsupported GLOBAL posix.execv whose module posix is blocked. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131259 Approved by: https://github.com/malfet, https://github.com/albanD	2024-07-22 18:23:21 +00:00
William Wen	93ef2e53f8	[3.13, dynamo] support FORMAT_SIMPLE/FORMAT_SPEC (#130751 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130751 Approved by: https://github.com/Skylion007 ghstack dependencies: #130566, #130567, #130568, #130569	2024-07-22 18:07:40 +00:00
William Wen	375a4d7e9e	[3.13, dynamo] decompose fused load/store instructions (#130569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130569 Approved by: https://github.com/jansel ghstack dependencies: #130566, #130567, #130568	2024-07-22 18:07:40 +00:00
William Wen	157f38bc4d	[3.13, dynamo] support STORE_FAST_LOAD_FAST (#130568 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130568 Approved by: https://github.com/jansel ghstack dependencies: #130566, #130567	2024-07-22 18:07:35 +00:00
William Wen	1e116c7a1e	[3.13, dynamo] fix END_FOR (#130567 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130567 Approved by: https://github.com/jansel ghstack dependencies: #130566	2024-07-22 18:07:32 +00:00
William Wen	4319147ca9	[3.13, dynamo] fix closures, MAKE_FUNCTION, LOAD_CLOSURE; support SET_FUNCTION_ATTRIBUTE (#130566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130566 Approved by: https://github.com/jansel	2024-07-22 18:07:28 +00:00
PyTorch MergeBot	44e689d947	Revert "[TD] More synonyms, new heuristic for test_public_bindings (#130397 )" This reverts commit d8a35d57220cdd5ed2fe52c02bb1f78cc0b3c75b. Reverted https://github.com/pytorch/pytorch/pull/130397 on behalf of https://github.com/clee2000 due to broke lint, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130397#issuecomment-2243518651))	2024-07-22 18:03:22 +00:00
Xiaodong Wang	56bb047449	[pt2] Increase dynamo/inductor default log level to info (#131311 ) Summary: Avoid the logs to be too verbose Test Plan: CI Differential Revision: D60028647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131311 Approved by: https://github.com/oulgen	2024-07-22 17:33:29 +00:00
Catherine Lee	d8a35d5722	[TD] More synonyms, new heuristic for test_public_bindings (#130397 ) test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/ flex_attention should be run on anything involving autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397 Approved by: https://github.com/malfet	2024-07-22 17:06:00 +00:00
PyTorch MergeBot	b9912f31ef	Revert "[export] fix zero arg export in training_ir (#130990 )" This reverts commit 50436d5bdb5d2e29307a0c0bcfcce8d7e2da82c0. Reverted https://github.com/pytorch/pytorch/pull/130990 on behalf of https://github.com/clee2000 due to failing some executorch and torchrec tests internally D60006710 ([comment](https://github.com/pytorch/pytorch/pull/130990#issuecomment-2243395316))	2024-07-22 16:49:25 +00:00
zdevito	32c2f84e34	Support IPC for Expandable Segments (#130890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2 ghstack dependencies: #130888, #130889	2024-07-22 16:15:01 +00:00
Henry Tsang	0246b28510	[aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases (#130868 ) Continuing https://github.com/pytorch/pytorch/pull/128683 and https://github.com/pytorch/pytorch/pull/130582. The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well. Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now. The reason this wasn't known earlier is probably because the CI doesn't use H100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130868 Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire	2024-07-22 15:24:20 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
Miguel Perez	5c78581fc9	Fix documentation for tensor.repeat. (#131195 ) Fixes #130930. Adjusts the documentation which used `sizes` instead of `repeats`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131195 Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer	2024-07-22 14:48:18 +00:00
PyTorch MergeBot	26383a6cc0	Revert "Added and_masks and or_masks utilities (#131073 )" This reverts commit 92bb323d36adca097c44a2fc8d9f0d574214d801. Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))	2024-07-22 13:44:55 +00:00
Thanh Ha	3eb9fa5d58	Add support for using LF Canary runners (#131188 ) The script is updated such that if a canary build is detected and the label_type is LF runner it will run on an LF Canary runner. Closes pytorch/ci-infra#245. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131188 Approved by: https://github.com/ZainRizvi	2024-07-22 13:26:46 +00:00
eqy	69e2590490	Fix MKLDNN check in `test_aot_inductor.py` (#130982 ) `torch.ops.mkldnn._is_mkldnn_bf16_supported()` assumes MKLDNN is on the system which isn't the case for e.g., some ARM system configurations CC @tinglvv @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/130982 Approved by: https://github.com/malfet	2024-07-22 11:58:18 +00:00
chilli	92bb323d36	Added and_masks and or_masks utilities (#131073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073 Approved by: https://github.com/drisspg ghstack dependencies: #130871, #130904	2024-07-22 11:48:03 +00:00
PyTorch UpdateBot	68df24f9b6	[xla hash update] update the pinned xla hash (#126672 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126672 Approved by: https://github.com/pytorchbot	2024-07-22 11:35:36 +00:00
Wang, Eikan	6d65a2c3f4	[3/N] Non-Tensor: Support string parameter for aten operations (#125831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-07-22 09:42:35 +00:00
xinan.lin	8da19fec60	[Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849 Approved by: https://github.com/peterbell10, https://github.com/EikanWang	2024-07-22 05:59:03 +00:00
albanD	2820e1d9f8	Update CPython support policy (#130989 ) Update as specified in the RFC that was accepted: https://github.com/pytorch/rfcs/blob/master/RFC-0038-cpython-support.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/130989 Approved by: https://github.com/seemethere	2024-07-22 05:29:07 +00:00
Florian	1614891946	[Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733 ) Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733 Approved by: https://github.com/aaronenyeshi	2024-07-22 04:35:21 +00:00
Nikita Shulga	c2425a3b57	[BE] Use `_linux-build.yml` instead of `-linux-build-label.yml` flavor (#130762 ) It was also introduced during the ARC experiment and supposed to be a temporary thing. Fix `use_split_build` option handling in `_linux_build.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130762 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/jeanschmidt	2024-07-21 23:17:17 +00:00
Tom Ritchford	500cbb5b90	Add decomposition for view_copy (#130938 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938 Approved by: https://github.com/peterbell10 ghstack dependencies: #130937	2024-07-21 20:39:24 +00:00
Tom Ritchford	f628813066	Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937 ) * See #128416 and #129476 * Simplify xskip lists in test/functorch/test_ops.py * Add supports_out=True to OpInfos for copy ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937 Approved by: https://github.com/peterbell10	2024-07-21 20:39:24 +00:00
Aaron Orenstein	b193894b94	FakeTensor cache SymInt support (#127596 ) Adds support for SymInts in the FakeTensor cache. A couple notes: 1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain. 2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail. 3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not). 4. In the cache entry we store SymInts as a _SymIntOutputStub. Perf example: ``` python benchmarks/dynamo/timm_models.py --ci --accuracy --timing --explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda --training --amp --total-partitions 2 --partition-id 0 --output /tmp/training_timm_models.csv --filter crossvit_9_240 ``` fake tensor cache before: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 68137 INFO: cache_misses: 837 INFO: cache_bypasses: INFO: symbolic shape: 48224 INFO: CompositeImplicitAutograd: 917 INFO: non-fake tensor: 70 INFO: non-FakeTensor output: 62 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` and after: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 88187 INFO: cache_misses: 14233 INFO: cache_bypasses: INFO: CompositeImplicitAutograd: 1037 INFO: non-FakeTensor output: 602 INFO: non-fake tensor: 70 INFO: unsafe view: 36 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596 Approved by: https://github.com/eellison ghstack dependencies: #131014, #129780	2024-07-21 19:26:38 +00:00
Aaron Orenstein	ebce85172e	FakeTensor cache SymInt support: flatten cache key (#129780 ) This is part of #127596, pulled out to make reviewing a little easier. Flatten the FakeTensor cache key - so it's a list of singular elements and pointing at one requires a single index rather than a PyTree path. This is used in the next PR to allow us to have the cache entry refer to an input SymInt that it needs to copy directly into the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129780 Approved by: https://github.com/oulgen, https://github.com/eellison ghstack dependencies: #131014	2024-07-21 19:26:38 +00:00
Aaron Orenstein	f3562e2cdc	backport dataclass(slots=True) (#131014 ) Python 3.10 adds `@dataclass(slots=True)` to auto-build the `__slots__` for a dataclass. This is really useful but we can't use it until 3.10 becomes our minimum version. Copied the code for that functionality from python into a new decorator and ported it to use 3.8 syntax (removed use of `match`). Usage: ``` @dataclass_slots @dataclass class X: pass ``` is the same as (in py3.10): ``` @dataclass(slots=True) class X: pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131014 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-07-21 19:26:31 +00:00
Xuehai Pan	1439bd3c9c	[Easy][pytree] enable CXX pytree under `torch::deploy` (#130144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130144 Approved by: https://github.com/zou3519 ghstack dependencies: #130895, #130139	2024-07-21 07:36:22 +00:00
Animesh Jain	ddde9dd25c	[dynamo][automatic_dynamic] Trigger dynamism on stride changes (#130232 ) Fixes https://github.com/pytorch/pytorch/issues/129798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130232 Approved by: https://github.com/ezyang	2024-07-21 03:45:54 +00:00
Chuanhao Zhuge	e506dfa640	[dynamo] Add a JK kill switch for disabling compile (#131258 ) Summary: The JK disables dynamo by passing None to set_eval_frame. Test Plan: Ran buck test mode/opt caffe2/test/dynamo:test_dynamo Buck UI: https://www.internalfb.com/buck2/1fec33b4-c95a-4bdf-b47b-7c0b8ab9e24a Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814750010105363 Network: Up: 0B Down: 0B Jobs completed: 9596. Time elapsed: 28:54.5s. Tests finished: Pass 4796. Fail 0. Fatal 0. Skip 17. Build failure 0 Also manually write a small local test with torch.compile and toggles the code to see if PT2 can be disabled. Validated with running the test and observing the log. PT2 enabled: P1486847242. Can see dynamo log about graph breaks. PT2 disabled: P1486847727. No dynamo log. The newly added warning printed. Reviewed By: ezyang Differential Revision: D59968925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131258 Approved by: https://github.com/c00w	2024-07-21 01:22:31 +00:00
cyy	1d1d074072	[3/N] Fix Wunused-parameter warnings (#131271 ) Follows #131170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131271 Approved by: https://github.com/ezyang	2024-07-20 23:31:03 +00:00
Shan19900305	d57af32e63	Fix undefined tensor error in _copy_from_and_resize when fallback to cpu. (#130237 ) 1) Add skip undefined tensor in cpu fallback when call _copy_from_and_resize; 2) Modify to_cpu function support optional tensor; 3) Add copy back to origin optional tensor when alias_info isWrite is true. @ezyang @bdhirsh Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130237 Approved by: https://github.com/ezyang	2024-07-20 23:12:17 +00:00
Tristan Rice	13283fb4bc	[distributed] test_store: remove flaky bind test (#131262 ) Fixes https://github.com/pytorch/pytorch/issues/131084 There's no good way to fix this since some tests environments can bind the protected range. Removing test since the value is relatively low since it's just testing error messages. Test plan: ``` python test/distributed/test_store.py -v -k address ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131262 Approved by: https://github.com/mori360, https://github.com/XilunWu	2024-07-20 23:04:31 +00:00
Anshul Sinha	407c87a32c	[debug][dtensor] fixed updating current module (#130995 ) Summary Fixed issue with updating the current module when transitioning between child module to parent module and in the backward pass. The first issue is caused because the prehook is not called again when we go back to the parent module and that the hook being used was a register_module_forward_hook, which runs before the register_module_hook used in redistribute, causing the collective call to be assigned to the incorrect module. In order to do this, I updated the current module to be the parent module in a register_forward_hook in the module tracker. The second issue was caused by the parent set in the module tracker I inherit from being incorrect. I fixed this issue by saving the parents of each module and using them in collective counter instead of the incorrect set. I have updated the example in module_operation_tracing to reflect the correct output. In addition, I changed the test cases that used the incompatible old CommDebugMode. Test Case 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 2. pytest test/distributed/_tensor/debug/test_comm_mode_features.py -s -k test_transformer_module_tracing 3. python test/distributed/_composable/fsdp/test_fully_shard_training.py -k TestFullyShardGradientAccumulation.test_gradient_accumulation 4. python test/distributed/_tensor/test_math_ops.py -k DistMathOpsTest.test_layer_norm_bwd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130995 Approved by: https://github.com/XilunWu ghstack dependencies: #130410	2024-07-20 20:57:29 +00:00
Peter Bell	33f036a6f7	[inductor] Kill mark_node_as_mutating (#130834 ) Resubmit of #129346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832, #130833	2024-07-20 18:53:33 +00:00
Nikita Shulga	fccbe85475	[BE] Improve CUDA UpSample error message (#131252 ) `Expected grad_output.numel() <= std::numeric_limits<int32_t>::max() to be true` is not very helpful, it's better to mention method name as well as actual tensor size This error was reported in https://github.com/pytorch/pytorch/issues/131185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131252 Approved by: https://github.com/albanD	2024-07-20 16:49:34 +00:00
PyTorch UpdateBot	a7a951a4ae	[executorch hash update] update the pinned executorch hash (#130001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Co-authored-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001 Approved by: https://github.com/pytorchbot	2024-07-20 16:44:07 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
Soumith Chintala	8e478d4fb1	Add Alban and Piotr into Core Maintainers (#130903 ) See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-20 16:02:42 +00:00
hongxyan	637ab85e7f	fix for launching kernel invalid config error when calling embedding … (#130994 ) …with large index Fixes #130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](`39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)`): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell	2024-07-20 08:33:29 +00:00
Wu, Chunyuan	a8319698b3	[inductor] [cpp] improve cache blocking with CPU info (#129348 ) ## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% ### FP32 single thread (measured on Ice Lake) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690	2024-07-20 06:53:31 +00:00
Jiong Gong	0b44e1a74c	[inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690 ) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130675	2024-07-20 06:30:15 +00:00
ankurneog	ebc012ace6	Add hooks for execution on intel gaudi devices - 1 (#128584 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970 to support Gaudi devices for Pytorch UT execution. ## Changes We are adding additional hooks to: 1. Add dtype exceptions for Gaudi/HPU 2. Extend onlyNativeDevices decorator functionality to add additional devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584 Approved by: https://github.com/albanD	2024-07-20 05:03:36 +00:00
Michael Lazos	d31f2ae904	Ensure invariant that all inputs have tensor dict (#131249 ) There was a path with freezing enabled that violated the invariant that all inputs have the "tensor_dict" meta. This ensures that `register_attr_or_module` also sets tensor_dict meta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131249 Approved by: https://github.com/anijain2305	2024-07-20 04:40:58 +00:00
drisspg	37337ef5c3	add some description on create_block_mask and mask mods (#131209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131209 Approved by: https://github.com/joydddd	2024-07-20 04:40:48 +00:00
Yifu Wang	168c0e24a5	[IntraNodeComm] Fix some issues in two-shot all-reduce (#131244 ) Two issues: - Similar to https://github.com/pytorch/pytorch/pull/129501, two-shot all-reduce's reduction order was different across ranks. This PR fixes it. - When migrated to use SymmetricMemory, I accidentally used `get_buffer_ptrs_dev` instread of `get_buffer_ptrs` (the former is an on-device array). This PR fixes it (for https://github.com/pytorch/pytorch/issues/131215). The failing snippet provided by https://github.com/pytorch/pytorch/issues/131215 now works. ```python import os import torch import torch.distributed as dist def _get_global_rank() -> int: return int(os.environ.get("LOCAL_RANK", "0")) def is_local(): return _get_global_rank() == 0 def _get_world_size() -> int: return int(os.environ.get("LOCAL_WORLD_SIZE", "1")) global_rank = _get_global_rank() world_size = _get_world_size() torch.cuda.set_device(global_rank) dist.init_process_group(backend="nccl") global_group = dist.group.WORLD draft_group = dist.new_group([0,1]) inp = torch.full((128, 1, 4096), global_rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=global_group) expect = sum(range(world_size)) assert inp.eq(expect).all() if 0 <= global_rank < 2: inp = torch.full((128, 1, 2048), global_rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=draft_group) expect = sum(range(2)) assert inp.eq(expect).all() torch.cuda.synchronize() print("success") dist.destroy_process_group() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131244 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-07-20 02:51:45 +00:00
Xuehai Pan	d2bd9acabd	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519 ghstack dependencies: #130895	2024-07-20 02:41:10 +00:00
Yidi Wu	50436d5bdb	[export] fix zero arg export in training_ir (#130990 ) Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing. edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure: a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state. The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990 Approved by: https://github.com/pianpwk	2024-07-20 02:35:13 +00:00
Sam Larsen	3c43fe068f	[inductor] parallel compile: Create new pipes for subproc communication (#131194 ) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. https://github.com/pytorch/pytorch/issues/131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in https://github.com/pytorch/pytorch/issues/131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman	2024-07-20 02:23:01 +00:00
Peter Bell	9df8ea1cf2	[inductor] Use multiple outputs for flex-attention (#130833 ) Resubmit of #129344 This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832	2024-07-20 02:05:10 +00:00
Peter Bell	deacc543f1	[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 ) Resubmit of #129325 Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832 Approved by: https://github.com/lezcano ghstack dependencies: #130831	2024-07-20 02:05:10 +00:00
Peter Bell	27c2a0d63b	[inductor] Separate Buffer and Operation into two concepts (#130831 ) Resubmit of #128893 Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831 Approved by: https://github.com/lezcano	2024-07-20 02:05:07 +00:00
Isuru Fernando	bb4251213b	Add decomposition for channel_shuffle (#118775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775 Approved by: https://github.com/peterbell10	2024-07-20 01:24:41 +00:00
Xuehai Pan	f0075c179b	Pin `sympy >= 1.13.0` (#130895 ) ------ The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8. - #130836 See the PR description of #130836 for more details. `sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically: - Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations) > BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x2.0 != x2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr)) `sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689. - #130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895 Approved by: https://github.com/ezyang	2024-07-20 00:59:24 +00:00
PyTorch MergeBot	30d1826b2b	Revert "[executorch hash update] update the pinned executorch hash (#130001 )" This reverts commit 4821f72457afd7b1b5c61c1c8c3c49105c1bd22d. Reverted https://github.com/pytorch/pytorch/pull/130001 on behalf of https://github.com/clee2000 due to the test_sympy_utils failure is real, Dr. CI is wrong https://github.com/pytorch/pytorch/actions/runs/10015433275/job/27687163560 `4821f72457` ([comment](https://github.com/pytorch/pytorch/pull/130001#issuecomment-2240807631))	2024-07-20 00:56:14 +00:00
cyy	cd8bbdc71a	[2/N] Fix Wunused-parameter warnings (#131170 ) Follows #130924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131170 Approved by: https://github.com/mikaylagawarecki	2024-07-19 23:58:56 +00:00
rzou	207fb96155	[functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191 ) There's no reason to ban them for vmap or jvp, because without the {grad, vjp} transforms those just act above PyTorch autograd, which will end up saving regular Tensors. Test Plan: - some tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191 Approved by: https://github.com/drisspg	2024-07-19 23:16:27 +00:00
PyTorch UpdateBot	4821f72457	[executorch hash update] update the pinned executorch hash (#130001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001 Approved by: https://github.com/pytorchbot	2024-07-19 23:10:20 +00:00
PyTorch MergeBot	7c299b46ca	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit 8390843eba6271dcdbec7d048e9fa4e56d4479d8. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))	2024-07-19 22:58:51 +00:00
Shuo Ding	35bf05561c	[Inductor] B2B-GEMM performance tuning with test (#130778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130778 Approved by: https://github.com/eellison	2024-07-19 22:53:57 +00:00
peaceorwell	6657b14a64	[inductor] Fix the method for checking the variable type of entry.numel (#131026 ) The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer. Co-authored-by: peterbell10 <peterbell10@live.co.uk> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026 Approved by: https://github.com/peterbell10	2024-07-19 22:51:11 +00:00
PyTorch MergeBot	0e72baddf0	Revert "[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 )" This reverts commit 0ca7b6ddd91192ebffd3c88bf314d07ba6cddf50. Reverted https://github.com/pytorch/pytorch/pull/131021 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/131021#issuecomment-2240280827))	2024-07-19 21:56:09 +00:00
Shuqiang Zhang	4aef5a1134	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-19 21:11:26 +00:00
Andrii Grynenko	0ca7b6ddd9	[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 ) Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately. Test Plan: unit test Reviewed By: jamesperng, asiab4, c-p-i-o Differential Revision: D59842868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021 Approved by: https://github.com/asiab4	2024-07-19 20:58:32 +00:00
Zain Rizvi	c64ad2403c	LF runners: Add new runner types for Amazon2023 AMIs (#131246 ) Add new LF runner types with the Amazon2023 ami, matching the change done in https://github.com/pytorch/test-infra/pull/5487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131246 Approved by: https://github.com/malfet	2024-07-19 20:30:41 +00:00
lessw2020	85ca88a2bb	[Distributed][PP export] update tracing to handle autocast inclusion (#130998 ) Fixes https://github.com/pytorch/pytorch/issues/128394 This updates PP export tracing to use no_grad() context along with avoid predispatch. This enables tracing for HF llama models that currently fail due to not handling the use of autocast in the Rope embeddings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130998 Approved by: https://github.com/fduwjj	2024-07-19 20:08:00 +00:00
Yidi Wu	ceee87df2e	[export] modify export code owners (#130894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130894 Approved by: https://github.com/zhxchen17	2024-07-19 19:49:34 +00:00
PyTorch MergeBot	5f981388ec	Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 )" This reverts commit d7a78ec8b938a61297221912464f5afef288b823. Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143))	2024-07-19 19:46:26 +00:00
Li-Huai (Allan) Lin	125be005eb	[Docs] Fix fake tensor doc (#131205 ) Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205 Approved by: https://github.com/eellison	2024-07-19 17:59:45 +00:00
Animesh Jain	e49c0acc39	[dynamo] Revert https://github.com/pytorch/pytorch/pull/130416 (#131058 ) All the changes brought by the original PR have been addressed in alternative ways in the stack. Why the original PR has to be reverted requires more effort because there is some bad interaction with export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131058 Approved by: https://github.com/williamwen42	2024-07-19 17:26:24 +00:00
henrylhtsang	042be441ba	[aoti] Unskip some aot inductor tests (#130973 ) Trying to unskip some tests, and if they are still broken, add reasons. ## example testing command ``` pytest -v test/inductor/test_aot_inductor.py -k test_add_complex ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130973 Approved by: https://github.com/ColinPeppler	2024-07-19 17:19:35 +00:00
Jiashen Cao	9b5c70878b	[Fix] Missing parameter happens when retracing an already jit.scripted module (#129787 ) #### Issue Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787 Approved by: https://github.com/angelayi	2024-07-19 16:58:48 +00:00
Zhengxu Chen	abb3f2822c	[aotinductor] Support additional lifted constants supplied to const folding. (#130743 ) Summary: In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph. This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness. This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms. Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding. For the internal followup counterpart, see D59685145 Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm Differential Revision: D59692790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743 Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad	2024-07-19 16:48:56 +00:00
Catherine Lee	31e79aae6a	Another follow up to #130260 (#130993 ) Another followup to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993 Approved by: https://github.com/huydhn	2024-07-19 16:43:54 +00:00
xingyunjohn1	d4a79d4a7c	Fix an example: Resolve broadcasting error in attn_bias and attn_mask… (#130209 ) … addition, fix device assignment for newly created variables in method Fix an example: Resolve broadcasting error in attn_bias and attn_mask addition, fix device assignment for newly created variables in method 1. `attn_bias += attn_mask` would cause a broadcasting error. Because the shape of `attn_bias` is (L, S), the shape of the output would be expected as (L, S) too. When the shape of input is (N, num_heads, L, S), a broadcasting should be triggered. Then, the shape of the output would be (N, num_heads, L, S), which is unexpected. 2. `attn_bias` is a newly created variables in method, which is not assigned device. This is my retry of #130200 . I used a wrong account in that pr. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130209 Approved by: https://github.com/mikaylagawarecki	2024-07-19 15:23:22 +00:00
sradc	451fc029fe	docs: note transposed weight initialisations (#130122 ) Fixes #129834 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130122 Approved by: https://github.com/mikaylagawarecki	2024-07-19 15:23:03 +00:00
PyTorch MergeBot	5f3d8b8788	Revert "[c10] add an option to pg_config split share (#130877 )" This reverts commit 367213a608528ee74e67e03bf11f775e263ef480. Reverted https://github.com/pytorch/pytorch/pull/130877 on behalf of https://github.com/atalman due to breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/130877#issuecomment-2239298810))	2024-07-19 14:24:50 +00:00
Andres Suarez	25d8a0480b	[lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions Differential Revision: D59935630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187	2024-07-19 07:19:11 -07:00
Edward Z. Yang	a6a2cd6257	Typo fix (#131037 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131037 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-19 13:17:54 +00:00
Michael Lazos	1b72cf0b09	Add hasattr for tensor variable (#131008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131008 Approved by: https://github.com/anijain2305 ghstack dependencies: #131007	2024-07-19 12:43:27 +00:00
Syed Tousif Ahmed	1f961ad495	Runs aten cuda cpp tests in CI (#131061 ) It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061 Approved by: https://github.com/malfet, https://github.com/eqy	2024-07-19 12:35:33 +00:00
Jack Taylor	d7a78ec8b9	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-07-19 09:45:03 +00:00
cyy	feef057691	[1/N] Fix Wunused-parameter warnings (#130924 ) Before we can turn Wunused-parameter into an error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130924 Approved by: https://github.com/ezyang	2024-07-19 06:14:51 +00:00
Oguz Ulgen	eee76c86a8	Write trace_structured events to scuba (#130955 ) Summary: https://fb.workplace.com/groups/1286739428954016/posts/1287192258908733 Test Plan: Run test with tlparse and inspect https://www.internalfb.com/intern/scuba/query/?dataset=pt2_trace_structured_events Differential Revision: D59866096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130955 Approved by: https://github.com/ezyang	2024-07-19 06:02:47 +00:00
Chirag Pandya	982309b501	Initial commit of flight recorder trace (#130764 ) Summary: `fr_trace.py` is used to analyze flight recorder dump files. This script was taken from @wconstab and @zdevito. Only minor changes made were to make the linter happy and add a few odd new fields that I added in version `2.2` of the collector portions. Test Plan: Tested manually on some flight recorder data and it seems to run. TODO: Address 15 odd `#type: ignore` that I put in there to make the linter happy for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130764 Approved by: https://github.com/fduwjj	2024-07-19 06:00:54 +00:00
Justin Chu	fd4899bc58	[ONNX] Run ruff pyupgrade to update type annotations (#130657 ) Use the newest syntax for type annotations Pull Request resolved: https://github.com/pytorch/pytorch/pull/130657 Approved by: https://github.com/titaiwangms	2024-07-19 05:09:44 +00:00
kausik	4f60a2e39c	Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953 ) Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization. Please refer: https://github.com/pytorch/pytorch/pull/121450 However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow. This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-07-19 04:58:02 +00:00
chilli	d59803fb67	Refactored flexattention kernel (#130904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904 Approved by: https://github.com/drisspg ghstack dependencies: #130871	2024-07-19 04:56:32 +00:00
Animesh Jain	ac76dd606f	[dynamo] Alternative way to skip empty hooks guards on inbuilt nn modules (#131057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131057 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #131056	2024-07-19 04:42:38 +00:00
Animesh Jain	00e54e74ff	[dynamo][cpp-guards] Fix bug in dict tags (#131056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131056 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-07-19 04:42:38 +00:00
Peter Bell	3c622fbcd3	[inductor] Fix var_to_range in IndexPropagation (#130984 ) The current code assumes that indirect variables will be created by the same `IndexPropagation` instance, however that isn't true in the case of masked sub-blocks where we take in variables from the parent block. This fixes the issue by moving the var range information up to the `LoopBody` object where it can be shared by all sub-blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984 Approved by: https://github.com/lezcano	2024-07-19 03:08:00 +00:00
Feng Yuan	b556d31586	Update torch-xpu-ops pin (ATen XPU implementation) (#131015 ) Regular update. 1. New 90 ATen operators and their variants are supported for XPU. 2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error 3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure. 4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015 Approved by: https://github.com/EikanWang	2024-07-19 02:18:55 +00:00
Ma, Jing1	52cb9abb1d	Add deterministic support in nn.functional.interpolate for XPU (#129864 ) Both for CUDA and XPU, there are no deterministic implementation at native in `aten::upsample_bilinear` and `aten::replication_pad`. CUDA leverage operator decomposition path in frontend hook `nn.functional.interpolate` as its deterministic implentation. XPU backend uses the same solution in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129864 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/EikanWang	2024-07-19 02:15:42 +00:00
Jiong Gong	39493aa934	[inductor][cpp][gemm] move bias add to epilogue (#130675 ) Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.9200 ms 100.0% _linear_pointwise 1.9345 ms 99.3% After AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.8321 ms 100.0% _linear_pointwise 1.9246 ms 95.2% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-07-19 01:16:34 +00:00
xinan.lin	5a6a806b19	[Inductor UT] Generalize device-bias code in case TestFxGraphCache.test_inductor_counters. (#131006 ) [Inductor UT] Generalize device-bias code in case `TestFxGraphCache.test_inductor_counters`. Fix #131005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131006 Approved by: https://github.com/masnesral	2024-07-19 01:14:22 +00:00
Will Feng	208dffa702	[Compiled DDP] DDP + AC unit test (#130981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981 Approved by: https://github.com/fegin	2024-07-19 01:07:41 +00:00
cyy	3cc6183ce1	Fix getAugOp error (#131033 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131033 Approved by: https://github.com/ezyang	2024-07-19 01:07:24 +00:00
Xu Han	6e7b9ee8a0	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-18 23:19:38 +00:00
Justin Chu	e880cb2fe0	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-18 22:07:40 +00:00
PyTorch MergeBot	fb3674b1f4	Revert "[Autograd] Cond Higher-Order Operation (#126911 )" This reverts commit f7058b735e52a1d876912f8c96a594673a495007. Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch `f7058b735e` Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))	2024-07-18 22:06:40 +00:00
Jiashen Cao	686b7f046a	[Fix]: TSConverter handles call ops with multiple outputs (#129294 ) #### Issue * Current call ops does not handle IR with multiple outputs. If an op has multiple outputs, we add an implicit unpack to map output. E.g., ``` %5 : Tensor, %6 : Tensor = aten::max(%x.1, %3, %4), scope: export.test_converter.M:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:774:20 ``` * There are some cases that `prim::If` sub-blocks do not return any outputs. E.g., ``` %9 : bool = aten::gt(%8, %3), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:9 = prim::If(%9), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 block0(): -> () block1(): = prim::RaiseException(%5, %4), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 -> () ``` #### Test Plan We did an exhaustive search of all torch APIs that can return multiple outputs. We sample some of common ones and add new test cases based on those. * `pytest test/export/test_converter.py -s -k test_ts2ep_multi_outputs_on_call_ops` #### Appendix * aten ops that return multiple outputs. ``` aten._batch_norm_impl_index aten._batch_norm_no_update aten._batch_norm_with_update aten._batch_norm_with_update_functional aten._cudnn_rnn aten._efficient_attention_backward aten._efficient_attention_forward aten._embedding_bag aten._embedding_bag_forward_only aten._flash_attention_backward aten._flash_attention_forward aten._fused_adam aten._fused_dropout aten._fused_moving_avg_obs_fq_helper aten._linalg_det aten._linalg_eigh aten._linalg_slogdet aten._linalg_solve_ex aten._linalg_svd aten._native_batch_norm_legit aten._native_batch_norm_legit_functional aten._native_batch_norm_legit_no_training aten._pack_padded_sequence aten._prelu_kernel_backward aten._scaled_dot_product_efficient_attention aten._scaled_dot_product_efficient_attention_backward aten._scaled_dot_product_flash_attention aten._scaled_dot_product_flash_attention_backward aten._scaled_dot_product_flash_attention_for_cpu aten._scaled_dot_product_flash_attention_for_cpu_backward aten._thnn_fused_lstm_cell aten._thnn_fused_lstm_cell_backward_impl aten._unique2 aten._weight_norm_interface aten.adaptive_max_pool2d aten.adaptive_max_pool3d aten.aminmax aten.batch_norm_backward aten.convolution_backward aten.cudnn_batch_norm aten.cudnn_batch_norm_backward aten.cummax aten.cummin aten.fractional_max_pool2d aten.frexp aten.grid_sampler_2d_backward aten.grid_sampler_3d_backward aten.gru aten.linalg_cholesky_ex aten.linalg_eig aten.linalg_inv_ex aten.linalg_ldl_factor_ex aten.linalg_lu aten.linalg_lu_factor_ex aten.linalg_qr aten.linear_backward aten.log_sigmoid_forward aten.lstm aten.lu_unpack aten.max aten.max_pool2d_with_indices aten.max_pool3d_with_indices aten.median aten.min aten.miopen_batch_norm aten.miopen_batch_norm_backward aten.mkldnn_rnn_layer aten.mkldnn_rnn_layer_backward aten.mode aten.multilabel_margin_loss_forward aten.nanmedian aten.native_batch_norm aten.native_batch_norm_backward aten.native_dropout aten.native_group_norm aten.native_group_norm_backward aten.native_layer_norm aten.native_layer_norm_backward aten.nll_loss2d_forward aten.nll_loss_forward aten.quantized_gru aten.quantized_lstm aten.rnn_relu aten.rnn_tanh aten.sort aten.std_mean aten.topk aten.triangular_solve aten.unique_dim aten.var_mean ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129294 Approved by: https://github.com/angelayi	2024-07-18 21:55:18 +00:00
Alnis Murtovi	7f1cda1533	Autoheuristic: Do not store choices as metadata (#130304 ) While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`). In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304 Approved by: https://github.com/eellison	2024-07-18 21:39:42 +00:00
zdevito	4d9f2a6d56	Small expandable segments refactor. (#130889 ) Makes next PRs that will export/import segment handles easier to write. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130889 Approved by: https://github.com/dsjohns2 ghstack dependencies: #130888	2024-07-18 21:34:38 +00:00
zdevito	d8fed480ef	Move handle-creation logic into cudacaching allocator. (#130888 ) A later PR will then make the handle abstract and able to use either cudaMalloc or expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130888 Approved by: https://github.com/dsjohns2	2024-07-18 21:34:38 +00:00
Richard Zou	3e9cf1cc80	Fix potential segfault during deletion (#131036 ) Summary: See comment in code Test Plan: code reading Reviewed By: albanD Differential Revision: D59872819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131036 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-07-18 21:18:31 +00:00
Thomas Bohnstingl	f7058b735e	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-18 21:09:09 +00:00
JackCaoG	24467ba2ec	Update pin (#130896 ) Test the XLA pin update Pull Request resolved: https://github.com/pytorch/pytorch/pull/130896 Approved by: https://github.com/anijain2305	2024-07-18 21:04:30 +00:00
Jerry Zhang	793b17ebcb	Add numeric_debugger top level APIs (#130643 ) Summary: Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model and calculate summary for metric comparisons between nodes in two graphs * `prepare_for_propagation_comparison` * `extract_results_from_loggers` * `compare_results` Test Plan: python test/test_quantization.py -k test_prepare_for_propagation_comparison python test/test_quantization.py -k test_extract_results_from_loggers Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-18 20:54:18 +00:00
PyTorch MergeBot	726b9268d2	Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 )" This reverts commit c986aeea2d7d9403be702119e3dd4dcb18134fc2. Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))	2024-07-18 20:25:20 +00:00
Peter Bell	e7f7c5c3f8	[inductor] Avoid fallback case for custom scan op lowering (#130936 ) We currently can't generate split scans when there are multiple scan values, so we normally fall back to ATen. However, for the higher order scan op, we can't fallback so it makes sense to just generate the slower kernel anyway. This avoids having special shapes where we fail to codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936 Approved by: https://github.com/lezcano	2024-07-18 19:53:47 +00:00
Shuqiang Zhang	367213a608	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-18 19:03:00 +00:00
drisspg	c015e5b9e3	Make sure that TransformGetItemToIndex for all graph replay (#131003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131003 Approved by: https://github.com/Chillee ghstack dependencies: #130871	2024-07-18 18:32:21 +00:00
redwrasse	82242a258a	rm duplicate index_dtype arg (#130803 ) - Remove duplicate `index_dtype` argument for `_test_meta_sparse_compressed` operation. - Also remove unused `y_v_numel` variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130803 Approved by: https://github.com/soulitzer	2024-07-18 18:30:13 +00:00
joydddd	6d9f74f0af	Add flex decoding benchmark (#130850 ) ghstack-source-id: b4f26fb66ed47907b11580c8c853737959c58811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130788 Add benchmark for flex decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130850 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-18 18:09:25 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit 9f392f8294e928aec49599ad649aa899e1356102. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
Pian Pawakapan	745324e487	[export] turn on hybrid symints by default (#130775 ) Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes. For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert. ``` x = torch.randn(4, 8) # [s0, s1] y = torch.randn(32) # [s2] out = x.reshape(-1) + y # this emits Eq(s0 * s1, s2), and we represent y's shape as [s0s1] in the graph. ``` However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization. We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring. Follow up: Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775 Approved by: https://github.com/avikchaudhuri	2024-07-18 17:40:58 +00:00
Michael Lazos	22388ffe03	Graph break on tostring for numpy remapping (#131007 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131007 Approved by: https://github.com/williamwen42	2024-07-18 17:23:41 +00:00
Boyuan Feng	8bf0be7c78	[CUDAGraph] Add operator.mul to skip list for find_input_mutations (#130986 ) The #130912 error happens since `operator.mul` does not have `_schema`. So why do we have `operator.mul` and why is it not dispatched to `torch.ops.aten.mul`? This op comes from %mul_3. %mul_3 : [num_users=50] = call_function[target=operator.mul](args = (%arg689_1, 4096), kwargs = {}) `%arg689_1` is a placeholder with `meta[‘val’] = s0`. It comes form dynamic shapes and represents the batch size since it’s also used in many other nodes such as: %view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%mm, [%arg689_1, 4096, 320]), kwargs = {}) and %native_group_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_group_norm.default](args = (%div_1, %arg16_1, %arg17_1, %arg689_1, 320, 4096, 32, 1e-06), kwargs = {}) To fix the issue, we can add `operator.mul` to skip list. Fixes #130912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130986 Approved by: https://github.com/eellison	2024-07-18 17:11:39 +00:00
mori360	5979014059	DSD for TorchTune LoRA (#129635 ) Fixes #128745 Solve the issue with conflicts when users use full_state_dict while the model is FSDP. Current solve the issue for `full_state_dict=True`, with error `'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).` TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is `NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635 Approved by: https://github.com/fegin	2024-07-18 17:00:35 +00:00
Zhengxu Chen	5484c86021	[export] Fully support extension op in serialization/deserialization. (#130851 ) Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch. Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization Differential Revision: D59825148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851 Approved by: https://github.com/angelayi	2024-07-18 16:47:53 +00:00
Iris Z	85451b2cde	[DTensor] Fix shard_dim_alltoall fake tensor return (#129945 ) shard_dim_alltoall op has a return type as a Tensor in its schemas (here: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L628), but its FakeTensor implementation returns a list of tensors (see the chunk() call here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/_collective_utils.py#L33). So it would error out when device="meta". This PR fixes the fake tensor mode return type for 1d mesh and adds a test to compare shape with non-meta tensor case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129945 Approved by: https://github.com/wanchaol	2024-07-18 16:43:40 +00:00
eellison	16aaff7783	Fix mm pad regresion - more conservative estimation of plannable inputs (#128909 ) - More conservative estimation of plannable inputs - Consider constant_pad_nd as pointwise node in concat lowering - Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909 Approved by: https://github.com/Chillee	2024-07-18 16:42:30 +00:00
Shangdi Yu	27ded03545	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 16:31:40 +00:00
Anshul Sinha	32ff04d30a	[dtensor][debug] adding functionality to control noisiness of the debug output (#130410 ) Summary Currently, the output of CommDebugMode contains a lot of noise, such as operations that usually won’t tell the user much information such as aten.detach.default. I have created a set of these trivial operations and added a user argument noise_level for users to choose how much information they would want to receive. noise_level = 1 prints module-level collective counts noise_level = 2 prints operations not included in trivial operations and module information noise_level = 3 prints all operations In addition, I have removed the generate_module_tracing_table since noise_level = 1 essentially replaces it. Finally, I have updated the examples and test cases. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130410 Approved by: https://github.com/XilunWu	2024-07-18 16:12:59 +00:00
Li-Huai (Allan) Lin	8ea03372a1	[MPS] Store philox counter as part of the RNG state (#130662 ) Fixes #130613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662 Approved by: https://github.com/malfet	2024-07-18 15:57:28 +00:00
cyy	7c90a82970	[Reland] [5/N] Change static functions in headers to inline (#131010 ) Reland of #130673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010 Approved by: https://github.com/Skylion007	2024-07-18 15:53:48 +00:00
PyTorch MergeBot	d6ae8bbf16	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit 9fee87e4cd9efb55ee5427a8e6b3c57de7c599f9. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9984688318/job/27595182606 `433ef4e444` Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2236867975))	2024-07-18 15:31:51 +00:00
PyTorch MergeBot	120fdf7ee2	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit e98135d1ad2f999fec649ecd21b35f3d5676be43. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))	2024-07-18 14:58:25 +00:00
rzou	5a90ed3523	Reinplacing should ignore copy_ nodes where the mutated arg is not read (#130866 ) Might fix #127660, need to test some more cases. We update the reinplacing pass. If we have something like the following, where "sin" is a custom op (this situation should also apply to triton kernels) ```py def graph(x): y = sin(x) z = sin(y) x.copy_(z) ``` then the reinplacer used to produce the following: ```py """step 1: reinplaces the first sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) z = sin(x_clone) x.copy_(z) """step 2: reinplaces the second sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) sin_out(x_clone, out=x_clone) x.copy_(x_clone) ``` However, the first clone is unnecessary. It is safe to reinplace the first sin into the following: ```py def graph(x): sin_out(x, out=x) z = sin(x) x.copy_(z) ``` because there are no users of `x`'s original value (the copy_ node doesn't actually use the original value of x!) This PR updates the reinplacing pass to ignore copy_ in its computation of if the original value of the mutated argument is still needed. NB: this also applies to triton kernels, but it was easier for me to reason about custom ops (and my repros were all for custom ops). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866 Approved by: https://github.com/oulgen	2024-07-18 13:47:54 +00:00
drisspg	dd39dca034	Removing some cruff and updating signatures for consistency (#130871 ) # Summary - This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little - Fixes a bug with indexing on block mask - Adds some doc strings to helper funcs and fixes some misc typing things - Forces functions passed to `create_block_mask` to mask_mods and updates tests files Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871 Approved by: https://github.com/joydddd, https://github.com/Chillee	2024-07-18 13:32:11 +00:00
PyTorch MergeBot	9f6db5d0e2	Revert "Ensure staticmethods can be allowed in graph (#130882 )" This reverts commit b0387449db41c90fb4226baea97a8d889a0951c4. Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/atalman due to failing torchrec tests internally, please fix and reland ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2236528473))	2024-07-18 13:31:30 +00:00
redwrasse	63a0a65df9	Define 'zero-preserving unary functions' in docs (#130804 ) Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804 Approved by: https://github.com/soulitzer	2024-07-18 13:30:29 +00:00
eqy	1b07d42171	Add @syed-ahmed to CUDA `CODEOWNERS` paths (#130971 ) CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/130971 Approved by: https://github.com/soulitzer	2024-07-18 11:55:10 +00:00
wizzniu	c986aeea2d	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-18 11:54:14 +00:00
Syed Tousif Ahmed	38b7d89aa4	Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage (#130472 ) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807, https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-18 11:33:21 +00:00
jananisriram	28a74b9fa4	[NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425 ) Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`. Functions modified in `caffe2/torch/nested/_internal/ops.py`: - `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, , M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `` dimension (`dim == 1`) to a `(B, M)` output tensor. - `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible. Functions modified in `caffe2/test/test_nestedtensor.py`: - `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. - `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor. Notes: - This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former. - The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this! - I modified some of the comments in the `sum` function as well as the unit tests for more clarity. Test Plan: Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList ``` Differential Revision: D59571209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425 Approved by: https://github.com/davidberard98	2024-07-18 10:48:18 +00:00
IvanKobzarev	e98135d1ad	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-18 08:27:53 +00:00
Michael Lazos	cf3f4285a8	Add recursive metadata guard test (#131002 ) Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](`d77af49380/torch/_dynamo/variables/builder.py (L1496)`) which will recursively wrap inner tensors adding metadata guards for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002 Approved by: https://github.com/bdhirsh	2024-07-18 08:24:43 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
Andrii Grynenko	dfc3347c4a	[pytorch][counters] Make WaitCounter backend pluggable (#130934 ) Summary: This diff introduces a much more flexible model for WaitCounter backend: 1. Backend can be installed dynamically (even if not linked with pytorch) instead of relying on macros and swapping implementation at compile time 2. Multiple backends are supported at the same time. Test Plan: unit test Reviewed By: jamesperng Differential Revision: D59795863 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130934 Approved by: https://github.com/asiab4	2024-07-18 07:23:55 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
angelayi	6c2c8ee15b	[export] Remove preserved ops from decomp list (#130970 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1466016147369925/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130970 Approved by: https://github.com/bdhirsh	2024-07-18 05:15:22 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
Xuehai Pan	740fb22966	[BE][Easy][4/19] enforce style for empty lines in import segments in `functorch/` (#129755 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129755 Approved by: https://github.com/zou3519 ghstack dependencies: #129752	2024-07-18 05:08:03 +00:00
Animesh Jain	a085acd7d6	[dynamo] Revert back changes to UnspecializedBuiltinNNModuleVariable (#130991 ) xref - https://fb.workplace.com/groups/1075192433118967/permalink/1466525440652329/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130991 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-18 05:01:46 +00:00
Sam Larsen	9f392f8294	Use inductor TestCase for test_replicate_with_compiler.py (#129494 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-18 03:08:32 +00:00
PyTorch MergeBot	433ef4e444	Revert "[FX][export] DCE pass, check schema for node impurity (#130395 )" This reverts commit e22b0acc766db4a853fe8fd73e919b4adf0e3148. Reverted https://github.com/pytorch/pytorch/pull/130395 on behalf of https://github.com/yushangdi due to breaking tests, need to rebase and fix ([comment](https://github.com/pytorch/pytorch/pull/130395#issuecomment-2235192986))	2024-07-18 02:46:03 +00:00
Aidyn-A	bd56bcf0ab	[TEST] Fix _scaled_mm tests (#130897 ) This PR resolves several sets of `_scaled_mm` test failures: - `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them - `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897 Approved by: https://github.com/drisspg	2024-07-18 02:15:00 +00:00
angelayi	9fee87e4cd	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-07-18 01:36:01 +00:00
cyy	a0ae77b25b	Simpilfy cub::unique_by_key code (#130907 ) It removed an unused parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130907 Approved by: https://github.com/ezyang	2024-07-18 01:12:00 +00:00
Alnis Murtovi	d818c3319f	Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245 ) Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245 Approved by: https://github.com/shunting314	2024-07-18 01:04:36 +00:00
Edward Z. Yang	051971ab32	Reorder MIOpen conditions so getCUDAHooks only called when CUDA input (#130867 ) See post for more details: [fb.workplace.com/groups/1405155842844877/permalink/8719141948112860](https://fb.workplace.com/groups/1405155842844877/permalink/8719141948112860/) Function getCUDAHooks() returns a reference to an object without checking if the object is null. In the AutoMOS QE, which runs a ML model in Messenger Android, we are getting native crashes because of this reason: [internalfb.com/code/fbsource/[b7f8e18320f9d5d8347c3428c67301f20c3c81d2]/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504](https://www.internalfb.com/code/fbsource/%5Bb7f8e18320f9d5d8347c3428c67301f20c3c81d2%5D/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504), crash [fburl.com/logview/xi4w7jk4](https://fburl.com/logview/xi4w7jk4) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130867 Approved by: https://github.com/albanD	2024-07-18 00:59:33 +00:00
Shangdi Yu	e22b0acc76	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 00:55:20 +00:00
cyy	73d0f484b3	[structural binding][11/N] Replace std::tie with structural binding (#130830 ) Follows #130784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130830 Approved by: https://github.com/janeyx99	2024-07-18 00:45:06 +00:00
eellison	e14d1d10ef	Unwrap Identity in prepare indexing (#130967 ) We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](`752c817898/torch/_inductor/codegen/triton.py (L1554)`) which would do wrapping with tl.full. I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64. Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-07-18 00:43:53 +00:00
Will Feng	d77af49380	[Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786 ) Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_fullgraph_backend_inductor` - `pytest -rA test/functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_fsdp_set__into_same_input` - `PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py -k TestAOTAutogradWithCache.test_input_mutation_fsdp_set__into_same_input` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130786 Approved by: https://github.com/bdhirsh ghstack dependencies: #129773	2024-07-17 23:25:42 +00:00
Will Feng	fc3dbcd1c3	[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 ) FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead. This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op). One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes. --- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773 Approved by: https://github.com/eellison	2024-07-17 22:51:20 +00:00
Oguz Ulgen	442bfa7fc4	Fix mypy error (#130992 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130992 Approved by: https://github.com/izaitsevfb	2024-07-17 22:49:23 +00:00
Oguz Ulgen	a0da1265c5	Define key in codecache (#130979 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules' ``` Differential Revision: D59875657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979 Approved by: https://github.com/jamesjwu	2024-07-17 22:44:50 +00:00
Andrew Gu	31e3330040	[Reland][FSDP2] Allowed `List[nn.Module]` as arg (#130949 ) This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication. Approach At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node. To implement the runtime schedule, we define new forward hooks that run based on the following semantics: - If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op. - If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op. - First and last are determined by scoreboarding against a set of the modules. - This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward. Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`. Examples This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382. If at least one of the modules in the list does not run forward before backward, then there will be a warning message like: ``` 1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)] ``` --- Changes for reland: none since breakage was from PR below Pull Request resolved: https://github.com/pytorch/pytorch/pull/130949 Approved by: https://github.com/weifengpy ghstack dependencies: #130947	2024-07-17 22:40:14 +00:00
Andrew Gu	ff7e021e94	[Reland][PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 ) (#130947 ) This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`. --- Changes for reland: - The previous PR assumed that any `func` decorated with `@contract` would return the same input `module` as output (which is true for PT-D composable APIs). - However, TorchRec `shard` returns a different module as output (though that module _does_ satisfy the `@contract` FQN check). - This PR removes the assumption and instead only enforces the FQN check following the input module order. In other words, if calling `func([x1, ..., xN])` for `N` modules `x1, ..., xN` that returns `[y1, ..., yM]` for `M` modules, we require that `N = M` and that FQNs are preserved coordinate-wise: `xi` and `yi` have same FQNs for all `i = 1, ..., N`. Differential Revision: [D59863438](https://our.internmc.facebook.com/intern/diff/D59863438) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130947 Approved by: https://github.com/weifengpy, https://github.com/atalman	2024-07-17 22:40:13 +00:00
Boyuan Feng	90105a4f3e	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 21:59:52 +00:00
PyTorch MergeBot	874bbc53c9	Revert "Define key in codecache (#130979 )" This reverts commit 4112f687831fb6f3554ff659a0be45909a1b4639. Reverted https://github.com/pytorch/pytorch/pull/130979 on behalf of https://github.com/clee2000 due to broke lint on torch/_inductor/codecache.py https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 `f0faecd291` ([comment](https://github.com/pytorch/pytorch/pull/130979#issuecomment-2234392332))	2024-07-17 21:59:19 +00:00
Isuru Fernando	43a6d20883	Add decomposition for reflection_pad{1,2,3}d_backward (#130299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130299 Approved by: https://github.com/lezcano ghstack dependencies: #130130	2024-07-17 21:56:00 +00:00
PyTorch MergeBot	0eb43ed189	Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 )" This reverts commit f0faecd2915d73e56917922cc995237cef064e50. Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint, but for for torch/_inductor/codecache.py this time https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 `f0faecd291` ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234387254))	2024-07-17 21:55:48 +00:00
Nikita Shulga	ebdfc7e37d	[BE] Rename `ISORT_WHITELIST` to `ISORT_SKIPLIST` (#130987 ) To better represent what this list is doing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130987 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2024-07-17 21:52:56 +00:00
Jeff Daily	df5919393c	[ROCm] std::clamp work-around for hip-clang compiler (#127812 ) Fixes #127666. Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max. Using #ifndef USE_ROCM to differentiate between CUDA using std::clamp and the ROCm replacement broke Windows builds. The replacement generates the same PTX as std::clamp, so using the replacement unconditionally. The replacement generates the same PTX as std::clamp. See https://godbolt.org/z/Wde9KW3v4 for a sample. Original patch comes from @lamikr. Modified to improve efficiency. https://github.com/lamikr/rocm_sdk_builder/pull/37 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-17 21:31:17 +00:00
Boyuan Feng	f0faecd291	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 21:27:45 +00:00
Oguz Ulgen	4112f68783	Define key in codecache (#130979 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules' ``` Differential Revision: D59875657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979 Approved by: https://github.com/jamesjwu	2024-07-17 21:19:13 +00:00
PyTorch MergeBot	0b134c15cd	Revert "Relax constraints for creating a `GenericContextWrappingVariable` (#129091 )" This reverts commit 882fd9186924b4632fba65033717d97d15ad3339. Reverted https://github.com/pytorch/pytorch/pull/129091 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 `a8bd2933d9` ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))	2024-07-17 20:59:40 +00:00
PyTorch MergeBot	c49f909aab	Revert "wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 )" This reverts commit a8bd2933d9eaf24ec9582001efa844de499d9e93. Reverted https://github.com/pytorch/pytorch/pull/130490 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 `a8bd2933d9` ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))	2024-07-17 20:59:40 +00:00
Animesh Jain	65b4163bd2	[dynamo][nn-module] Make slice getitem on nn module container sourceless (#130852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130852 Approved by: https://github.com/mlazos ghstack dependencies: #130773	2024-07-17 20:17:08 +00:00
Guilherme Leobas	a8bd2933d9	wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #129091	2024-07-17 20:07:06 +00:00
Guilherme Leobas	882fd91869	Relax constraints for creating a `GenericContextWrappingVariable` (#129091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-07-17 20:07:06 +00:00
PyTorch MergeBot	41f5d5dcaf	Revert "[inductor] adapte windows file path (#130713 )" This reverts commit e51e971a8675826e517a78bf2a97f8e2df5f4abd. Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to sorry but I think its still failing, this time on windows CUDA https://github.com/pytorch/pytorch/actions/runs/9971126834/job/27552761451 `bb62e9d7c3`. It was not run on PR due to being on the periodic workflow, which isnt usually run on PRs due to capacity issues for windows CUDA machines. I will add ciflow/periodic to the PR to ensure the test gets run ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2234092078))	2024-07-17 19:37:16 +00:00
PyTorch MergeBot	1bf4a44b33	Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 )" This reverts commit ef0511245a92bae7057c195dcae2efc237b96f16. Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint for test/export/test_converter.py https://github.com/pytorch/pytorch/actions/runs/9979009143/job/27577181982 `ef0511245a`. Probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234067407))	2024-07-17 19:21:52 +00:00
Michael Lazos	b0387449db	Ensure staticmethods can be allowed in graph (#130882 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-07-17 19:18:30 +00:00
Michael Lazos	e4f9d01cd9	Add test for dataclass field accesses (#130848 ) Fixes https://github.com/pytorch/pytorch/issues/120108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130848 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-07-17 19:14:23 +00:00
Michael Lazos	470f07c840	Add guard override capability for tensor subclass metadata (#130780 ) Fixes https://github.com/pytorch/pytorch/issues/114405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130780 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh ghstack dependencies: #130779	2024-07-17 19:13:53 +00:00
Michael Lazos	bea6762c01	Add guards on subclass metadata (#130779 ) This PR adds guards in dynamo which verify the equality of tensor subclass metadata along with tests verifying the expected recompile behavior. The next PR adds the capability to override the guard behavior to possibly perform the check in a less expensive manner. Toward fixing https://github.com/pytorch/pytorch/issues/114405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130779 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2024-07-17 19:13:52 +00:00
Bin Bao	752c817898	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796 ) Summary: Unify the argment codegen logic between python wrapper and cpp wrapper. Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796 Approved by: https://github.com/oulgen	2024-07-17 18:37:23 +00:00
chilli	efefea52e0	renamed inductor kernel args in flexattention properly (#130869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130869 Approved by: https://github.com/drisspg, https://github.com/joydddd ghstack dependencies: #130809, #130818	2024-07-17 18:36:03 +00:00
chilli	480a5bd881	Renamed mask_fn to mask_mod (#130818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818 Approved by: https://github.com/drisspg ghstack dependencies: #130809	2024-07-17 18:36:03 +00:00
Pian Pawakapan	d96c80649f	[export] constants & non-persistent buffers for training IR (#130864 ) Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR. Test Plan: test_export Differential Revision: D59820909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864 Approved by: https://github.com/angelayi	2024-07-17 18:27:53 +00:00
Boyuan Feng	ef0511245a	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 17:48:36 +00:00
Catherine Lee	d552e5c3d5	Fix ciflow/nightly triggering commit hash update workflow (#130570 ) Move the if statement to be higher so people don't get the below ![image](https://github.com/user-attachments/assets/e9be7d7c-6400-4f80-880f-d58dcb4b5495) like https://togithub.com/pytorch/pytorch/pull/130465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130570 Approved by: https://github.com/ZainRizvi	2024-07-17 17:13:50 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Oguz Ulgen	1e13cb2f28	Log cache state to structured logs (#130845 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpRm4MaD/0_0_0/fx_graph_cache_hash_4.json Differential Revision: D59795574 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130845 Approved by: https://github.com/jamesjwu	2024-07-17 16:45:45 +00:00
lezcano	af0b5ee924	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-07-17 16:29:36 +00:00
Sam Larsen	6e916f112f	[inductor] skip fx remote cache for 2 tests in test_metrics.py (#130853 ) Summary: `collect_defined_kernels()` is essentially patching deep inside to see if a specific codegen is happening. We could also patch somewhere in the cache path to make sure it's called, but I'm not sure that's really testing anything interesting. I suggest it's better to just disable the remote cache here. Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:metrics -- --exact 'caffe2/test/inductor:metrics - test_kernel_args_num_gb (caffe2.test.inductor.test_metrics.TestMetrics)' --run-disabled --stress-runs 10` Differential Revision: D59825899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130853 Approved by: https://github.com/oulgen	2024-07-17 16:17:43 +00:00
fduwjj	1fb572289b	[BE][c10d] Add a warning messages in the comment about cuda hang (#130844 ) Add comments to warn users potential hang for the cuda event query in NCCLPG. Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130844 Approved by: https://github.com/wconstab	2024-07-17 15:51:19 +00:00
Isuru Fernando	b7d2abd766	Fix vectorized ops.masked (#130130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-17 14:55:11 +00:00
Xuehai Pan	b29b23137c	[Easy] Fix argument name collision in dispatched functions (#129562 ) Use positional-only argument to avoid naming collision with aten ops arguments that are named "self". ```python In [1]: def foo(self, args, kwargs): ...: print(self, args, kwargs) ...: In [2]: def bar(self, /, args, **kwargs): ...: print(self, args, kwargs) ...: In [3]: foo(1, 2, self=3) TypeError: foo() got multiple values for argument 'self' In [4]: bar(1, 2, self=3) 1 (2,) {'self': 3} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562 Approved by: https://github.com/zou3519, https://github.com/fegin	2024-07-17 14:39:56 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Yutao Xu	32995dec28	Add support for XPU accumulate type (#128579 ) Provide an accumulate type interface specifically for XPU, similar to what was done for MPS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128579 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-07-17 14:33:53 +00:00
Xuehai Pan	76169cf691	[BE][Easy][9/19] enforce style for empty lines in import segments in `test/[e-h]*/` (#129760 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129760 Approved by: https://github.com/ezyang	2024-07-17 14:25:29 +00:00
angelayi	cbf274d4a7	[aoti] Add packaging solution (#129895 ) In this PR, I added support for packaging the AOTI generated files into a zipfile, and loading it in python. `compile_so` takes the path to the package, a device, and a desired so_path location, and compiles package into a .so, and saves to the specified location. `load_package` takes a path to the package and device, calls _extract_so, and then creates a callable to run the compiled model. The zipfile generated looks like the following: ``` \|- version \|- archive_format \|- data \|- aotinductor \|- cbtnafqaqrhvwztv7xudlal4xs6sofxa5oxccyuaqtrt6aozaklx.cubin # AOTI cuda generated cubin files \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe.cpp # AOTI generated cpp file \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_compile_flags # Flags for compiling the .o \|- c6qqtnpgwfi3dv5nb76ai773kt45ezoxfwdmd7q37lvq6fs2tnoi.o # AOTI saved const.o \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_linker_flags # Flags for linking the files to form the .so \|- constants \|- constants.pt # Constants saved using torch.save, can be loaded using mmap ``` The workflow is something like: ``` with torch.no_grad(): ep = torch.export.export( model, example_inputs, dynamic_shapes=dynamic_shapes, strict=False, ) gm = ep.module() package_path = torch._inductor.aot_compile( gm, example_inputs, options= { "aot_inductor.output_path": "my_path.pt2", # or a directory "aot_inductor.package": True, } ) compiled_model = torch._inductor.package.load_package(package_path, device) return compiled_model ``` I tried turning on loading the weights using mmap by default, but had some trouble with it, so that is just left as a todo Pull Request resolved: https://github.com/pytorch/pytorch/pull/129895 Approved by: https://github.com/malfet	2024-07-17 13:56:58 +00:00
PyTorch MergeBot	94a910b43b	Revert "Renamed mask_fn to mask_mod (#130818 )" This reverts commit 1a97bcf93b2ac98505ef6ff011ccb3565e456596. Reverted https://github.com/pytorch/pytorch/pull/130818 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/130818#issuecomment-2233367318))	2024-07-17 13:47:08 +00:00
PyTorch MergeBot	d027aef8f8	Revert "Removed q_num_blocks from constructor (#130819 )" This reverts commit 03c660468eb57772e82c1034613f5ff8781c775a. Reverted https://github.com/pytorch/pytorch/pull/130819 on behalf of https://github.com/atalman due to Internal problem with previous PR in stack https://github.com/pytorch/pytorch/pull/130818 ([comment](https://github.com/pytorch/pytorch/pull/130819#issuecomment-2233359569))	2024-07-17 13:43:35 +00:00
Alnis Murtovi	4b7ff35622	Fix flex_attention import in score_mod (#130906 ) torch.nn.attention._flex_attention has been renamed to torch.nn.attention.flex_attention, so the import does not work currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130906 Approved by: https://github.com/Chillee	2024-07-17 13:37:08 +00:00
PyTorch MergeBot	e1b2d8f975	Revert "[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 )" This reverts commit de177b50f89e45a57ac056ee64a64d7775b450ff. Reverted https://github.com/pytorch/pytorch/pull/130482 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/130482#issuecomment-2233309217))	2024-07-17 13:21:50 +00:00
xinan.lin	d3a11a0198	[Inductor] Handle device_put op in constant folding. (#130824 ) Fix #130823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130824 Approved by: https://github.com/eellison, https://github.com/EikanWang ghstack dependencies: #130817	2024-07-17 10:13:36 +00:00
xinan.lin	2af2d26562	[Inductor UT] Generalize device-bias code in test_triton_kernels.py and test_torchinductor.py (#130817 ) [Inductor UT] Generalize newly introduced device-bias code in test_triton_kernels.py::test_add_kernel and test_torchinductor.py::test_ctr_not_moved_to_cuda_when_used_in_index_put Fix #130814 , #130838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130817 Approved by: https://github.com/zou3519	2024-07-17 10:13:36 +00:00
William Wen	2300bb2a88	[3.13, dynamo] support TO_BOOL (#130565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130565 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460, #130461, #130564	2024-07-17 09:47:58 +00:00
William Wen	539acf7656	[3.13, dynamo] support CALL_KW (#130564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130564 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460, #130461	2024-07-17 09:47:58 +00:00
William Wen	e2365c05d7	[3.13, dynamo] fix instruction line numbers (#130461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130461 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460	2024-07-17 09:47:58 +00:00
William Wen	82b2e7a253	[3.13, dynamo] fix CALL_FUNCTION_EX in symbolic_convert (#130460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130460 Approved by: https://github.com/jansel ghstack dependencies: #130459	2024-07-17 09:47:58 +00:00
William Wen	8c9a996091	[3.13, dynamo] support LOAD_FAST_LOAD_FAST and STORE_FAST_STORE_FAST (#130459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130459 Approved by: https://github.com/jansel	2024-07-17 09:47:58 +00:00
Adrian Wälchli	bb62e9d7c3	Avoid autocast deprecation warning in DataParallel (#130660 ) Fixes #130659 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130660 Approved by: https://github.com/guangyey, https://github.com/fegin, https://github.com/albanD	2024-07-17 08:32:19 +00:00
Xuehai Pan	f6838d521a	[BE][Easy][5/19] enforce style for empty lines in import segments in `tools/` and `torchgen/` (#129756 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756 Approved by: https://github.com/ezyang	2024-07-17 06:44:35 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Xu Han	e51e971a86	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-17 06:36:11 +00:00
Andrii Grynenko	7c45476d38	[pytorch][counters] WaitCounter cleanup (#130664 ) Summary: This diff does a minor cleanup of WaitCounters: 1. Fixes some singleton use to ensure one instance of WaitCounterImpl per counter per process 2. Updates API to enable measuring duration of individual wait operations Test Plan: unit test Differential Revision: D59709324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130664 Approved by: https://github.com/c-p-i-o, https://github.com/asiab4	2024-07-17 04:42:35 +00:00
Colin Peppler	419b8df0b6	[inductor][easy] add debug logs for inlining constants (#130799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130799 Approved by: https://github.com/chenyang78	2024-07-17 04:21:08 +00:00
Yu, Guangye	f2552dcc3d	refactor cached tensor more generic (#129359 ) # Motivation solve https://github.com/pytorch/pytorch/issues/129027 to refactor cached tensor to be generic. # Additional Context No API name change. It is only decoupling with CUDA build option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129359 Approved by: https://github.com/eqy, https://github.com/EikanWang, https://github.com/albanD	2024-07-17 03:00:08 +00:00
Yu, Guangye	c6aa03bd4e	Add allow_xpu to enable XPU UTs (#130312 ) # Motivation enable UTs under folder test/xpu/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130312 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-17 02:40:28 +00:00
Wang, Eikan	fc238db62a	Separate AOTI Eager utils as a single file (#125819 ) The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire	2024-07-17 02:27:11 +00:00
Aaron Gokaslan	d1c4e6b55f	[BE]: Enable a few additional ruff rules (#130700 ) Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700 Approved by: https://github.com/justinchuby, https://github.com/ezyang	2024-07-17 02:06:04 +00:00
Yu, Guangye	c24c50da92	fix tensor print behavior for XPU (#130523 ) # Motivation Some XPU device don't support `double` data type. So we have to use `tensor.to(torch.float)` if it is a XPU tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130523 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2024-07-17 02:03:32 +00:00
Edward Z. Yang	aa95fb99af	On advice of James March, log pid instead of tid (#130679 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130679 Approved by: https://github.com/jmarchfb	2024-07-17 02:00:10 +00:00
Jack Taylor	e9023d57b0	[ROCm] Return correct AMDSMI socket_power metric (#130331 ) Extending on the change in https://github.com/pytorch/pytorch/pull/127729 Depending on gcnArch the API to return socket power will change based on underlying gpu_metrics. This PR will handle both cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130331 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/malfet	2024-07-17 01:58:58 +00:00
chilli	03c660468e	Removed q_num_blocks from constructor (#130819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130819 Approved by: https://github.com/drisspg ghstack dependencies: #130809, #130818	2024-07-17 01:41:20 +00:00
chilli	1a97bcf93b	Renamed mask_fn to mask_mod (#130818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818 Approved by: https://github.com/drisspg ghstack dependencies: #130809	2024-07-17 01:41:20 +00:00
chilli	6024fea0f8	Compute q_num_blocks from kv_num_blocks if q_num_blocks is not passed in (#130809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130809 Approved by: https://github.com/drisspg	2024-07-17 01:41:15 +00:00
Tristan Rice	ef9d9be236	TCPStoreLibUvBackend: log port on error (#130797 ) Adds better error messages when a socket fails to bind in libuv. New format: ``` The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied ``` Old format: ``` The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use ``` Test plan: Added test in `test_store.py` ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797 Approved by: https://github.com/kurman	2024-07-17 01:34:15 +00:00
Sam Larsen	25cb4426d3	[inductor] Add num_matches_for_scatter_upon_const_tensor to list of cached metrics (#130843 ) Summary: test/inductor:scatter_optimization is using this counter and fails with remote caching enabled Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:scatter_optimization -- --exact 'caffe2/test/inductor:scatter_optimization - test_cross_entropy_loss (caffe2.test.inductor.test_scatter_optimization.TestScatterOpt)' --run-disabled --stress-runs 10` Differential Revision: D59817406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130843 Approved by: https://github.com/oulgen	2024-07-17 00:41:22 +00:00
PyTorch MergeBot	8458dc8966	Revert "Use inductor TestCase for distributed tests (#129494 )" This reverts commit 3cd2ae331a5ed6839456bb0025c729a1ee50bc84. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))	2024-07-17 00:32:48 +00:00
PyTorch MergeBot	d7a8e8f7c5	Revert "[PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 )" This reverts commit b27695791e9cc4eedb1b713b1be20398bfeb911b. Reverted https://github.com/pytorch/pytorch/pull/127773 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/127773#issuecomment-2232004006))	2024-07-16 23:48:09 +00:00
Lei Wang (Server LLVM)	9a6d81b178	Fix pytorch JIT build for LLVM 18+ (#130661 ) Summary: LLVM upstream(https://github.com/llvm/llvm-project/pull/97824) changed `getHostCPUFeatures`to use Return StringMap. Fix this to unblock T195389358 Test Plan: ``` buck2 build mode/opt-clang-thinlto --upload-all-actions -c unicorn.hfsort="1" -c cxx.extra_cxxflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference -ferror-limit=0" -c cxx.extra_cflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference" -c cxx.profile="fbcode//fdo/autofdo/unicorn/topaggr/top_aggregator_server:autofdo" unicorn/topaggr:top_aggregator_server ``` Differential Revision: D59708722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130661 Approved by: https://github.com/Skylion007	2024-07-16 23:47:48 +00:00
eqy	de177b50f8	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg	2024-07-16 23:45:21 +00:00
PyTorch MergeBot	4f40a7078e	Revert "[FSDP2] Allowed `List[nn.Module]` as arg (#127786 )" This reverts commit d3ab8cecedd7843b8caed5946404704a18479811. Reverted https://github.com/pytorch/pytorch/pull/127786 on behalf of https://github.com/atalman due to bottom pr from the stack is failing on internal error ([comment](https://github.com/pytorch/pytorch/pull/127786#issuecomment-2231999178))	2024-07-16 23:45:17 +00:00
Michael Lazos	7919f0b952	Add buffer static input tests to cudagraph trees (#130402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402 Approved by: https://github.com/eellison ghstack dependencies: #130393	2024-07-16 22:12:38 +00:00
Michael Lazos	415d5e53ae	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh	2024-07-16 22:12:38 +00:00
PyTorch MergeBot	5f3c356a56	Revert "[inductor] adapte windows file path (#130713 )" This reverts commit 69e99172450e40536bf2e6c110183d34a0e283e2. Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to broke functorch\test_eager_transforms.py on windows https://github.com/pytorch/pytorch/actions/runs/9958208725/job/27530132704 `69e9917245`. Test failure on PR is real, possibly force merged to get around lint error? ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2231901793))	2024-07-16 22:07:55 +00:00
soulitzer	2eec02523b	[autograd] Support GradientEdge as output for torch.autograd.grad (#127766 ) This is useful for splitting grad to run in two parts while preserving intermediates: <details> <summary> Click to see code </summary> ```python import collections import weakref from torch.autograd.graph import GradientEdge def _get_grad_fn_or_grad_acc(t): if t.requires_grad and t.grad_fn is None: return t.view_as(t).grad_fn.next_functions[0][0] else: return t.grad_fn def reverse_closure(roots, target_nodes): # Recurse until we reach a target node closure = set() actual_target_nodes = set() q: Deque = collections.deque() for node in roots: if node is not None and node not in closure: closure.add(node) q.append(node) while q: node = q.popleft() reverse_edges = node.metadata.get("reverse_edges", []) for holder_ref, idx in reverse_edges: ref = holder_ref() if ref is not None: raise RuntimeError("Reverse graph is no longer alive") fn = ref.node if fn in closure or fn is None: continue if fn in target_nodes: actual_target_nodes.add(fn) continue closure.add(fn) q.append(fn) return closure, actual_target_nodes # Enable weak pointer class Holder(): def __init__(self, node): self.node = node # TODO: use weak references to avoid reference cycle def construct_reverse_graph(roots): q: Deque = collections.deque() root_seen = set() reverse_graph_refs = [] for node in roots: if node is not None and node not in root_seen: q.append(node) root_seen.add(node) while q: node = q.popleft() for fn, idx in node.next_functions: if fn is not None: # Don't necessarily need to store on the graph reverse_edges = fn.metadata.get("reverse_edges", []) if len(reverse_edges) == 0: q.append(fn) holder = Holder(node) holder_ref = weakref.ref(holder) reverse_graph_refs.append(holder) reverse_edges.append((holder_ref, idx)) fn.metadata["reverse_edges"] = reverse_edges return reverse_graph_refs def get_param_groups(inputs, params): inputs_closure, _ = reverse_closure(inputs, set()) param_groups = dict() # keyed on intermediates for i, param in enumerate(params): closure, intersected = reverse_closure([param], inputs_closure) param_group = { "params": set([param]), "intermediates": set(intersected), } for input_node in intersected: existing = param_groups.get(input_node, None) if existing is not None: existing["params"] = existing["params"].union(param_group["params"]) existing["intermediates"] = existing["intermediates"].union(param_group["intermediates"]) param_group = existing else: param_groups[input_node] = param_group # Sanity check: union of all param_groups params should be equal to all params union_params = set() seen_ids = set() unique_param_groups = [] for param_group in param_groups.values(): if id(param_group) not in seen_ids: seen_ids.add(id(param_group)) unique_param_groups.append(param_group) union_params = union_params.union(param_group["params"]) assert union_params == set(params) return unique_param_groups def compute_grads_only_inputs2(roots, inps, weights): root_grad_fns = list(map(_get_grad_fn_or_grad_acc, roots)) inp_grad_fns = list(map(_get_grad_fn_or_grad_acc, inps)) weight_grad_fns = list(map(_get_grad_fn_or_grad_acc, weights)) reverse_graph_refs = construct_reverse_graph(root_grad_fns) param_groups = get_param_groups(inp_grad_fns, weight_grad_fns) del reverse_graph_refs for param_group in param_groups: for i, intermediate in enumerate(param_group["intermediates"]): def get_hook(param_group, i): def hook(grad_inputs): if param_group.get("grads", None) is None: param_group["grads"] = [None] * len(param_group["intermediates"]) param_group["grads"][i] = grad_inputs return hook # These are always "split" nodes that we need to recompute, so # save their inputs. intermediate.register_prehook(get_hook(param_group, i)) dinputs = torch.autograd.grad((out,), inputs=tuple(inps), grad_outputs=(torch.ones_like(out),), retain_graph=True) return dinputs, param_groups def compute_grads_only_weights2(user_weights, param_groups): all_dweights = dict() for param_group in param_groups: # TODO: Handle case where intermediate can have multiple outputs intermediate_edges = tuple(GradientEdge(i, 0) for i in param_group["intermediates"]) weights_edges = tuple(GradientEdge(w, 0) for w in param_group["params"]) assert all(len(g) == 1 for g in param_group["grads"]) # [NEW!] Able to pass a GradientEdge to autograd.grad as output # We do not need to retain_graph because... guarantee no overlap? print("trying to execute: ", intermediate_edges, weights_edges) dweights = torch.autograd.grad(intermediate_edges, weights_edges, grad_outputs=sum(param_group["grads"], tuple())) for w, dw in zip(param_group["params"], dweights): all_dweights[w] = dw # return grads in the original order weights were provided in out = [] for w in user_weights: grad_acc = _get_grad_fn_or_grad_acc(w) out.append(all_dweights[grad_acc]) return tuple(out) ``` </details> ```python import torch.nn as nn # Setup mod1 = nn.Linear(10, 10) mod2 = nn.Linear(10, 10) a = torch.rand(10, requires_grad=True) weights = tuple(mod1.parameters()) + tuple(mod2.parameters()) inps = (a,) out = mod2(mod1(a)) class LoggingTensorMode(torch.utils._python_dispatch.TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): if kwargs is None: kwargs = {} rs = func(args, *kwargs) print(f"{func.__module__}.{func.__name__}") return rs print(" -- SPLIT -- ") # Compute gradients in two parts with LoggingTensorMode(): print("PART 1") dinputs, state = compute_grads_only_inputs2((out,), inps, weights) print("PART 2") dweights = compute_grads_only_weights2(weights, state) out = mod2(mod1(a)) print(" -- REF -- ") # Compare with reference with LoggingTensorMode(): ref_all_gradients = torch.autograd.grad(out, inputs=tuple(inps) + weights, grad_outputs=(torch.ones_like(out),)) for actual, ref in zip(dinputs + dweights, ref_all_gradients): print(torch.allclose(actual, ref)) ``` <img width="598" alt="image" src="https://github.com/pytorch/pytorch/assets/13428986/3681b8a7-3ab4-4d1d-a836-abef6913e671"> ``` PART 1 torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.ones_like.default V0603 10:17:21.590878 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1ee160> with grad_outputs: [f32[10]] torch._ops.aten.view.default V0603 10:17:21.591204 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default V0603 10:17:21.591578 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x100d7ae50> with grad_outputs: [f32[1, 10]] torch._ops.aten.view.default V0603 10:17:21.591747 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a60> with grad_outputs: [f32[10]] torch._ops.aten.view.default V0603 10:17:21.591834 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default V0603 10:17:21.591922 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a90> with grad_outputs: [f32[1, 10]] torch._ops.aten.view.default PART 2 trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1e4bb0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a21b130>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b7c0>, output_nr=0)) V0603 10:17:21.592223 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default torch._ops.aten.t.default torch._ops.aten.sum.dim_IntList torch._ops.aten.view.default V0603 10:17:21.592421 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a1cad60> with grad_outputs: [f32[10, 10]] torch._ops.aten.t.default trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1ee0d0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a1e41c0>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b670>, output_nr=0)) V0603 10:17:21.593481 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default torch._ops.aten.t.default torch._ops.aten.sum.dim_IntList torch._ops.aten.view.default V0603 10:17:21.593750 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a21b2b0> with grad_outputs: [f32[10, 10]] torch._ops.aten.t.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127766 Approved by: https://github.com/albanD	2024-07-16 21:46:19 +00:00
PyTorch MergeBot	c1e7e40f24	Revert "[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 )" This reverts commit f2f31027ce8dc3985663bf6eaa66f3c5559b724a. Reverted https://github.com/pytorch/pytorch/pull/129773 on behalf of https://github.com/clee2000 due to failed inductor/test_torchinductor_dynamic_shapes.py on mac https://github.com/pytorch/pytorch/actions/runs/9963396991/job/27530249256 `f2f31027ce`. The build failed on PR so test jobs didn't run ([comment](https://github.com/pytorch/pytorch/pull/129773#issuecomment-2231808437))	2024-07-16 20:54:14 +00:00
Atul Jangra	4e479568df	[PT2] Log compile ID in the signpost event (#130801 ) Summary: We should log compile ID as well for easier comparison. Currently going through some of this data, I think we should make few more changes as well. Reland for D59725870 Test Plan: Sandcastle and Pytorch Differential Revision: D59789110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130801 Approved by: https://github.com/oulgen	2024-07-16 20:47:36 +00:00
Yifu Wang	2ceade37c5	[SymmetricMemory] put socket files in /tmp (#130757 ) Currently the socket files are put in the current directory, which may not be writable in all environments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130757 Approved by: https://github.com/Chillee ghstack dependencies: #130756	2024-07-16 20:21:05 +00:00
Yifu Wang	0468f2616a	[SymmetricMemory] make sure different subgroups with the same name use different store prefixes (#130756 ) This fixes a race condition in which different subgroups with the same name on the same host would use the same store. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130756 Approved by: https://github.com/Chillee	2024-07-16 20:21:05 +00:00
Will Feng	f2f31027ce	[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 ) FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead. This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op). One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes. --- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773 Approved by: https://github.com/eellison	2024-07-16 20:07:41 +00:00
Sam Larsen	156b99cfb1	[inductor] Handle inductor counters in fx graph cache (#130635 ) Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit Test Plan: new unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635 Approved by: https://github.com/eellison	2024-07-16 20:07:16 +00:00
David Berard	d548417d95	[NJT] throw an exception if nested_tensor_from_jagged is fx-traced without being fx.wrapped (#130702 ) The NJT constructor can't be fx-traced safely due to the dummy nt used: `774ca93fd2/torch/nested/_internal/nested_tensor.py (L501-L508)` The error doesn't appear immediately, but appears if you try to move a module with an fx-traced NJT constructor onto a different device, or try to serialize it. Let's throw an error if we try to fx-trace the NJT constructor so users know to wrap the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130702 Approved by: https://github.com/jbschlosser, https://github.com/soulitzer	2024-07-16 19:21:10 +00:00
PyTorch MergeBot	0851de5b16	Revert "[ONNX] Remove beartype usage (#130484 )" This reverts commit 1794c35912025aa44b0d70f67ff664b4f7bd1014. Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/clee2000 due to test_sympy_utils failure is real https://github.com/pytorch/pytorch/actions/runs/9961499559/job/27523758780 `1794c35912`. Dr CI is matching with commits in current commit? ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2231575577))	2024-07-16 18:41:51 +00:00
Joel Schlosser	09b1b113f5	Cache min / max seq len for torch.nested.as_nested_tensor(t) (#130766 ) For the `torch.nested.as_nested_tensor(t)` constructor, computing min / max seq len is trivial since the sequence lengths are all the same. Might as well cache them during construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130766 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-07-16 18:32:47 +00:00
Edward Z. Yang	408c921d96	Make hashing a SymInt raise an error again (#130548 ) See https://github.com/pytorch/pytorch/issues/130547 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/lezcano	2024-07-16 18:30:30 +00:00
Xu Zhao	1d8baa4df2	[torchbench][servicelab] Fix servicelab test failures (#130781 ) Fix servicelab test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781 Approved by: https://github.com/desertfire	2024-07-16 17:35:13 +00:00
Justin Chu	1794c35912	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-16 17:34:36 +00:00
Jiashen Cao	67e22d6c61	[Fix]: Convert operator that does specialization to its symbolic counterpart (#129578 ) #### Issue During conversion, use symbolic operator when exist. #### Test Plan `pytest test/export/test_converter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129578 Approved by: https://github.com/angelayi	2024-07-16 17:19:57 +00:00
Pian Pawakapan	e8998d68c8	[export] add non-strict training IR (#130062 ) Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved. Refraining from unifying export/export_for_training, per @ydwu4's feedback :) Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR Differential Revision: D59349454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2024-07-16 17:08:00 +00:00
Sidney Tsang	d2f44eabe7	[Export] Support aten.full.default and aten.full_like.default (#130639 ) Summary: Add operator tests for full & full_like operators Test Plan: Rerun kernel test using ``` buck2 run //glow/fba/tests:run_kernel mode/dev -- --kernel splat --config "input=1;dtype=fp32;fill_value=42.0" -tl_time ``` {F1752274071} Operator tests ``` buck2 run mode/{opt,inplace} //caffe2/torch/fb/test_library:afg_operator_test -- -k __full__ ``` {F1752340913} Differential Revision: D59593849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130639 Approved by: https://github.com/StellarrZ	2024-07-16 16:50:04 +00:00
Colin Peppler	f272e0ab4a	[inductor] support unbacked symint divisors in vars_and_sizes (#130595 ) Scenario: ``` >>> nodes IterationRangesEntry( x2, divisor=192u0 + 192576, length=s1, (xindex//(192u0 + 192576)), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x1, divisor=192, length=u0 + 1003, ModularIndexing(xindex, 192, u0 + 1003), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x0, divisor=1, length=192, ModularIndexing(xindex, 1, 192), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) ``` Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint? Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2024-07-16 16:21:38 +00:00
drisspg	2b43d339fe	Make FlexAttention API public (#130755 ) # Summary Makes the prototype API flex_attention public Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755 Approved by: https://github.com/Chillee	2024-07-16 16:21:25 +00:00
PyTorch MergeBot	cbda8be537	Revert "Propagate buffer and parameter indices through AOT (#130393 )" This reverts commit 69a77389e2c4052834c89a25757cdbf5f83b6208. Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
PyTorch MergeBot	9cb23ba85b	Revert "Add buffer static input tests to cudagraph trees (#130402 )" This reverts commit 80236dca90b0874cb2b6f9c9fa5f159c55726401. Reverted https://github.com/pytorch/pytorch/pull/130402 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
Sam Larsen	c509319210	[inductor] Disable remote fx graph cache in test_snode_runtime (#130655 ) Summary: Unfortunately we can't save / restore metrics.metrics.node_runtimes in the cache entries because these contain objects that don't pickle: `TypeError: cannot pickle 'PyCapsule' object`. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:snode_runtime -- --exact 'caffe2/test/inductor:snode_runtime - test_mm (caffe2.test.inductor.test_snode_runtime.ComputeBoundedTests)' --run-disabled --jobs 18 --stress-runs 10` Differential Revision: D59705654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130655 Approved by: https://github.com/oulgen	2024-07-16 15:11:17 +00:00
Aaron Enye Shi	aa4ad711ef	[CCA][Memory Snapshot] Create TraceEntryRingBuffer class for alloc_trace logic (#130741 ) Summary: Move the alloc_trace logic into a separate class, to reduce risk of deadlocks when mixing with CCA's lock. Switch to an std::mutex instead of std::recursive_mutex. Let's us re-use the logic in TraceEntryRingBuffer class for later diffs. Test Plan: CI, resnet run, and FBR model. Differential Revision: D59690408 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130741 Approved by: https://github.com/davidberard98	2024-07-16 15:01:48 +00:00
eellison	e11c41035c	Directly use empty strided in cudagraph copy (#130777 ) We had an issue with the `-1` somehow ending up in negative num elements required. not sure why the original didn't work - we should land if CI is green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130777 Approved by: https://github.com/BoyuanFeng	2024-07-16 14:37:30 +00:00
Aaron Orenstein	4c3348932c	typing: convert_frame (#130670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130670 Approved by: https://github.com/Skylion007 ghstack dependencies: #130669	2024-07-16 14:31:35 +00:00
Aaron Orenstein	ea25febfab	typing: storage (#130669 ) This isn't a full typing of the file - it just fixes some uses of unbound 'T' (if you use a TypeVar as an output it also needs to be an input). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130669 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-07-16 14:31:35 +00:00
Isuru Fernando	8390843eba	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-16 14:29:29 +00:00
David Berard	1fbfb3202d	[docs][TorchScript] document c10::AliasAnalysisKind::CONSERVATIVE (#130765 ) I spent a while trying to search this to remember what this was called. Adding it to the OVERVIEW.md docs so it's easier to search Pull Request resolved: https://github.com/pytorch/pytorch/pull/130765 Approved by: https://github.com/nmacchioni, https://github.com/eellison, https://github.com/aaronenyeshi	2024-07-16 14:20:31 +00:00
Xu Han	69e9917245	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-16 13:53:39 +00:00
Aaron Gokaslan	53e5b8ac5b	[BE]: Update flake8-comprehensions and enable C420 (#130699 ) Uses `dict.fromkeys` whenever possible as covered by flake8-comprehensions rule C420. While the ruff rule RUF025 is still in preview, flake8-comprehensions have added a new rule which covers this. Use dict.fromkeys is faster when the value being added to the dictionary is the same at every iteration and is immutable, it also removes an unnecessary dict comprehension. This rule will be enabled with our current ruleset in RUF in 0.6 as C420. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130699 Approved by: https://github.com/lezcano, https://github.com/ezyang	2024-07-16 13:47:49 +00:00
Xu Zhao	213685ba97	[torchao][pt2 benchmark runner] Run performance test non-alternately (#130136 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 ``` ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune ``` Differential Revision: D59332736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136 Approved by: https://github.com/jerryzh168	2024-07-16 13:38:17 +00:00
eellison	67c6941b4e	Update torch.cat decomp for 0-dim (#130763 ) Fix for https://github.com/pytorch/pytorch/issues/130615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130763 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2024-07-16 13:34:01 +00:00
Jiong Gong	705da70f2c	[inductor][cpp] align dtype convert cache between vec and scalar kernels (#130677 ) The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers. `pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py` before ```c++ extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr1) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = tmp1 + tmp1; auto tmp3 = at::vec::convert<bfloat16>(tmp2); auto tmp4 = at::vec::convert<float>(tmp3); auto tmp5 = tmp1 + tmp4; auto tmp6 = at::vec::convert<bfloat16>(tmp5); tmp6.store(out_ptr1 + static_cast<long>(x0), 16); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = decltype(tmp1)(tmp1 + tmp1); auto tmp3 = c10::convert<bfloat16>(tmp2); auto tmp4 = decltype(tmp1)(tmp1 + tmp2); auto tmp5 = c10::convert<bfloat16>(tmp4); out_ptr1[static_cast<long>(x0)] = tmp5; } } } ``` after ```c++ extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr1) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = tmp1 + tmp1; auto tmp3 = at::vec::convert<bfloat16>(tmp2); auto tmp4 = tmp1 + tmp2; auto tmp5 = at::vec::convert<bfloat16>(tmp4); tmp5.store(out_ptr1 + static_cast<long>(x0), 16); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = decltype(tmp1)(tmp1 + tmp1); auto tmp3 = c10::convert<bfloat16>(tmp2); auto tmp4 = decltype(tmp1)(tmp1 + tmp2); auto tmp5 = c10::convert<bfloat16>(tmp4); out_ptr1[static_cast<long>(x0)] = tmp5; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677 Approved by: https://github.com/leslie-fang-intel	2024-07-16 13:25:05 +00:00
PyTorch MergeBot	68a4f2a3df	Revert "Tighten torch.library.infer_schema input types (#130705 )" This reverts commit ca2d424c6e5358f9fee8dc9ee7477de76b50f848. Reverted https://github.com/pytorch/pytorch/pull/130705 on behalf of https://github.com/atalman due to Failing internal CI ([comment](https://github.com/pytorch/pytorch/pull/130705#issuecomment-2230821876))	2024-07-16 12:57:11 +00:00
Andrea Frittoli	dee0f43fde	Add a CI job to check runner det sync (#129746 ) Add a new CI job that runs only when the runner determinator files are modified. The jobs checks that the runner_determinator.py script is in sync with the version embedded in _runner-determinator.yaml. Fixes TBD Pull Request resolved: https://github.com/pytorch/pytorch/pull/129746 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi, https://github.com/jeanschmidt	2024-07-16 11:44:55 +00:00
Jovian Anthony Jaison	e57101d927	Add testing regarding SparseAdam state_dicts (#130645 ) Summary: - Updated SparseAdam to run test_state_dict_deterministic unit test. - Made gradients sparse while keeping weights dense in the above test. Test Plan: - Ran test_optim.py locally. Fixes #116507 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645 Approved by: https://github.com/janeyx99	2024-07-16 11:29:22 +00:00
cyy	168e41009b	[structural binding][10/N] Replace std::tie with structural binding (#130784 ) Follows #130404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130784 Approved by: https://github.com/malfet	2024-07-16 10:28:14 +00:00
Xuehai Pan	747b38c131	[BE][Easy][2/19] enforce style for empty lines in import segments in `.ci/` and `.github/` (#129753 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129753 Approved by: https://github.com/malfet ghstack dependencies: #129752	2024-07-16 09:40:00 +00:00
Yu, Guangye	096dc444ce	Keep zero check be compatible with different sympy versions (#130729 ) # Motivation I found a difference between sympy 1.12 and 1.13. ```python # for 1.12 >>> import sympy >>> a = sympy.Number(0.0) >>> a == 0 True ``` ```python # for 1.13 >>> import sympy >>> a = sympy.Number(0.0) >>> a == 0 False ``` The different behavior will impact the result of [safe_mul](`6beec34b1c/torch/utils/_sympy/value_ranges.py (L521-L528)`), resulting in an incorrect results when `a = sympy.Number(0.0)`, `b = inf` and the result is `nan` if sympy version is 1.13. (the expected result is 0) ```python def safe_mul(a, b): # Make unknown() * wrap(0.0) == wrap(0.0) if a == 0.0: return a elif b == 0.0: return b else: return a * b ``` In different sympy versions, `sympy.Number(0)` always has the same behavior that equals to 0.0. ```python >>> import sympy >>> a = sympy.Number(0) >>> a == 0.0 True # for different sympy versions ``` So, use 0.0 when checking zero in safe_mul to keep compatible with different sympy versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130729 Approved by: https://github.com/lezcano, https://github.com/EikanWang	2024-07-16 08:39:00 +00:00
Animesh Jain	fedae41c57	[dynamo] Do not mark nn.module containers as BuiltinNNModuleVariable (#130773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130773 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-16 06:55:46 +00:00
Aaron Gokaslan	83eedf66b9	Update libfmt submodule to 11.0.1 (#130628 ) Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628 Approved by: https://github.com/aaronenyeshi	2024-07-16 06:12:11 +00:00
chuanqiw	c549629696	[CD] Fix xpu nightly wheel test failure (#130742 ) The xpu nightly wheel test met permission issue on `linux.idc.xpu` runner. Because those runners onboarded with `jenkins` user but the binary test in docker container with `root` directly. The temp files can't be deleted, refer https://github.com/pytorch/pytorch/actions/runs/9935452320/job/27448053625#step:8:91 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130742 Approved by: https://github.com/atalman	2024-07-16 05:31:20 +00:00
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
Wanchao Liang	7b2e802f31	[dtensor] add a few dunder methods to pointwise ops (#130754 ) fixes https://github.com/pytorch/pytorch/issues/130671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130754 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/msaroufim ghstack dependencies: #130753	2024-07-16 02:53:35 +00:00
Wanchao Liang	2b2671a7b1	[dtensor] fix foreach_norm when ord is 2 (#130753 ) as titled, fixed a case when passing ord as 2 (default value), the op dispatching does not receive the default value case We simply check if the args schema receiving a `ord` field or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/130753 Approved by: https://github.com/awgu	2024-07-16 02:53:35 +00:00
Aaron Gokaslan	a29052a0bf	[BE][Ez]: Update ruff to 0.5.2 (#130698 ) Update ruff to 0.5.2 which bugfixes and performance improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/130698 Approved by: https://github.com/ezyang	2024-07-16 01:31:30 +00:00
Adrian Wälchli	ad314a2f05	Pass `torch.load(weights_only=)` internally to avoid FutureWarning (#130663 ) Fixes #130658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663 Approved by: https://github.com/malfet, https://github.com/LucasLLC	2024-07-16 01:24:38 +00:00
Sam Larsen	3cd2ae331a	Use inductor TestCase for distributed tests (#129494 ) Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-16 01:24:35 +00:00
Brian Hirsh	39eeaac4e5	inductor: avoiding moving constructor to cuda when it would cause h2d sync in index_put_ fallback (#130338 ) My attempt at a fix for https://github.com/pytorch/pytorch/issues/130335, see issue for more details / internal xref. Any feedback from inductor folks is appreciated. I attempted to make the move-constructors-to-cuda pass a bit less aggressive by detecting when the movement would incur a H2D sync for `aten.index_put_`. I'm not sure if there are any other ops that inductor falls back to eager on, that may-or-may-not incur a H2D sync if we change any of their inputs from cpu to cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130338 Approved by: https://github.com/eellison	2024-07-16 00:48:58 +00:00
Jiang, Yanbing	93a03edcf9	Update error message in meta__convert_weight_to_int4pack (#130707 ) This PR is to fix error message in https://github.com/pytorch/pytorch/pull/129940. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130707 Approved by: https://github.com/lezcano, https://github.com/malfet	2024-07-16 00:44:35 +00:00
Xuehai Pan	a3abfa5cb5	[BE][Easy][1/19] enforce style for empty lines in import segments (#129752 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-16 00:42:56 +00:00
eqy	5e617d7ef5	[CUDA] Actually bump tolerances for `test_grad_pca_lowrank` (#130770 ) Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770 Approved by: https://github.com/Skylion007	2024-07-16 00:41:10 +00:00
Michael Lazos	80236dca90	Add buffer static input tests to cudagraph trees (#130402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402 Approved by: https://github.com/eellison ghstack dependencies: #130391, #130392, #130503, #130393	2024-07-16 00:25:38 +00:00
Michael Lazos	69a77389e2	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392, #130503	2024-07-16 00:25:38 +00:00
Michael Lazos	200d3d0a89	Remove static param counting if inlining NN modules (#130503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130503 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392	2024-07-16 00:25:34 +00:00
Michael Lazos	0d0c09702a	Update mark_static_address for inlining NN modules (#130392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130392 Approved by: https://github.com/anijain2305 ghstack dependencies: #130391	2024-07-16 00:25:29 +00:00
Michael Lazos	d8616eb66a	Mark nn_module params and buffers as static in dynamo (#130391 ) This PR marks all buffers and parameters of an NNModule as static using the `mark_static_address` API. As a result, when tensors are passed to AOT, the `tensor_dict` metadata of placeholder nodes will contain the `static_address_type` key, indicating which graph argument positions are static for cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130391 Approved by: https://github.com/anijain2305	2024-07-16 00:25:23 +00:00
eellison	9ab8d47f9d	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-16 00:17:11 +00:00
yuqingj	ea4f310ff1	[Nested Tensor][easy] Add softmax backward support (#130602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130602 Approved by: https://github.com/davidberard98, https://github.com/jbschlosser	2024-07-16 00:07:42 +00:00
Andrew Gu	d3ab8ceced	[FSDP2] Allowed `List[nn.Module]` as arg (#127786 ) This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication. Approach At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node. To implement the runtime schedule, we define new forward hooks that run based on the following semantics: - If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op. - If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op. - First and last are determined by scoreboarding against a set of the modules. - This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward. Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`. Examples This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382. If at least one of the modules in the list does not run forward before backward, then there will be a warning message like: ``` 1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127786 Approved by: https://github.com/yf225, https://github.com/weifengpy ghstack dependencies: #127773	2024-07-15 23:54:10 +00:00
Andrew Gu	b27695791e	[PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 ) This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127773 Approved by: https://github.com/weifengpy	2024-07-15 23:54:10 +00:00
Bilal Khan	54a932b0ac	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/eqy, https://github.com/eellison	2024-07-15 23:23:23 +00:00
Sijia Chen	006020ff6e	Fix the cudagraph capture of SDPA (#130712 ) Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2' Differential Revision: D59740639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712 Approved by: https://github.com/Skylion007, https://github.com/chenyang78	2024-07-15 23:05:48 +00:00
Alnis Murtovi	50ef099ad0	Learn a heuristic to decide whether to pad before mm (#128643 ) This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code. The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code. The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set. The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen. On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often. The results for randomly generated data and huggingface look as follows: `max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`. The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set. A100 ``` max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold 15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530 16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530 17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530 18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530 19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530 ``` The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned: A100 hf-train: 60 A100 hf-inf: 10 ## Results on running huggingface locally This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds. #pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure"). I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency. Results on huggingface training ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2 BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1 BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2 CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1 DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2 PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2 PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1 RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1 TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2 Mean difference in speedup: 0.01 Mean compilation latency saved: -0.80s Mean compilation latency reduction: 1.68% ``` Results on huggingface inference ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2 BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1 BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2 DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2 PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2 PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1 TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2 Mean difference in speedup: 0.00 Mean compilation latency saved: -0.38s Mean compilation latency reduction: 2.07% ``` For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-07-15 23:04:06 +00:00
Sam Larsen	9a5204dc2d	[inductor] Remove "spawn" as an option for parallel compile method (#130746 ) Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option. Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746 Approved by: https://github.com/Skylion007	2024-07-15 22:55:54 +00:00
Jiashen Cao	3f031b96c6	[Fix] Correctly identifying arguments for sub-blocks with renaming logic during TorchScript to ExportedProgram conversion (#128386 ) #### Issue Fix two issues related to inputs lifting when there are sub-blocks. * Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block. * Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param` * `test/export/test_converter.py -s -k test_hidden_input_name` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386 Approved by: https://github.com/angelayi	2024-07-15 22:48:13 +00:00
Jerry Zhang	b893aa71ca	Rename generate_numeric_debug_handle to numeric_debugger (#130590 ) Summary: att Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-15 22:42:27 +00:00
WeiChunyu-star	535016967a	Enable UFMT on all of torch/sparse (#130545 ) Partially addresses #123062 Ran lintrunner on: - torch/sparse Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/130545 Approved by: https://github.com/ezyang	2024-07-15 22:35:52 +00:00
Alex Dennis	7d4f50de19	dynamo add support for `defaultdict(set)` (#130745 ) Fixes #130554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130745 Approved by: https://github.com/Skylion007	2024-07-15 22:23:33 +00:00
William Wen	3928ca2ab6	[dynamo] update call map to allow multiple input parameters (#130748 ) Fixes https://github.com/pytorch/pytorch/issues/128072. Commandeering https://github.com/pytorch/pytorch/pull/128282 since the issue is now hi pri. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130748 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-07-15 22:16:49 +00:00
eqy	6f32dc0c7b	Don't pass error message as `places` in `assertGreaterAlmostEqual` (#130648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648 Approved by: https://github.com/awgu	2024-07-15 22:14:49 +00:00
PyTorch MergeBot	dff9d68f18	Revert "Fix names conflict when lifting (#129817 )" This reverts commit 53cf46b8c602f8512d49a5c30bca7fcf5411e25c. Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to Failing inductor/test_flex_attention.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27478084137 `74da2a467f` Sorry for the churn, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2229519886))	2024-07-15 22:08:45 +00:00
PyTorch MergeBot	78799e82b0	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit 1bc390c5f5ac065c156f55f4eceed267ecc67b41. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 `1bc390c5f5`. Test was introduced by `fa5f572748` which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))	2024-07-15 21:59:46 +00:00
Yifu Wang	db3a641b71	Implement operator for micro-pipelined all-gather -> _scaled_mm (#129289 ) This PR implements `torch.ops.symm_mem.fused_all_gather_scaled_matmul`. It's similar to `torch.ops.symm_mem.fused_all_gather_matmul`, except that it takes scales and calls ` _scaled_mm`. [Profiling Trace vs. Baseline](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0gmg1f2_) (FB internal only) Co-authored-by: Will Feng <yf225@cornell.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129289 Approved by: https://github.com/Chillee, https://github.com/weifengpy, https://github.com/drisspg	2024-07-15 21:48:35 +00:00
Shuqiang Zhang	77fb5b0e23	[c10d] a new Pytorch API (split_group) to create a process group (#130507 ) This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407 ncclCommSplit Summary: In current Pytorch/c10d, the new_group API is used to create a new process group from the default pg. When device_id is specified in init_process_group and nccl is used as the backend, the new_group call will use ncclCommSplit to create the nccl communicators to save communicator resources. It has a few drawbacks: Redundant calls Suppose the default group has 256 ranks, we need to have 32 children PGs and each child PG has 8 ranks. in this case, each rank needs to call new_group and ncclCommSplit 32 times because of how we implement new_group API and the collective requirement of ncclCommSplit. For a specific global rank, 31 calls of ncclCommSplit would be no_color split, and only 1 of them is colored split. With the proposed new split_group API, we expect only 1 call of split_group/ncclCommSplit is needed per rank in the above example case new_group can only split from default_pg Ideally, a new pg should be able to be split from any pg With the new split_group API, users can create new PGs using ncclCommSplit with less number of calls and initialize the PG eagerly. This is also useful in the cases of creating many P2P communicators. Test Plan: New UTs: e.g., python test/distributed/test_c10d_nccl.py -k test_comm_split_group_larger_scale Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507 Approved by: https://github.com/wconstab	2024-07-15 21:26:43 +00:00
Nikita Shulga	ac3e2cb64a	[BE] Delete unused -rg.yml workflow (#130759 ) As well as `_linux-test-label.yml` as ARC experiment is dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/130759 Approved by: https://github.com/ZainRizvi	2024-07-15 20:41:59 +00:00
Iris Zhang (PyTorch)	ee6f0ab190	[DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 ) (#130685 ) Summary: As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different. This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases. As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash. ``` test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32 ``` Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs. Test Plan: ``` [irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends' ``` Reviewed By: gag1jain Differential Revision: D59725924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685 Approved by: https://github.com/gag1jain	2024-07-15 20:05:26 +00:00
chilli	27322355de	Added some more documentation to block mask creation (#130649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130649 Approved by: https://github.com/drisspg ghstack dependencies: #130626	2024-07-15 19:48:42 +00:00
yuqingj	0e79e1f958	[NJT+SDPA]Fix flash_attention output when batch_size=1 and seq_len=1 (#130652 ) fix issue #130196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130652 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jbschlosser	2024-07-15 19:44:04 +00:00
PyTorch MergeBot	074a5c0c9b	Revert "[BE] bump `optree` version to 0.12.1 (#130139 )" This reverts commit 8fcb156e8b5697a8f292db6db2a1803c5f4ce2d7. Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py `8fcb156e8b` ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))	2024-07-15 19:42:11 +00:00
Xu Han	f1456c74a0	Fix mkl-static issue for Windows. (#130697 ) Background: We found the pytorch Windows release/2.4 performance regression: https://github.com/pytorch/pytorch/issues/130619 After some debug works, I found the pytorch Windows static mkl build options are wrong: <img width="1049" alt="image" src="https://github.com/user-attachments/assets/38692142-bfca-4c98-8092-6e105c82bb13"> 1. Thread lib is wrong. 2. Miss `openmp` lib and config. > Debug history: https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226782504 and https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226418611 This PR will fix `mkl-static` build options issue. <img width="863" alt="image" src="https://github.com/user-attachments/assets/834f6cee-7e6d-4d74-b2bc-8a270f05e429"> Reference: <img width="482" alt="image" src="https://github.com/user-attachments/assets/8184dadb-f230-4062-a49f-51df1d7285f5"> https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.c6izlg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130697 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-07-15 19:28:11 +00:00
Wanchao Liang	a7cfe40c9b	[dtensor] Improve from_local API with run_check (#130289 ) as titled, this PR: 1. switch `run_check` to be by default False and add extra doc/comments about the correctness guarantee. Since I observed so many calls forget to use run_check=False, we should simply switch to not perform metadata check and make our documentation explicit 2. Implement metadata check by picking up the changes from https://github.com/pytorch/pytorch/pull/115229 3. Improve the from_local documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/130289 Approved by: https://github.com/awgu, https://github.com/wz337 ghstack dependencies: #130286, #130287, #130288	2024-07-15 18:52:55 +00:00
Wanchao Liang	3342f3aa4e	[dtensor] simplify sdpa strategies (#130288 ) as titled, this PR simplifies both flash and efficient attention op strategy generation paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/130288 Approved by: https://github.com/tianyu-l ghstack dependencies: #130286, #130287	2024-07-15 18:52:55 +00:00
Wanchao Liang	7d82dc2c23	[dtensor] slice_backward to use op strategy (#130287 ) as titled. slice_backward right now forward the sharding unconditionally, which is wrong mathmatically. This PR switch it to op strategy and only allow replication Pull Request resolved: https://github.com/pytorch/pytorch/pull/130287 Approved by: https://github.com/awgu ghstack dependencies: #130286	2024-07-15 18:52:49 +00:00
Zhanghan Wang	53cf46b8c6	Fix names conflict when lifting (#129817 ) ## Bug description When pending args that are potentially to be lift [here](`58f346c874/torch/_dynamo/output_graph.py (L1866)`) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](`58f346c874/torch/_dynamo/output_graph.py (L2081)`) can finally create a name ([here](`58f346c874/torch/fx/graph.py (L1008)`)) that overwrite args to lift. And thus causing a wrong output of graph. ## Reproducing Below is an reproduceable example, ```python import logging from typing import List import torch from functorch.compile import aot_module_simplified, make_boxed_func @torch.library.custom_op("mylib::somefunc_forward", mutates_args=()) def somefunc_forward( input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: return torch.ones_like(input_) @somefunc_forward.register_fake def _(input_, shape, weight): return torch.empty_like(input_) @torch.library.custom_op("mylib::somefunc_backward", mutates_args=()) def somefunc_backward( grad_output: torch.Tensor, input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: print(f"backward.{grad_output.shape=}") print(f"backward.{input_.shape=}") print(f"backward.{weight.shape=}") print(f"backward.{shape=}") assert list(weight.shape) == shape return torch.ones_like(weight) @somefunc_backward.register_fake def _(grad_output, input_, weight, shape): return torch.empty_like(weight) def a_func(grad_output, input_, weight_, shape): return torch.ones_like(input_.sum() * weight_) class SomeFunc(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, normalized_shape): ctx.normalized_shape = normalized_shape input_ = input.contiguous() weight_ = weight.contiguous() output = somefunc_forward(input_, weight_, ctx.normalized_shape) ctx.save_for_backward(input_, weight_) return output @staticmethod def backward(ctx, grad_output): input_, weight_ = ctx.saved_tensors # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape) grad_weight = somefunc_backward( grad_output.contiguous(), input_, weight_, ctx.normalized_shape, ) return None, grad_weight, None class MyModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.ones(7)) def forward(self, x): return SomeFunc.apply(x, self.weight, [7]) model = MyModel() torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True) def aot_print_backend(gm, sample_inputs): # Forward compiler capture def fw(gm, sample_inputs): print(f"----- fw") gm.print_readable() return make_boxed_func(gm.forward) # Backward compiler capture def bw(gm, sample_inputs): print(f"----- bw") gm.print_readable() return make_boxed_func(gm.forward) # Call AOTAutograd gm_forward = aot_module_simplified( gm, sample_inputs, fw_compiler=fw, bw_compiler=bw ) return gm_forward model = torch.compile( model, backend=aot_print_backend, dynamic=False, ) out = model(torch.rand((128, 4, 7))) out.mean().backward() ``` I can see log that showing calling into create_graph_input like ```log V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none) V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none) ``` And the backward graph generate will be like ```log class GraphModule(torch.nn.Module): def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"): contiguous_1 = contiguous contiguous_2 = contiguous_1 # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(), contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous(); somefunc_forward_default = None # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(args, *kwargs) somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]); contiguous = contiguous_1 = contiguous_2 = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (None, somefunc_backward_default) ``` The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`. ## Environment ```log Collecting environment information... PyTorch version: 2.5.0a0+git0b7e8df Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: version 3.26.4 Libc version: N/A Python version: 3.9.19 (main, May 6 2024, 14:39:30) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Pro Versions of relevant libraries: [pip3] numpy==2.0.0 [pip3] optree==0.11.0 [pip3] torch==2.5.0a0+git0b7e8df [pip3] torchgraph==0.0.1 [conda] numpy 2.0.0 pypi_0 pypi [conda] optree 0.11.0 pypi_0 pypi [conda] torch 2.5.0a0+git0b7e8df dev_0 <develop> [conda] torchgraph 0.0.1 dev_0 <develop> ``` ## How to fix? I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you. @zou3519 @oulgen Co-authored-by: rzou <zou3519@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817 Approved by: https://github.com/zou3519	2024-07-15 18:49:12 +00:00
Guilherme Leobas	b4b64f76e5	Ensure tensors devices match on `torch.index_put` batch rule impl (#130479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130479 Approved by: https://github.com/zou3519	2024-07-15 18:16:31 +00:00
Joel Schlosser	00d71b3e86	Tweak tolerances for test_vjp_linalg_tensorsolve_cuda_float32 to pass in Windows / debug builds (#130449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130449 Approved by: https://github.com/zou3519, https://github.com/malfet ghstack dependencies: #128238, #130360	2024-07-15 17:35:34 +00:00
PyTorch MergeBot	9e161af179	Revert "Increase tolerance for tensorsolve tests (#130620 )" This reverts commit 103b6ccab2bd025dfacc8c8a91f71f3d68e50426. Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))	2024-07-15 17:35:04 +00:00
Xuehai Pan	8fcb156e8b	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519	2024-07-15 17:27:07 +00:00
PyTorch MergeBot	1e897a0ca4	Revert "Fix names conflict when lifting (#129817 )" This reverts commit 74da2a467f166e00316aee82ba24835ca563ed87. Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to broke dynamo/test_inline_inbuilt_nn_modules.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27461141919 `74da2a467f`. Test passed on PR, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2228993570))	2024-07-15 17:09:52 +00:00
Edward Z. Yang	0099e15b47	Also put unbacked symbols in symbol_to_node in split_module pass (#130535 ) This is not a complete fix but it is a simple one, full fix tracked in https://github.com/pytorch/pytorch/issues/130534 Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7510238679103969/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130535 Approved by: https://github.com/malfet	2024-07-15 16:56:01 +00:00
rzou	ca2d424c6e	Tighten torch.library.infer_schema input types (#130705 ) Made the following changes: - mutates_args is now keyword-only and mandatory. This is to align with torch.library.custom_op (which makes it mandatory because it's easy to miss) - op_name is now keyword-only. This helps the readability of the API - updated all usages of infer_schema This change is not BC-breaking because we introduced torch.library.infer_schema a couple of days ago. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705 Approved by: https://github.com/yushangdi	2024-07-15 16:43:57 +00:00
PyTorch MergeBot	9df4bc6a0d	Revert "Constant folding for dynamic shape node (#129686 )" This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0. Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally. Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))	2024-07-15 15:19:24 +00:00
Yu, Guangye	7cd48df2da	Refine the logic of device construction when only device index is given (#129119 ) # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119 Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang ghstack dependencies: #129463, #129205, #129363	2024-07-15 14:34:29 +00:00
Yu, Guangye	9cae2160f5	Introduce the concept of Accelerators to PyTorch doc (#129363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #129463, #129205	2024-07-15 14:24:46 +00:00
Zhanghan Wang	74da2a467f	Fix names conflict when lifting (#129817 ) ## Bug description When pending args that are potentially to be lift [here](`58f346c874/torch/_dynamo/output_graph.py (L1866)`) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](`58f346c874/torch/_dynamo/output_graph.py (L2081)`) can finally create a name ([here](`58f346c874/torch/fx/graph.py (L1008)`)) that overwrite args to lift. And thus causing a wrong output of graph. ## Reproducing Below is an reproduceable example, ```python import logging from typing import List import torch from functorch.compile import aot_module_simplified, make_boxed_func @torch.library.custom_op("mylib::somefunc_forward", mutates_args=()) def somefunc_forward( input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: return torch.ones_like(input_) @somefunc_forward.register_fake def _(input_, shape, weight): return torch.empty_like(input_) @torch.library.custom_op("mylib::somefunc_backward", mutates_args=()) def somefunc_backward( grad_output: torch.Tensor, input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: print(f"backward.{grad_output.shape=}") print(f"backward.{input_.shape=}") print(f"backward.{weight.shape=}") print(f"backward.{shape=}") assert list(weight.shape) == shape return torch.ones_like(weight) @somefunc_backward.register_fake def _(grad_output, input_, weight, shape): return torch.empty_like(weight) def a_func(grad_output, input_, weight_, shape): return torch.ones_like(input_.sum() * weight_) class SomeFunc(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, normalized_shape): ctx.normalized_shape = normalized_shape input_ = input.contiguous() weight_ = weight.contiguous() output = somefunc_forward(input_, weight_, ctx.normalized_shape) ctx.save_for_backward(input_, weight_) return output @staticmethod def backward(ctx, grad_output): input_, weight_ = ctx.saved_tensors # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape) grad_weight = somefunc_backward( grad_output.contiguous(), input_, weight_, ctx.normalized_shape, ) return None, grad_weight, None class MyModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.ones(7)) def forward(self, x): return SomeFunc.apply(x, self.weight, [7]) model = MyModel() torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True) def aot_print_backend(gm, sample_inputs): # Forward compiler capture def fw(gm, sample_inputs): print(f"----- fw") gm.print_readable() return make_boxed_func(gm.forward) # Backward compiler capture def bw(gm, sample_inputs): print(f"----- bw") gm.print_readable() return make_boxed_func(gm.forward) # Call AOTAutograd gm_forward = aot_module_simplified( gm, sample_inputs, fw_compiler=fw, bw_compiler=bw ) return gm_forward model = torch.compile( model, backend=aot_print_backend, dynamic=False, ) out = model(torch.rand((128, 4, 7))) out.mean().backward() ``` I can see log that showing calling into create_graph_input like ```log V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none) V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none) ``` And the backward graph generate will be like ```log class GraphModule(torch.nn.Module): def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"): contiguous_1 = contiguous contiguous_2 = contiguous_1 # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(), contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous(); somefunc_forward_default = None # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(args, *kwargs) somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]); contiguous = contiguous_1 = contiguous_2 = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (None, somefunc_backward_default) ``` The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`. ## Environment ```log Collecting environment information... PyTorch version: 2.5.0a0+git0b7e8df Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: version 3.26.4 Libc version: N/A Python version: 3.9.19 (main, May 6 2024, 14:39:30) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Pro Versions of relevant libraries: [pip3] numpy==2.0.0 [pip3] optree==0.11.0 [pip3] torch==2.5.0a0+git0b7e8df [pip3] torchgraph==0.0.1 [conda] numpy 2.0.0 pypi_0 pypi [conda] optree 0.11.0 pypi_0 pypi [conda] torch 2.5.0a0+git0b7e8df dev_0 <develop> [conda] torchgraph 0.0.1 dev_0 <develop> ``` ## How to fix? I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you. @zou3519 @oulgen Co-authored-by: rzou <zou3519@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817 Approved by: https://github.com/zou3519	2024-07-15 13:41:46 +00:00
rzou	ee039c0614	[custom_op] triton_op API V0 (#130637 ) This is the initial version of an API to create custom operators whose implementations are backed by triton kernels. While user-defined triton kernels work out-of-the-box with triton kernels, you may wish to construct a custom operator if you need to compose with other PyTorch subsystems, like Tensor subclasses or vmap. I'm hoping to get design feedback on this and ship it so that we can begin experimenting with customers. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637 Approved by: https://github.com/albanD	2024-07-15 13:00:54 +00:00
cyy	6beec34b1c	[structural binding][9/N] Replace std::tie with structural binding (#130404 ) Follows #130544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130404 Approved by: https://github.com/janeyx99	2024-07-15 10:14:52 +00:00
Aaron Gokaslan	ac28ae18dc	[BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827 ) Updates pybind11 submodule to v2.13.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827 Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD	2024-07-15 08:58:56 +00:00
Animesh Jain	1d983bbb28	[easy][inline-inbuilt-nn-module] Update test output (#130681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130681 Approved by: https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #130654, #130420	2024-07-15 06:19:53 +00:00
Animesh Jain	1a266def4f	[dynamo][unsoundness but very controlled] Skip guards on inbuilt nn module hooks (#130420 ) Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x. This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine). Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420 Approved by: https://github.com/jansel ghstack dependencies: #130654	2024-07-15 06:19:53 +00:00
Li-Huai (Allan) Lin	dc7725cc16	[halide-backend] Random number generation (#130211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211 Approved by: https://github.com/jansel	2024-07-15 05:03:24 +00:00
Isuru Fernando	1bc390c5f5	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-15 04:16:17 +00:00
Wu, Chunyuan	a3c0bab502	[inductor] [cpp] use non-temporal tile load for A (#129455 ) Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding https://github.com/pytorch/pytorch/pull/129348 (also in this ghstack) on top of this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129455 Approved by: https://github.com/jgong5	2024-07-15 04:07:29 +00:00
Nikita Shulga	c547b2e871	Fix python detection in cuda.cmake (#130651 ) If Python package has not been detected previously, call it here This fixes regression introduced by https://github.com/pytorch/pytorch/pull/128801 that results in annoying, but harmless warning reported in https://github.com/pytorch/pytorch/issues/129777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130651 Approved by: https://github.com/Skylion007	2024-07-15 03:45:31 +00:00
PyTorch MergeBot	c0897919da	Revert " [5/N] Change static functions in headers to inline (#130673 )" This reverts commit 4410c44ae6fd8eb36f2358ac76f7d988ca7537c5. Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk. Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))	2024-07-15 03:27:11 +00:00
cyy	28f6ae2718	[9/N] Replace c10::optional with std::optional (#130674 ) Follows #130509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674 Approved by: https://github.com/Skylion007	2024-07-15 00:48:43 +00:00
Haoci Zhang	774ca93fd2	Added zb1p schedule (#130210 ) Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241. The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof: ![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7) Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210 Approved by: https://github.com/H-Huang	2024-07-14 17:32:59 +00:00
cyy	5fe9515d35	[structural binding][8/N] Replace std::tie with structural binding (#130544 ) Follows #130216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130544 Approved by: https://github.com/ezyang	2024-07-14 13:23:20 +00:00
leslie-fang-intel	81322aee74	[Inductor][CPP] Support more than one LocalBuffer (#129121 ) Summary Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion ``` Next Step - [✓] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126967	2024-07-14 11:31:14 +00:00
leslie-fang-intel	adaa0fea5a	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-14 11:28:10 +00:00
awayzjj	dcaa111dc8	support intersection by polyfill (#130672 ) Fixes https://github.com/pytorch/pytorch/issues/130557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130672 Approved by: https://github.com/anijain2305	2024-07-14 10:44:26 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
Boyuan Feng	fa5f572748	[cudagraph] fallback to eager if re-record too many times (#129349 ) Summary: CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime. While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid: - [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses - Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION) - Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs. Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees. A summary on how static tensor address changes are handled now: - For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption. - If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node. - If the function_id has been re-recorded too many times, fallback to eager function and warning. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349 Approved by: https://github.com/eellison	2024-07-14 04:17:24 +00:00
cyy	4410c44ae6	[5/N] Change static functions in headers to inline (#130673 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673 Approved by: https://github.com/ezyang	2024-07-14 03:15:28 +00:00
Shivam Raikundalia	6f275ae4d0	Add kwinputs to Kineto Traces (#130373 ) Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces. Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested. Differential Revision: D59472345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373 Approved by: https://github.com/davidberard98	2024-07-14 00:40:59 +00:00
chilli	f9f85bfc0b	[Inductor] FlexAttention supports partial masking (#130415 ) (#130626 ) This is the new version of https://github.com/pytorch/pytorch/pull/130415 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Approved by: https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626 Approved by: https://github.com/drisspg, https://github.com/yanboliang	2024-07-14 00:37:26 +00:00
William Wen	cbb7e26acd	[3.13, dynamo] fix jump target offset calculation (#130458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130458 Approved by: https://github.com/jansel ghstack dependencies: #130383, #130384, #130385	2024-07-13 23:32:06 +00:00
William Wen	0b5792c0ae	[3.13, dynamo] fix NULL ordering in symbolic_convert CALL (#130385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130385 Approved by: https://github.com/jansel ghstack dependencies: #130383, #130384	2024-07-13 23:32:05 +00:00
William Wen	87b406d7e5	[3.13, dynamo] codegen TO_BOOL before conditional jump (#130384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130384 Approved by: https://github.com/jansel ghstack dependencies: #130383	2024-07-13 23:32:02 +00:00
William Wen	92ac9ee83c	[3.13, dynamo] swap null and pop_null in codegen (#130383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130383 Approved by: https://github.com/jansel	2024-07-13 23:31:57 +00:00
Gagan Jain	97cfc65dbc	Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 )" (#130676 ) Summary: Original commit changeset: 80c2ca639146 Original Phabricator Diff: D59612200 Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends' Differential Revision: D59719562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676 Approved by: https://github.com/xunnanxu	2024-07-13 23:19:22 +00:00
Tobias Ringwald	e5de25896f	Fixed CUDA randint generation for large ranges. (#126066 ) Fixes #125224 For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224). This also affects multiple other random functions, such as `torch.rand` and `torch.randn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-07-13 21:42:27 +00:00
PyTorch MergeBot	1f162a5fce	Revert "[Inductor][CPP] Support vectorization of remainder (#129849 )" This reverts commit 5bc18ec0a181fac0994522fefaf664f917d64b86. Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to fails the compilation of executorch benchmark internally ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2227054413))	2024-07-13 19:28:34 +00:00
Animesh Jain	8714b7fc69	[dynamo][cpp-guards] Use dict tags to skip guards on immutable dict getitems (#130654 ) Reduces the guard overhead from 3.7k units to 2.1k units. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130654 Approved by: https://github.com/jansel	2024-07-13 15:31:10 +00:00
cyy	7c83f5f7d5	[8/N] Replace c10::optional with std::optional (#130509 ) Follows #130510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509 Approved by: https://github.com/ezyang	2024-07-13 13:05:36 +00:00
PyTorch MergeBot	0effcb70ef	Revert "[ONNX] Remove beartype usage (#130484 )" This reverts commit f44739cf42e22a569bd1bdb0c113f8a069c17a41. Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/huydhn due to Sorry for reverting your change but those failures show up in trunk after the commit landed `f44739cf42`, I am reverting it to see if it fix trunk ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2226812311))	2024-07-13 07:52:59 +00:00
Aaron Orenstein	567482973d	typing fake_tensor.py (#128041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041 Approved by: https://github.com/eellison ghstack dependencies: #129182	2024-07-13 06:07:40 +00:00
drisspg	1ad0f38a37	Fix IMAs in FlexAttention + autotuning (#130352 ) # Summary Makes error message better for non divisible sequence lengths. Updates this PR was blocked due to two IMAs. - The first, is that when the kv indices ends up being an 'arange' I.e. there are non sparse blocks, we end up loading off of kv_indices + 1. - The second I dont really have a clear answer for. We were hitting an ima here: `9f401187c7/torch/_inductor/kernel/flex_attention.py (L846)` I noticed that the for our inputs 2048 and q_blocksize = 128 we were again exactly at 16. Something felt fishy. I suspect we launch one extra sparse_q block, But why only during autotuning... ### Repro: https://gist.github.com/drisspg/f312a66426f3440b7756c6c0cc037f4c ### After this change: ``` ========= COMPUTE-SANITIZER AUTOTUNE flex_attention(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x16, 1x1x16x16) triton_flex_attention_0 2.1118 ms 100.0% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_3 2.4306 ms 86.9% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_1 2.5729 ms 82.1% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_4 2.8035 ms 75.3% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_2 2.8837 ms 73.2% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 SingleProcess AUTOTUNE benchmarking takes 0.7225 seconds and 1.5218 seconds precompiling AUTOTUNE flex_attention_backward(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x2048, 1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x16, 1x1x16x16, 1x1x16, 1x1x16x16) triton_flex_attention_backward_30 2.7763 ms 100.0% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_15 3.1404 ms 88.4% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_backward_14 3.2604 ms 85.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_7 3.4176 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_backward_8 3.4182 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=4, num_warps=4 triton_flex_attention_backward_34 3.4939 ms 79.5% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 triton_flex_attention_backward_6 3.6517 ms 76.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_26 3.7000 ms 75.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 triton_flex_attention_backward_22 4.0120 ms 69.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_18 4.5052 ms 61.6% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 SingleProcess AUTOTUNE benchmarking takes 6.6558 seconds and 6.3567 seconds precompiling torch.Size([1, 1, 2048, 64]) Test completed successfully! ========= ERROR SUMMARY: 0 errors ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130352 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2024-07-13 05:27:39 +00:00
Will Feng	c03e667276	[Inductor][PatternMatcher] Always prevent match across mutations (#130584 ) Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important. Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier? A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity. Test commands: - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584 Approved by: https://github.com/jansel	2024-07-13 03:39:21 +00:00
joydddd	3710a79622	Flex Attention HOP: Add support for flex decoding (#129415 ) # Flex Decoding tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding. Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences. ### Details LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs. `flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output. ## Examples Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1). We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv. The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64). ```Python import torch from torch.nn.attention._flex_attention import _flex_attention as flex_attention torch.manual_seed(0) # Lets create some input tensors # query of shape (B, Hkv, Hq//Hkv, D) # key/value of shape (B, Hkv, N, D) query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32) key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32) value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32) # Lets create a new score_modification checkerboard. def checkerboard(score, batch, head, token_q, token_kv): score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score) score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score) return score # Lets call flex_attention with this new score modification for decoding. # The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8. output = flex_attention(query, key, value, score_mod=checkerboard) compiled_flex_attention = torch.compile(flex_attention) out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard) torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2) ``` ## Future Plans - This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16. i.e. ```python q_scale = torch.randn(Hq//Hkv, device="cuda") q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer def bias_mod(score, batch, head, q, kv): score = score + q_scale[token_q] return score ``` - Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128. - Dynamic shape and max_autotuning is currently not working - Add block sparse mask support (#129216 is a draft for flex_attention kernel) - Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415 Approved by: https://github.com/Chillee	2024-07-13 00:41:48 +00:00
Justin Chu	f44739cf42	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-13 00:08:25 +00:00
Colin Peppler	a7f54c7f8a	[dynamo] add meta fn for aten.kthvalue.default (#130562 ) I saw ``` torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562 Approved by: https://github.com/jingsh, https://github.com/zou3519	2024-07-12 23:48:31 +00:00
Aaron Orenstein	634b62f111	typing proxy_tensor.py (#129182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182 Approved by: https://github.com/Chillee	2024-07-12 23:17:09 +00:00
PyTorch MergeBot	ea78b0c177	Revert "Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 )" This reverts commit a17d1e5322229a31f868d98987996a04736933a6. Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/izaitsevfb due to internal needs pybind update ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2226499397))	2024-07-12 23:07:37 +00:00
inkcherry	f422027fce	fix torch.linalg.lstsq input check (#130612 ) Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236) The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient). Consequently, it triggers an internal assertion error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612 Approved by: https://github.com/lezcano	2024-07-12 23:06:52 +00:00
Yifu Wang	06ebf87a1e	Fix and improve reorder_compute_for_overlap (#130573 ) Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes. - The unit tests are now fixed and re-enabled. - Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573 Approved by: https://github.com/Chillee ghstack dependencies: #129980	2024-07-12 22:25:49 +00:00
Mikayla Gawarecki	619029e892	[easy] Small rendering fix in Tensor.module_load doc (#130489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130489 Approved by: https://github.com/janeyx99	2024-07-12 22:12:53 +00:00
rzou	95046c86e3	[HOP] add HOP x torch_dispatch interaction (#130606 ) This involved beefing up the Python dispatcher to handle torch_dispatch. Given a HOP and a torch_dispatch Tensor subclass: - the HOP will show up in the subclass's `__torch_dispatch__` - you can also use HOP.py_impl to register a rule for the HOP x subclass interaction - (coming soon) we'll offer a way to open register HOP x subclass interaction without needing to touch the subclass's `__torch_dispatch__` or the HOP's .py_impl. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606 Approved by: https://github.com/ydwu4	2024-07-12 21:51:36 +00:00
rzou	f093cd4086	Fix custom ops warning during export (#130623 ) Fixes https://github.com/pytorch/pytorch/issues/130588 The problem was we were warning on all custom ops, not just ones marked as CompositeImplicitAutograd. This PR changes the warning to just warn on CompositeImplicitAutograd ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130623 Approved by: https://github.com/williamwen42	2024-07-12 21:34:29 +00:00
Mikayla Gawarecki	7c289c2a5c	Add torch.serialization.safe_globals context manager (#127939 ) Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939 Approved by: https://github.com/albanD	2024-07-12 20:38:43 +00:00
PyTorch MergeBot	f0d7164cb9	Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 2abc7cc21b8a215f000ac037c316ca178e9ade81. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to breaks meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2226313943))	2024-07-12 20:36:00 +00:00
albanD	103b6ccab2	Increase tolerance for tensorsolve tests (#130620 ) Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620 Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet	2024-07-12 20:08:18 +00:00
Scott Wolchok	af4da0799c	[PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465 ) As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en) is part of the base aarch64 instruction set, so it should work fine on mobile. Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-12 19:46:30 +00:00
dshi7	d727e2f2d1	add total wall time in calculate_time_spent (#130611 ) Fixes #ISSUE_NUMBER Actual wall time is fwd_entire_frame_time + bwd_inductor_compile. `calculate_time_spent` is accessed internally for monitoring use https://fburl.com/code/iiurj5m6. However, summing values up lose the info of fwd/bwd. This PR adds a new key of `total_wall_time` without affecting dynamo counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130611 Approved by: https://github.com/oulgen, https://github.com/Yuzhen11	2024-07-12 19:32:44 +00:00
eqy	60fc01d0ab	[CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401 ) Repro from @eellison Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-12 18:57:07 +00:00
Bertrand Thia	43b98fa521	Add debug repr to SymNode (#129925 ) Fixes #129403 Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph This is my first contribution, please let me know if there is anything that I should look into in further details Thank you for you guidance! 🙏 I hope to contribute more in the future! @aorenste Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925 Approved by: https://github.com/aorenste	2024-07-12 18:31:23 +00:00
Jack Taylor	2c4303c1d1	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters (#130617 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: ``` if torch.version.hip is not None: ``` Which was incorrectly replaced by: ``` if self.device_props.type != "hip": ``` Perhaps we need to write some unit tests here in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130617 Approved by: https://github.com/masnesral	2024-07-12 18:29:59 +00:00
Yidi Wu	741c1710e8	[cond] inlining into one of the branches when pred is a python constant (#130493 ) Reland https://github.com/pytorch/pytorch/pull/128709. When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493 Approved by: https://github.com/BoyuanFeng	2024-07-12 18:02:09 +00:00
Yidi Wu	0bf9a091ec	[torchbind] add tracing_mode support (#129586 ) Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object. This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class. Test Plan: `pytest test/export/test_torchbind.py -k safe` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586 Approved by: https://github.com/zou3519	2024-07-12 18:01:47 +00:00
William Wen	c3e77d144e	[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 ) Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython. This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame. We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12. This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185 Approved by: https://github.com/jansel	2024-07-12 17:56:38 +00:00
Tom Ritchford	b0a597fcb4	Fix #121334 : graph break on constant method call (#130158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130158 Approved by: https://github.com/lezcano	2024-07-12 17:34:46 +00:00
Chirag Pandya	4865c6425c	Add new control plane handler (#129712 ) Summary: Add a new control plane handler to retrieve flight recorder data as JSON. Test Plan: Unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129712 Approved by: https://github.com/wconstab	2024-07-12 17:32:01 +00:00
Nikita Shulga	55dc82bef9	[EZ] Make test_pytree_inputs actually run tests on CUDA (#130593 ) Right now it's only running it on CPU even when `self.device` is set to CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/130593 Approved by: https://github.com/angelayi	2024-07-12 17:17:28 +00:00
Pian Pawakapan	988ed4d5db	[export] clean up allow_complex_guards_as_runtime_asserts flag (#130596 ) Summary: removes underscore, cleans up dead code in DimConstraints Test Plan: existing export tests Reviewed By: angelayi Differential Revision: D59612746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596 Approved by: https://github.com/angelayi	2024-07-12 17:17:11 +00:00
Chien-Chin Huang	dafef3ff35	[CP] Make CP loss curve on par with TP (#129515 ) Summary: This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8). 1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement. 2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515 Approved by: https://github.com/d4l3k, https://github.com/wanchaol ghstack dependencies: #129512, #129514	2024-07-12 16:55:28 +00:00
Nikita Shulga	c35f12c67c	[EZ] Add formatting changes to .git-blame-ignore-revs (#130627 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130627 Approved by: https://github.com/izaitsevfb, https://github.com/clee2000	2024-07-12 16:37:46 +00:00
Aidyn-A	22fd89c904	[TEST][Inductor] Fix scaled_mm call (#130582 ) `_scaled_mm` no longer returns `amax` (see #128683) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130582 Approved by: https://github.com/drisspg	2024-07-12 16:25:18 +00:00
Edward Z. Yang	34e57025e1	Add unsigned int types to torch/types.h (#130616 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130616 Approved by: https://github.com/NicolasHug, https://github.com/albanD	2024-07-12 16:24:29 +00:00
PyTorch MergeBot	2b1df24877	Revert "Make hashing a SymInt raise an error again (#130548 )" This reverts commit 3100455b8eeebdfbc3428ff9454579ac50666faf. Reverted https://github.com/pytorch/pytorch/pull/130548 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels.py https://github.com/pytorch/pytorch/actions/runs/9908970127/job/27377960411 `3100455b8e`. Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130548#issuecomment-2225912018))	2024-07-12 16:20:12 +00:00
leslie-fang-intel	2a1f22e57f	Change BN to eval before QAT Convert phase (#130598 ) Summary In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default` for a safe DCE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598 Approved by: https://github.com/jgong5, https://github.com/yushangdi	2024-07-12 16:03:56 +00:00
titaiwangms	18418a7dbb	[ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586 ) The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586 Approved by: https://github.com/justinchuby	2024-07-12 15:47:59 +00:00
Mayank Mishra	e5657024b5	Fix loss_parallel with BF16 logits (#130550 ) Fixes #130549 This PR uses the specific dtype for the `grad_input` buffer and fixes the error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130550 Approved by: https://github.com/tianyu-l	2024-07-12 15:47:38 +00:00
Shangdi Yu	ea4b80e6d6	[FX][export] strict DCE pass, check schema for node impurity (#130552 ) Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects. For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully. A call_function node is impure if it has at least one mutable argument. Fixed the tests below: test_to_module_with_mutated_buffer_multiple_update_sub_later test_export_input_mutation_static_shape test_buffer_util Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552 Approved by: https://github.com/pianpwk	2024-07-12 15:43:27 +00:00
Nikita Shulga	febadda107	[MPS] Fix `torch.[all\|any]` for 5+D tensors (#130542 ) Workaround bug in `reductionAndWithTensor:` that kills app with the following assert if 5+D tensor as an input ``` Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76. ``` by reshaping the tensor to 2D/3D one before running the reduction. Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue Enabled `test_reduction_ops_5D` and added regression test to it Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #130541	2024-07-12 15:06:22 +00:00
Bert Maher	d443fbc025	[inductor] Cache precompilation functions based on configs (#130350 ) Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs. Add the config's hashes to the key to avoid this problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350 Approved by: https://github.com/eellison	2024-07-12 14:21:49 +00:00
rzou	9c69684af8	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-12 14:13:01 +00:00
rzou	ba941769b5	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-12 14:13:01 +00:00
Edward Z. Yang	ae3ac9cb64	Only test _is_param if doing instance check on Parameter base (#130578 ) Fixes https://github.com/pytorch/pytorch/issues/111348 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130578 Approved by: https://github.com/Skylion007	2024-07-12 13:55:13 +00:00
Edward Z. Yang	6f54e961ea	Add trace_shape_events artifact tracing for ShapeEnv events (#130473 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130473 Approved by: https://github.com/lezcano	2024-07-12 13:50:25 +00:00
Edward Z. Yang	3100455b8e	Make hashing a SymInt raise an error again (#130548 ) See https://github.com/pytorch/pytorch/issues/130547 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-07-12 13:49:56 +00:00
Will Constable	b75cc70875	[Pipelining] add looped schedules to fsdp/ddp test (#130563 ) It feels like an oversight that these were not tested, especially since the test case already handles multi schedules specially but no multi-schedules were being tested Pull Request resolved: https://github.com/pytorch/pytorch/pull/130563 Approved by: https://github.com/H-Huang	2024-07-12 13:39:47 +00:00
PyTorch MergeBot	da030e7add	Revert "[Inductor] FlexAttention supports partial masking (#130415 )" This reverts commit 207564bab1c4fe42750931765734ee604032fb69. Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))	2024-07-12 13:20:18 +00:00
Yanbo Liang	207564bab1	[Inductor] FlexAttention supports partial masking (#130415 ) This is the new version of #130235 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415 Approved by: https://github.com/Chillee	2024-07-12 07:19:28 +00:00
Chien-Chin Huang	e568c91a7b	[CP] Fix the incorrect ring schedule in the fwd and bwd (#129514 ) Summary: 1. The argument order for all_to_all_single is "block, output_split_size, input_split_sizes, pg". 2. Uses the correct ring order for the grad_kv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129514 Approved by: https://github.com/d4l3k, https://github.com/drisspg, https://github.com/wanchaol ghstack dependencies: #129512	2024-07-12 07:05:36 +00:00
Chien-Chin Huang	0d8dedb01b	[dtensor] Add dtensor to TORCH_LOGS (#129512 ) Summary: Add the basic log for dispatcher of dtensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/129512 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-07-12 06:50:53 +00:00
Harshavardhan Reddy Bommireddy	b6215f44ef	DCP checkpoint_dist_client integration (#130452 ) Summary: Integrate scope tracking with `checkpoint/fb/logging_handlers.py`. Add a map of uuid -> tracker context manager. when logging handler has following events: * `start`: create scope_tracker object, call `__enter__`, add to map with uuid * `end`: retrieve scope_tracker object by uuid, call `__exit__`. * `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info. Test Plan: Test with bento notebook (attached). with a runtime_error in finish_checkpoint method. scuba records: https://fburl.com/scuba/workflow_signpost/ddttgmv2 Differential Revision: D56654417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452 Approved by: https://github.com/LucasLLC	2024-07-12 06:01:56 +00:00
Tarun Karuturi	ff25dfca5a	Save quantization_tag in export graph serialization (#127473 ) Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized. Test Plan: Added test case Differential Revision: D57939282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473 Approved by: https://github.com/angelayi	2024-07-12 05:06:40 +00:00
eellison	b7d287fbec	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-12 03:44:29 +00:00
Sijia Chen	ae0edadea0	[SDPA] Replace `masked_fill_` with `aten::where` (#130281 ) Summary: full context in D59385876 Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile Differential Revision: D59495634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby	2024-07-12 03:04:31 +00:00
yan-yhy	c16e90fe06	The device_suffix in a test_name is "privateuse1" sometimes. (#130091 ) When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes. For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes. When setUpClass() didn't set it, the device_suffix would be "privateuse1". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091 Approved by: https://github.com/zou3519	2024-07-12 02:51:40 +00:00
Yifu Wang	9ae40c6bc0	Fix and improve raise_comms and sink_waits (#129980 ) The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes. Correctness issues: - The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput). Effectiveness issues: - The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all. - The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node. This PR: - Make the passes handle mutation correctly. - Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues. - The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it. - Enable the tests in CI. Co-authored-by: Will Feng <yf225@cornell.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980 Approved by: https://github.com/yf225	2024-07-12 01:55:47 +00:00
Will Feng	c6a676add4	[Traceable FSDP2][Inductor] Add GroupedSchedulerNode to contain nodes that must be scheduled together (#128568 ) As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them). This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements: - At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise all of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule. - The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`. From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them). ---- Q: Can we leverage `ir.Subgraph`? A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that: 1. `ir.Subgraph` requires defining the subgraph in FX IR. 2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node. 3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR). ---- Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568 Approved by: https://github.com/eellison, https://github.com/mlazos	2024-07-12 01:42:38 +00:00
Michael Lazos	c101c4517a	Add python type for list iterators (#130511 ) Fixes https://github.com/pytorch/pytorch/issues/117026 Also not sure why this was missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130511 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/anijain2305	2024-07-12 01:14:18 +00:00
PyTorch MergeBot	536b5b19b5	Revert "Simplify c10::string_view (#130009 )" This reverts commit 10c7f037fe3271cb3865816c216007ba403f5347. Reverted https://github.com/pytorch/pytorch/pull/130009 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/130009#issuecomment-2224223526))	2024-07-12 00:46:49 +00:00
Feny Patel	7f2436014e	add MTIA as valid device type for prof averages (#130340 ) Summary: Add MTIA as valid device option for getting profile averages Test Plan: Tested with auto-trace on MTIA Differential Revision: D59486392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130340 Approved by: https://github.com/aaronenyeshi	2024-07-12 00:39:01 +00:00
PyTorch MergeBot	7ce5b5767c	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))	2024-07-12 00:37:04 +00:00
Shivam Raikundalia	b5b91b418d	[Easy] Update record_function Comment (#130561 ) Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled. Test Plan: N/A it is just updating a comment Differential Revision: D59649390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561 Approved by: https://github.com/aaronenyeshi	2024-07-11 23:51:25 +00:00
Pian Pawakapan	18b7633bfb	[export] fix kwargs in run_decompositions() for training IR (#130553 ) Re-exporting GraphModule expects all inputs to be in args, though not in pytree-flattened format. This avoids failing when we run with a fx.Interpreter subclass in [AOTAutograd tracing](`973037be6a/torch/_functorch/_aot_autograd/traced_function_transforms.py (L760-L762)`). Removes 7 test failures for training IR export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130553 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2024-07-11 22:53:18 +00:00
Yidi Wu	26c2b92525	[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 ) Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph. This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident. Test Plan: Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680 Approved by: https://github.com/angelayi	2024-07-11 22:46:21 +00:00
Edward Z. Yang	9c6c0deadc	Add eager_compile_backwards_failure to tlparse (#130434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434 Approved by: https://github.com/albanD	2024-07-11 22:35:33 +00:00
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
eellison	fc872e98f3	Infer prim tags from equivalent aten ones (#130367 ) Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367 Approved by: https://github.com/ezyang	2024-07-11 20:53:52 +00:00
Zhengxu Chen	726a287271	[export] Expand verifier to be multiple on ExportedProgram (#130364 ) Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field. Test Plan: CI Differential Revision: D59408546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-07-11 20:34:49 +00:00
mengph	5c6edd29ec	Turn on splitShare=1 to make the optimization of comm_split effective. (#129929 ) Fixes #129865 Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid. This PR turn on splitShare=1 to make the optimization of comm_split effective. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929 Approved by: https://github.com/shuqiangzhang	2024-07-11 20:14:58 +00:00
Nikita Shulga	c50b189280	Move trunk windows builds to CUDA-12.1 (#130446 ) That should catch build regressions that were previously only detectable during the nightly builds Win + CUDA-11.8 builds and tests are still run as part of periodic workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/130446 Approved by: https://github.com/atalman	2024-07-11 19:50:57 +00:00
Tijmen Blankevoort	bc18863713	Corner-case fix for upscale_histogram in the new HistogramObserver (#130316 ) Summary: Small fix to the bucketize function that caused a run-time error in some corner cases. Test Plan: Unit tests Differential Revision: D59508432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130316 Approved by: https://github.com/jerryzh168	2024-07-11 19:49:21 +00:00
Yidi Wu	cd9bae30de	Allow kwargs in _remove_effect_tokens_pass (#130491 ) Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it. Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs Reviewed By: angelayi Differential Revision: D59603147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491 Approved by: https://github.com/angelayi	2024-07-11 19:03:19 +00:00
PyTorch MergeBot	578388bed8	Revert "Support for expandable segments with cuda graph trees (#128068 )" This reverts commit fdc83610f272610ce50d1a6f5b6354f2df1baabb. Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))	2024-07-11 18:58:13 +00:00
Yidi Wu	1cae60a87e	Caching attr_proxy for nn_module attribute to fix guard check failure (#130280 ) Fixes https://github.com/pytorch/pytorch/issues/129939 Differential Revision: [D59594605](https://our.internmc.facebook.com/intern/diff/D59594605) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130280 Approved by: https://github.com/anijain2305	2024-07-11 18:21:35 +00:00
Chien-Chin Huang	0a4fe2ff86	[DSD] Use no_grad() to make some operations faster and avoid possible memory leakage (#130355 ) Use no_grad() to make some operations faster and avoid possible memory leakage Pull Request resolved: https://github.com/pytorch/pytorch/pull/130355 Approved by: https://github.com/wz337	2024-07-11 18:18:08 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
PyTorch MergeBot	492de213e2	Revert "Change deprecated warning on dispatch_on_subclass to warn once (#130047 )" This reverts commit f21a21828ac6e16d903ee88f726fdb2278c04782. Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/albanD due to The failure on the PR are valid, they should not have been ignored ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2223488933))	2024-07-11 17:24:02 +00:00
Iris Zhang (PyTorch)	f21a21828a	Change deprecated warning on dispatch_on_subclass to warn once (#130047 ) Summary: Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead. More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/ Test Plan: Sandcastle Differential Revision: D59338775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047 Approved by: https://github.com/XilunWu	2024-07-11 17:02:26 +00:00
wz337	3896ba3260	[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 ) Fixes #ISSUE_NUMBER As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different. This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases. As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash. ``` test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-07-11 17:02:18 +00:00
Dmitry Nikolaev	72d9135679	increase tensor size to force out of memory exception on the latest generations of GPUs (#130334 ) This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor. This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99	2024-07-11 16:59:40 +00:00
Nikita Shulga	9c1ba5ac10	[BE] Cleanup unused vars in MPS (#130541 ) And move `using namespace mps` outside of every function as there are no need to repeat it Use `getTensorsStringKey` instead of explicit `getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541 Approved by: https://github.com/Skylion007	2024-07-11 16:48:03 +00:00
Edward Z. Yang	68ad3eb722	Do not set hints for mark_unbacked quantities (#130483 ) Fixes https://github.com/pytorch/pytorch/issues/130456 When we mark_unbacked a size, we actually DO have a hint for it (because we have a real, input tensor) for it, and previously, we were accidentally putting it into the hint field of SymNode. If marked unbacked size is zero or one, this can lead to inconsistency between hint compute and static evaluation compute under guard size oblivious, since that's the whole point of size oblivious. Answer is to scrub out hints on mark unbacked ints. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483 Approved by: https://github.com/lezcano	2024-07-11 15:51:00 +00:00
chuanqiw	ca023f77bc	[CD] Add pytorch xpu wheel build in nightly (#129560 ) Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560 Approved by: https://github.com/atalman	2024-07-11 15:49:04 +00:00
Shangdi Yu	fb9bc6d74a	[custom op] add doc for CustomOpDef.set_kernel_enabled (#130406 ) <img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096"> <img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406 Approved by: https://github.com/zou3519	2024-07-11 15:47:35 +00:00
James Wu	5ed72ff5f5	Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583 ) This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this. One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior. Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583 Approved by: https://github.com/oulgen ghstack dependencies: #128335	2024-07-11 15:39:09 +00:00
Oguz Ulgen	be7bf20234	Add JK to enable fx graph cache for amd (#130463 ) Test Plan: ad hoc testing Differential Revision: D59593961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463 Approved by: https://github.com/nmacchioni, https://github.com/mxz297	2024-07-11 15:28:38 +00:00
Jiang, Yanbing	6f662e9575	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-11 15:26:48 +00:00
cyy	c4a2b6a943	[2/N] Fix NVCC warnings (#130214 ) Follows #130191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130214 Approved by: https://github.com/ezyang	2024-07-11 14:46:53 +00:00
Animesh Jain	a833582dbb	[dynamo][tuple] Optimize guard for small tuples - helps conv2d guards (#130400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130400 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #130285, #130368, #130416	2024-07-11 14:13:24 +00:00
Animesh Jain	f7d7b94017	[dynamo][unspecialized-nn-module] Distinguish between user-defined and builtin nn module (#130416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130416 Approved by: https://github.com/jansel ghstack dependencies: #130285, #130368	2024-07-11 14:13:24 +00:00
Animesh Jain	fed8b0055f	[dynamo][bufgix] Fix the value for key manager (#130368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130368 Approved by: https://github.com/jansel ghstack dependencies: #130285	2024-07-11 14:13:19 +00:00
Animesh Jain	9c612df504	[dynamo][cpp-guards][QOL] Print NO_TENSOR_ALIASING guard once (#130285 ) NO_TENSOR_ALIASING guard lists all tensors. Printing it on every occurence is ugly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130285 Approved by: https://github.com/jansel	2024-07-11 14:13:14 +00:00
cdzhan	bac10cdd6f	[DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423 ) …ogger Fixes #129951 . Would you take a moment to review it? @LucasLLC Pull Request resolved: https://github.com/pytorch/pytorch/pull/130423 Approved by: https://github.com/Skylion007	2024-07-11 13:43:39 +00:00
Yifu Wang	0d66ccaf23	[IntraNodeComm] fix an issue where input check fails when running all-reduce on sub groups (#130492 ) Tested against the following snippet with `ENABLE_INTRA_NODE_COMM=1`. ```python import os import torch import torch.distributed as dist def main(): rank = int(os.environ["RANK"]) local_rank = int(os.environ["LOCAL_RANK"]) world_size = int(os.environ["WORLD_SIZE"]) torch.cuda.set_device(f"cuda:{local_rank}") dist.init_process_group("nccl") draft_group = dist.new_group([0, 1, 2, 3]) target_group = dist.new_group([4, 5, 6, 7]) inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp) expect = sum(range(world_size)) assert inp.eq(expect).all() if 0 <= rank < 4: inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=draft_group) expect = sum(range(4)) assert inp.eq(expect).all() else: inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=target_group) expect = sum(range(4, 8)) assert inp.eq(expect).all() torch.cuda.synchronize() dist.destroy_process_group() if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130492 Approved by: https://github.com/Chillee	2024-07-11 13:39:14 +00:00
PyTorch MergeBot	f261c6ebe8	Revert "[halide-backend] Update CI pin (#130258 )" This reverts commit 4fcfd475bea24b832da32a0c4d464dd87c73a2a9. Reverted https://github.com/pytorch/pytorch/pull/130258 on behalf of https://github.com/albanD due to Seems to have broken trunk pretty bad `4fcfd475be` ([comment](https://github.com/pytorch/pytorch/pull/130258#issuecomment-2222935064))	2024-07-11 13:26:01 +00:00
albanD	354edb232a	Make public binding test only consider files that are packaged in the wheels (#130497 ) In particular, when creating the PyTorch wheel, we use setuptools find_packages `551b3c6dca/setup.py (L1055)` which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages. So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them. A manual diff of the two traversal methods: ``` torch._inductor.kernel.bmm torch._inductor.kernel.conv torch._inductor.kernel.flex_attention torch._inductor.kernel.mm torch._inductor.kernel.mm_common torch._inductor.kernel.mm_plus_mm torch._inductor.kernel.unpack_mixed_mm torch._strobelight.examples.cli_function_profiler_example torch._strobelight.examples.compile_time_profile_example torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks torch.ao.quantization.experimental.APoT_tensor torch.ao.quantization.experimental.adaround_fake_quantize torch.ao.quantization.experimental.adaround_loss torch.ao.quantization.experimental.adaround_optimization torch.ao.quantization.experimental.apot_utils torch.ao.quantization.experimental.fake_quantize torch.ao.quantization.experimental.fake_quantize_function torch.ao.quantization.experimental.linear torch.ao.quantization.experimental.observer torch.ao.quantization.experimental.qconfig torch.ao.quantization.experimental.quantizer torch.csrc.jit.tensorexpr.codegen_external torch.csrc.jit.tensorexpr.scripts.bisect torch.csrc.lazy.test_mnist torch.distributed._tensor.examples.checkpoint_example torch.distributed._tensor.examples.comm_mode_features_example torch.distributed._tensor.examples.comm_mode_features_example_argparser torch.distributed._tensor.examples.convnext_example torch.distributed._tensor.examples.torchrec_sharding_example torch.distributed._tensor.examples.visualize_sharding_example torch.distributed.benchmarks.benchmark_ddp_rpc torch.distributed.checkpoint.examples.async_checkpointing_example torch.distributed.checkpoint.examples.fsdp_checkpoint_example torch.distributed.checkpoint.examples.stateful_example torch.distributed.examples.memory_tracker_example torch.fx.experimental.shape_inference.infer_shape torch.fx.experimental.shape_inference.infer_symbol_values torch.include.fp16.avx torch.include.fp16.avx2 torch.onnx._internal.fx.analysis.unsupported_nodes torch.onnx._internal.fx.passes._utils torch.onnx._internal.fx.passes.decomp torch.onnx._internal.fx.passes.functionalization torch.onnx._internal.fx.passes.modularization torch.onnx._internal.fx.passes.readability torch.onnx._internal.fx.passes.type_promotion torch.onnx._internal.fx.passes.virtualization torch.utils._strobelight.examples.cli_function_profiler_example torch.utils.benchmark.examples.sparse.compare torch.utils.benchmark.examples.sparse.fuzzer torch.utils.benchmark.examples.sparse.op_benchmark torch.utils.tensorboard._convert_np torch.utils.tensorboard._embedding torch.utils.tensorboard._onnx_graph torch.utils.tensorboard._proto_graph torch.utils.tensorboard._pytorch_graph torch.utils.tensorboard._utils torch.utils.tensorboard.summary torch.utils.tensorboard.writer ``` These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497 Approved by: https://github.com/aorenste	2024-07-11 13:22:04 +00:00
Eddie Yan	215013daad	[cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494 ) Limit cuDNN SDPA to head-dim 128 globally. Apparently the support for 256 is only for the forward on sm90+, which would be clunky to maintain as it would mean dispatching different for forward/backward. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130494 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2024-07-11 13:21:18 +00:00
cyy	9822fdc354	[7/N] Replace c10::optional with std::optional (#130510 ) Follows #130438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130510 Approved by: https://github.com/janeyx99	2024-07-11 13:21:05 +00:00
Wang, Eikan	f52b2ee90f	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`. Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman	2024-07-11 13:17:25 +00:00
Edward Z. Yang	2a51ccc77e	When translation validation is enabled, assert that hint is consistent (#130478 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130478 Approved by: https://github.com/lezcano	2024-07-11 13:02:31 +00:00
cyy	c9551a3f50	Make c10::string_view an alias of std::string_view (#130417 ) Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-07-11 12:31:06 +00:00
cyy	c5b66c3fe1	Enable -Werror=pedantic on torch targets (#130319 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130319 Approved by: https://github.com/ezyang	2024-07-11 12:27:32 +00:00
Isuru Fernando	5db9bd467e	Skip test_nnc_correctness for new op _unsafe_masked_index (#130375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375 Approved by: https://github.com/lezcano	2024-07-11 08:17:16 +00:00
Benson Ma	b1942a1af4	[fbgemm_gpu] Break up `fbgemm_cuda_utils.cuh`, pt 10 (#130468 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/2814 X-link: https://github.com/facebookresearch/FBGEMM/pull/19 - Break up `fbgemm_cuda_utils.cuh`, pt 10 Test Plan: ``` buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/jagged/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/tbe/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/sparse/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//smart/inference_platform_sp/llm_predictor_amd:service buck2 build --flagfile fbcode//mode/amd-gpu fbcode//hpc/ops:sparse_ops buck2 build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/benchmarks/operator_benchmark/pt:add_test ``` Reviewed By: spcyppt Differential Revision: D59545097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130468 Approved by: https://github.com/ezyang	2024-07-11 07:10:27 +00:00
Xu Han	79c41bb58a	[inductor] switch CppCodeCache to new cpp_builder. (#130132 ) Changes: 1. switch CppCodeCache to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-11 07:03:43 +00:00
Wanchao Liang	75ab027fbb	[dtensor] move bernolli to op strategy (#130286 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/130286 Approved by: https://github.com/awgu, https://github.com/yifuwang	2024-07-11 06:43:11 +00:00
Bilal Khan	fdc83610f2	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/zdevito, https://github.com/eqy	2024-07-11 05:33:09 +00:00
fduwjj	da24823e06	[BE][EZ] Migrate to new dcp save and load APIs (#130475 ) When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475 Approved by: https://github.com/wz337	2024-07-11 04:13:39 +00:00
Will Feng	5835ff1ed5	[Easy][Inductor] Add comment for .min_order and .max_order (#130390 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130390 Approved by: https://github.com/anijain2305	2024-07-11 03:58:03 +00:00
Shangdi Yu	a4576dad34	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-11 03:39:07 +00:00
Will Constable	9f401187c7	[pipelining] Refactor test_schedule to fix "-k" (#130294 ) This is kind of a short-sighted workaround and we should actually come up with a way to fix this in general, but I got annoyed that I can't use -k to filter tests in test_schedule, and realized it's because we jam tests using the new MultiProcContinuousTest fixture together with old-style tests. For now I separate the two types of tests so -k works again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294 Approved by: https://github.com/H-Huang	2024-07-11 03:18:02 +00:00
Mikayla Gawarecki	dfd1d1971e	Fix warning when pickle.load torch.Storage (#130246 ) Fixes https://github.com/pytorch/pytorch/issues/130242 Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246 Approved by: https://github.com/albanD	2024-07-11 02:40:29 +00:00
Jason Ansel	4fcfd475be	[halide-backend] Update CI pin (#130258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258 Approved by: https://github.com/eellison	2024-07-11 02:26:16 +00:00
Jerry Zhang	df9d1b44e7	Preserve _numeric_debug_handle throguh deepcopy and re-export (#129287 ) Summary: * Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers to the nodes in the graph TODO: need to fully support re-export, currently the metadata for output node is not preserved Test Plan: python test/test_quantization.py -k test_deepcopy_preserve_handle python test/test_quantization.py -k test_copy_preserve_handle all related tests: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287 Approved by: https://github.com/zhxchen17	2024-07-11 02:19:41 +00:00
Edward Z. Yang	a205a53c50	Make sym_node log more useful (#130436 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130436 Approved by: https://github.com/Skylion007	2024-07-11 01:42:53 +00:00
Edward Z. Yang	79e34800c3	Suppress guards generated by empty_strided in ir_node_to_tensor (#130431 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130431 Approved by: https://github.com/IvanKobzarev	2024-07-11 01:19:11 +00:00
cyy	798b9652f7	[6/N] Replace c10::optional with std::optional (#130438 ) Follows #130408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438 Approved by: https://github.com/janeyx99	2024-07-11 01:15:37 +00:00
leslie-fang-intel	5bc18ec0a1	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano ghstack dependencies: #130405	2024-07-11 00:50:50 +00:00
Vladimir Fokow	6adc725157	doc - fix the `max_norm` value in a note (#129687 ) `max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](`ec284d3a74/torch/nn/modules/sparse.py (L30)`) says). That note was created in #45595 The current pull request cleans it up. The value `True` in the note can confuse the users to think it can be a boolean. In fact, a counter-intuitive behavior will happen if users try to set it to `False`: it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-07-11 00:01:17 +00:00
Sam Larsen	358da54be5	[inductor] Better messaging when triton version is too old (#130403 ) Summary: If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior: 1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem. 2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..." Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403 Approved by: https://github.com/eellison	2024-07-10 23:45:50 +00:00
Andrew Gu	ceedee23ec	[DTensor] Included meshes in cross-mesh error msg (#130454 ) The current error message is not actionable since we do not know which meshes are involved. Including the `__repr__` of each mesh in the error helps but is not always sufficient. `7d4cb21098/torch/distributed/device_mesh.py (L395-L408)` The problem is that `DeviceMesh.__eq__` is actually pretty involved, and we cannot see all parts of the `__eq__` criteria just from the `__repr__` (e.g. the thread ID). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130454 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-07-10 22:40:57 +00:00
Xu Han	2abc7cc21b	[inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-10 22:28:29 +00:00
Ivan Zaitsev	551b3c6dca	Use irange to avoid -Wsign-compare errors (#130388 ) Fixes meta-internal errors after importing #128753 (see [D59498679](https://www.internalfb.com/diff/D59498679)) ``` fbcode/caffe2/aten/src/ATen/Context.cpp:286:34: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare] for (auto index = 0; index < at::getNumGPUs(); index++) { ~~~~~ ^ ~~~~~~~~~~~~~~~~ 1 error generated. ``` Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130388 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-07-10 22:07:51 +00:00
PyTorch MergeBot	ce499eee0c	Revert "Add API for open registration between operators and subclasses (and modes) (#130064 )" This reverts commit c23d103afae65588772cb30037ea4110f01f6f41. Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/izaitsevfb due to fails internal builds, see [D59553526](https://www.internalfb.com/diff/D59553526) ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2221587575))	2024-07-10 21:50:32 +00:00
Chirag Pandya	83c95c48f7	Flight recoder data as JSON (#129505 ) Summary: Provide a new API to retrieve flight recorder data as JSON. The one minor difference between flight recorder as Pickle v/s JSON is that the JSON API does not retrieve stack traces at the moment. This ends up being far too much data. Test Plan: unit test Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505 Approved by: https://github.com/wconstab, https://github.com/d4l3k	2024-07-10 21:50:27 +00:00
PyTorch MergeBot	86bca69c5f	Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261 )" This reverts commit bb9a73f767526e0d23c60360db5212b6bed0e8bc. Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))	2024-07-10 21:43:28 +00:00
PyTorch MergeBot	e14a0f45ed	Revert "[reland][custom ops] infer schema (#130079 )" This reverts commit bef085bdfa62cc14589c70279de17108b2c2089f. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))	2024-07-10 21:40:16 +00:00
Jon Janzen	46c52661bc	Use a better cherry-pick strategy for stable pytorch w/ distribute changes (#129987 ) 1. Update the branch name from internal feedback 2. Only cherry-pick in the changes to these folders Pull Request resolved: https://github.com/pytorch/pytorch/pull/129987 Approved by: https://github.com/seemethere	2024-07-10 20:55:36 +00:00
Catherine Lee	80a421a54d	[TD] Pin numpy to 1.26.0 in indexer (#130442 ) Temporarily pin 1.26.0 to get the workflow working while I go sort out which dependencies need to be updated Succeeding run: https://github.com/pytorch/pytorch/actions/runs/9877733366/job/27280052419?pr=130442 Tested by adding my branch to the trust relationship for the policy and removing the environment Pull Request resolved: https://github.com/pytorch/pytorch/pull/130442 Approved by: https://github.com/atalman, https://github.com/malfet	2024-07-10 20:52:24 +00:00
PyTorch MergeBot	cd2638be09	Revert "[pipelining] Refactor test_schedule to fix "-k" (#130294 )" This reverts commit 1352f13f7827cd1862a6e0507fb17dccddf73dc2. Reverted https://github.com/pytorch/pytorch/pull/130294 on behalf of https://github.com/clee2000 due to broke lint https://github.com/pytorch/pytorch/actions/runs/9879591538/job/27286156803 ([comment](https://github.com/pytorch/pytorch/pull/130294#issuecomment-2221376073))	2024-07-10 20:26:58 +00:00
PyTorch MergeBot	b81767161e	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit 08d5423d339ac4b302f8ae6b63b334e032104753. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 `08d5423d33` test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))	2024-07-10 20:22:24 +00:00
Pian Pawakapan	1b3b4c2fb9	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) (#130380 ) original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train) Summary: This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Test Plan: contbuild & OSS CI, see `940e4477ab` Original Phabricator Test Plan: Imported from GitHub, without a `Test Plan:` line. Differential Revision: D59543603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380 Approved by: https://github.com/izaitsevfb	2024-07-10 19:23:37 +00:00
Will Constable	1352f13f78	[pipelining] Refactor test_schedule to fix "-k" (#130294 ) This is kind of a short-sighted workaround and we should actually come up with a way to fix this in general, but I got annoyed that I can't use -k to filter tests in test_schedule, and realized it's because we jam tests using the new MultiProcContinuousTest fixture together with old-style tests. For now I separate the two types of tests so -k works again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294 Approved by: https://github.com/H-Huang	2024-07-10 18:32:51 +00:00
Feng Yuan	cf090e222e	Update torch-xpu-ops pin (ATen XPU implementation) (#130333 ) 1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`. 2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function. 3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-07-10 18:10:53 +00:00
Nikita Shulga	4b7ee51260	[BE][MPS] Cleanup optimizers code (#130453 ) - Fix C++20 forward compatibility warnings, namely ``` warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions] multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name, ``` - Use nested namespaces - Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace - Use more convenience methods (rather than call by hand) - Use C++14 `return f();` for void functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453 Approved by: https://github.com/Skylion007	2024-07-10 18:00:05 +00:00
IvanKobzarev	08d5423d33	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-10 17:56:32 +00:00
PyTorch MergeBot	0beeac35fa	Revert "[cond] inlining into one of the branches when pred is a python constant (#128709 )" This reverts commit fe3e6878c4bb2a6001045c179fd7fa9838242558. Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/ydwu4 due to causing error on truck due to a land racing: `fe3e6878c4` ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2221104043))	2024-07-10 17:47:19 +00:00
Shivam Raikundalia	b4b7477d3f	Fix CPU Annotation Overlapping with Python Events (#129599 ) Summary: Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0. To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events. Test Plan: Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599 Approved by: https://github.com/aaronenyeshi	2024-07-10 17:33:56 +00:00
Ivan Zaitsev	6b3460ae0d	fix discrepancy from the export of #126601 (#130296 ) #126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-07-10 17:26:44 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00
AIM \| Nara	a7aa066b09	Fix link to dynamo in torch/fx readme (#130233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130233 Approved by: https://github.com/janeyx99	2024-07-10 17:00:49 +00:00
Laith Sakka	a09910d3a9	add strobelight profile links to tlparse (#129703 ) Summary: title. Test Plan: buck2TORCH_TRACE=~/my_trace_log_dir buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compile_time_profiler_example tlparse ~/my_trace_log_dir result https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html {F1726980413} Differential Revision: D59130581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703 Approved by: https://github.com/aorenste	2024-07-10 16:53:21 +00:00
Yidi Wu	fe3e6878c4	[cond] inlining into one of the branches when pred is a python constant (#128709 ) When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709 Approved by: https://github.com/zou3519	2024-07-10 16:44:27 +00:00
atalman	9d94b122f0	Fix usage of USE_ROCM when calling cudaFuncGetAttributes (#130441 ) This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with ``` C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0); ``` Fixes https://github.com/pytorch/pytorch/issues/130437 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-07-10 16:30:43 +00:00
Richard Barnes	ae73489b7d	[codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/aten/src/ATen/native/cuda/DistributionTemplates.h (#130433 ) Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D59528276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130433 Approved by: https://github.com/malfet	2024-07-10 16:30:37 +00:00
Shangdi Yu	bef085bdfa	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-10 16:18:36 +00:00
chilli	ce4d95143f	Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 ) After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error. ![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250 Approved by: https://github.com/drisspg ghstack dependencies: #130227	2024-07-10 16:14:45 +00:00
chilli	a7715e36de	Add block mask utility support for batches and heads > 1 (#130227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227 Approved by: https://github.com/yanboliang	2024-07-10 16:14:45 +00:00
Shangdi Yu	c83b941141	[export] add dynamic shapes argument and infer from graph nodes (#129928 ) Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`. On a high level, the issue is caused by not detecting fake_mode when there's no input. Change plan: 1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`. 2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg. 3) If the input is a graph module, then we can traverse the graph and check. 4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type. Fixes #129927 Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo. Change plan: check `gm_torch_level` graph's node meta "example_value" for fake mode in addition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928 Approved by: https://github.com/angelayi	2024-07-10 15:51:05 +00:00
cyy	d31f866b33	[BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409 ) AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409 Approved by: https://github.com/malfet	2024-07-10 15:50:47 +00:00
Chien-Chin Huang	81ea298600	Wrap the test func with try/except to always call destroy_process_group (#124961 ) This can avoid PG warning about not calling destry_pg Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961 Approved by: https://github.com/wanchaol, https://github.com/wz337	2024-07-10 15:36:38 +00:00
Michael Eisel	81df076bfd	Fix Apple crash when running PyTorch with Metal API validation turned on (#130377 ) Fixes #130376 (at least, for my usage) There may be other places in the code base where `-setBytes:length:` is called with a length of 0 besides this, but this is the case that has triggered for me. Please let me know if there are any specific tests I should run. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130377 Approved by: https://github.com/malfet	2024-07-10 15:07:47 +00:00
Andres Lugo-Reyes	417c83e7cf	[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 ) Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560 This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069 unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping. The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966 Approved by: https://github.com/malfet	2024-07-10 14:53:41 +00:00
rzou	b38de2f9e2	[decomps] Fix aten._to_copy decomp (#130381 ) `aten._to_copy` can receive a python number as input. This occurs in torch.compile support for vmap (see #130188). Previously, this would raise an assertion error. This PR changes it so that if we see a python number, we call torch.scalar_tensor on it first (h/t @bdhirsh). Fixes #130362 Fixes #130188 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381 Approved by: https://github.com/Chillee	2024-07-10 14:34:28 +00:00
cyy	bd3452f431	[5/N] Change #include <c10/util/Optional.h> to #include <optional> (#130408 ) Follows #130329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130408 Approved by: https://github.com/malfet	2024-07-10 14:29:43 +00:00
Li-Huai (Allan) Lin	99967e1119	[MPS][TYPE_PROMOTION] Fix Clamp (#130226 ) Summary: 1. Fixed #130201 by adding type promotion. 2. Added proper tests. 3. Found torch's type promotion is different from numpy as follows: ```python import torch import numpy as np np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64') torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32 ``` ~Not sure the proper way to handle it, it causes numpy ref tests to fail.~ Reason here, so think I'm gonna xfail it: `3c1cf03fde/test/test_ops.py (L260-L264)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226 Approved by: https://github.com/malfet	2024-07-10 14:27:39 +00:00
rzou	6ce0bd7d3b	[HOP] Use user directed names for variables where possible (#130271 ) Afaict the previous check was too strict. Removing it passes all the mutation tests (mutation checks happen via the TensorVariable's mutable_local). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271 Approved by: https://github.com/Chillee, https://github.com/ydwu4	2024-07-10 13:59:20 +00:00
PyTorch MergeBot	637cc8d27f	Revert "update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 )" This reverts commit 6367f02a0e136ced05c665301bcdaa4d76690457. Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main `6367f02a0e` ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))	2024-07-10 13:48:32 +00:00
atalman	a1590e16df	Add single Python 3.10, single Cuda 12.1 build with dependencies included (#130349 ) Build large wheel for Python 3.10, CUDA 12.1 that will be used in Colab. Build name: ``manywheel-py3_11-cuda12_1-full-build`` We still have all code to support the full build in builder repo, here: https://github.com/pytorch/builder/blob/main/manywheel/build_cuda.sh#L151 Test: ``` import sys import torch sys.version_info print(torch.__version__) sys.version_info 2.3.0+cu121 sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130349 Approved by: https://github.com/malfet	2024-07-10 12:57:39 +00:00
Li-Huai (Allan) Lin	cb2bce98de	[MPS][BE] Reduce the number of parameters encoded for no momentum fused SGD (#130131 ) Summary: 1. Reduce the number of parameters encoded for no momentum fused SGD 2. Use convenience functions `mtl_setBuffer` and `mtl_setBytes`. Just a BE, no significant performance difference is observed. Test plan: Relying on CI signals Pull Request resolved: https://github.com/pytorch/pytorch/pull/130131 Approved by: https://github.com/janeyx99, https://github.com/malfet	2024-07-10 07:58:38 +00:00
Jiang, Yanbing	6367f02a0e	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-10 07:38:42 +00:00
leslie-fang-intel	e29657efb6	[Inductor][CPP] Fix typo in merge rules (#130405 ) Summary There is a typo of the `CPU Inductor` group in `merge_rules.yaml` which should be `test/inductor/test_cpu_repro.py` instead of `test/inductor/test_cpu_repo.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130405 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-10 07:13:03 +00:00
cyy	10c7f037fe	Simplify c10::string_view (#130009 ) Make c10::basic_string_view a subclass of std::basic_string_view for easier replacement in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130009 Approved by: https://github.com/ezyang	2024-07-10 05:02:16 +00:00
Xuehai Pan	a17d1e5322	Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 ) Fix static `py::object`s with `py::gil_safe_call_once_and_store`. The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault. ```c++ void func() { static py::object obj = py::module_::import("foo").attr("bar"); ... } ``` The correct code is to use raw pointers rather than the instance. ```c++ void func() { static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")}; py::object obj = *obj_ptr; ... } ``` This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely. ```c++ void func() { PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage; py::object obj = storage .call_once_and_store_result( []() -> py::object { return py::module_::import("foo").attr("bar"); } ) .get_stored(); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341 Approved by: https://github.com/ezyang	2024-07-10 04:23:37 +00:00
rzou	5abe7ebd41	Add new (private) capture_triton API (#130178 ) When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178 Approved by: https://github.com/oulgen ghstack dependencies: #130177	2024-07-10 03:09:29 +00:00
rzou	99c68f7bea	Refactor TritonKernelVariable's logic so it can be shared (#130177 ) TritonKernelVariable's logic tells us how to go from a user-defined triton kernel and a grid to a call to the triton_kernel_wrapper_mutation HOP. We want to re-use this in a setting without Dynamo; in the next PR up, we create a new decorator (capture_triton) that, when applied to a triton kernel, transforms a call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177 Approved by: https://github.com/oulgen, https://github.com/ydwu4	2024-07-10 03:09:29 +00:00
Valentine233	868d9a4f12	[cpu][flash attention] fix nan issue (#130014 ) Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130014 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-07-10 02:33:26 +00:00
Tom Ritchford	68751799b8	Add decompositions for copy variants of view ops (#128416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 01:39:09 +00:00
cyy	007e75958f	[4/N] Change #include <c10/util/Optional.h> to #include <optional> (#130329 ) Follows #130300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130329 Approved by: https://github.com/ezyang	2024-07-10 01:26:50 +00:00
awayzjj	9912209743	check if the input fx graph of aot_compile return tuple (#129824 ) Fixes https://github.com/pytorch/pytorch/issues/129719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129824 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2024-07-10 01:18:55 +00:00
cyy	85b8503621	[Caffe2] Remove Caffe2 documentation (#130089 ) Due to the removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089 Approved by: https://github.com/r-barnes, https://github.com/albanD	2024-07-10 00:52:16 +00:00
cyy	7a3ab1fe79	[structural binding][7/N] Replace std::tie with structural binding (#130216 ) Follows #120353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216 Approved by: https://github.com/albanD	2024-07-10 00:52:04 +00:00
PyTorch MergeBot	fb696bf264	Revert "Add block mask utility support for batches and heads > 1 (#130227 )" This reverts commit 64139987c0588f2eef198a0b9fd6904783b37b2c. Reverted https://github.com/pytorch/pytorch/pull/130227 on behalf of https://github.com/izaitsevfb due to breaks internal builds, please see D59498662 ([comment](https://github.com/pytorch/pytorch/pull/130227#issuecomment-2218842579))	2024-07-09 22:34:39 +00:00
PyTorch MergeBot	44815ed67e	Revert "Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 )" This reverts commit 3e48d927332915e1ecbd3c7f2c6b9680428f181e. Reverted https://github.com/pytorch/pytorch/pull/130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130250#issuecomment-2218840674))	2024-07-09 22:32:54 +00:00
Catherine Lee	5b5a1f5202	Add on to Mark some test_decomp tests as slow on win #130260 (#130337 ) An add on to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337 Approved by: https://github.com/malfet	2024-07-09 22:30:53 +00:00
Joel Schlosser	fd43a2ba27	Forward fix for test_compare_cpu_cuda_float32 (#130360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360 Approved by: https://github.com/malfet ghstack dependencies: #128238	2024-07-09 22:28:39 +00:00
PyTorch MergeBot	3be4922a9d	Revert "[HOP] Use user directed names for variables where possible (#130271 )" This reverts commit adb65682affdfc37f724c02ea8c8930d3925fc07. Reverted https://github.com/pytorch/pytorch/pull/130271 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9863205414/job/27236960046 `adb65682af` Test not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130271#issuecomment-2218832643))	2024-07-09 22:24:39 +00:00
Zhengxu Chen	37d4d04309	[torchscript] Add logging for model id. (#130118 ) Summary: as title. Test Plan: CI Reviewed By: angelayi Differential Revision: D59348256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130118 Approved by: https://github.com/BoyuanFeng	2024-07-09 22:24:16 +00:00
Riley Dulin	fb5cb17fbe	[torch][fx] Add normalize_args constructor argument to FxGraphDrawer (#130348 ) Summary: When writing out Graphviz files for graphs, sometimes the arguments are all in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not remember the stride, padding, dilation order. Add an option `normalize_args` (defaults to False) to normalize all args into kwargs. This should help the readability of a graph. Differential Revision: D59529417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348 Approved by: https://github.com/mcremon-meta	2024-07-09 22:16:54 +00:00
Aaron Enye Shi	df83142131	[CCA][Memory Snapshot] Stop duplicating annotations to all device_traces (#130315 ) Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation. Test Plan: CI and ran workloads on MVAI WPR FBR. Reviewed By: zdevito Differential Revision: D59477339 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315 Approved by: https://github.com/zdevito	2024-07-09 21:38:47 +00:00
rzou	bb9a73f767	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-09 21:11:27 +00:00
rzou	c23d103afa	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-09 21:11:27 +00:00
PyTorch MergeBot	9c9744c3ac	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )" This reverts commit 940e4477ab0b81eea25051447cf5f599080c903f. Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/izaitsevfb due to breaking internal APS tests, see D59498864 ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2218724762))	2024-07-09 21:03:49 +00:00
Tristan Rice	f85bda8bdd	c10d/Handlers: expose running handlers from Python (#130149 ) This adds a `_run_handler` method that will invoke a specific handler. Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130149 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-07-09 20:20:59 +00:00
Tianyi Tao	1d93367cfa	Fix typo (#130305 ) Fixes #130241 that is a reopen pr of #130244, for possibly fixing the failed job Pull Request resolved: https://github.com/pytorch/pytorch/pull/130305 Approved by: https://github.com/Skylion007	2024-07-09 20:02:00 +00:00
Chen Lai	721a798886	add bits16 to graph dtype_abbrs (#130339 ) As title, patch the dtype in torch.fx.graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/130339 Approved by: https://github.com/angelayi	2024-07-09 19:58:51 +00:00
Jerry Mannil	42f647219a	[ROCm] Add int4 support (#129710 ) - Add AMD support for int4 kernel - Only supports CDNA2 and CDNA3 gpus for now - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types - Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus - Fix torchscript issues due to hipify for `__nv_bfloat16` type - TorchScript has its own implementation for bfloat16 type - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h) - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify Fixes #124699 Fixes pytorch-labs/gpt-fast/issues/154 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-09 19:49:12 +00:00
rzou	adb65682af	[HOP] Use user directed names for variables where possible (#130271 ) Afaict the previous check was too strict. Removing it passes all the mutation tests (mutation checks happen via the TensorVariable's mutable_local). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271 Approved by: https://github.com/Chillee, https://github.com/ydwu4 ghstack dependencies: #130255, #130268	2024-07-09 19:42:52 +00:00
cyy	a6345d3477	[CMake] [3/N] Remove unused code (#130322 ) Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322 Approved by: https://github.com/r-barnes	2024-07-09 19:33:33 +00:00
Tianyi Tao	3477ee38e4	fix the use of initial learning rate in the OneCycleLR example (#130306 ) Fixes #127649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306 Approved by: https://github.com/janeyx99	2024-07-09 18:58:07 +00:00
Peter Bell	3689471ea4	[inductor] Add FileCheck to flex attention epilogue test (#129343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343 Approved by: https://github.com/lezcano	2024-07-09 18:15:55 +00:00
Yifu Wang	c6cce976b2	Fix an issue where ENABLE_INTRA_NODE_COMM=1 + multiple process groups leads to failure (#130269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130269 Approved by: https://github.com/Chillee	2024-07-09 17:42:09 +00:00
Yidi Wu	cb4bec311a	Fix nodes has more than one output users after replace_set_grad_with_hop pass (#129716 ) Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to to replace the output of the call_modules. This records dead depencies in node.users. This PR fixes this. Test Plan: Added a new test. Also see the torchrec tests: Step 1: buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold Step 2: buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli -- --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save Differential Revision: D59132214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716 Approved by: https://github.com/angelayi	2024-07-09 17:04:03 +00:00
Eddie Yan	e4c51d22c5	[cuDNN] Cleanup < 8.5 #ifdefs (#130283 ) We've said cuDNN 8.5 is the minimum supported version for a bit now Pull Request resolved: https://github.com/pytorch/pytorch/pull/130283 Approved by: https://github.com/Skylion007	2024-07-09 16:35:39 +00:00
Shangdi Yu	cab90b0049	[custom ops] disable kernel temporarily (#130190 ) Fixes #128621 Sometimes we want to disable the backend implementation for testing/benchmarking purposes. For example: ```python @custom_op("mylib::f", mutates_args=()) def f(x: Tensor) -> Tensor: return torch.zeros(1) print(f(torch.randn(1))) # tensor([0.]) @f.register_kernel("cpu") def _(x): return torch.ones(1) print(f(torch.randn(1))). # tensor([1.]) with f.set_kernel_enabled("cpu", enabled = False): print(f(0)) # tensor([0.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130190 Approved by: https://github.com/williamwen42, https://github.com/zou3519	2024-07-09 16:13:50 +00:00
Richard Zou	edf273edf4	Revert some PRs (#130303 ) Summary: Revert https://github.com/pytorch/pytorch/pull/129346 thru https://github.com/pytorch/pytorch/pull/128893 For S430832 Test Plan: Tests Differential Revision: D59503843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303 Approved by: https://github.com/bdhirsh	2024-07-09 14:46:00 +00:00
cyy	71efbf701d	[3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300 ) Follows #130236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300 Approved by: https://github.com/ezyang	2024-07-09 13:32:57 +00:00
milesial	a5f816df18	Add more dtypes to __cuda_array_interface__ (#129621 ) `__cuda_array_interface__` was missing some unsigned integer dtypes as well as BF16. numba doesn't support BF16 so I skip tests for that one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129621 Approved by: https://github.com/lezcano	2024-07-09 10:47:19 +00:00
chilli	3e48d92733	Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250 Approved by: https://github.com/drisspg ghstack dependencies: #130160, #130106, #130224, #130227	2024-07-09 09:24:06 +00:00
eqy	86fb76e871	[SDPA] Clean up `print` in `test/test_transformers.py` (#130302 ) Left this in #125343, oops... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130302 Approved by: https://github.com/awgu	2024-07-09 09:20:52 +00:00
Yichen Yan	953c6476bd	[CMAKE] Look for `Development.Module` instead of `Development` (#129669 ) Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613, pytorch could be built with a static libpython (e.g. in manylinux). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669 Approved by: https://github.com/malfet	2024-07-09 09:16:43 +00:00
Valentin Andrei	b139b5090f	[pytorch] Name threads in thread pools for better debugging (#130270 ) Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools. This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270 Approved by: https://github.com/d4l3k, https://github.com/albanD	2024-07-09 08:03:47 +00:00
Yuanhao Ji	312652c325	[RFC] Add support for device extension autoloading (#127074 ) Fixes #122468 - Load device extensions at the end of `torch/__init__.py` - Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0` run test: ```python python test/run_test.py -i test_autoload_enable python test/run_test.py -i test_autoload_disable ``` doc: https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html co-author: @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074 Approved by: https://github.com/albanD, https://github.com/jgong5	2024-07-09 06:14:13 +00:00
Aaron Enye Shi	6c4efd4e95	[Memory Snapshot][BE] Clean up record function callback scope (#130265 ) Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback. Test Plan: Ran locally, works fine. https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle Differential Revision: D59477046 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265 Approved by: https://github.com/davidberard98	2024-07-09 05:23:48 +00:00
Sam Larsen	ded469cfbd	[issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275 ) Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;) Test Plan: Ran both ways: * `python test/inductor/test_memory_planning.py` * `pytest test/inductor/test_memory_planning.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275 Approved by: https://github.com/zou3519	2024-07-09 05:20:56 +00:00
Xu Han	e235db98c9	[Inductor] Add aot_mode UT to new cpp_builder. (#130105 ) Changes: 1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT. 2. Switch AotCodeCompiler vec isa command gen to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-09 04:08:35 +00:00
Sheng Fu	31df1d235e	Support tensor stride (#129297 ) Summary: X-link: https://github.com/facebookresearch/param/pull/126 Support tensor stride for execution trace. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda profiler.test_execution_trace.TestExecutionTrace Differential Revision: D58900476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129297 Approved by: https://github.com/sanrise, https://github.com/izaitsevfb	2024-07-09 03:55:46 +00:00
Edward Z. Yang	e836ee1955	Enhancements to recompiles logs (#130043 ) ---- - We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected - Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start - Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line Example output: ``` V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3 V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] triggered by the following guard failure(s): V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043 Approved by: https://github.com/anijain2305	2024-07-09 03:40:56 +00:00
cyy	29861779ce	[2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236 ) Follows #128301. The changes were made by grep and sed Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236 Approved by: https://github.com/ezyang	2024-07-09 03:17:24 +00:00
rzou	d1e0653fad	[fx][easy] print_readable should recursively apply options (#130268 ) For example, print_readable(colored=True) should also print submodules with colors. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/130268 Approved by: https://github.com/Chillee ghstack dependencies: #130255	2024-07-09 02:50:20 +00:00
rzou	f2c9f0c0db	[HOP] improve naming for subgraph inputs (#130255 ) Previously, subgraph input names were whatever the input proxies were, which were confusing. This PR changes those names to be whatever the names of the arguments the functions being speculate_subgraph'ed are. This is best-effort: if we can't figure it out then we go back to the previous strategy. Test Plan: - existing expecttests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255 Approved by: https://github.com/ydwu4	2024-07-09 02:46:40 +00:00
Jane Xu	abe81d5d05	Fix the rest of foreach flakers (#130277 ) Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277 Approved by: https://github.com/soulitzer	2024-07-09 02:08:21 +00:00
PyTorch MergeBot	d44c30e2f9	Revert "Add API for open registration between operators and subclasses (and modes) (#130064 )" This reverts commit 922d2737d5e0ad22ee1dcf91c48ab09d641de840. Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_profiler_tree is failing in trunk after this lands `922d2737d5`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2216135497))	2024-07-09 01:48:38 +00:00
Catherine Lee	75fa10066d	Mark some test_decomp tests as slow on win (#130260 ) Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows. These tests take >500s on windows. This is part of the reason why test_decomp keeps failing on windows (ex `da66e50e6e`) The other part is something to do with reruns + thresholds that I am still investigating Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-07-09 00:16:31 +00:00
Will Constable	7f08d3d9a0	[C10D] Fix corrupt log due to uint_8 printing as char (#130184 ) Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out. "ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@" We need a better solution for avoiding this systematically, but at least for now fix the spot we know about. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184 Approved by: https://github.com/eeggl, https://github.com/Skylion007	2024-07-08 23:37:50 +00:00
Jerry Zhang	4c19623800	Change numeric_debug_handle to store per-node id (#129811 ) Summary: Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack, but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional support for numerical debugging for inputs and willing to hack around to achieve this. This PR changes the structure of numeric_debug_handle to store unique_id for each node instead. e.g. graph: ``` node = op(input_node, weight_node) ``` Before: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3} ``` After: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1 ``` Test Plan: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811 Approved by: https://github.com/tarun292	2024-07-08 23:36:19 +00:00
Will Constable	a28bb3268d	[Pipelining] Reorder _Action from F1_1 to 1F1 (#129786 ) Also steers away from accesing _Action via positional unpacking since that is error prone Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786 Approved by: https://github.com/H-Huang	2024-07-08 23:07:51 +00:00
Huy Do	60d9f3f7d9	Set the epoch timestamp when uploading data to dynamoDB (#130273 ) This is to move away the `_event_time` field from Rockset, which we cannot use when reimport the data Pull Request resolved: https://github.com/pytorch/pytorch/pull/130273 Approved by: https://github.com/clee2000	2024-07-08 22:58:32 +00:00
Yueming Hao	b4cc25f126	[custom_op]Fix self in mutation_args (#130179 ) Fixes #124933 ## Issue Summary If users define `self` as mutate args, there is an error occurs `TypeError: AutoFunctionalized.__call__() got multiple values for argument 'self'`. For the following example, the schema for mutates_args is parsed as {"self": FakeTensor}. `6df963a2c8/torch/_higher_order_ops/auto_functionalize.py (L234)` In the above line, it is unwrapped as `self=FakeTensor` and leads to wrong argument pass because `self` is the default keyword for functions of a class, such as https://github.com/pytorch/pytorch/compare/main...findhao/fix-self-custom-ops#diff-9453b6b52a54783beec3dd1c60248620f61c3a524d404a188af17bbdf6be3d9eR292 . ```python import torch @torch.library.custom_op("mylib::foo", mutates_args={"self"}) def foo(self: torch.Tensor) -> None: self.sin_() x = torch.randn(3) @torch.compile(backend="inductor", fullgraph=True) def f(x): foo(x) f(x) ``` ## Fix This PR changes all related default argument `self` to `self_` following the existing way in `6fc771d19b/torch/_ops.py (L667)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130179 Approved by: https://github.com/zou3519	2024-07-08 22:55:50 +00:00
Andrey Talman	17ca0d0edf	Add linux manywheel python 3.13 binary workflows (#130030 ) Test with passing linux manywheel workflows is here: https://github.com/pytorch/pytorch/pull/121979 Builder PR already merged: https://github.com/pytorch/builder/pull/1910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130030 Approved by: https://github.com/albanD	2024-07-08 22:50:15 +00:00
Joel Schlosser	00335a27b4	Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175 ) This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on. NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity). API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction: 1. Min / max seq len has already been obtained from somewhere (manual calculation, static values, etc.) and they should be used in the cache 2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually) 3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175 Approved by: https://github.com/davidberard98, https://github.com/soulitzer	2024-07-08 22:14:52 +00:00
rzou	922d2737d5	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-08 22:13:05 +00:00
PyTorch MergeBot	44a773c121	Revert "[custom ops] infer schema (#130079 )" This reverts commit 3fe324ffb612c8712f6af7639c1e7bcec5f3b4fd. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit `3fe324ffb6` ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))	2024-07-08 22:02:29 +00:00
PyTorch MergeBot	f9bb258892	Revert "[Inductor] Add aot_mode UT to new cpp_builder. (#130105 )" This reverts commit 21eeedb4554edab22b42bcb2f75f19e85652b72e. Reverted https://github.com/pytorch/pytorch/pull/130105 on behalf of https://github.com/izaitsevfb due to Breaks 46 tests internally at meta with: OSError: CUDA_HOME environment variable is not set ([comment](https://github.com/pytorch/pytorch/pull/130105#issuecomment-2215392198))	2024-07-08 21:40:03 +00:00
PyTorch MergeBot	5e467604c3	Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit dc5f37193f8d144d3de8525bf64eb1775d91e932. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2215355259))	2024-07-08 21:25:28 +00:00
PyTorch MergeBot	09d57f577b	Revert "[inductor] switch CppCodeCache to new cpp_builder. (#130132 )" This reverts commit 3957b3b34976896e0b13e1d09cf19e1da5b8292e. Reverted https://github.com/pytorch/pytorch/pull/130132 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130132#issuecomment-2215352180))	2024-07-08 21:22:39 +00:00
Yang Chen	856fe230c7	[AOTI] better approach to generating runtime checks for symbolic dimensions (#130220 ) Previously, we only handled cases where the symbolic dimension is of Symbol. We should use bound_sympy which handles more general cases for us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130220 Approved by: https://github.com/aakhundov	2024-07-08 20:46:38 +00:00
Shangdi Yu	3fe324ffb6	[custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-08 20:46:23 +00:00
PyTorch MergeBot	1e61cb8c87	Revert "[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 )" This reverts commit b428f1ad77aedfd150e920c8b0d23b7e6393ad6f. Reverted https://github.com/pytorch/pytorch/pull/129185 on behalf of https://github.com/huydhn due to dr ci categorization is wrong, the test_linalg xsuccess is real, theres also a test_jit failure https://github.com/pytorch/pytorch/actions/runs/9844339391/job/27178009798 `b428f1ad77` ([comment](https://github.com/pytorch/pytorch/pull/129185#issuecomment-2215230345))	2024-07-08 20:37:07 +00:00
Anshul Sinha	f059201e0d	[dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn (#130072 ) Summary In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below: Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072 Approved by: https://github.com/XilunWu ghstack dependencies: #129994	2024-07-08 20:12:52 +00:00
atalman	3e53cae0fc	Release 2.4 matrix update. Future releases dates (#130267 ) Added Release Compatibility Matrix for release 2.4 Updated future release dates for 2.6-2.9 Updated possible patch release date for 2.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130267 Approved by: https://github.com/malfet, https://github.com/albanD	2024-07-08 20:09:17 +00:00
Xia, Weiwen	36e2608783	[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 ) Description Add fusion path for dynamic quant and for QAT. The following patterns can be matched for static quant with QAT cases: `qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant` The following patterns can be matched for dynamic quant cases: `qx -> qlinear -> add -> optional relu` Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear python test/test_quantization.py -k test_linear_unary python test/test_quantization.py -k test_linear_binary Differential Revision: [D57655830](https://our.internmc.facebook.com/intern/diff/D57655830) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-07-08 20:04:39 +00:00
Tristan Rice	a8985a97f9	elastic/store: use wait instead of get for barrier (#130148 ) Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first. Test Plan: CI + existing tests -- no behavior change Differential Revision: D59396199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148 Approved by: https://github.com/kurman, https://github.com/wconstab	2024-07-08 19:53:42 +00:00
Jeeja	22c809aa73	[FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110 ) for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device. In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110 Approved by: https://github.com/fegin	2024-07-08 18:52:13 +00:00
James Wu	9158bb7837	Ignore functional tensor wrapper when caching (#128335 ) This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work. To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335 Approved by: https://github.com/bdhirsh	2024-07-08 18:39:20 +00:00
Michael Lazos	6dc64026cb	Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046 ) In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error. @eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092) To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning. Co-authored-by: David Berard <dberard@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046 Approved by: https://github.com/eellison	2024-07-08 18:25:16 +00:00
chilli	64139987c0	Add block mask utility support for batches and heads > 1 (#130227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106, #130224	2024-07-08 18:15:35 +00:00
chilli	cd683212a2	Fix indexing twice with score_mod (#130224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106	2024-07-08 18:15:35 +00:00
Jithun Nair	e16276b9bf	[ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753 ) This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944. Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992 ### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture: _Using setter function:_ ``` >>> torch.backends.cuda.preferred_blas_library(backend="cublaslt") [W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator()) [W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator()) <_BlasBackend.Cublas: 0> ``` _Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_ ``` root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python >>> import torch >>> torch.backends.cuda.preferred_blas_library() [W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator()) <_BlasBackend.Cublas: 0> ``` ### and the following on a gfx90a (supported by hipblasLT) architecture: _Using setter function:_ ``` >>> import torch >>> torch.backends.cuda.preferred_blas_library() <_BlasBackend.Cublaslt: 1> >>> torch.backends.cuda.preferred_blas_library(backend="cublas") <_BlasBackend.Cublas: 0> >>> torch.backends.cuda.preferred_blas_library(backend="cublaslt") [W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator()) <_BlasBackend.Cublaslt: 1> ``` _Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_ ``` root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python >>> import torch >>> torch.backends.cuda.preferred_blas_library() <_BlasBackend.Cublaslt: 1> ``` (Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753 Approved by: https://github.com/malfet	2024-07-08 17:43:41 +00:00
William Wen	b428f1ad77	[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 ) Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython. This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame. We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12. This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185 Approved by: https://github.com/jansel	2024-07-08 17:39:05 +00:00
Jason Ansel	d325aaef39	[halide-backend] Use get_reduction_combine_fn for reduction ops (#130212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212 Approved by: https://github.com/eellison	2024-07-08 17:23:32 +00:00
Anshul Sinha	a18568f293	[dtensor][debug] Added functionality to convert log into a json file (#129994 ) Summary Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below: <img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8"> <img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994 Approved by: https://github.com/XilunWu	2024-07-08 17:15:34 +00:00
Abhinav Podili	61017eb77b	Add missing mapping between DLDevice and ATenDevice for MAIA (#129615 ) This PR adds missing mapping between the `DLDevice `and `ATenDevice `for MAIA device. These changes are necessary for `dlpack `support for `maia `tensors. [MAIA is added to the DldeviceType enum in the dlpack repo](`bbd2f4d324/include/dlpack/dlpack.h (L120)`) already. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129615 Approved by: https://github.com/albanD	2024-07-08 17:08:39 +00:00
Edan Tessel Sneh	63743b223c	[AO] catch qparam mismatch for cat (#123769 ) Summary: use &= instead of \|= since \|= ignores incorrect scale/zp change scale to use float comparison, instead of int comparison Issue warning instead of error for backward compatibility: ex: P1204628034 Test Plan: see warning in: P1204628034 Reviewed By: jerryzh168 Differential Revision: D55699212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769 Approved by: https://github.com/jerryzh168	2024-07-08 16:47:14 +00:00
Catherine Lee	f4774d64bf	Skip test_profile_memory on windows (#130037 ) The test was introduced in https://github.com/pytorch/pytorch/pull/128743 It is failing on windows cuda `a9a744e442/1` (it is skipped on cpu jobs) After talking with the author and Aaron, I have been advised to skip it on windows, as windows support for kineto is not a high priority Pull Request resolved: https://github.com/pytorch/pytorch/pull/130037 Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi	2024-07-08 16:11:51 +00:00
PyTorch MergeBot	d7b7f8b79f	Revert "[ROCm] Add int4 support (#129710 )" This reverts commit d0ad13fa42fc2e9935bd3bda2937a3491276d274. Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))	2024-07-08 16:07:53 +00:00
Joel Schlosser	c8ab2e8b63	Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238 ) This PR: * Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed). * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`: * Uncovered a bunch of test issues: * Test breakdown (>100 total) * A lot of tolerance issues (tweaked tolerance values to fix) * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype) * 3 actually broken semantics (for masked tensor; added xfails) * 4 Jacobian mismatches (added xfails) * 2 nan results (skip for now, need fixing) * 3 results too far from reference result (add xfails) * Skips MPS tests for now (there are so many failures!). Those will default to the old behavior. before (no seed setting): ``` real 0m21.306s user 0m19.053s sys 0m5.192s ``` after (with seed setting): ``` real 0m21.905s user 0m19.578s sys 0m5.390s ``` * Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command. ``` ====================================================================== ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper return test(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar self.assertFalse(True) AssertionError: True is not false The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper fn(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper raise new_e from e Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.037s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238 Approved by: https://github.com/janeyx99, https://github.com/justinchuby	2024-07-08 16:06:38 +00:00
Feny Patel	acf9e31cf8	adding MTIA to supported activities (#130052 ) Summary: Put the hasMTIA block in the if condition as well to let MTIA activities be added to supported activities Test Plan: Tested with auto-trace Differential Revision: D59280848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130052 Approved by: https://github.com/aaronenyeshi	2024-07-08 15:20:05 +00:00
Alnis Murtovi	16d53cb7d5	Only run mixed_mm heuristic if shapes are static (#130081 ) If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081 Approved by: https://github.com/Chillee	2024-07-08 14:20:55 +00:00
Simon Fan	010009e642	[compiled autograd] c++ autograd function saved_data: lift tensors (#130057 ) avoid recompiles when custom c++ autograd function use ctx->saved_data to save tensors iv.toTensor can return reference for `after(iv.toTensor())` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130057 Approved by: https://github.com/jansel	2024-07-08 07:42:07 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
Animesh Jain	f053be2a97	[dynamo] Graph break on random_ op (#130222 ) Fixes https://github.com/pytorch/pytorch/issues/121621 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130222 Approved by: https://github.com/jansel	2024-07-08 06:10:24 +00:00
Sijia Chen	31bb65de19	[Inductor] Fix conditional codegen (#129492 ) Summary: We have the cache to guarantee the `sym` is codegen only once, see the following code ``` def ensure_size_computed(self, sym: sympy.Symbol): if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE): if sym in self.computed_sizes: return self.computed_sizes.add(sym) expr = V.graph.sizevars.inv_precomputed_replacements[sym] self.writeline( f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}" ) ``` However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of `undefined symbols`: P1441378833 To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen Test Plan: TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size Differential Revision: D58973730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492 Approved by: https://github.com/aakhundov	2024-07-08 05:33:47 +00:00
Animesh Jain	c5c9dbece1	[dynamo][user-defined] Simplify and improve scope of UserDefinedObject var_getattr (#130169 ) Fixes https://github.com/pytorch/pytorch/issues/122649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130169 Approved by: https://github.com/jansel ghstack dependencies: #118448, #130159	2024-07-08 04:10:56 +00:00
Jerry Mannil	d0ad13fa42	[ROCm] Add int4 support (#129710 ) Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction. Only supports CDNA2 and CDNA3 gpus for now. Fixes #124699 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-07 23:54:22 +00:00
Animesh Jain	d1b832e739	[inductor][mkl][inline-inbuilt-nn-modules] Change assertion (#130219 ) Fixes the test in the next PR - `python test/inductor/test_mkldnn_pattern_matcher.py -k TestDynamicPatternMatcher.test_conv3d_unary_dynamic_shapes` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130219 Approved by: https://github.com/leslie-fang-intel	2024-07-07 21:32:07 +00:00
Pian Pawakapan	940e4477ab	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] # something with _w ... # turns into -> s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) # turns into torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599 Approved by: https://github.com/ezyang	2024-07-07 20:10:14 +00:00
Simon Mahns	0c44684901	[Typo] Fix typo in DispatchKeyExtractor.h (#130221 ) Summary: typo_helper Test Plan: ci Differential Revision: D59424671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130221 Approved by: https://github.com/Skylion007	2024-07-07 19:43:31 +00:00
PyTorch MergeBot	e423224546	Revert "[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 )" This reverts commit 98929ceae3873f18f4747b88cdff708fde107aa7. Reverted https://github.com/pytorch/pytorch/pull/126967 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/126967#issuecomment-2212337926))	2024-07-07 06:16:32 +00:00
PyTorch MergeBot	1b57dce35f	Revert "[Inductor][CPP] Support more than one LocalBuffer (#129121 )" This reverts commit f794cf59bd0891ff4a4337e0d919ee68ba1f0472. Reverted https://github.com/pytorch/pytorch/pull/129121 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/129121#issuecomment-2212337590))	2024-07-07 06:13:40 +00:00
leslie-fang-intel	f794cf59bd	[Inductor][CPP] Support more than one LocalBuffer (#129121 ) Summary Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion ``` Next Step - [✓] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126967	2024-07-07 05:43:08 +00:00
leslie-fang-intel	98929ceae3	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-07 05:34:57 +00:00
Xuehai Pan	a3ce9eddd6	[BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198 Approved by: https://github.com/Skylion007	2024-07-07 00:58:22 +00:00
peaceorwell	9983242c8e	[inductor] support adding a new inductor backend using PrivateUse1 (#129953 ) Add handling custom device registered by PrivateUse1 in init_backend_registration() func Fixes #129952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129953 Approved by: https://github.com/jansel	2024-07-06 21:15:40 +00:00
Shuo Ding	3d138af943	[Inductor] First implementation of the B2B-GEMM pass with tests (#129995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129995 Approved by: https://github.com/eellison	2024-07-06 19:10:22 +00:00
Xu Han	3957b3b349	[inductor] switch CppCodeCache to new cpp_builder. (#130132 ) Changes: 1. switch CppCodeCache to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 18:57:44 +00:00
Xu Han	dc5f37193f	[inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 18:44:13 +00:00
cyy	dfe3534134	[1/N] Fix NVCC warnings (#130191 ) Fixes NVCC warnings, as the required steps to enable Werror on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130191 Approved by: https://github.com/Skylion007	2024-07-06 18:25:04 +00:00
Xuehai Pan	3f50e197c4	[BE] annotate `torch.autograd.graph` (#129558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558 Approved by: https://github.com/soulitzer	2024-07-06 18:14:16 +00:00
Xu Han	01ec03bac6	[inductor] switch HalideCodeCache to new cpp_builder. (#130146 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130146 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 17:35:17 +00:00
cyy	2f219f7d79	Enforce unused-{variable/function} checks to all torch targets (#130189 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130189 Approved by: https://github.com/ezyang	2024-07-06 16:03:01 +00:00
cyy	096eca2f9a	[2/N] Replace exceptions with static_assert(false) in some templates (#130116 ) Follows #127371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130116 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-06 13:23:05 +00:00
Nikita Shulga	520a4642bf	[CI] Enable build with asserts (#129924 ) Not a standard CMake config, as far as I can tell, but it introduces an important concept of optimized build without `NDEBUG`. Test by running `python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)"`, which is a no-op unless debug_assert_fail is enabled. Add recently added `_unsafe_masked_index`/`_unsafe_masked_index_put_accumulate` to DONT_ENFORCE_SAME_TENSOR_IMPL_OR_STORAGE to avoid all test involving those ops to fail with internal assert Suppress number of internal asserts to make CI green, see https://github.com/pytorch/pytorch/issues/130073 Fixes https://github.com/pytorch/pytorch/issues/102105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129924 Approved by: https://github.com/atalman, https://github.com/albanD	2024-07-06 13:14:32 +00:00
chilli	da66e50e6e	Added compile option to create_block_mask (#130106 ) Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 trillion element tensor if fully materialized. ``` print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 220, 220, _compiled=True))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106 Approved by: https://github.com/yanboliang ghstack dependencies: #130160	2024-07-06 08:09:56 +00:00
PyTorch MergeBot	963f430d13	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )" This reverts commit 0267b2ddcb58aa66b2b62336216da7df4f9939d8. Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a landrace and fails inductor/test_cudagraph_trees in trunk `0267b2ddcb` ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2211690518))	2024-07-06 07:20:05 +00:00
Aaron Enye Shi	aa4899eee9	[CCA][Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180 ) Summary: Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time. We need to lock when inserting, accessing or removing TraceEntry in alloc_trace. Test Plan: This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes. Differential Revision: D59380507 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180 Approved by: https://github.com/eqy, https://github.com/kit1980	2024-07-06 06:14:54 +00:00
PyTorch MergeBot	e019540c9e	Revert "Fix the SDPA AOT export issue (#130164 )" This reverts commit 1927c406844affbfe3496d5cbc31d4ebe11c8bfb. Reverted https://github.com/pytorch/pytorch/pull/130164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ExecuTorch tests in trunk `1927c40684` ([comment](https://github.com/pytorch/pytorch/pull/130164#issuecomment-2211667777))	2024-07-06 05:59:49 +00:00
chilli	bf609630ae	Fix a bunch of stride issues with FlexAttention (#130160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160 Approved by: https://github.com/yanboliang	2024-07-06 03:58:14 +00:00
Edward Z. Yang	10c831567b	Make sympify'ing SymInt/etc produce their sympy expression (#130166 ) There is one huge problem this fixes: today, sympify(symint) produces a float(!!) because Sympy attempts to see if you can coerce the symint to float in sympify and of course this works on SymInt. However, this also has another nontrivial effect: anywhere in Inductor where sympy expressions are passed around, it is also valid to pass around a SymInt now. I'm ambivalent about this: it's currently a mistake to be passing around a SymInt when a sympy expression is expected. But maybe this is fine? Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166 Approved by: https://github.com/yf225	2024-07-06 03:56:45 +00:00
Jason Ansel	acd03ca2d9	[halide-backend] Support scan kernels (#129035 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #130129	2024-07-06 03:49:50 +00:00
Jason Ansel	c5110f6388	[halide-backend] Use 0D scalar inputs/outputs (#130129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129 Approved by: https://github.com/shunting314	2024-07-06 03:49:50 +00:00
Pian Pawakapan	0267b2ddcb	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] # something with _w ... # turns into -> s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) # turns into torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599 Approved by: https://github.com/ezyang	2024-07-06 03:44:49 +00:00
PyTorch UpdateBot	7c43f59a45	[audio hash update] update the pinned audio hash (#129429 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429 Approved by: https://github.com/pytorchbot	2024-07-06 03:34:12 +00:00
Animesh Jain	bd0252fb98	[dynamo][user-defined] Support method descriptors (#130159 ) Fixes https://github.com/pytorch/pytorch/issues/120650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130159 Approved by: https://github.com/jansel ghstack dependencies: #118448	2024-07-06 02:03:09 +00:00
Daulet Askarov	a1a2023eb8	Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972 ) Summary: It turns out, the device used as a param in is_pinned is meant to be the accelerator device with the respect to which pinning is expected. Passing 'cpu' always makes the return value false, regardless of whether the actual tensor is a cpu tensor pinned to Cuda. Besides, there is a PR https://github.com/pytorch/pytorch/pull/126376 about to be merged which automatically uses the correct accelerator device which obviates the need for users to pass any kind of explicit device and doesn't create Cuda context for pure cpu tensors. Note, https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 test is expected to be broken by this diff, but it should be fixed forward by https://github.com/pytorch/pytorch/pull/126376 Test Plan: Sandcastle. Differential Revision: D59283190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129972 Approved by: https://github.com/LucasLLC	2024-07-06 01:07:32 +00:00
Sijia Chen	1927c40684	Fix the SDPA AOT export issue (#130164 ) Summary: ## Context TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode` The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607 ## Root Cause In the `inference_mode()`, the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_` So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure In the `torch.no_grad()`, `aten::sdpa` was decomposed before `functionalization`, so the story is `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good ## How to fix Long-term: The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good Short-term: It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff Test Plan: local repro works now buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti Differential Revision: D59385876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164 Approved by: https://github.com/zou3519	2024-07-06 00:57:47 +00:00
Shunting Zhang	c5ede865c4	[pt2-bench] raise tolerance for squeezenet1_1 (#130165 ) The training accuracy for this model starts to regress. It does not show up on the weekly run yet but 1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c) 2. I can repro it locally Command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend inductor --device cuda --only squeezenet1_1 ``` Raise the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005, #130163	2024-07-06 00:49:15 +00:00
Shunting Zhang	0fcbca9adb	[pt2-bench] use eval mode for vision_maskrcnn (#130163 ) Try to fix https://github.com/pytorch/pytorch/issues/130161 The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors I fix that to always use eval mode for vision_maskrcnn training. With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005	2024-07-06 00:49:15 +00:00
cyy	e5841bb8d5	[3/N] Enforce unused-function and unused-variable checks (#130084 ) Follows #129878. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130084 Approved by: https://github.com/ezyang	2024-07-05 23:56:00 +00:00
Shuqiang Zhang	126796d239	[c10d] fixing an UT after a change in eager mode new group (#130167 ) Summary: after https://github.com/pytorch/pytorch/pull/129284, new_group is eager now if device_id is specified, one UT was broken This PR fixes it. Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130167 Approved by: https://github.com/wconstab	2024-07-05 23:18:30 +00:00
Xuehai Pan	d1d0a7080f	[torchgen] reference generated comment to actual location of the generator and template (#130020 ) As per title. ```diff # torch/_VF.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/return_types.pyi - # @generated from torch/_C/return_types.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in ``` ```diff # torch/_C/__init__.pyi - # @generated from torch/_C/__init__.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in ``` ```diff # torch/_C/_nn.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in ``` ```diff # torch/_C/_VariableFunctions.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/nn/functional.pyi + # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020 Approved by: https://github.com/ezyang	2024-07-05 21:47:14 +00:00
PyTorch MergeBot	6fc771d19b	Revert "Change depreacate warning on dispatch_on_subclass to warn once (#130047 )" This reverts commit 8ff243bcf190bab62348310693f0ad2f90061c89. Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function on multiple jobs `8ff243bcf1` https://github.com/pytorch/pytorch/actions/runs/9812489165/job/27097342443. Dr CI is doing something weird about the unstable failures ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2211409090))	2024-07-05 21:03:36 +00:00
Catherine Lee	df50452279	Pin optree==0.11.0 on windows CI (#130155 ) Fixes #ISSUE_NUMBER doctests test_testing Failing run has 0.12.0 https://github.com/pytorch/pytorch/actions/runs/9804335516/job/27072891998 Succeeding run has 0.11.0 https://github.com/pytorch/pytorch/actions/runs/9798330845/job/27057359554 It is already pinned for mac and linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/130155 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-07-05 20:28:58 +00:00
Lucas Pasqualin	18e75c098b	[DCP] Adds Checkpointing Team (dcp) to merge rules (#129582 ) [DCP] Adds Checkpointing Team (dcp) to merge rules. Please comment to this PR if you think you should be added as well! Pull Request resolved: https://github.com/pytorch/pytorch/pull/129582 Approved by: https://github.com/fegin	2024-07-05 20:09:31 +00:00
Eddie Yan	739fc01ac9	[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 ) The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally: `a21d4363d2/c10/cuda/CUDAStream.h (L132)` OUTDATED below: The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following: ``` import logging import os import time import torch import torch.distributed as dist def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s") backend = 'nccl' group = torch.distributed.init_process_group(backend=backend) rank = torch.distributed.get_rank(group=group) for i in range(4): time.sleep(rank) logging.info(f"Rank {rank}: enter barrier {i}") dist.barrier() logging.info(f"Rank {rank}: exit barrier {i}") dist.destroy_process_group() if __name__ == "__main__": main() ``` appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead. The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization. This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device. CC @wujingyue @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908 Approved by: https://github.com/kwen2501	2024-07-05 19:53:54 +00:00
Huy Do	faebaef089	[EZ] Fix typo in upload stats OIDC rolename (#130168 ) My mistake from https://github.com/pytorch/pytorch/pull/129544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130168 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman	2024-07-05 19:38:24 +00:00
PaliC	3d56673b24	[Split Build][BE] remove extraneous .py, .a, and .so files (#130053 ) Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912 Verification: The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions) ``` sahanp@devgpu086 ~/p/dist (viable/strict)> pwd (pytorch-3.10) /home/sahanp/pytorch/dist sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f $ -name ".py" -o -name ".a" -o -name "*.so" $ (pytorch-3.10) ./torch/__init__.py ./torch/lib/libbackend_with_compiler.so ./torch/lib/libc10.so ./torch/lib/libjitbackend_test.so ./torch/lib/libtorch.so ./torch/lib/libtorch_cpu.so ./torch/lib/libtorch_global_deps.so ./torch/lib/libtorchbind_test.so sahanp@devgpu086 ~/p/dist (viable/strict)> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053 Approved by: https://github.com/atalman	2024-07-05 19:05:32 +00:00
Iris Zhang (PyTorch)	8ff243bcf1	Change depreacate warning on dispatch_on_subclass to warn once (#130047 ) Summary: Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead. More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/ Test Plan: Sandcastle Differential Revision: D59338775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047 Approved by: https://github.com/XilunWu	2024-07-05 18:52:49 +00:00
PyTorch MergeBot	784e3b4123	Revert "Change numeric_debug_handle to store per-node id (#129811 )" This reverts commit a9a744e442975cfbc6f4b26a532e5c1b3d9d5692. Reverted https://github.com/pytorch/pytorch/pull/129811 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129811#issuecomment-2211245852))	2024-07-05 18:14:02 +00:00
Huy Do	889ed48a22	Fix missing id-token write in upload stats (#130153 ) Fix the mistake from https://github.com/pytorch/pytorch/pull/129544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130153 Approved by: https://github.com/clee2000	2024-07-05 18:05:46 +00:00
Jiashen Cao	7c5f3cd049	Add explain function to TSConverter. (#129968 ) Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users. Test Plan: * `pytest test/export/test_converter.py` Differential Revision: D59251934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968 Approved by: https://github.com/angelayi	2024-07-05 18:04:29 +00:00
Animesh Jain	7ea8a3c9b8	[dynamo] Validate check_fn (#118448 ) Fixes - https://github.com/pytorch/pytorch/issues/128090 Tracker issue here - https://github.com/pytorch/pytorch/issues/129937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118448 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-07-05 18:04:12 +00:00
Joel Schlosser	7192ee0735	Default to input tensor device for as_nested_tensor(t) (#130050 ) Fixes #129647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130050 Approved by: https://github.com/YuqingJ	2024-07-05 17:50:08 +00:00
Huy Do	a33ee73a28	Upload perf stats to both Rockset and dynamoDB (#129544 ) To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows: 1. Upload perf stats to both Rockset and dynamoDB 2. Copy all the existing content from Rockset to dynamoDB 3. Create new Rockset tables to map to dynamoDB 4. Switch HUD to use the new Rockset tables (temporarily) 5. Delete the existing tables This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422 ### Testing ``` python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_" ... Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats ``` And confirm the same number of documents is on the table ![Screenshot 2024-07-03 at 18 10 35](https://github.com/pytorch/pytorch/assets/475357/6c055c96-00ca-4cb3-bbe5-fe4914f9da9b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544 Approved by: https://github.com/clee2000	2024-07-05 16:31:49 +00:00
James Wu	e7ab7b83bc	Have torch_key hash entire torch directory (#129250 ) Summary: Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files. There's an argument to be made to only hash .py and .cpp files. Maybe we can fix the glob to do that. We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources. Test Plan: Unit tests still pass on OSS For torch_key: ``` buck2 build caffe2:src_hash.txt -v 2 --show-output ``` See the output, then make any change to any torch file. See that the hash changes. Reviewed By: oulgen Differential Revision: D58875785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250 Approved by: https://github.com/oulgen	2024-07-05 15:37:16 +00:00
PyTorch MergeBot	eea4ece256	Revert "[audio hash update] update the pinned audio hash (#129429 )" This reverts commit 30fc4b06f55c7c4a915f938d7d5d6abbbc23bf61. Reverted https://github.com/pytorch/pytorch/pull/129429 on behalf of https://github.com/jeanschmidt due to pytorch bot should not have allowed this merge, as there are failing jobs ([comment](https://github.com/pytorch/pytorch/pull/129429#issuecomment-2210894639))	2024-07-05 13:38:44 +00:00
PyTorch MergeBot	4b05d9d233	Revert "[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 )" This reverts commit c9f1db265e317829b3a4d3af5be5c9266874dcd4. Reverted https://github.com/pytorch/pytorch/pull/129908 on behalf of https://github.com/jeanschmidt due to Seems to have introduced windows errors on main ([comment](https://github.com/pytorch/pytorch/pull/129908#issuecomment-2210888890))	2024-07-05 13:34:59 +00:00
Shunting Zhang	8f6765f7a7	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-05 10:26:39 +00:00
Shunting Zhang	c0735a3dd3	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-05 10:26:39 +00:00
Shunting Zhang	8f1c2e1e28	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-05 10:26:39 +00:00
Yu, Guangye	78a0b010eb	Refine XPU UTs (#130138 ) # Motivation 1. enable all test cases related to `TestXpu` running in XPU CI. 2. make `test_lazy_init` stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130138 Approved by: https://github.com/EikanWang	2024-07-05 09:56:22 +00:00
Jason Ansel	3240bff56a	[benchmarking] Add join_results.py (#129202 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202 Approved by: https://github.com/yanboliang, https://github.com/shunting314	2024-07-05 06:55:30 +00:00
PyTorch UpdateBot	30fc4b06f5	[audio hash update] update the pinned audio hash (#129429 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429 Approved by: https://github.com/pytorchbot	2024-07-05 03:32:29 +00:00
Eddie Yan	c9f1db265e	[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 ) The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally: `a21d4363d2/c10/cuda/CUDAStream.h (L132)` OUTDATED below: The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following: ``` import logging import os import time import torch import torch.distributed as dist def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s") backend = 'nccl' group = torch.distributed.init_process_group(backend=backend) rank = torch.distributed.get_rank(group=group) for i in range(4): time.sleep(rank) logging.info(f"Rank {rank}: enter barrier {i}") dist.barrier() logging.info(f"Rank {rank}: exit barrier {i}") dist.destroy_process_group() if __name__ == "__main__": main() ``` appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead. The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization. This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device. CC @wujingyue @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908 Approved by: https://github.com/kwen2501	2024-07-04 20:36:58 +00:00
Lei Zhang	7128504424	[inductor] Add Triton template for Conv3D (#129518 ) This commit adds a Triton template for Conv3D ops, by following the same logic like Conv2D. Conv3D aren't as frequently used like Conv2D so they might enjoy less optimizations in various libraries. So having a Triton based inductor impl can improve performance for cases. Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129518 Approved by: https://github.com/jansel, https://github.com/jataylo	2024-07-04 20:30:50 +00:00
Kurt Mohler	e590168865	Enable sharing meta tensors between processes (#129520 ) Fixes #129436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520 Approved by: https://github.com/ezyang	2024-07-04 20:29:48 +00:00
Xu Han	21eeedb455	[Inductor] Add aot_mode UT to new cpp_builder. (#130105 ) Changes: 1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT. 2. Switch AotCodeCompiler vec isa command gen to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-04 19:08:56 +00:00
chuanqiw	d496145534	[CD] Add triton xpu wheel build (#129730 ) Enable triton xpu wheel build firstly, then add pytorch xpu nightly wheel build Pull Request resolved: https://github.com/pytorch/pytorch/pull/129730 Approved by: https://github.com/atalman	2024-07-04 17:55:20 +00:00
Huy Do	f78b79daaa	Forward fix the missing torch.nn.Module.set_submodule from D59140215 (#130075 ) Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module. This is the same way get_submodule is currently implemented. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test` Differential Revision: D59254638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075 Approved by: https://github.com/mikaylagawarecki	2024-07-04 17:46:56 +00:00
Howard Huang	5b5f4b02c2	[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 ) # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369 Approved by: https://github.com/wconstab ghstack dependencies: #129368	2024-07-04 16:38:30 +00:00
PyTorch MergeBot	6dfa53ca76	Revert "[pt2-bench] pass acc test if ref is NaN (#129996 )" This reverts commit 51fa0bd436cf627bd0c8ccf3a3a8b9c07d260622. Reverted https://github.com/pytorch/pytorch/pull/129996 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	fa3953a2e1	Revert "[pt2-bench] fix accuracy failure for a few models (#129941 )" This reverts commit dafbd603ee6672d9592ec72b59300a2631f431d2. Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	54da35a2e0	Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 )" This reverts commit 0af8c8a981e79b05767089e57e81262dbbf2b1b4. Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
Yu, Guangye	57d05f2616	[RELAND] Add xpu to getAccelerator (#129205 ) # Motivation Add `xpu` support to `getAccelerator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205 Approved by: https://github.com/albanD, https://github.com/gujinghui ghstack dependencies: #129463	2024-07-04 10:26:52 +00:00
Yanbo Liang	551f3b92b2	[Dynamo] Add assertion for tensor unpack shape mismatch (#130077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130077 Approved by: https://github.com/Chillee	2024-07-04 09:25:08 +00:00
Yu, Guangye	f3962cfd9c	[RELAND] XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-07-04 08:46:34 +00:00
Animesh Jain	fa4e489d70	[dynamo][dynamic-shapes] Graph break if out shape changes on out= variants (#130074 ) Fixes https://github.com/pytorch/pytorch/issues/130068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130074 Approved by: https://github.com/ezyang ghstack dependencies: #129913, #129914	2024-07-04 08:36:12 +00:00
Yan Zhiwei	e98587c58d	Update torch-xpu-ops pin (ATen XPU implementation) (#129353 ) 188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353 Approved by: https://github.com/EikanWang	2024-07-04 07:36:17 +00:00
titaiwangms	bffb278700	[ONNX] Add `artifacts_dir` to torch-onnx-patch in benchmark (#130069 ) Add `artifacts_dir` to torch-onnx-patch to save error report for debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069 Approved by: https://github.com/justinchuby	2024-07-04 07:11:02 +00:00
Li-Huai (Allan) Lin	d62d351107	[Optim][BE] Change str(device) to _get_device_type(device) (#129984 ) Prevent using vague expressions like `"cuda" in str(device)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129984 Approved by: https://github.com/janeyx99 ghstack dependencies: #129451, #129552	2024-07-04 06:44:48 +00:00
Li-Huai (Allan) Lin	42f3d7e948	[MPS] Add mps profiler env vars to docs (#129552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552 Approved by: https://github.com/malfet ghstack dependencies: #129451	2024-07-04 06:44:48 +00:00
cyy	07b06f0f0a	[2/N] Remove outdated CMake code (#130006 ) Follows #129851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130006 Approved by: https://github.com/drisspg	2024-07-04 06:24:22 +00:00
Jithun Nair	26be691e6b	Unify shard logic for inductor and dynamo test_config (#129508 ) Addresses https://github.com/pytorch/pytorch/pull/129480#issuecomment-2189954552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129508 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-07-04 06:04:29 +00:00
Anshul Sinha	9c9ac670a0	[dtensor][be] Reduced redundant LOC by creating functions to set up models used in example (#129613 ) Summary As the CommModeFeature example file grew, there were to many LOC that was repeated for setting up the models used. I created two functions, one to handle MLP and MLPStacked models and the other for transformer models. The output of the examples will not have changed. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/129613 Approved by: https://github.com/XilunWu ghstack dependencies: #129602	2024-07-04 06:00:58 +00:00
Anshul Sinha	0b9995c1ce	[dtensor][debug] Added forward and backward differentiation for module level tracing (#129602 ) Summary Currently, comm_mode only allowed users to differentiate between forward and backward passes at the operational level. I modified the code so that users can now see the collective counts for the passes at a module level. I decided to slightly change how the output was formatted making it easier to differentiate between a collective count and an operation. I have designed the operational trace table function so that in the future, a user can use command line arguments in order to determine the level of information they want to display instead of having two similar functions. Finally, I have updated the new output and test cases for comm_mode example and test files. The expected output for the first 3 examples are shown below: <img width="320" alt="Screenshot 2024-06-26 at 2 30 25 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b8e88075-a07f-4e84-b728-a08959df3661"> <img width="497" alt="Screenshot 2024-06-26 at 2 29 15 PM" src="https://github.com/pytorch/pytorch/assets/50644008/5ef4bea7-1355-4089-bfb0-c7e3f588ac77"> <img width="615" alt="Screenshot 2024-06-26 at 2 31 05 PM" src="https://github.com/pytorch/pytorch/assets/50644008/feacae51-76f7-403b-b6cd-dd15e981770e"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129602 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-07-04 06:00:58 +00:00
Peter Bell	e2e624a02f	[AOTAutograd] Micro-optimize runtime_wrapper (#128188 ) This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188 Approved by: https://github.com/bdhirsh	2024-07-04 03:53:06 +00:00
Animesh Jain	a7a7363be0	[dynamo] Skip side effect tracking for c wrappers/descriptors (#129914 ) Fixes PYTORCH_TEST_WITH_DYNAMO=1 pytest -vs test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass Pull Request resolved: https://github.com/pytorch/pytorch/pull/129914 Approved by: https://github.com/jansel ghstack dependencies: #129913	2024-07-04 03:14:45 +00:00
Animesh Jain	da8af685ac	[dynamo] Skip ID_MATCH guard on GetSetDescriptorType (#129913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129913 Approved by: https://github.com/jansel	2024-07-04 03:14:45 +00:00
Jiong Gong	8405ba21c1	[inductor][cpp] fix the vec convertion between float and int64 on AVX2 (#130013 ) Fix https://github.com/pytorch/pytorch/issues/129863 There is no single instruction support on AVX2 to convert between fp and int64 and has to be emulated. The original fast implementation (see https://stackoverflow.com/questions/41144668) assumes the data range is within [-2^51, 2^51]. The issue reported in https://github.com/pytorch/pytorch/issues/129863 has the input data outside this range and failed the test. This PR supports the full range of the conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130013 Approved by: https://github.com/lezcano	2024-07-04 03:01:49 +00:00
cyy	99ec7bbee7	Force inconsistent-missing-override for torch targets (#130010 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130010 Approved by: https://github.com/ezyang	2024-07-04 02:37:57 +00:00
Shunting Zhang	0af8c8a981	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-04 01:14:29 +00:00
Shunting Zhang	dafbd603ee	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-04 01:14:29 +00:00
Shunting Zhang	51fa0bd436	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-04 01:14:29 +00:00
drisspg	9108b74bbc	Updates to scaled_mm for rowwise scaling (#130059 ) # Summary This updates _scaled_mm's API to enforce that input scales are always 2 dimensional. This resolves ambiguity around scaling scheme Pull Request resolved: https://github.com/pytorch/pytorch/pull/130059 Approved by: https://github.com/vkuzo	2024-07-04 00:53:17 +00:00
Tristan Rice	cd70ac884f	c10d/Utils: better error message on 0 bytes (#130056 ) This improves the error messages on 0 bytes sent/received. We currently log it as a connection reset when it's caused by other reasons. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130056 Approved by: https://github.com/kurman, https://github.com/rsdcastro	2024-07-04 00:48:20 +00:00
cyy	efb73eda51	[2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878 ) Follows #128670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129878 Approved by: https://github.com/ezyang	2024-07-04 00:39:28 +00:00
Shangdi Yu	d95a019704	[export] construct empty graph when there's no tensor computation (#129541 ) Fixes [#127110](https://github.com/pytorch/pytorch/issues/127110). When input module does not contain any tensor computation, we would create a graph with inputs and outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129541 Approved by: https://github.com/angelayi	2024-07-04 00:26:17 +00:00
Shangdi Yu	2fe7c1fe04	[custom ops] Support factory function (#129978 ) Fixes #129389 If a user registers a device-specific implementation for an operator that accepts no Tensors, then we require the operator to have a "device: torch.device argument" We switch on the device argument to select the correct backend to dispatch to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129978 Approved by: https://github.com/zou3519	2024-07-04 00:10:52 +00:00
PyTorch MergeBot	779fc8119e	Revert "XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 )" This reverts commit 6353a12e6a80f06217645b10fb69cffeac08a8d0. Reverted https://github.com/pytorch/pytorch/pull/129463 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129463#issuecomment-2207529072))	2024-07-03 23:43:15 +00:00
PyTorch MergeBot	8a9725bedb	Revert "Add xpu to getAccelerator (#129205 )" This reverts commit 3e2df3ca9d0a593e09bc94c14bbf2b213413cbf3. Reverted https://github.com/pytorch/pytorch/pull/129205 on behalf of https://github.com/kit1980 due to Need to revert https://github.com/pytorch/pytorch/pull/129463 which breaks Meta builds ([comment](https://github.com/pytorch/pytorch/pull/129205#issuecomment-2207514346))	2024-07-03 23:37:24 +00:00
Jerry Zhang	a9a744e442	Change numeric_debug_handle to store per-node id (#129811 ) Summary: Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack, but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional support for numerical debugging for inputs and willing to hack around to achieve this. This PR changes the structure of numeric_debug_handle to store unique_id for each node instead. e.g. graph: ``` node = op(input_node, weight_node) ``` Before: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3} ``` After: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1 ``` Test Plan: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811 Approved by: https://github.com/tarun292	2024-07-03 22:03:31 +00:00
Zain Rizvi	b0d0114f5b	Enable automigration for windows jobs (#129977 ) Enable Windows jobs to automatically use LF runners when the author is opted-in Pull Request resolved: https://github.com/pytorch/pytorch/pull/129977 Approved by: https://github.com/clee2000	2024-07-03 22:02:56 +00:00
Yukio Siraichi	a79bb8db91	Make `_embedding_bag_backward` explicitly dispatch to CPU and CUDA. (#129691 ) This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`. Context: PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not allow third party backends (e.g. XLA) to modify its implementation, since this dispatch key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors. Problem: `_embedding_bag_backward` has a `sparse` parameter that controls whether the operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does not support sparse tensors. In order to fallback that execution to dense, i.e. change the flag at runtime, we need to be able to modify its implementation. Solution: we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA, which allowed us to introduce our own kernel for it. Additionally, this PR refactored the representation of its mode from constant integers into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode` and `int != EmbeddingBagMode`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691 Approved by: https://github.com/lezcano	2024-07-03 21:54:49 +00:00
rzou	7bbd6cf931	[custom_ops] Mark older custom ops prototypes as deprecated (#130032 ) I've had at least one person try to call APIs from here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130032 Approved by: https://github.com/yushangdi, https://github.com/williamwen42	2024-07-03 21:11:05 +00:00
Shivam Raikundalia	a21d4363d2	[Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973 ) Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP Test Plan: Ran resnet. Trace looks good https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi, swolchok Differential Revision: D59132793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973 Approved by: https://github.com/aaronenyeshi	2024-07-03 19:28:52 +00:00
Zhengxu Chen	042d764872	[export] Update example inputs format for DB. (#129982 ) Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs. Test Plan: CI Differential Revision: D59288920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982 Approved by: https://github.com/angelayi	2024-07-03 17:53:15 +00:00
Brian Hirsh	9b902b3ee3	AOTI: dont treat views of buffers as constants (#129688 ) More context [here](https://github.com/pytorch/pytorch/issues/129682#issuecomment-2195463838), but this change was enough to get this AOTI + float8 repro running for me (below). Previously, it would fail an assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L5387) at inductor lowering time. It looks like during lowering, we were supposed to pass `param.transpose(1, 0)` as the second argument to the scaled_mm kernel. But in the inductor IR, this object is a `ReinterpretView` with `get_name()` equal to one of the param constants, so we would end up passing the constant directly into the kernel, instead of performing the view first. I'm not totally sure if this is the right place to make the change, so interested in any thoughts from inductor folks (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @eellison ) ``` import torch from torch.export import export from torch.export._trace import _export # Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the BSD 3-Clause license found in the # LICENSE file in the root directory of this source tree. import copy import io import random import unittest import pytest import torch import torch.nn as nn import torch.nn.functional as F from float8_experimental.float8_dynamic_linear import Float8DynamicLinear from float8_experimental.float8_linear_utils import swap_linear_with_float8_linear from float8_experimental.float8_tensor import Float8Tensor from float8_experimental.float8_utils import compute_error random.seed(0) torch.manual_seed(0) is_H100 = torch.cuda.is_available() and torch.cuda.get_device_capability() >= (9, 0) import torch.nn.utils.parametrize as parametrize # NOTE: we should upstream this directly into export and make it more automatic! class UnwrapTensorSubclass(torch.nn.Module): def forward(self, tensors): todo = list(tensors) for tp, meta, inner_tensors in reversed(self.rebuild_stack): nb_tensor = len(inner_tensors) inner_tensors = {a: b for a, b in zip(inner_tensors, todo[-nb_tensor:])} todo = todo[nb_tensor:] rebuilt = tp.__tensor_unflatten__(inner_tensors, meta, None, None) todo.append(rebuilt) assert len(todo) == 1 return todo[0] def right_inverse(self, tensor): assert type(tensor) is not torch.Tensor rebuild_stack = [] plain_tensors = [] todo = [tensor] while todo: obj = todo.pop() inner_tensors, metadata = obj.__tensor_flatten__() rebuild_stack.append((type(obj), metadata, inner_tensors)) for attr_name in inner_tensors: val = getattr(obj, attr_name) if type(val) is torch.Tensor: plain_tensors.append(val) else: assert isinstance(val, torch.Tensor) todo.append(val) self.rebuild_stack = rebuild_stack return plain_tensors def unwrap_tensor_subclass(model, filter_fn=None): for name, child in model.named_children(): if ( isinstance(child, Float8DynamicLinear) and hasattr(child, "weight") and type(child.weight) is not torch.Tensor and isinstance(child.weight, torch.Tensor) ): parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass()) unwrap_tensor_subclass(child) return model class FeedForward(nn.Module): def __init__(self) -> None: super().__init__() self.w1 = nn.Linear(4096, 14336, bias=False) self.w3 = nn.Linear(4096, 14336, bias=False) self.w2 = nn.Linear(14336, 4096, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.w2(F.silu(self.w1(x)) self.w3(x)) def reset_parameters(self): for m in self.modules(): if isinstance(m, nn.Linear): m.reset_parameters() export_model = FeedForward().to("cuda") swap_linear_with_float8_linear( export_model, Float8DynamicLinear, from_float_kwargs={"pre_quantize_weight": True}, ) export_model = unwrap_tensor_subclass(export_model) batch_size = 4 num_tokens = 1024 embedding_dim = 4096 input_tensor = torch.randn( batch_size, num_tokens, embedding_dim, device="cuda", dtype=torch.float32 ) example_args = (input_tensor,) # NOTE: this breaks unless we use strict=False, pre_dispatch=False! exported_program: torch.export.ExportedProgram = _export( export_model, example_args, strict=False, pre_dispatch=False, ) with torch.no_grad(): so_path = torch._inductor.aot_compile(exported_program.module(), example_args) print(so_path) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129688 Approved by: https://github.com/eellison	2024-07-03 17:24:08 +00:00
Edward Z. Yang	35600bcaad	Print float with full precision, don't truncate (#130027 ) Fixes https://github.com/pytorch/pytorch/issues/119338 Exercised in https://github.com/pytorch/pytorch/pull/118448 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130027 Approved by: https://github.com/lezcano, https://github.com/Skylion007	2024-07-03 17:20:19 +00:00
chilli	01e41f1814	Modified autotuning for flex_attention to pass in (proper) fake inputs for the block sparse entries (#129915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129915 Approved by: https://github.com/yanboliang, https://github.com/eellison ghstack dependencies: #129846, #129950	2024-07-03 17:08:45 +00:00
chilli	e2eb33b089	Added methods to blockmask to visualize them (#129950 ) <img width="319" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/319b10f4-f6fe-4ff8-9529-d366ff411b95"> <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/27a8953a-3c50-4922-b5d0-4ea5630a133a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129950 Approved by: https://github.com/yanboliang, https://github.com/drisspg ghstack dependencies: #129846	2024-07-03 17:08:45 +00:00
Edward Z. Yang	29c68df600	Stop immediately specializing common constants 0/1 for plain int (#128327 ) Fixes https://github.com/pytorch/pytorch/issues/128319 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327 Approved by: https://github.com/lezcano ghstack dependencies: #129983	2024-07-03 16:41:51 +00:00
James Wu	9e1e58e052	Support allowlisted modules and op overloads in AOTAutogradCache (#128329 ) Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329 Approved by: https://github.com/bdhirsh	2024-07-03 14:59:24 +00:00
Edward Z. Yang	64a04d2225	Make sparse empty constructors specialize instead of fail on symbolic inputs (#129983 ) Exercised in https://github.com/pytorch/pytorch/pull/128327 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129983 Approved by: https://github.com/anijain2305	2024-07-03 13:27:19 +00:00
Xuehai Pan	735044191f	[Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884 ) The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`. `5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)` Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string. ```python ", ".join(string.split(",")) ``` ```python # before def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ... # after def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884 Approved by: https://github.com/ezyang	2024-07-03 11:45:24 +00:00
Huy Do	8f70bf7a94	Skip TestSDPAPrivateUse1Only on FBCODE (#129997 ) Summary: The test is from D59181111, but I couldn't figure out a way to make it pass on FBCODE because loading PyTorch C++ extension requires Ninja which is not going to work with BUCK Test Plan: `buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:transformers` Differential Revision: D59304327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129997 Approved by: https://github.com/drisspg	2024-07-03 06:48:51 +00:00
Valentine233	62b710782d	change LayoutLMForSequenceClassification inference accuracy tolerance (#129728 ) Fixes #128510. https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff. The test log: Single thread ``` correct_result: SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000 E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits fail_accuracy ``` Multiple threads ``` correct_result: SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-03 06:28:27 +00:00
Jason Ansel	4fc9157e90	[halide-backend] Disable split reductions for Halide (#129320 ) In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129321	2024-07-03 05:56:40 +00:00
Jason Ansel	0abcca85b7	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314	2024-07-03 05:56:40 +00:00
Edward Z. Yang	8af58f66bb	Fix typo in floordiv solver code that affects flipped relation (#129888 ) Fixes https://github.com/pytorch/pytorch/issues/123535 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888 Approved by: https://github.com/lezcano	2024-07-03 04:47:32 +00:00
Edward Z. Yang	424cd1e1df	Enable TORCH_TRACE by default on Conda on Mast (#129988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129988 Approved by: https://github.com/kunalb	2024-07-03 03:35:45 +00:00
Catherine Lee	1026b0f687	Use setup-miniconda step from test-infra for llm retrival workflow (#129720 ) Undo https://github.com/pytorch/pytorch/pull/129722 Use the setup-miniconda step in written in test-infra to install miniconda in the llm retrieval workflow. It comes with a cache so we don't have to worry about hitting cache limits. The llm retrieval job was failing due to too many requests https://github.com/pytorch/pytorch/issues/129718#issue-2379260544 `2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129720 Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/huydhn	2024-07-03 03:02:23 +00:00
chilli	31fc5b8966	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-03 02:34:03 +00:00
Tristan Rice	9ee8c18309	TCPStore: add ping to verify network connectivity on connect (#129985 ) This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent. This adds support for PING to both the libuv and legacy backend. Example error: ``` [trainer85612\|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer [trainer85612\|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): ... [trainer85612\|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637 [trainer85612\|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868 [trainer85612\|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775 ``` Test plan: ``` python test/distributed/test_store.py -v ``` ``` tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py starting pool started 90000 started 30000 started 70000 started 20000 started 80000 started 60000 started 0 [W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. init 20000 set 20000 init 80000 set 80000 init 70000 set 70000 init 60000 set 60000 init 30000 set 30000 init 90000 set 90000 started 40000 init 40000 set 40000 started 50000 init 50000 set 50000 started 10000 init 10000 set 10000 init 0 set 0 run finished 617.2992351055145 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985 Approved by: https://github.com/rsdcastro, https://github.com/kurman	2024-07-03 02:09:44 +00:00
Catherine Lee	91a8376d47	run_test: Unset cpp stacktraces after reruns (#129004 ) Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var We don't want to waste time generating these if we don't have to They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs In the above: * test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free) * test_format_traceback_short failed consistently, but it continued to run because keep-going was set Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004 Approved by: https://github.com/PaliC	2024-07-03 01:50:15 +00:00
xinan.lin	c77c139878	[Intel Triton] Update Intel Triton to resolve installation issue on manylinux. (#129847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129847 Approved by: https://github.com/Skylion007, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #129782	2024-07-03 01:46:32 +00:00
dilililiwhy	c686304277	Enable UFMT on test/test_public_bindings.py (#128389 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > test/test_public_bindings.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389 Approved by: https://github.com/malfet	2024-07-03 01:43:41 +00:00
xinan.lin	3b77b122c5	[Inductor UT] update rtol for convoluton on XPU. (#129782 ) [Inductor UT] update rtol for convoluton on XPU. Fix https://github.com/pytorch/pytorch/issues/129974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129782 Approved by: https://github.com/atalman	2024-07-03 01:37:16 +00:00
Shiyan Deng	1e27af335e	[easy] enhance local model loading (#129897 ) Summary: 1. add one more model lib dep. 2. add error message when torchscript failed to find a class in python compilation unit. Test Plan: CI Reviewed By: jingsh Differential Revision: D59243250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897 Approved by: https://github.com/jingsh	2024-07-03 00:29:02 +00:00
Simon Fan	be2d79a16b	[dynamic] config to disable duck sizing (#129804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129804 Approved by: https://github.com/ezyang	2024-07-03 00:20:54 +00:00
Yanbo Liang	111f9b5d44	[Dynamo] Add config to skip/inline torchrec (#129912 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912 Approved by: https://github.com/anijain2305	2024-07-03 00:14:51 +00:00
PyTorch MergeBot	89646ebb11	Revert "[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 )" This reverts commit 4b8a5e03745924c8f987dc072fa4d41f4cb6f103. Reverted https://github.com/pytorch/pytorch/pull/129680 on behalf of https://github.com/kit1980 due to breaking internal builds, see D59181183 ([comment](https://github.com/pytorch/pytorch/pull/129680#issuecomment-2204737227))	2024-07-03 00:03:50 +00:00
Peter Bell	921c116089	[inductor] Kill mark_node_as_mutating (#129346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129346 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325, #129343, #129344	2024-07-02 23:50:07 +00:00
Peter Bell	b2ac8d2af3	[inductor] Use multiple outputs for flex-attention (#129344 ) This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/129344 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325, #129343	2024-07-02 23:50:07 +00:00
Peter Bell	45844e0d4e	[inductor] Add FileCheck to flex attention epilogue test (#129343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325	2024-07-02 23:50:04 +00:00
Peter Bell	7955cd3e83	[inductor] Make UserDefinedTritonKernel a multi-output operation (#129325 ) Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129325 Approved by: https://github.com/lezcano ghstack dependencies: #128893	2024-07-02 23:50:00 +00:00
Peter Bell	fb078c20c1	[inductor] Separate Buffer and Operation into two concepts (#128893 ) Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893 Approved by: https://github.com/lezcano	2024-07-02 23:49:57 +00:00
rzou	872d972e41	[custom_op] better error message on no returns (#129896 ) I run into this a lot. I can imagine that it would look opaque to users, so made it more friendly Old error message: "ValueError: infer_schema(func): Return has unsupported type <class 'inspect._empty'>." Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129896 Approved by: https://github.com/yushangdi	2024-07-02 23:34:23 +00:00
Shangdi Yu	aa0352ca38	[custom ops] add default value support for device types (#129792 ) Fixes #129371 I think the first case in Issue #129371 is already supported in the current code? Since it takes care of string default values. This PR adds support for device type default values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129792 Approved by: https://github.com/zou3519	2024-07-02 23:31:29 +00:00
Edward Z. Yang	d7680a564b	Bug fixes for disabling 0/1 specialization on plain int (#129961 ) These bug fixes will be exercised in https://github.com/pytorch/pytorch/pull/128327 but I separate them from the actual policy change (which is more risky) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129961 Approved by: https://github.com/lezcano	2024-07-02 23:19:48 +00:00
eqy	29ffa20bb1	[CUDA] Bump tolerances for `test_grad_pca_lowrank` (#129902 ) The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this. I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here... CC @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902 Approved by: https://github.com/ezyang	2024-07-02 23:17:02 +00:00
PyTorch MergeBot	b5fdbc1a9f	Revert "[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 )" This reverts commit ec789a3c9ddd4e550b3dea6934ce2d41deb98784. Reverted https://github.com/pytorch/pytorch/pull/129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 `ec789a3c9d`. You can see the error on the PR, but Dr. CI classified it wrong ([comment](https://github.com/pytorch/pytorch/pull/129369#issuecomment-2204568418))	2024-07-02 22:30:53 +00:00
Sheng Fu	b6f781e433	Bug fix for captuing execution trace grid function (#129832 ) Summary: The inputs to grid function are varying argument, it can be one number, two numbers, or three numbers. The current implementation captured it as a tuple. For example "grid((16,))". The fix is to change it to varying number of elements. In the previous example, it is changed to "grid(16,)". PARAM et-replay code will be modified to reflect this change in a following up DIFF. Test Plan: buck2 test mode/dev-nosan caffe2/test:profiler -- -- test_execution_trace_with_pt2 Differential Revision: D59195933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129832 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2024-07-02 22:23:57 +00:00
Colin Peppler	39357ba06f	[dynamo] don't constrain range on the replacement for a symbol (#129907 ) # Error ``` File "/data/users/colinpeppler/pytorch/torch/_meta_registrations.py", line 704, in sym_constrain_range constrain_range(size, min=min, max=max) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 898, in constrain_range a.node.shape_env._constrain_range(a.node.expr, min, max) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/recording.py", line 245, in wrapper return fn(args, *kwargs) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 2813, in _constrain_range assert isinstance(a, sympy.Symbol), f"constraining non-Symbols NYI, {a} is {type(a)}" torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: constraining non-Symbols NYI, s1 + s2 is <class 'sympy.core.add.Add'> ``` # Context I ran into the following scenario: ``` getitem = ... sym_size_int = torch.ops.aten.sym_size.int(getitem, 0) # this is u0 = s0 + s1 _check_is_size = torch._check_is_size(sym_size_int) # we fail at this guy sym_constrain_range_default = torch.ops.aten.sym_constrain_range.default(sym_size_int, min = 4, max = 1234) # runtime assertion add = sym_size_int + sym_size_int_1 eq = add == sym_size_int _assert_scalar_default = torch.ops.aten._assert_scalar(eq, "Runtime assertion failed for expression Eq(s0 + s1, u0) on node 'eq'") ``` everything but getitem was asserted into the FX graph by insert_deferred_runtime_asserts() `7e4329c258/torch/fx/passes/runtime_assert.py (L38-L52)` In the above scenario, we fail trying to constraint the range on `s0 + s1` which is not a `sympy.Symbol`. And why exactly are we constraining the range on `s0 + s1`? Because it's the replacement for `u0`. # Approach Whenever we try to constrain the range on the replacement of ~~an unbacked symint~~ a non-symbol, just ignore it. In the scenario above, we'll be okay to ignore it because whenever there's a replacement on an unbacked symint, we will update its range. Hence, no need to constrain the range on `s1 + s1`. We can confirm this with `TORCH_LOGS="+dynamic"`. ``` torch/fx/experimental/symbolic_shapes.py:4737: _update_var_to_range u0 = VR[4, 198] (update) torch/fx/experimental/symbolic_shapes.py:4856: set_replacement u0 = s1 + s2 (trivial_lhs) VR[4, 198] ``` `600bf978ba/torch/fx/experimental/symbolic_shapes.py (L4759-L4764)` Differential Revision: [D59257079](https://our.internmc.facebook.com/intern/diff/D59257079) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129907 Approved by: https://github.com/jingsh	2024-07-02 21:46:40 +00:00
PyTorch MergeBot	c22e66896f	Revert "Fix typo in floordiv solver code that affects flipped relation (#129888 )" This reverts commit 3c6c3b94486d49614bae5e76e7bd6b9579f643d4. Reverted https://github.com/pytorch/pytorch/pull/129888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the updated test starts to fail flakily in trunk somehow, so I am reverting the change to see if it helps ([comment](https://github.com/pytorch/pytorch/pull/129888#issuecomment-2204442653))	2024-07-02 21:16:59 +00:00
wz337	1ddb100318	[FSDP1][Easy] Remove Spammy Log Lin in _runtime_utils.py (#129967 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129967 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/Skylion007	2024-07-02 21:08:57 +00:00
PyTorch UpdateBot	deefc10dd3	[executorch hash update] update the pinned executorch hash (#129428 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129428 Approved by: https://github.com/pytorchbot	2024-07-02 20:39:39 +00:00
cyy	26de2c2487	[3/N] Enable clang-tidy on torch/csrc/jit/serialization/* (#129850 ) Follows #129300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129850 Approved by: https://github.com/ezyang	2024-07-02 20:08:48 +00:00
Li-Huai (Allan) Lin	8ec5ba960f	[MPS] Add tensor_lr overloads to fused adam & adamw (#129451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451 Approved by: https://github.com/janeyx99	2024-07-02 19:46:30 +00:00
Edward Z. Yang	2631a96f2a	Stop updating hints (#129893 ) Some profiling suggests that the repeated maybe evaluate static calls are expensive. Ref: https://github.com/pytorch/pytorch/issues/123964 With test script: ``` import torch import torch._dynamo.config torch._dynamo.config.capture_scalar_outputs = True @torch.compile(fullgraph=True) def f(a, b): xs = b.tolist() for x in xs: torch._check_is_size(x) torch._check(x <= 20) return a.split(xs) N = 20 splits = torch.randint(10, (N,)) sz = splits.sum().item() f(torch.randn(sz), splits) ``` Before: ``` real 0m18.526s user 0m16.555s sys 0m11.031s ``` After: ``` real 0m13.831s user 0m12.152s sys 0m10.941s ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129893 Approved by: https://github.com/lezcano	2024-07-02 19:24:33 +00:00
Anshul Sinha	1f6c1fcd36	[dtensor][debug] add operation tracing to comm_mode (#129017 ) Summary I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below: <img width="574" alt="Screenshot 2024-06-25 at 3 33 16 PM" src="https://github.com/pytorch/pytorch/assets/50644008/a09e2504-19d5-4c69-96e8-f84e852d7786"> <img width="467" alt="Screenshot 2024-06-25 at 3 33 45 PM" src="https://github.com/pytorch/pytorch/assets/50644008/55c07d2d-6cb6-410f-82ac-2849bb7bfbbb"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/129017 Approved by: https://github.com/XilunWu	2024-07-02 19:05:05 +00:00
Huy Do	bf05ea2bab	Re-generate Linux build workflows after #124014 (#129976 ) This looks like a landrace as lint passed on #124014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129976 Approved by: https://github.com/kit1980	2024-07-02 18:57:20 +00:00
Yanbo Liang	080149cb38	[Inductor][FlexAttention] Add helper functions of converting score_mod to block_mask (#129909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129909 Approved by: https://github.com/Chillee, https://github.com/drisspg ghstack dependencies: #129831, #129859	2024-07-02 18:48:16 +00:00
Yanbo Liang	1f3e2d7877	[Inductor] Rename TemplatedAttention to FlexAttention (#129859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129859 Approved by: https://github.com/Chillee, https://github.com/drisspg ghstack dependencies: #129831	2024-07-02 18:48:16 +00:00
Michael Lazos	aa7ea6b45c	Add wraps back (#129933 ) Fixes https://github.com/pytorch/pytorch/issues/129922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-07-02 18:24:02 +00:00
Howard Huang	ec789a3c9d	[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 ) # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369 Approved by: https://github.com/wconstab ghstack dependencies: #129368	2024-07-02 18:19:28 +00:00
Howard Huang	4eb449f7dc	[pipelining] add small logging section to docs (#129368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368 Approved by: https://github.com/wconstab	2024-07-02 18:19:28 +00:00
Yanbo Liang	34e94c507a	[Inductor] Make FlexAttention block_mask argument as tuple (#129831 ) Re-organize ```block_mask``` related arguments a tuple to reduce the individual argument number. I was trying to use named tuple, but aot autograd doesn't work well with named tuple. The only downside of using tuple rather than named tuple is we need to use index to access its element. But we only need this at one place, it should be fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129831 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-02 17:18:33 +00:00
Animesh Jain	9105d54c6b	[dynamo][sparse] Graph break on sparse tensors (#129883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129883 Approved by: https://github.com/ezyang ghstack dependencies: #129830, #129858, #129857, #129881	2024-07-02 16:51:56 +00:00
Animesh Jain	75443d3daf	[dynamic-shapes] Dont create symbol if .item() is a nan (#129881 ) Passes ` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/torch_np/numpy_tests/lib/test_function_base.py::TestInterp::test_scalar_interpolation_point` in the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129881 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #129830, #129858, #129857	2024-07-02 16:51:56 +00:00
Nikita Shulga	d146a62e77	[MPS][BE] Introduce `mtl_setBytes` (#129910 ) Which for primitive types calls `[encoder setBytes:&val legnth:sizeof(val) index:idx];` and for container types passes number of elements equal to the size of the container Pull Request resolved: https://github.com/pytorch/pytorch/pull/129910 Approved by: https://github.com/Skylion007	2024-07-02 16:36:57 +00:00
Shangdi Yu	9fb2dec7a6	[custom ops] Add unknown arg (#129614 ) Fixes #129372 Add a mutated_args="unknown" that pessimistically assumes that all inputs to the operator are being mutates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129614 Approved by: https://github.com/zou3519	2024-07-02 16:10:14 +00:00
Tijmen Blankevoort	e3b3431c42	Fix for HistogramObserver (#129387 ) Summary: There were two problems with the HistogramObserver: 1. It does not work when someone passes a batch_size 1, tensor_size 1 data-point. 2. The Histogram doesn't seem to actually update if the range of the new x falls within the old one These issues were both fixed. On top of this, I greatly simplified the logic for the histogram updating. Now, it doesn't do the downsampling anymore, which saves a ton of memory and code. The accuracy can still be controlled with the upsampling ratio. This ratio was also too high for the accuracy we generally need here, I reduced the default for this. Also the code is cleaner now, much easier to follow what's happening. test_histogram_observer_same_inputs was likely wrong - If I pass 0s and 1s to my histogramobserver, I want them to actually count! The current test now thinks it's good to discard and ignore these values. Test Plan: You can run the included tests. Differential Revision: D58931336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129387 Approved by: https://github.com/jerryzh168	2024-07-02 15:41:44 +00:00
PyTorch MergeBot	03440a1c13	Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846 )" This reverts commit badc638eb68c0b07ae3b857e885e6d0137b218aa. Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))	2024-07-02 15:25:34 +00:00
Aart Bik	3fd128361e	[traced-graph][sparse] add relay override for layout_impl (#129930 ) In the "layout()" method of "TensorImpl" defined in the file core/TensorImpl.h, the following code and documentation can be found: ``` Layout layout() const { ... if .. { ... } else if (is_sparse_compressed()) { // Typically, the tensor dispatch keys define the tensor layout // uniquely. This allows using non-virtual layout method for // better performance. However, when tensor's layout depends, // say, on tensor attributes, one must use this execution path // where the corresponding tensor impl class overwrites virtual // layout_impl() method. return layout_impl(); } else { ... } } ``` However, this override was never implemented. This PR put the override in place, to prepare for sparsity propagation in another PR. https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129930 Approved by: https://github.com/ezyang	2024-07-02 15:24:34 +00:00
Edward Z. Yang	dacc33d2fa	Make sym_min/sym_max handle Numpy scalars (#129917 ) Internal xref: https://fb.workplace.com/groups/1069285536500339/posts/7773876449374514/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129917 Approved by: https://github.com/Skylion007	2024-07-02 14:59:20 +00:00
Xuehai Pan	f1df13f023	[BE][Easy] Fix `PYI001`: unprefixed-type-param in `torch/utils/data/datapipes` (#129885 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129885 Approved by: https://github.com/ezyang	2024-07-02 14:56:27 +00:00
Joel Schlosser	257b9c7936	Fix layout for _like() factories on NJTs (#129879 ) Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here: `433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)` This PR explicitly sets `layout=torch.strided` when redispatching `_like()` factories on `values`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879 Approved by: https://github.com/soulitzer	2024-07-02 14:51:23 +00:00
Aaron Gokaslan	6c2a8b6b38	[Ez][BE]: Enable new stable ruff rules (#129825 ) Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825 Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet	2024-07-02 14:47:10 +00:00
Xu Han	2926655761	[inductor] optimize cpp builder configuration code (#129577 ) Changes: 1. Combine choose isa condition dispatch code. 2. Unificate MacOS openmp configuration code. 3. Clean up useless code. Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-02 14:41:59 +00:00
Aaron Gokaslan	6cb0ad3375	[BE]: Update NCCL submodule to 2.21.5 (#124014 ) Update NCCL to the latest version. This release is mostly bugfixes with a few new minor features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124014 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman	2024-07-02 14:39:33 +00:00
Peter Bell	dc75ec252a	[inductor] Fix can_merge check for expr=q0q1 (#129806 ) Fixes #111884 In the minimised reproducer, we have a loop with the index expression `-q0q1` for which in the merge tester we get: ``` expr1 = - 0 * (_merge_tester * 16) = 0 expr2 = - _merge_tester * 0 = 0 ``` so it decides we can merge the dimensions and `q0` is set to `0`, meaning `-q0q1` is always zero! Here I change the test so we have at least one case where no zeros are substituted so we can catch this situation. In the normal strided case we get e.g. ``` expr = 16 q0 + q1 expr1 = 16 * _merge_tester2 + (16 * _merge_tester1) expr2 = 16 * (_merge_tester2 + _merge_tester1) ``` which are still equivalent expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129806 Approved by: https://github.com/lezcano	2024-07-02 14:30:02 +00:00
leslie-fang-intel	37e3c60897	[Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template (#129470 ) Summary Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template. Test Plan ``` numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129470 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221	2024-07-02 13:15:15 +00:00
leslie-fang-intel	b6379591a9	[Inductor][CPP] Pass weight dtype explicitly for cpp gemm template (#129221 ) Summary This PR mainly refactor 2 things: 1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered. 2. Add an util function to get the output data type and compute data type from input data type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049, #129103, #129220	2024-07-02 13:06:32 +00:00
leslie-fang-intel	72fa864098	[Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM (#129220 ) Summary Add the AMX micro gemm kernel with int8 data type. Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [✓] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129220 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103	2024-07-02 12:53:35 +00:00
leslie-fang-intel	a796358330	[Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion (#129103 ) Summary Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion. - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with binary and optional[Unary] post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129103 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049	2024-07-02 12:45:10 +00:00
leslie-fang-intel	86e2d16ba0	[Inductor][Quant] Change the schema of QLinear Binary (#129049 ) Summary We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template. - Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar. - We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion. Test Plan ``` python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048	2024-07-02 12:36:38 +00:00
PyTorch MergeBot	07450e9713	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit 6240cfd5c751bea6ca91dc765085e1d871b22345. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))	2024-07-02 12:29:51 +00:00
Fuzzkatt	0441173ab2	Add slowTest marker to test_linalg_solve_triangular_large (#129903 ) In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903 Approved by: https://github.com/lezcano	2024-07-02 12:27:12 +00:00
Jack Taylor	95a5958db4	[ROCm] Update nightly triton-rocm pin to release branch (#129361 ) Update pin to tip of https://github.com/triton-lang/triton/commits/release/3.0.x/ following upstream strategy here https://github.com/pytorch/pytorch/pull/126098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129361 Approved by: https://github.com/peterbell10	2024-07-02 11:49:52 +00:00
Edward Z. Yang	3c6c3b9448	Fix typo in floordiv solver code that affects flipped relation (#129888 ) Fixes https://github.com/pytorch/pytorch/issues/123535 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888 Approved by: https://github.com/lezcano	2024-07-02 11:15:03 +00:00
Edward Z. Yang	8ef8240172	Don't mark conversion to float as is_integer = False (#129890 ) Zero is an integer, so if you say is_integer = False, you are also saying the result cannot be zero, which is undesirable. This is exercised by next PR in the stack. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129890 Approved by: https://github.com/lezcano	2024-07-02 11:08:09 +00:00
Edward Z. Yang	eb1ff76f23	Make are_strides_like_channels_last size oblivious (#129677 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129677 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #129869	2024-07-02 11:05:20 +00:00
Edward Z. Yang	ebeeb22669	Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache (#129869 ) Internal xref: https://www.internalfb.com/intern/anp/view/?source=version_selector&id=5534845 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129869 Approved by: https://github.com/albanD	2024-07-02 11:05:20 +00:00
Xu Han	567dd1a3ca	[inductor] unificate toolchain code. (#129816 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789 Changes: 1. Unificate cpp builder's toolchain code. 2. Move all build related code to `cpp_builder.py`. 3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816 Approved by: https://github.com/jansel	2024-07-02 09:52:06 +00:00
chilli	badc638eb6	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-02 09:31:38 +00:00
awayzjj	ccc4ee7793	check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839 ) Fixes https://github.com/pytorch/pytorch/issues/127043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839 Approved by: https://github.com/lezcano	2024-07-02 09:20:49 +00:00
Jeff Willette	5c9d5272e4	fixes #124582 (#128483 ) added check for existence of outputs requiring grad to make_graphed_callables. added new test case, updated existing test case to include parameterless modules. Fixes #124582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-02 08:45:59 +00:00
Haoci Zhang	1ad683033b	Implemented flexible PP schedule (#129597 ) Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0. pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0. Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129) Tested using the config in (1), schedule looks like the following graph: ``` =========== ALL_RANK_ACTIONS =========== Rank 0 Rank 1 Rank 2 Rank 3 Step 00: F0_s0 None None None Step 01: F1_s0 F0_s1 None None Step 02: F2_s0 F1_s1 F0_s2 None Step 03: F3_s0 F2_s1 F1_s2 F0_s3 Step 04: F4_s0 F3_s1 F2_s2 F1_s3 Step 05: F0_s4 F4_s1 F3_s2 F2_s3 Step 06: F1_s4 F0_s5 F4_s2 F3_s3 Step 07: F2_s4 F1_s5 F0_s6 F4_s3 Step 08: F3_s4 F2_s5 F1_s6 F0_s7 Step 09: F4_s4 F3_s5 None B0_s7 Step 10: F5_s0 None F2_s6 F1_s7 Step 11: None None B0_s6 B1_s7 Step 12: None F4_s5 F3_s6 F2_s7 Step 13: None B0_s5 B1_s6 B2_s7 Step 14: F6_s0 F5_s1 F4_s6 F3_s7 Step 15: B0_s4 B1_s5 B2_s6 B3_s7 Step 16: F7_s0 F6_s1 F5_s2 F4_s7 Step 17: B1_s4 B2_s5 B3_s6 B4_s7 Step 18: F8_s0 F7_s1 F6_s2 F5_s3 Step 19: B2_s4 B3_s5 B4_s6 B0_s3 Step 20: F9_s0 F8_s1 F7_s2 F6_s3 Step 21: B3_s4 B4_s5 B0_s2 B1_s3 Step 22: F5_s4 F9_s1 F8_s2 F7_s3 Step 23: B4_s4 B0_s1 B1_s2 B2_s3 Step 24: F6_s4 F5_s5 F9_s2 F8_s3 Step 25: B0_s0 B1_s1 B2_s2 B3_s3 Step 26: F7_s4 F6_s5 F5_s6 F9_s3 Step 27: B1_s0 B2_s1 B3_s2 B4_s3 Step 28: F8_s4 F7_s5 F6_s6 F5_s7 Step 29: B2_s0 B3_s1 B4_s2 B5_s7 Step 30: F9_s4 F8_s5 F7_s6 F6_s7 Step 31: B3_s0 B4_s1 B5_s6 B6_s7 Step 32: None F9_s5 F8_s6 F7_s7 Step 33: B4_s0 B5_s5 B6_s6 B7_s7 Step 34: None None F9_s6 F8_s7 Step 35: B5_s4 B6_s5 B7_s6 B8_s7 Step 36: None None None F9_s7 Step 37: B6_s4 B7_s5 B8_s6 B9_s7 Step 38: None None None None Step 39: B7_s4 B8_s5 B9_s6 B5_s3 Step 40: None None None None Step 41: B8_s4 B9_s5 B5_s2 B6_s3 Step 42: None None None None Step 43: B9_s4 B5_s1 B6_s2 B7_s3 Step 44: None None None None Step 45: B5_s0 B6_s1 B7_s2 B8_s3 Step 46: None None None None Step 47: B6_s0 B7_s1 B8_s2 B9_s3 Step 48: None None None Step 49: B7_s0 B8_s1 B9_s2 Step 50: None None Step 51: B8_s0 B9_s1 Step 52: None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597 Approved by: https://github.com/H-Huang	2024-07-02 07:54:38 +00:00
Yu, Guangye	3e2df3ca9d	Add xpu to getAccelerator (#129205 ) # Motivation Add `xpu` support to `getAccelerator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205 Approved by: https://github.com/albanD, https://github.com/gujinghui ghstack dependencies: #129463	2024-07-02 06:48:24 +00:00
Yu, Guangye	6353a12e6a	XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-07-02 06:48:24 +00:00
Xu Han	76259ebfdd	[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1 Changes: 1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`. <img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92"> 2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa. 3. Update code for above changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-02 05:29:05 +00:00
Jovian Anthony Jaison	f6edd1f7c9	[BE] Make ActivationWrapper an abstract class (#129808 ) Fixes #95481 Test Plan: Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129808 Approved by: https://github.com/Skylion007	2024-07-02 04:29:43 +00:00
PyTorch MergeBot	c2d0b7b96d	Revert "[ROCm] std::clamp work-around for hip-clang compiler (#127812 )" This reverts commit 8c2c3a03fb87c3568a22362d83b00d82b9fb3db2. Reverted https://github.com/pytorch/pytorch/pull/127812 on behalf of https://github.com/ezyang due to windows trunk job failing ([comment](https://github.com/pytorch/pytorch/pull/127812#issuecomment-2201653245))	2024-07-02 01:52:31 +00:00
Kulin Seth	6240cfd5c7	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-07-02 01:49:52 +00:00
Howard Huang	600bf978ba	[Pipelining] Add to/from CSV format and improved __repr__ (#129264 ) _Action.__repr__ gets rearranged so it doesn't require an underscore or a 's' prefix, but still keeps multi-digit stage and microbatch indices separated by an alpha character indicating the action type. to/from CSV methods allow dumping a generated schedule to CSV format for offline visualization or manual editing in a spreadsheet and reloading to use at runtime. Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129264 Approved by: https://github.com/H-Huang	2024-07-02 01:26:23 +00:00
wz337	83e6ec2ccd	[FSDP2+TP] Disable 2D state_dict (#129519 ) Fixes #ISSUE_NUMBER Gonna fill in the RFC but just want to run CI to see if anything else breaks. Test: ``` python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129519 Approved by: https://github.com/awgu	2024-07-02 01:26:14 +00:00
cyy	46366888d7	Remove outdated CMake code (#129851 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851 Approved by: https://github.com/ezyang	2024-07-02 00:40:37 +00:00
Nikita Shulga	7e4329c258	[EZ][BE] Bump min cmake version to 3.18 (#129906 ) As this is a min CMake version supported by top level PyTorch Hides ``` CMake Deprecation Warning at aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt:7 (cmake_minimum_required): Compatibility with CMake < 3.5 will be removed from a future version of CMake. Update the VERSION argument <min> value or use a ...<max> suffix to tell CMake that the project does not need compatibility with older versions. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129906 Approved by: https://github.com/kit1980	2024-07-01 23:06:49 +00:00
Zain Rizvi	9645eaaaec	[BE] Improve logging for runner-determinator (#129679 ) This lets us be more flexible about what data we output and throwing exceptions. It's also less likely to break when others make changes (e.g. any print statement would have broken this code before since the printed output was expected to only be a json) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129679 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt, https://github.com/Skylion007	2024-07-01 22:31:35 +00:00
soulitzer	eeef68671d	[autograd] Do not detach when unpacking tensors that do not require grad (#127959 ) In this PR: - Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad). - Update non-reentrant checkpoint to also no longer detach for this case. Alternatives: - For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959 Approved by: https://github.com/YuqingJ ghstack dependencies: #125795, #128545, #129262	2024-07-01 21:57:36 +00:00
Jithun Nair	87693b534c	[ROCm] Use AOTriton as a dynamic library (#129094 ) This PR enables using AOTriton as a shared library dependency instead of a static one. Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094 Approved by: https://github.com/malfet	2024-07-01 21:39:27 +00:00
Jeff Daily	8c2c3a03fb	[ROCm] std::clamp work-around for hip-clang compiler (#127812 ) Fixes #127666. Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max for USE_ROCM builds. Patch comes from @lamikr. Modified to use #ifndef USE_ROCM. https://github.com/lamikr/rocm_sdk_builder/pull/37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 21:00:33 +00:00
Andres Lugo-Reyes	750c701e49	[ROCm] Update xlogy comment detailing issue (#128151 ) update skip reason comment with more accurate descriptor Pull Request resolved: https://github.com/pytorch/pytorch/pull/128151 Approved by: https://github.com/zou3519	2024-07-01 20:58:58 +00:00
Animesh Jain	78cda9a810	[symbolic-shapes] Add FloatPow in the symbolic shape guard closure (#129857 ) Fixes test failure raised in the next diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129857 Approved by: https://github.com/ezyang ghstack dependencies: #129830, #129858	2024-07-01 20:44:59 +00:00
Animesh Jain	53d67165c0	[dynamo] Skip FUNCTION_MATCH guards for descriptors (#129858 ) Hard to write tests. This PR makes many test pass in the stack such as `PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858 Approved by: https://github.com/mlazos ghstack dependencies: #129830	2024-07-01 20:44:59 +00:00
Jithun Nair	f86dbae247	Fix typo in lxml requirement (#129695 ) Extra period at the end throws off pip: ``` root@f04177cab5af:/data/pytorch# pip install -r .ci/docker/requirements-ci.txt ERROR: Invalid requirement: 'lxml==5.0.0.': Expected end or semicolon (after version specifier) lxml==5.0.0. ~~~~~~~^ (from line 309 of .ci/docker/requirements-ci.txt) ``` Not sure why CI docker builds do not have an issue with this period. Typo comes from `f73b1b9388` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129695 Approved by: https://github.com/huydhn	2024-07-01 19:43:37 +00:00
Huy Do	fdd0a7f9b4	Run test_mps_allocator_module serially (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-01 18:44:48 +00:00
PyTorch MergeBot	b02186ffc1	Revert "Allow get attributes on DDP similar to FSDP (#128620 )" This reverts commit 065c386990dce444db17eff7b254bf79e82450ef. Reverted https://github.com/pytorch/pytorch/pull/128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](https://github.com/pytorch/pytorch/pull/128620#issuecomment-2200717876))	2024-07-01 17:57:00 +00:00
Hao Dong	bb0f3df562	Fix index issues in torch.fx.interpreter (#129527 ) Summary: Fix index issues in torch.fx.interpreter by changing range from `[:i]` to `[:i+1]`. Because if there are `n` elements, the last index `i` of the `for` loop is `n-1` and `[:i]` can only get access to elements from index `0` to index `n-2` and miss the last element. `[:i+1]` can get access to all elements correctly. Test Plan: Test with Node API Differential Revision: D59028395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129527 Approved by: https://github.com/dulinriley	2024-07-01 17:46:13 +00:00
zhangfeiv0	1956d87c1f	Increase riscv implementation in DepthwiseConvKernel (#127867 ) Summary: Increase riscv implementation in DepthwiseConvKernel. Compile: export USE_CUDA=0 export USE_DISTRIBUTED=0 export USE_MKLDNN=0 export MAX_JOBS=4 export CMAKE_CXX_COMPILER=clang++ export CMAKE_C_COMPILER=clang export CMAKE_C_FLAGS=-march=rv64gcv export CMAKE_CXX_FLAGS=-march=rv64gcv python3 setup.py develop --cmake Test Plan: Correctness - Check the results of the run before and after test_convolution.py python3 test/run_test.py --include nn/test_convolution --keep-going Before: ===== 9 passed, 13 skipped, 564 deselected in 46.55s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 After: ===== 9 passed, 13 skipped, 564 deselected in 48.13s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 Performance - Compare the results before and after mobilenet_v2 python3 run.py mobilenet_v2 -d cpu -t eval Before: Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 19590.647 milliseconds CPU Wall Time: 19590.647 milliseconds Time to first batch: 5271.3518 ms CPU Peak Memory: 0.3809 GB After: Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 13523.530 milliseconds CPU Wall Time: 13523.530 milliseconds Time to first batch: 2696.0304 ms CPU Peak Memory: 0.3408 GB Versions: Clang version: 17.0.2 Platform: CanMV-K230 Architecture: riscv64 OS: Ubuntu 23.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127867 Approved by: https://github.com/malfet	2024-07-01 17:11:21 +00:00
PyTorch MergeBot	c9dc9887db	Revert "Enable UFMT on test/test_public_bindings.py (#128389 )" This reverts commit fe5424d0f8604f6e66d827ae9f94b05cb7119d55. Reverted https://github.com/pytorch/pytorch/pull/128389 on behalf of https://github.com/clee2000 due to broke test_mps.py::TestMPS::test_mps_allocator_module? https://github.com/pytorch/pytorch/actions/runs/9730750763/job/26854426294 `fe5424d0f8` Not sure how this change can do that. Build failed on PR so test didn't run ([comment](https://github.com/pytorch/pytorch/pull/128389#issuecomment-2200589719))	2024-07-01 16:34:04 +00:00
PyTorch MergeBot	433b691f98	Revert "[inductor] optimize cpp builder configuration code (#129577 )" This reverts commit 2e3ff394bf94d3b9cbab0fe8a93a9ea7c9cb4267. Reverted https://github.com/pytorch/pytorch/pull/129577 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D59181128 ([comment](https://github.com/pytorch/pytorch/pull/129577#issuecomment-2200554824))	2024-07-01 16:14:06 +00:00
PyTorch MergeBot	19e17216a2	Revert "[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 )" This reverts commit 58f346c874a8a982679b4d4f3876602cc05d66d4. Reverted https://github.com/pytorch/pytorch/pull/129789 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/129577 ([comment](https://github.com/pytorch/pytorch/pull/129789#issuecomment-2200545144))	2024-07-01 16:08:44 +00:00
PyTorch MergeBot	b6dc37bb4e	Revert "[inductor] unificate toolchain code. (#129816 )" This reverts commit 67c9ec2b6d12ffd0e83861dcc16c1cd1a9b74d35. Reverted https://github.com/pytorch/pytorch/pull/129816 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #129577 ([comment](https://github.com/pytorch/pytorch/pull/129816#issuecomment-2200539687))	2024-07-01 16:06:22 +00:00
cyy	ca5d13c672	[1/N] Enable unused variable warnings on torch_cpu and fix some violations (#128670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128670 Approved by: https://github.com/ezyang	2024-07-01 14:56:46 +00:00
PyTorch MergeBot	e385bf8ef8	Revert "[halide-backend] Disable split reductions for Halide (#129320 )" This reverts commit a18eb651d352e45860a96869abaf9fb7b215eac6. Reverted https://github.com/pytorch/pytorch/pull/129320 on behalf of https://github.com/jeanschmidt due to This PR is breaking internal builds, please check comments on it D59204360 ([comment](https://github.com/pytorch/pytorch/pull/129320#issuecomment-2200351678))	2024-07-01 14:44:35 +00:00
PyTorch MergeBot	a83eaf1c3a	Revert "[halide-backend] Support manual schedules (#129321 )" This reverts commit 9ae78a578caff195821ad535a9e8d8ef59552142. Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664))	2024-07-01 14:42:33 +00:00
Xu Zhao	cc9b005bf2	Enable torchao nightly workflow (#129779 ) Summary: Make the following improvements: * Schedule the torchao benchmark nightly * Enable torchbench, timm, and huggingface models * Refactor the benchmarking script to better arrange the benchmarking groups Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352 X-link: https://github.com/pytorch/benchmark/pull/2336 Differential Revision: D59074571 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779 Approved by: https://github.com/jerryzh168	2024-07-01 14:28:38 +00:00
Xuehai Pan	75f64e1203	Fix test `test_type_hints.py::TestTypeHints::test_doc_examples` (#129829 ) As per the title, this test was broken for months. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129829 Approved by: https://github.com/ezyang	2024-07-01 13:28:37 +00:00
Jack Taylor	e1b426b345	[ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650 ) Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 11:40:09 +00:00
cyy	313eec02cc	Add hash function of std::string_view to torch/csrc/lazy/core/hash.h (#128800 ) For easier moving of c10::string_view to std::string_view in PyTorch/XLA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128800 Approved by: https://github.com/ezyang	2024-07-01 09:53:34 +00:00
Ramana Cherukuri	f6a0be5023	Add warpSize to Device properties (#128449 ) Adding warp_size to CudaDeviceProperties. >>> import torch >>> prop = torch.cuda.get_device_properties(torch.cuda.current_device()) >>> prop.warp_size 64 >>> @jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449 Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet	2024-07-01 09:13:32 +00:00
Nikita Shulga	04a0d85620	[BE] Print all pip packages installed on the system after TorchChat (#129809 ) To make debugging regressions like ones happened last Wed when new version of torchao was released, that resulted in TorchBench downgrading pytorch version to 2.3.1 Test plan: Look at the log output for example https://github.com/pytorch/pytorch/actions/runs/9720408234/job/26832794157?pr=129809#step:20:1158 contains ``` + echo 'Print all dependencies after TorchBench is installed' Print all dependencies after TorchBench is installed + python -mpip freeze absl-py==2.1.0 accelerate==0.31.0 aiohttp==3.9.5 aiosignal==1.3.1 astunparse==1.6.3 async-timeout==4.0.3 attrs==23.2.0 audioread==3.0.1 beautifulsoup4==4.12.3 boto3==1.19.12 botocore==1.22.12 bs4==0.0.2 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129809 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-07-01 04:51:53 +00:00
cyy	eb1583dbc1	[2/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129300 ) Follows #129055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129300 Approved by: https://github.com/ezyang	2024-07-01 01:09:00 +00:00
Animesh Jain	e62073d799	[dynamo] Skip FUNCTION_MATCH on method-wrapper objects (#129830 ) Fixes https://github.com/pytorch/pytorch/issues/118563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129830 Approved by: https://github.com/jansel	2024-06-30 20:21:18 +00:00
eqy	24b6c5a41f	[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 ) Fix for #129579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-06-30 19:37:44 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
eqy	7b0e9a27ba	Restore `allowed_info` in OOM message when applicable (#129546 ) Seems to be removed following #99699? Pull Request resolved: https://github.com/pytorch/pytorch/pull/129546 Approved by: https://github.com/Skylion007	2024-06-30 17:22:32 +00:00
Eddie Yan	8755e035d2	[CUDA][Pooling] Fix 64-bit indexing in `avg_pool_2d` backward attempt 2 (#129818 ) Somehow the original PR was missing the `CUDA_KERNEL_LOOP_TYPE` change??? Thanks @johnc-keen @Chillee for the great repro! (#129785) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129818 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2024-06-30 16:52:33 +00:00
eqy	4dd3cff234	[CUDA] Fix more `DeviceIndex` printing (#128540 ) Same `char` dtype causing device index `0` to be interpreted as a null-terminator, see also #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128540 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-30 16:44:14 +00:00
eqy	68484621fe	[cuDNN][functorch] Bump tolerances for `nn.functional.conv2d` in `test_vmap_autograd_grad` (#129796 ) Newer versions of cuDNN can dispatch to a winograd kernel here on A100 which affects numerics a bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129796 Approved by: https://github.com/Skylion007	2024-06-30 16:36:12 +00:00
Weizhuo Zhang	fff633f087	[CI] Enable AOT inductor FP32 accuracy test for CPU (#129040 ) This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage. Test Time cost: \| Suite \| Precision \| Time cost \| \|------------- \|----------- \|----------- \| \| Huggingface \| FP32 \| 1h12m \| \| Timm models \| FP32 \| 1h32m \| \| Torchbench \| FP32 \| 1h40m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040 Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet	2024-06-30 14:00:09 +00:00
Randolf Scholz	8a5fda0377	added type hints for __contains__ (#129653 ) - Fixes #129646 - Added test in test/typing/reveal/tensor_constructors.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129653 Approved by: https://github.com/ezyang	2024-06-30 11:49:11 +00:00
leslie-fang-intel	1a689ea38c	[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op (#129048 ) Summary Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129048 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825	2024-06-30 09:53:55 +00:00
leslie-fang-intel	35a197defa	[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825 ) Summary Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Next Step - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-30 09:45:43 +00:00
dilililiwhy	fe5424d0f8	Enable UFMT on test/test_public_bindings.py (#128389 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > test/test_public_bindings.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389 Approved by: https://github.com/ezyang	2024-06-30 08:49:51 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit 6d75604ef135925e8c85363c2f4a5e0b6f7fef28. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xu Han	67c9ec2b6d	[inductor] unificate toolchain code. (#129816 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789 Changes: 1. Unificate cpp builder's toolchain code. 2. Move all build related code to `cpp_builder.py`. 3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816 Approved by: https://github.com/jansel	2024-06-29 23:21:13 +00:00
leslie-fang-intel	3fec0efd34	[Inductor][CPP] Support vectorization of bitwise fn (#129733 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: bitwise_and`. In this PR, we add vectorization support of 6 bitwise functions. In this PR, we also remove `bitwise_xor` from `ops_to_bool` list which sets output data type as bool in data type propagation. It seems wrong since according to this doc https://pytorch.org/docs/stable/generated/torch.bitwise_xor.html, it should return the same integral data type with input and the testcase `test_bitwise3` failed due to this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_bitwise python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_bitwise3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129733 Approved by: https://github.com/jgong5, https://github.com/Skylion007	2024-06-29 17:25:27 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Xuehai Pan	7837a12474	[BE] enforce style for empty lines in import segments (#129751 ) This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet: > Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one) `usort` allows empty lines within import segments. For example, `usort` do not change the following code: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style: 1. no empty lines within segments. 2. single empty line between segments. 3. two spaces after import statements. All the code snippets above will be formatted to: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` which produces a consistent code style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751 Approved by: https://github.com/malfet	2024-06-29 14:15:24 +00:00
Jason Ansel	9ae78a578c	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314 ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320	2024-06-29 14:06:28 +00:00
Jason Ansel	a18eb651d3	[halide-backend] Disable split reductions for Halide (#129320 ) In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506, #129036	2024-06-29 14:06:28 +00:00
Jason Ansel	4cb8cb04a7	[halide-backend] Enable bfloat16 support (#129036 ) Requires https://github.com/halide/Halide/pull/8255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506	2024-06-29 14:06:25 +00:00
Jason Ansel	b93bf55b6a	[halide-backend] Add GPU support (#127506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026	2024-06-29 14:06:21 +00:00
Jason Ansel	86cadc6385	[halide-backend] Dimension-based indexing (#129026 ) Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs. Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues. This PR infers dimensions and changes the indexing in the generated code. Before ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 1) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 xindex = hl.Var('xindex') rindex = hl.Var('rindex') r1 = rindex x0 = xindex idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)]) odom = hl.RDom([hl.Range(0, 16)]) rdom = hl.RDom([hl.Range(0, 32)]) xindex_idom = idom.x xindex_odom = odom.x rindex_idom = idom.y r1_idom = rindex_idom x0_idom = xindex_idom x0_odom = xindex_odom tmp0 = hl.Func('tmp0') tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)] tmp1 = hl.Func('tmp1') tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex]) tmp2 = hl.Func('tmp2') tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex] tmp3 = hl.Func('tmp3') tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex]) tmp4 = hl.Func('tmp4') tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex]) tmp5 = hl.Func('tmp5') tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex] out_ptr3_i0 = hl.Var('out_ptr3_i0') out_ptr3_i1 = hl.Var('out_ptr3_i1') out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1]) assert g.using_autoscheduler() in_ptr0.set_estimates([hl.Range(0, 512)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` After ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 2) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 h0 = hl.Var('h0') h1 = hl.Var('h1') rdom = hl.RDom([hl.Range(0, 32)]) hr1 = rdom[0] tmp0 = hl.Func('tmp0') tmp0[h0, h1] = in_ptr0[h0, h1,] tmp1 = hl.Func('tmp1') tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1]) tmp2 = hl.Func('tmp2') tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1] tmp3 = hl.Func('tmp3') tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1]) tmp4 = hl.Func('tmp4') tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1]) tmp5 = hl.Func('tmp5') tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1] out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1]) assert g.using_autoscheduler() in_ptr0.dim(0).set_min(0) in_ptr0.dim(0).set_stride(1) in_ptr0.dim(0).set_extent(32) in_ptr0.dim(1).set_min(0) in_ptr0.dim(1).set_stride(32) in_ptr0.dim(1).set_extent(16) in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129026 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025	2024-06-29 14:06:16 +00:00
Jason Ansel	da5f37515e	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-29 14:06:12 +00:00
Jason Ansel	e34b7e6af3	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-29 14:06:08 +00:00
Howard Huang	13d4be1dc7	[pipelining] Support W action for schedules (#129233 ) Add support to for the `W` action in `_step_microbatches`. ## TODO: - Clean up the tests theres a lot of copy-pasted repeated code there Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129233 Approved by: https://github.com/wconstab ghstack dependencies: #128983, #128976	2024-06-29 11:51:40 +00:00
Howard Huang	a6da01bd01	[pipelining] Support arbitrary stage ordering on ranks (#128976 ) Fixes based on discussion in https://github.com/pytorch/pytorch/issues/128665 Our previous assumption was that for looped schedules `stage_ids = range(rank, total_stages, num_local_stages)`. This is not true for all schedules. This change relaxes that assumptions and allows arbitrary ordering of stages. For example in the added test we do, rank 0: [stage0, stage3], rank 1: [stage1, stage2]. The test also adds a schedule registry (for testing) which performs 1 microbatch through this schedule ``` F0_0 None None F0_3 B0_3 None None B0_0 None F0_1 F0_2 None None B0_2 B0_1 None ``` Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128976 Approved by: https://github.com/wconstab ghstack dependencies: #128983	2024-06-29 11:51:39 +00:00
Will Constable	18ae3bab2f	[Pipelining] Support separate dw_runner for PipelineStage (#128983 ) Fixes #128974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128983 Approved by: https://github.com/H-Huang	2024-06-29 11:51:34 +00:00
谭九鼎	b0e5c9514d	use shutil.which in check_compiler_ok_for_platform (#129069 ) the same as https://github.com/pytorch/pytorch/pull/126060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129069 Approved by: https://github.com/ezyang	2024-06-29 11:38:51 +00:00
Xuehai Pan	56935684c3	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-29 09:23:39 +00:00
Xuehai Pan	9120992c72	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-29 09:23:39 +00:00
Xuehai Pan	8a67daf283	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-29 09:23:35 +00:00
Xu Han	58f346c874	[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1 Changes: 1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`. <img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92"> 2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa. 3. Update code for above changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-29 07:19:54 +00:00
Animesh Jain	a676b7c5f3	Add XGLMForCausalLM to the flaky model list (#129776 ) Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610, #129775	2024-06-29 05:47:28 +00:00
Animesh Jain	5d1763d159	Add lcnet to the inline_inbuilt_nn_module list (#129775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610	2024-06-29 05:47:28 +00:00
Wenlei He	89696db4b0	Revert "[LLVM/TensorExpr] Update for an API change in LLVM 18." (#129797 ) This reverts commit 20f394f10a389bcf13485929be8862f98ad4b322 (https://github.com/pytorch/pytorch/pull/117086) LLVM upstream changed the pass builder API again, so registerPassBuilderCallbacks no longer takes extra boolean for PopulateClassToPassNames. Update accordingly. Relevant LLVM upstream change: https://github.com/llvm/llvm-project/pull/96321 https://github.com/llvm/llvm-project/pull/96462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129797 Approved by: https://github.com/dcci	2024-06-29 05:17:20 +00:00
Boyuan Feng	3ef44df667	[ts-migration] support prim::SetAttr and fix prim::GetAttr (#129440 ) - Lifting Tensor Constant attributes to buffers: TorchScript does not automatically lift tensor constant attributes to buffers. So previous converter cannot access tensor constant attributes. This PR fixed the issue. - Add SetAttr support for tensor attributes by copy_. - Add SetAttr support for non-tensor attributes. In particular, we maintain the current value of non-tensor attributes in `name_to_non_tensor_attribute_node`, similar to an interpreter pass on non-tensor attributes. So we can support the following use case: ```python def forward(self, x): c1 = self.count self.count += 1 c2 = self.count return x + c1 + c2 ``` - Fixed a bug in GetAttr to support the following use case: ```python def forward(self, inp): x = self.buffer self.buffer += 1 y = self.buffer return x + y + inp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129440 Approved by: https://github.com/angelayi	2024-06-29 05:08:13 +00:00
Yanbo Liang	ec47d4d9a8	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-29 04:44:38 +00:00
Yanbo Liang	7b5a8424a1	[GPT-fast] Update micro benchmark numbers as A100-50G (#129799 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129799 Approved by: https://github.com/Chillee	2024-06-29 04:36:07 +00:00
Mayank Mishra	065c386990	Allow get attributes on DDP similar to FSDP (#128620 ) FSDP implements the following logic but its missing from DDP. This PR adds an equivalent function for the same. ```python def __getattr__(self, name: str) -> Any: """Forward missing attributes to the wrapped module.""" try: return super().__getattr__(name) # defer to nn.Module's logic except AttributeError: return getattr(self._fsdp_wrapped_module, name) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128620 Approved by: https://github.com/awgu	2024-06-29 01:57:22 +00:00
Nikita Shulga	2bc6f329b2	Make PyTorch argparser understand complex (#129580 ) It understands float and int, so why not `complex`. Test plan: `python -c "import torch;print(torch.rand(3, dtype=complex))"` Fixes https://github.com/pytorch/pytorch/issues/126837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129580 Approved by: https://github.com/albanD	2024-06-29 01:21:12 +00:00
PyTorch MergeBot	dfd55d1714	Revert "[cond] inlining into one of the branches when pred is a python constant (#128709 )" This reverts commit 23adf166e166bd56e3446284939af7e46a181079. Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking one ExecuTorch test ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2197806850))	2024-06-29 01:03:55 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 9e1f3ecaa710785a1ab03c6ad5093a5566d6c5e5. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
Haiping Zhao	c0782e7c81	Kineto profiler: collecting observer traces from C++ child threads (#128743 ) Summary: In a C++ program, if we have child threads doing GPU work, it would be nice to get traces of those threads as well. The problem is, pushProfilingCallbacks() is not called on child threads, therefore, no observer traces are collected on these threads, entirely missing in the final output. This diff provides a new API that a child thread may elect to call to register itself onto the profiler that was started in main thread (or whatever the Python thread that manages the profiler). Test Plan: ``` buck2 test @mode/opt //caffe2/test:profiler_test_cpp_thread ``` Reviewed By: aaronenyeshi Differential Revision: D56669942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128743 Approved by: https://github.com/aaronenyeshi	2024-06-29 00:44:30 +00:00
PyTorch MergeBot	a32ce5ce34	Revert "[BE][Easy] enable postponed annotations in `tools` (#129375 )" This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0. Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
PyTorch MergeBot	6063bb9d45	Revert "[BE][Easy] enable postponed annotations in `torchgen` (#129376 )" This reverts commit 494057d6d4e9b40daf81a6a4d7a8c839b7424b14. Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
PyTorch MergeBot	83caf4960f	Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )" This reverts commit e40f50cb87bcd176a380b729af5dda13dbe9c399. Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:24 +00:00
PyTorch MergeBot	00d7bba2fa	Revert "[BE] enforce style for empty lines in import segments (#129751 )" This reverts commit f5ff1a3ab9ef279655308266029faf6543a8a1ca. Reverted https://github.com/pytorch/pytorch/pull/129751 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129751#issuecomment-2197799814))	2024-06-29 00:41:41 +00:00
PyTorch MergeBot	fa6c0fe3e4	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit 9450e198aa0bdf6f81ccb8ad2f74c06e81d1af6e. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2197790226))	2024-06-29 00:16:47 +00:00
Andrew Gu	24f69eef6a	[FSDP2] Ran reduce-scatter copy-in in default stream (#129721 ) This PR runs the reduce-scatter copy-in in the default stream, allowing the reduce-scatter input (large allocation proportional to unsharded gradients) to be allocated in the default stream to avoid fragmenting that memory across stream memory pools. - In general, minimizing memory usage spikes in non-default-stream memory pools helps because otherwise, that memory cannot be reused by the default stream outside of that spike. This reduce-scatter input allocation represents one such spike. The reduce-scatter outputs are still allocated in the separate `reduce_scatter` stream since they are small and have a non-spiky allocation/free pattern (we iteratively allocate them through backward and free them altogether after optimizer). - This PR should not have any impact on overlap (I sanity checked Llama3-8B traces from torchtitan; plus we have the `test_fully_shard_overlap.py` unit tests). Experiment (Before) Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1: ``` [rank0]:2024-06-27 16:38:56,620 - root - INFO - step: 1 loss: 12.2764 memory: 71.99GiB(75.75%) wps: 1,436 mfu: 8.41% [rank0]:2024-06-27 16:38:56,620 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-06-27 16:38:57,943 - root - INFO - step: 2 loss: 12.1001 memory: 79.82GiB(83.98%) wps: 6,195 mfu: 36.28% [rank0]:2024-06-27 16:38:59,266 - root - INFO - step: 3 loss: 11.7697 memory: 79.82GiB(83.98%) wps: 6,193 mfu: 36.27% [rank0]:2024-06-27 16:39:00,587 - root - INFO - step: 4 loss: 11.2807 memory: 79.82GiB(83.98%) wps: 6,203 mfu: 36.32% [rank0]:2024-06-27 16:39:01,910 - root - INFO - step: 5 loss: 10.9494 memory: 79.82GiB(83.98%) wps: 6,198 mfu: 36.30% ``` (After) Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1: ``` [rank0]:2024-06-27 16:41:12,106 - root - INFO - step: 1 loss: 12.2560 memory: 69.46GiB(73.08%) wps: 1,158 mfu: 6.78% [rank0]:2024-06-27 16:41:12,106 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-06-27 16:41:13,502 - root - INFO - step: 2 loss: 12.0949 memory: 77.29GiB(81.32%) wps: 5,870 mfu: 34.37% [rank0]:2024-06-27 16:41:14,839 - root - INFO - step: 3 loss: 11.7770 memory: 77.29GiB(81.32%) wps: 6,130 mfu: 35.90% [rank0]:2024-06-27 16:41:16,154 - root - INFO - step: 4 loss: 11.3188 memory: 77.29GiB(81.32%) wps: 6,230 mfu: 36.48% [rank0]:2024-06-27 16:41:17,474 - root - INFO - step: 5 loss: 10.9443 memory: 77.29GiB(81.32%) wps: 6,211 mfu: 36.37% ``` 2.53 GiB reduction in peak reserved memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129721 Approved by: https://github.com/weifengpy, https://github.com/yifuwang	2024-06-28 23:55:12 +00:00
Sahan Paliskara	f06e3a1569	[Split Build] Make script not crash if split build is not set (#129774 ) Fixes issue causing https://github.com/pytorch/pytorch/actions/runs/9704484834/job/26801889463 to crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/129774 Approved by: https://github.com/atalman	2024-06-28 23:50:18 +00:00
Aaron Gokaslan	7bda23ef84	[BE]: Update ruff to 0.5.0 (#129744 ) Update ruff to 0.5.0 so we can enable all the some of the new checks I've been wanting to add to the codebase. First just updating the code to comply with some rule changes and a couple minor API changes / deprecations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129744 Approved by: https://github.com/ezyang	2024-06-28 21:49:56 +00:00
Mohamed Yassine Kabouri	0a337613f8	Fix typo in stack_module_state doc (#129126 ) I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126 Approved by: https://github.com/zou3519	2024-06-28 21:36:40 +00:00
Xuehai Pan	f5ff1a3ab9	[BE] enforce style for empty lines in import segments (#129751 ) This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet: > Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one) `usort` allows empty lines within import segments. For example, `usort` do not change the following code: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style: 1. no empty lines within segments. 2. single empty line between segments. 3. two spaces after import statements. All the code snippets above will be formatted to: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` which produces a consistent code style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751 Approved by: https://github.com/malfet	2024-06-28 21:02:59 +00:00
Joona Havukainen	5b96a552df	Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484 ) Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value. The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484 Approved by: https://github.com/kulinseth	2024-06-28 20:57:40 +00:00
Zaida Zhou	bc8883a7c4	fix the error msg in device_mesh (#129747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747 Approved by: https://github.com/awgu, https://github.com/wconstab	2024-06-28 20:12:09 +00:00
Mikayla Gawarecki	45f3e20527	Improve error message for weights_only load (#129705 ) As @vmoens pointed out, the current error message does not make the "either/or" between setting `weights_only=False` and using `add_safe_globals` clear enough, and should print the code for the user to call `add_safe_globals` New formatting looks like such In the case that `add_safe_globals` can be used ```python >>> import torch >>> from torch.testing._internal.two_tensor import TwoTensor >>> torch.save(TwoTensor(torch.randn(2), torch.randn(2)), "two_tensor.pt") >>> torch.load("two_tensor.pt", weights_only=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TwoTensor])` to allowlist this global if you trust this class/function. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` For other issues (unsupported bytecode) ```python >>> import torch >>> t = torch.randn(2, 3) >>> torch.save(t, "protocol_5.pt", pickle_protocol=5) >>> torch.load("protocol_5.pt", weights_only=True) /data/users/mg1998/pytorch/torch/_weights_only_unpickler.py:359: UserWarning: Detected pickle protocol 5 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this. warnings.warn( Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 149 Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` Old formatting would have been like: ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1203, in load raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you get the file from a trusted source. Alternatively, to load with `weights_only` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals` to allowlist this global if you trust this class/function. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129705 Approved by: https://github.com/albanD, https://github.com/vmoens ghstack dependencies: #129239, #129396, #129509	2024-06-28 19:36:31 +00:00
Rachel Guo	99456a612b	[AOTI] Properly indent launchKernel calls in AOTInductor (#129616 ) Summary: There is a small cosmetic issue in the C++ wrapper file generated by AOTInductor - The launchKernel() call isn't properly indented. Added indentation for launchKernel() code block call when there's a "if" condition. a.k.a when `grid_uses_symbolic_shapes` is `True`. Test Plan: Test cmd ran (in pytorch oss): `TORCH_LOGS="output_code" TORCH_COMPILE_DEBUG=1 python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols_abi_compatible_cuda` And then manually verified the output code generated in a path like `/tmp/torchinductor_guorachel/coraisesuchpl3qabrazn7ydydszcit6lwpn7ckd3b4wej4rep5l/cba5g5ajeh5sym3tp5iqn7kkokimj7qqd4krs2rruhupbfqgppge.cpp` Similarly, also verified for test case:`test_zero_grid_with_unbacked_symbols_abi_compatible_cuda` Differential Revision: D58897157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129616 Approved by: https://github.com/ColinPeppler	2024-06-28 19:16:18 +00:00
Animesh Jain	6120aa3718	[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 ) TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow. With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model. Functionality impact - The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR). Perf impact - I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). Typing impact - I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #129163	2024-06-28 18:30:13 +00:00
zrr1999	db4c7bb7fc	Refine typing annotation for compile (#129136 ) before ![image](https://github.com/pytorch/pytorch/assets/46243324/91372d0f-ad0e-4abe-9582-7fe892f99ec8) after ![image](https://github.com/pytorch/pytorch/assets/46243324/175066ff-78f9-44a1-a3bb-5df809f7e86d) Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129136 Approved by: https://github.com/ezyang	2024-06-28 17:57:44 +00:00
FEI	59e4e92556	sdp::SDPBackend::flash_attention support PrivateUse1 (#126392 ) Fixes https://github.com/pytorch/pytorch/issues/124271 cc @cpuhrsch @drisspg @albanD @soulitzer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392 Approved by: https://github.com/drisspg	2024-06-28 17:48:40 +00:00
Chien-Chin Huang	26d633b721	[BE] Correctly catch skip signals emitting from sys.exit in Sandcastle (#129731 ) https://github.com/pytorch/pytorch/pull/129581 does not work correctly with Sandcastle environment. This PR fixes the issue. Differential Revision: [D59144062](https://our.internmc.facebook.com/intern/diff/D59144062/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129731 Approved by: https://github.com/wz337	2024-06-28 17:24:12 +00:00
Isuru Fernando	c12a4f2e65	Add decomposition for slice_scatter (#123744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744 Approved by: https://github.com/peterbell10	2024-06-28 17:02:10 +00:00
Joel Schlosser	6897631ceb	Guard on inner tensor names for traceable wrapper subclasses (#129618 ) Fixes #129601 Background: it's possible that a traceable wrapper subclass will have an optional inner tensor constituent (e.g. NJT's cached min / max sequence lengths). To specify this, the subclass's `__tensor_flatten__()` impl should leave out any unspecified optional inner tensors in the returned list of `attrs`. This PR guards on the list of inner tensor `attrs` returned in `subclass.__tensor_flatten__()[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129618 Approved by: https://github.com/anijain2305	2024-06-28 16:30:25 +00:00
Ying Zhao	b84036e3fb	[AOTI] Fix test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation (#129173 ) Fixes #122978 ## Summary To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation - Error 1 ``` error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’ 613 \| ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_); \| ^ ... torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note: no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’ 188 \| MiniArrayRef<const int64_t> sizes, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~ ``` Fix: added constructor for empty array in arrayref_tensor.h - Error 2 ``` error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’} 625 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw)); \| ^~~~ \| \| \| torch::aot_inductor::ArrayRefTensor<float> ``` Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first. ## Test Plan ``` python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation ``` Before the fix, detailed in #122978: ``` \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw)); \| ^~~~ \| \| \| torch::aot_inductor::ArrayRefTensor<float> /home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’ Ran 1 test in 4.377s FAILED (errors=1) ``` After the fix ``` /home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 1)] . ---------------------------------------------------------------------- Ran 1 test in 9.633s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173 Approved by: https://github.com/chenyang78	2024-06-28 16:27:42 +00:00
Oguz Ulgen	04264efab6	Add structured logging on FXGraphCache hit (#129588 ) We'll also want to do this for AOTAutogradCache once that's ready Differential Revision: [D59144226](https://our.internmc.facebook.com/intern/diff/D59144226) Co-authored-by: Oguz Ulgen <oulgen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129588 Approved by: https://github.com/oulgen, https://github.com/xmfan	2024-06-28 16:06:22 +00:00
Xuehai Pan	e40f50cb87	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-28 15:37:57 +00:00
Xuehai Pan	494057d6d4	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-28 15:37:57 +00:00
Xuehai Pan	59eb2897f1	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-28 15:37:54 +00:00
Xu Han	2e3ff394bf	[inductor] optimize cpp builder configuration code (#129577 ) Changes: 1. Combine choose isa condition dispatch code. 2. Unificate MacOS openmp configuration code. 3. Clean up useless code. Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-28 15:08:54 +00:00
Manuel Candales	eabe6574c0	[metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628 Approved by: https://github.com/kimishpatel	2024-06-28 15:01:30 +00:00
Andrew Gu	635d6c9d66	[FSDP2] Ran post-acc-grad hooks manually (#129450 ) FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually. Discussion Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity. Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not. Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually. Caveats - Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass). - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be. - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers. - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`. - The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream. - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues. - This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope. Experiments (torchtitan) - Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision: - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped) - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450 Approved by: https://github.com/weifengpy, https://github.com/yf225	2024-06-28 14:50:09 +00:00
Nikita Shulga	fe4032fe20	[BE][CMake] Do not use `EXEC_PROGRAM` (#129714 ) It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html This makes the following warning disappear: ``` CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM): Policy CMP0153 is not set: The exec_program command should not be called. Run "cmake --help-policy CMP0153" for policy details. Use the cmake_policy command to set the policy and suppress this warning. Use execute_process() instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714 Approved by: https://github.com/kit1980	2024-06-28 13:29:52 +00:00
Yu, Guangye	98d34d849d	Add a XPU UT to ensure lazy init (#129638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129638 Approved by: https://github.com/gujinghui	2024-06-28 13:22:17 +00:00
Randolf Scholz	22a06869f2	include jit/*.pyi (#129654 ) Fixes #108781, see https://github.com/pytorch/pytorch/pull/108782#issuecomment-1927321532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129654 Approved by: https://github.com/ezyang	2024-06-28 12:40:11 +00:00
Xu Han	424068d0d2	[Windows] remove mkl shared library dependency. (#129493 ) # Background I have fixed pytorch Windows missing mkl shared library dependency issue: https://github.com/pytorch/pytorch/issues/124009 The solution is change torch_cpu module static link mkl library: 1. pytorch static link mkl PR: https://github.com/pytorch/pytorch/pull/124925 2. builder install mkl static library: https://github.com/pytorch/builder/pull/1790 Double confirmed current build is using mkl static link: https://github.com/pytorch/pytorch/issues/124009#issuecomment-2160941802 # Goal Remove setup.py `install_requires` will install mkl shared lib on pytorch Windows. It is not required now, due to we have static linked it. It will reduce the pytorch install network traffic and avoid install useless mkl shared library package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129493 Approved by: https://github.com/malfet	2024-06-28 11:42:21 +00:00
Shan19900305	a0dac3de31	Noise tensor using same size/stride with input to promote performance when channel last situation. (#129467 ) All ops in _dropout_impl function are point-wise op. When input and output tensors are with same size and stride, those operators will get better performance. So i have remove memory in at::empty_like in make noise tensor. @ezyang Test code: ``` import torch input1 = torch.randn((50, 20, 50 ,30)).cuda() input2 = torch.randn((50, 20, 50 ,30)).cuda().to(memory_format=torch.channels_last) input3 = torch.randn((50, 20, 50 , 50)).cuda()[...,10:40] dropout = torch.nn.Dropout(p=0.5, inplace=True) # warmup: for i in range(20): output = dropout(input1) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) num = 10000 start_event.record() for i in range(num): output = dropout(input1) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input1 each time: {0}.".format(time * 1.0/num), flush =True) start_event.record() for i in range(num): output = dropout(input2) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input2 each time: {0}.".format(time * 1.0/num), flush =True) start_event.record() for i in range(num): output = dropout(input3) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input3 each time: {0}.".format(time * 1.0/num), flush =True) ``` Test result: \| 算子名称 \| 输入信息size / stride \| empty是否携带连续性参数 \| 耗时（ms） \| 备注 -- \| -- \| -- \| -- \| -- \| -- 1 \| dropout \| (50, 20, 50 ,30) / (30000, 1500, 30, 1) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0426735 \| 2 \| dropout \| (50, 20, 50 ,30) / (30000, 1, 600, 20) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0461689 \| 3 \| dropout \| (50, 20, 50 ,30) / (50000, 2500, 50, 1) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0512882 \| 4 \| dropout \| (50, 20, 50 ,30) / (30000, 1500, 30, 1) \| 空，根据输入决定size/stride \| 0.0426598 \| 对比1,基本一致 5 \| dropout \| (50, 20, 50 ,30) / (30000, 1, 600, 20) \| 空，根据输入决定size/stride \| 0.0422751 \| 对比2,提升8.4%左右 6 \| dropout \| (50, 20, 50 ,30) / (50000, 2500, 50, 1) \| 空，根据输入决定size/stride \| 0.0509037 \| 对比3,基本一致 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129467 Approved by: https://github.com/ezyang	2024-06-28 10:06:13 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit b7e7a4cb01de394af7686ab6feb216a8a5c716bb. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
PyTorch MergeBot	d21993bbb8	Revert "[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 )" This reverts commit 7854d84acbfb7a4e3e807951188535a0316b585e. Reverted https://github.com/pytorch/pytorch/pull/129587 on behalf of https://github.com/huydhn due to Sorry for revert yet another of your change but I need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196187332 ([comment](https://github.com/pytorch/pytorch/pull/129587#issuecomment-2196198756))	2024-06-28 06:01:07 +00:00
PyTorch MergeBot	c43923a116	Revert "[Inductor] FlexAttention supports block sparse mask (#129216 )" This reverts commit b9d3cedd648d4ed9d0bf5b918893341e5f95289a. Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is still failing in trunk `b9d3cedd64`, maybe a landrace given that TD has been turned off ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2196182882))	2024-06-28 05:44:46 +00:00
hippocookie	73eb4503cc	Enable UFMT for numpy_test files, test_xnnpack_integration.py (#129023 ) Fixes #123062 Run lintrunner on files: test/test_xnnpack_integration.py ```bash $ lintrunner FLAKE8 success! CLANGFORMAT success! MYPY success! MYPYSTRICT success! CLANGTIDY success! TYPEIGNORE success! TYPENOSKIP success! NOQA success! NATIVEFUNCTIONS success! NEWLINE success! CONSTEXPR success! SPACES success! TABS success! INCLUDE success! PYBIND11_INCLUDE success! ERROR_PRONE_ISINSTANCE success! PYBIND11_SPECIALIZATION success! PYPIDEP success! EXEC success! CUBINCLUDE success! RAWCUDADEVICE success! RAWCUDA success! ROOT_LOGGING success! DEPLOY_DETECTION success! CMAKE success! SHELLCHECK success! ACTIONLINT success! TESTOWNERS success! TEST_HAS_MAIN success! CALL_ONCE success! ONCE_FLAG success! WORKFLOWSYNC success! UFMT success! COPYRIGHT success! BAZEL_LINTER success! LINTRUNNER_VERSION success! ATEN_CPU_GPU_AGNOSTIC success! MERGE_CONFLICTLESS_CSV success! RUFF success! ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129023 Approved by: https://github.com/ezyang	2024-06-28 05:40:31 +00:00
Peter Bell	b019f38fdd	[inductor] Fix pattern replacements with multiple users (#129689 ) Fixes #129685 After matching a pattern, we currently try to remove all the nodes of that pattern, which doesn't work if any intermediate node has users outside of the pattern. In which case we can't delete those particular nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129689 Approved by: https://github.com/shunting314	2024-06-28 05:16:17 +00:00
eqy	7854d84acb	[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 ) Fix for #129579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-06-28 04:42:45 +00:00
Daniel Richard G.	8d4216af8c	Fix compile error with Intel oneAPI compiler (#129589 ) I am building PyTorch with the Intel oneAPI 2024.0.0 compiler, and encountered this compile error: ``` [ 85%] Building CXX object caffe2/CMakeFiles/cpu_rng_test.dir/__/aten/src/ATen/test/cpu_rng_test.cpp.o In file included from /home/src/pytorch/aten/src/ATen/test/cpu_rng_test.cpp:2: /home/src/pytorch/aten/src/ATen/test/rng_test.h:119:41: error: loop variable 'to' creates a copy from type 'const ::std::optional<int64_t>' (aka 'const optional<long>') [-Werror,-Wrange-loop-construct] 119 \| for (const ::std::optional<int64_t> to : tos) { \| ^ /home/src/pytorch/aten/src/ATen/test/rng_test.h:119:10: note: use reference type 'const ::std::optional<int64_t> &' (aka 'const optional<long> &') to prevent copying 119 \| for (const ::std::optional<int64_t> to : tos) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| & 1 error generated. ``` This change makes the compiler happy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129589 Approved by: https://github.com/colesbury	2024-06-28 02:35:10 +00:00
Yidi Wu	4b8a5e0374	[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 ) Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph. This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident. Test Plan: Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680 Approved by: https://github.com/angelayi	2024-06-28 02:22:30 +00:00
Nikita Shulga	4b598d87d3	Fix FindBLAS.cmake (#129713 ) Fixes regression introduced by https://github.com/pytorch/pytorch/pull/125227 by adding `INCLUDE(CheckFunctionExists)` that fixes ``` CMake Error at cmake/Modules/FindBLAS.cmake:413 (check_function_exists): Unknown CMake command "check_function_exists". ``` Fixes https://github.com/pytorch/pytorch/issues/129693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129713 Approved by: https://github.com/kit1980	2024-06-28 02:15:16 +00:00
Yanbo Liang	b9d3cedd64	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-28 01:32:54 +00:00
Will Feng	c07a799ed5	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 ) Test command: `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247 Approved by: https://github.com/bdhirsh ghstack dependencies: #129502	2024-06-28 01:04:49 +00:00
xinan.lin	36b9d9cfcd	[Inductor UT] Generalize device-bias code in newly added UT `test_scatter_optimization.py` (#129622 ) [Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py and test_torchinductor_dynamic_shapes.py Fix issue #129624 , #129642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129622 Approved by: https://github.com/EikanWang, https://github.com/peterbell10	2024-06-28 01:04:21 +00:00
Shangdi Yu	deaab33f3f	[custom op] add error message (#129417 ) Fixes [#129370](https://github.com/pytorch/pytorch/issues/129370) Suggest correct a List type annotation when input is in Tuple type. To avoid confusion, we only suggest a type if the type is supported. Example: Tuple[int, int] -> List[int] Tuple[Tensor, Tensor, Optional[Tensor]] -> List[Optional[Tensor]] Tuple[int, ...] -> List[int] ValueError: infer_schema(func): Parameter y has unsupported type typing.Tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]. Tuple type annotation is not supported. Please try to use a List instead. For example, typing.List[typing.Optional[torch.Tensor]]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129417 Approved by: https://github.com/zou3519	2024-06-28 01:03:14 +00:00
PyTorch MergeBot	8ba0f6c7c2	Revert "[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 )" This reverts commit f2840bb22079a6952c61446a3d0dfc12f6452852. Reverted https://github.com/pytorch/pytorch/pull/129164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal dper3 tests ([comment](https://github.com/pytorch/pytorch/pull/129164#issuecomment-2195888838))	2024-06-28 00:49:39 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
Nikita Shulga	d4b6ff6fbe	Disable llm-td step (#129722 ) As it often fails during conda install step with `Unexpected HTTP response: 429` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129722 Approved by: https://github.com/kit1980, https://github.com/clee2000	2024-06-28 00:12:32 +00:00
Will Feng	0ffb17547e	[Simple FSDP] Add unit test for torch.compile + reparameterization + SAC (#129641 ) This can reproduce the error in https://github.com/pytorch/pytorch/issues/129684. Adding a unit test so that we hold the line for torch.compile + reparameterization + SAC to always be working, to pave the path for Tianyu's intern's project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129641 Approved by: https://github.com/tianyu-l	2024-06-28 00:00:36 +00:00
Jeff Daily	169b4ca07e	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet	2024-06-27 23:53:13 +00:00
cyy	fb5888c719	Remove unused type traits in torch/csrc/utils (#128799 ) Follows #127852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128799 Approved by: https://github.com/ezyang	2024-06-27 23:51:18 +00:00
Peter Bell	3fc279633b	[ATen] Make argsort.stable CompositeImplicitAutograd (#129529 ) It literally just calls `at::sort` and returns the indices, so is composite compliant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129529 Approved by: https://github.com/lezcano	2024-06-27 23:49:16 +00:00
Xuehai Pan	7cf0b90e49	[BE] enable UFMT in `torch.utils.data` (#127705 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127705 Approved by: https://github.com/ezyang ghstack dependencies: #127706, #127704	2024-06-27 23:16:24 +00:00
Xuehai Pan	f911957573	[BE] sort imports in `torch.utils.data` (#127704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127704 Approved by: https://github.com/ezyang ghstack dependencies: #127706	2024-06-27 23:16:24 +00:00
Xuehai Pan	d80939e5e9	[BE] enable UFMT for `torch/storage.py` (#127706 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127706 Approved by: https://github.com/ezyang	2024-06-27 23:16:24 +00:00
Yifu Wang	67416a2996	[c10d] Introduce a util for detecting DMA connectivity among devices (#129510 ) This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices. The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible. `_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced. Example: ```python3 >>> from torch._C._autograd import DeviceType >>> from torch._C._distributed_c10d import _detect_dma_connectivity >>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink") >>> for row in connectivity.matrix: ... print(row) ... [0, 18, 18, 18, 18, 18, 18, 18] [18, 0, 18, 18, 18, 18, 18, 18] [18, 18, 0, 18, 18, 18, 18, 18] [18, 18, 18, 0, 18, 18, 18, 18] [18, 18, 18, 18, 0, 18, 18, 18] [18, 18, 18, 18, 18, 0, 18, 18] [18, 18, 18, 18, 18, 18, 0, 18] [18, 18, 18, 18, 18, 18, 18, 0] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510 Approved by: https://github.com/weifengpy	2024-06-27 23:02:07 +00:00
yousufmo	305ba62906	Add support to `GradScaler` for respecting an already set `grad_scale` value (#123429 ) Fixes #123428 Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429 Approved by: https://github.com/ezyang	2024-06-27 22:40:54 +00:00
Will Constable	83a4a8b510	[C10D] clean up pointless 'or None' clause (#129522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129522 Approved by: https://github.com/awgu	2024-06-27 22:40:11 +00:00
Chien-Lin Chen	5e7ac69a67	[Dynamic Shapes] fixed dynamic shape inference (#128807 ) Made dynamic dimension indirectly bound to an integer constrained. After each ShapeEnv._refine_ranges, check if the new ValueRange is singleton, if it is, replace the symbol. Fixes #122307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128807 Approved by: https://github.com/ezyang	2024-06-27 22:33:32 +00:00
Catherine Lee	b8398b771c	Upload test stats when workflow regardless of conclusion (#129694 ) Upload test stats when workflow always so that we can get status for cancelled workflows (especially ones that were cancelled manually) There aren't that many workflow conclusions, so might as well as always run it, and we can see what happens Undos [this old PR](https://togithub.com/pytorch/pytorch/pull/79180) Notable pitfalls from the above: Might cause noise if things can't be downloaded, but since this workflow doesn't show up on PRs, I think it's ok to slowly deal with what comes Pull Request resolved: https://github.com/pytorch/pytorch/pull/129694 Approved by: https://github.com/huydhn	2024-06-27 21:12:21 +00:00
Shivam Raikundalia	1d0efedc85	[Profiler] Add TSC Clock Callback to CUPTI (#125036 ) Summary: Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885 This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler Test Plan: Obtained following trace using resnet test: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces TBD: Add benchmarks Differential Revision: D56584521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036 Approved by: https://github.com/aaronenyeshi	2024-06-27 21:07:43 +00:00
Xu Han	602b5cb218	[inductor] switch HalideCodeCache to new cpp_builder. (#129441 ) Original PRs is damaged by confilct and rebase: https://github.com/pytorch/pytorch/pull/128303, https://github.com/pytorch/pytorch/pull/129144 This PR just switch `HalideCodeCache` to new cpp_builder and it is not `fb_code` related. It can merge without `fb_code` test. Let's land this change firstly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129441 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-27 20:50:13 +00:00
Tugsbayasgalan Manlaibaatar	39427288f4	Taskify training IR + run_decomp flow failures (#129547 ) Differential Revision: [D59069088](https://our.internmc.facebook.com/intern/diff/D59069088) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129547 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #128077, #129092, #129249	2024-06-27 20:43:22 +00:00
Yidi Wu	23adf166e1	[cond] inlining into one of the branches when pred is a python constant (#128709 ) When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709 Approved by: https://github.com/zou3519	2024-06-27 20:28:50 +00:00
Sanket Jayant Purandare	71f5ecd1ee	Fixed Memory Leaks in tests (#129640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129640 Approved by: https://github.com/clee2000 ghstack dependencies: #129400	2024-06-27 20:26:21 +00:00
Tugsbayasgalan Manlaibaatar	dabaebd339	Make run_decomp work (#129249 ) In this PR, we implement the first version of training_ir.run_decomp functionality. Since we don't return the modified buffers as extra output in training IR, our previous strategy of reusing graph signature won't work. In fact, this run_decomp is more similar to retracing. So i reuse some of export steps here. After this PR: export_for_training().run_decomp({}, _preserve_ops=[all 183 ops]) == export_for_predispatch() - autograd_manipulating_ops. Differential Revision: [D59069090](https://our.internmc.facebook.com/intern/diff/D59069090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129249 Approved by: https://github.com/zhxchen17 ghstack dependencies: #128077, #129092	2024-06-27 19:16:07 +00:00
Tugsbayasgalan Manlaibaatar	ec284d3a74	Prototype for export_for_training (#129092 ) This PR implements export_for_training where the IR is not-functional, pre-dispatch aten IR. The general strategy: 1. Call dynamo to get torch IR 2. Lift param/buffer 3. call make_fx TODO: 1. run_decomp doesn't work 2. not-strict is not supported Differential Revision: [D59069087](https://our.internmc.facebook.com/intern/diff/D59069087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129092 Approved by: https://github.com/zhxchen17 ghstack dependencies: #128077	2024-06-27 18:27:11 +00:00
Angela Yi	4dcc1ceff3	[dynamo] Fakify result of delegate (#128752 ) Summary: Somehow the delegate returns a real tensor result even though we pass in fake tensors. So here we need to convert the result to fake. Test Plan: `buck2 run @//mode/dev-nosan //on_device_ai/helios/multi_zion:multi_zion_test -- -r test_single_delegate_dsp_only` Differential Revision: D58617091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128752 Approved by: https://github.com/ydwu4	2024-06-27 17:59:52 +00:00
Zain Rizvi	389492e264	Fix runner determinator bug (#129612 ) Currently the runner determinator is buggy and doesn't let anyone's workflows run against the LF runners (it prefixes a "@" to the user names in the issue instead of either stripping it or prefixing it to the incoming names) This PR fixes the bug so that people opted in to using LF runners can actually use them. It also puts the python code back into the repo. Even though the code isn't directly invoked, having it there makes testing and linting easier/possible Also includes lint fixes Note: if you just review the .yml file you'll see all the relevant diffs ### Testing: #### Before ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo {"label_type": "", "message": "LF Workflows are disabled for ZainRizvi, ZainRizvi. Using meta runners."} ``` #### After ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi, ZainRizvi. Using LF runners."} ``` Aside: updated test case after rebase: ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi2 --github-branch foo --github-repo python/pythonss --github-ref-type branch {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129612 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-06-27 17:51:09 +00:00
Brian Hirsh	a4d7aa498b	[Traceable FSDP2] Add auto-functionalize support for mutable list[Tensor] (copy from Brian's PR #127347 ); enable E2E inductor unit test for transformer model (#129502 ) Copy of Brian's PR: https://github.com/pytorch/pytorch/pull/127347 with additional changes to support mutable `List[Tensor]` in Inductor. Also enable E2E inductor unit test for Traceable FSDP2 + transformer model. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_set_` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_aot_eager` - `pytest -rA test/dynamo/test_misc.py::MiscTests::test_auto_functionalize_tensorlist` - `pytest -rA test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129502 Approved by: https://github.com/zou3519	2024-06-27 17:50:57 +00:00
Aleksei Nikiforov	9174d14551	Don't install remaining caffe2 python files (#129067 ) It is assumed that they are no longer needed. And keeping their installation as is breaks "python setup.py develop --user" workflow when non-root user is used. This change is follow up for 3d617333e700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129067 Approved by: https://github.com/cyyever, https://github.com/r-barnes	2024-06-27 17:25:59 +00:00
Richard Barnes	e0bba37d66	[codemod] Add `[[noreturn]]` to 2 files inc caffe2/c10/util/TypeCast.cpp (#129575 ) Summary: LLVM-15 has a warning `-Wno-return` which can be used to identify functions that do not return. Qualifying these functions with `[[noreturn]]` is a perf optimization. Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D59003594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129575 Approved by: https://github.com/Skylion007	2024-06-27 17:23:22 +00:00
Dmitry Rogozhkin	321bdcb372	Fix device propagation for checkpointing (#128671 ) Fixes: #128478 In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU. This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671 Approved by: https://github.com/guangyey, https://github.com/soulitzer	2024-06-27 17:14:13 +00:00
Jeff Daily	04206d1898	TunableOp hotfix, unit test follow-up (#129606 ) PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues. This is a follow-up set of unit tests that would exercise the problems seen previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606 Approved by: https://github.com/atalman	2024-06-27 17:01:04 +00:00
Peter Bell	5c6af2b583	[cpu] Fix div with rounding_mode="floor" when division overflows (#129536 ) Fixes #77742 `Sleef_fmod` returns NaN when the division overflows, where `libm` returns 0. In this narrow case we can drop the `fmod` from the calulation entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129536 Approved by: https://github.com/lezcano	2024-06-27 16:50:47 +00:00
PyTorch MergeBot	5ceba6a3cb	Revert "[Inductor] FlexAttention supports block sparse mask (#129216 )" This reverts commit 4082759925a712b7cb340164d3da3a1dab372d9f. Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/clee2000 due to broke functorch/aot_dispatch and test_proxy_tensor on windows https://github.com/pytorch/pytorch/actions/runs/9691331440/job/26743164471 `4082759925` missed on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2195087274))	2024-06-27 15:57:52 +00:00
Adnan Akhundov	82c8fc3a2b	[inductor] Add size_hint to conv dilation (#129631 ) Summary: [Here](`ea588d7fd3/torch/_inductor/kernel/conv.py (L252)`) in the `conv` lowering `dilation` is not `size_hint`-ed. This breaks if `dilation` is a symbolic expression (which we see in some internal models). The PR fixes it by adding a `size_hints`. Test Plan: ``` $ python test/inductor/test_torchinductor.py -k test_convolution5 ... ---------------------------------------------------------------------- Ran 2 tests in 7.329s OK ``` Differential Revision: D59097019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129631 Approved by: https://github.com/chenyang78	2024-06-27 15:27:57 +00:00
Chien-Chin Huang	483dbfcf2a	[BE] Correctly catch skip signals emitting from sys.exit (#129581 ) Some tests in test_c10d_nccl.py overwrite `_join_process()` and `_check_return_codes()`, which cause the skip signals are not catched appropriately. This PR fixes the issue. Differential Revision: [D59067457](https://our.internmc.facebook.com/intern/diff/D59067457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129581 Approved by: https://github.com/fduwjj	2024-06-27 15:12:51 +00:00
Huy Do	2d9012ad25	Forward fix internal pyre failure from D58983461 (#129525 ) Summary: Somehow, using underscore alias of some builtin types breaks pyre Test Plan: All failed tests from D58983461 are passing: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/utils/tests:gpu_memory_utils_test-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:device_util-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:thompson_samplers_gpu-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:combined_sampling_diversifier_test-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:submodular_opt_test-type-checking ``` Differential Revision: D59029768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129525 Approved by: https://github.com/XuehaiPan, https://github.com/clee2000, https://github.com/malfet	2024-06-27 14:41:20 +00:00
Aaron Enye Shi	0680e6cd1c	[Profiler] Add sraikund16 to profiler paths in CODEOWNERS (#129591 ) Summary: Add Shivam to the list of code owners for the profiler code paths, so that Shivam gets added to reviewers for PRs too. Test Plan: CI Differential Revision: D59072152 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129591 Approved by: https://github.com/sraikund16	2024-06-27 14:22:09 +00:00
Animesh Jain	ad607b91f4	[dynamo][onnx] Skip some dynamic=True test with inlining in built nn modules (#129610 ) These tests fail with dynamic=True when inlining in built nn modules. There are a few more recompilations. Since `dynamic=True` is not a recommended usage, I am skipping these tests for now. This is the tracking issue to come back later and fix/update these tests - https://github.com/pytorch/pytorch/issues/129456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129610 Approved by: https://github.com/yanboliang ghstack dependencies: #129583	2024-06-27 10:56:24 +00:00
Chen, Zejun	a028e5862d	[profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554 ) Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler. Fix https://github.com/pytorch/pytorch/issues/101861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554 Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi	2024-06-27 10:46:05 +00:00
y-sq	ff026f3d0a	Fix an issue in meta_scaled_mm (#129521 ) Summary: To fix the following failure cases: For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`. --------- From the inductor generated code (https://fburl.com/everpaste/epcagkrd) ``` V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]), ... ... V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn) ... ... V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), ) ... ... V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] assert_size_stride(permute_10, (6560, 656), (1, 6656)) ... ... V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16) ``` Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`). Therefore, the `stride[1] == shape[0]` condition fails. To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases. Test Plan: Run the failed case again. It works with the fix. ----- Sandcastle / GitHub CI will make sure the existing tests could still pass. Reviewed By: vkuzo Differential Revision: D58994704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521 Approved by: https://github.com/drisspg	2024-06-27 07:03:34 +00:00
Yang Cao	9f29a2291c	Feat: Updated torch.nn.Modules.set_submodules() (#127714 ) modified: torch/nn/modules/module.py Implemented feature request by #127712. Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127714 Approved by: https://github.com/mikaylagawarecki	2024-06-27 06:38:54 +00:00
Animesh Jain	c9798d123b	[dynamo][compile-time] Manually trace torch.nn.Module.parameters (#129583 ) With this PR, we are not worse than no-inlining for Dynamo-only compilation time (there is a litte bit of noise, so outlier of 0.89 is probably ok here). For most of the models, we see positive numbers because of better caching in `UserDefinedObjectVariable`. ![image](https://github.com/pytorch/pytorch/assets/13822661/719d34fd-3e7f-4886-b7e0-1dbfc7141aa5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129583 Approved by: https://github.com/jansel	2024-06-27 06:06:04 +00:00
Valentin Andrei	cf392d8a89	[pytorch][cuda] Generate kernels for 5x5 filters on depth wise convolution backward (#129609 ) In #125362 we improved the default implementation of depth wise convolution 2D forward pass by precomputing boundaries of accessed slices instead of doing expensive edge checks in the inner loops. We also generated kernels for 5x5 filters as this is a common problem size. In this PR we tried to applied the same strategy for the backward kernel but we only saw good gains just by generating code for 5x5 filters. We could also write a fallback implementation that precomputes access boundaries when filter size and stride are not known at compile time may bring some speedup but that kernel would very rarely be called. This PR also hints the thread count at compile time and leaves only the unroll directive that seems to help performance. Before: ``` B C iH iW kH kW conv2d-backward (cuda) conv2d-fp16-backward (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 89.002686 26.400480 1 8.0 64.0 1008.0 1008.0 5.0 5.0 88.885025 25.995296 2 4.0 48.0 720.0 539.0 6.0 1.0 9.488832 9.091136 3 4.0 120.0 379.0 283.0 6.0 1.0 4.194640 3.844432 4 4.0 32.0 713.0 532.0 6.0 1.0 8.027296 7.700064 5 4.0 3.0 712.0 542.0 31.0 31.0 15.618095 15.097760 6 4.0 120.0 379.0 288.0 1.0 6.0 3.788224 3.499648 7 1024.0 384.0 1.0 928.0 1.0 3.0 18.988289 14.152768 8 4.0 24.0 687.0 512.0 6.0 1.0 6.902704 6.685056 9 96.0 96.0 112.0 112.0 5.0 5.0 15.672400 4.953984 10 96.0 80.0 56.0 56.0 5.0 5.0 3.261152 1.250320 11 64.0 128.0 64.0 84.0 3.0 3.0 3.172192 1.515648 12 16.0 960.0 7.0 7.0 5.0 5.0 0.197024 0.072736 13 16.0 64.0 112.0 112.0 3.0 3.0 1.126240 0.650304 ``` After ``` conv2d-performance: B C iH iW kH kW conv2d-backward (cuda) conv2d-fp16-backward (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 76.278656 26.418720 1 8.0 64.0 1008.0 1008.0 5.0 5.0 73.211617 26.018433 2 4.0 48.0 720.0 539.0 6.0 1.0 8.901312 9.322912 3 4.0 120.0 379.0 283.0 6.0 1.0 3.815616 3.992208 4 4.0 32.0 713.0 532.0 6.0 1.0 7.753024 8.032433 5 4.0 3.0 712.0 542.0 31.0 31.0 15.244144 15.277296 6 4.0 120.0 379.0 288.0 1.0 6.0 3.503264 3.552976 7 1024.0 384.0 1.0 928.0 1.0 3.0 16.682976 14.167969 8 4.0 24.0 687.0 512.0 6.0 1.0 6.802576 7.019040 9 96.0 96.0 112.0 112.0 5.0 5.0 12.713024 4.958656 10 96.0 80.0 56.0 56.0 5.0 5.0 2.648352 1.254752 11 64.0 128.0 64.0 84.0 3.0 3.0 3.213568 1.517952 12 16.0 960.0 7.0 7.0 5.0 5.0 0.182208 0.076256 13 16.0 64.0 112.0 112.0 3.0 3.0 1.139952 0.652432 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129609 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-06-27 06:01:47 +00:00
Yanbo Liang	4082759925	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-27 05:44:27 +00:00
Jiang, Yanbing	5ee893a84a	Add inductor support for conv3d transpose (#129458 ) This PR is to add Conv3d Transpose support in inductor. Basicly reuse and expand Conv2d Transpose and unit tests to Conv3d Transpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129458 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-27 05:27:10 +00:00
Wei Wang	9b5b93c58f	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-27 05:22:18 +00:00
Yifu Wang	ea588d7fd3	[SymmetricMemory] use SCM_RIGHTS socket control message to share exported cumem handle (#129412 ) `SymmetricMemory` currently uses the `pidfd_getfd` syscall to share the exported cumem fd among devices. The syscall is introduced in linux kernel 5.6 which is relatively new and not available everywhere. This PR replaces the use of the `pidfd_getfd` syscall with socket + SCM_RIGHTS control message. The approach is demonstrated in [memMapIPCDrv](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/memMapIPCDrv) in [cuda-samples](https://github.com/NVIDIA/cuda-samples/tree/master/Samples) (relevant code: https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_multiprocess.cpp). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129412 Approved by: https://github.com/Chillee	2024-06-27 04:38:13 +00:00
Li-Huai (Allan) Lin	84ad5452f6	[MPS] Fused SGD optimizer (#129350 ) ``` [-------------------------------------- Fused SGD --------------------------------------] \| Fused: True \| Fused: False 1 threads: ------------------------------------------------------------------------------ numel: 1024, num_tensors: 100, momentum: True \| 2 \| 15 numel: 1024, num_tensors: 100, momentum: False \| 2 \| 5 numel: 65536, num_tensors: 100, momentum: True \| 3 \| 16 numel: 65536, num_tensors: 100, momentum: False \| 2 \| 5 numel: 1048576, num_tensors: 100, momentum: True \| 11 \| 16 numel: 1048576, num_tensors: 100, momentum: False \| 8 \| 6 numel: 1024, num_tensors: 500, momentum: True \| 29 \| 70 numel: 1024, num_tensors: 500, momentum: False \| 20 \| 24 numel: 65536, num_tensors: 500, momentum: True \| 33 \| 76 numel: 65536, num_tensors: 500, momentum: False \| 22 \| 26 numel: 1048576, num_tensors: 500, momentum: True \| 70 \| 80 numel: 1048576, num_tensors: 500, momentum: False \| 43 \| 40 numel: 1024, num_tensors: 1000, momentum: True \| 108 \| 139 numel: 1024, num_tensors: 1000, momentum: False \| 72 \| 48 numel: 65536, num_tensors: 1000, momentum: True \| 116 \| 150 numel: 65536, num_tensors: 1000, momentum: False \| 77 \| 52 numel: 1048576, num_tensors: 1000, momentum: True \| 190 \| 170 numel: 1048576, num_tensors: 1000, momentum: False \| 120 \| 50 ``` ```python def profile_fused_sgd(): from torch.optim.sgd import sgd import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, momentum_buffer_list, fused): fn( params, grads, momentum_buffer_list, momentum=True if len(momentum_buffer_list) > 0 else False, dampening=0.0, nesterov=False, foreach=False, fused=fused, lr=1e-3, weight_decay=.0, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]): sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}" print(sublabel) params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)] momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else [] fn = sgd for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, momentum_buffer_list, fused)', label='Fused SGD', sub_label=sublabel, globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350 Approved by: https://github.com/janeyx99 ghstack dependencies: #129006, #129008, #129007, #129105	2024-06-27 04:37:14 +00:00
Eddie Yan	e19042481b	[cuDNN][cuDNN Frontend] Bump cuDNN FE submodule to 1.5.2 (#129592 ) Some relevant fixes include stride-0 support 👀 CC @drisspg @Skylion007 @vedaanta Pull Request resolved: https://github.com/pytorch/pytorch/pull/129592 Approved by: https://github.com/Skylion007	2024-06-27 04:01:23 +00:00
Antoni Viros	9450e198aa	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-06-27 03:41:28 +00:00
Kurman Karabukaev	c9ceae3fac	Use JK for mast rdzv handler tcpstore handling and additional logging (#129603 ) Summary: Use JK to control the release instead of using env variable to toggle the feature. Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time. Test Plan: CI Differential Revision: D59071544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603 Approved by: https://github.com/d4l3k	2024-06-27 03:34:52 +00:00
Yidi Wu	b9697eacd3	[torchbind] support tensor ops inside of __obj_flatten__ (#129605 ) As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True). In this PR, we disable all the dispatch modes when running to_fake_obj. Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents. Test Plan: Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605 Approved by: https://github.com/angelayi	2024-06-27 03:07:31 +00:00
Nikita Shulga	cdbd6542d0	Fix inductor benchmarks (#129620 ) By installing torchao explicitly, as torchao-0.3.0 that was release recently to pypi introduced hard dependency to torch-2.3.1, which results in following cryptic error: `RuntimeError: operator torchvision::nms does not exist` TODOs: - Figure out what installs torchao from pypi rather than builds from source - Add proper CI pin for torchao Pull Request resolved: https://github.com/pytorch/pytorch/pull/129620 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-06-27 02:59:08 +00:00
garfield1997	27a14405d3	enable device index check for all device types (#126767 ) enable device index check for all device types for grad setter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126767 Approved by: https://github.com/albanD	2024-06-27 01:09:53 +00:00
Boyuan Feng	0b7e8df7d8	[CUDAGraph Trees] Enable input mutation support in OSS (#129184 ) Summary: Enable input mutation support for cudagraph trees in OSS. Test Plan: CI Differential Revision: D58847850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129184 Approved by: https://github.com/eellison	2024-06-27 00:49:45 +00:00
yuqingj	7bb558fd6e	add _flash_attention_forward and _efficient_attention_forward to compute intensive ops in partitioner (#129533 ) Avoid recompute of SDPA during the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129533 Approved by: https://github.com/drisspg	2024-06-27 00:49:00 +00:00
Jiashen Cao	b6689e0fb8	[ts migration] add logging as part of torch logging system (#129405 ) #### Description Add more verbose logging of conversion process. Output which IR is being converted, which function is used to do conversion, and whether it succeeds. #### Example `TORCH_LOGS="+export,ts2ep_conversion" pytest test/export/test_converter.py -s -k test_prim_tolist` ``` test/export/test_converter.py I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] TorchScript graph I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] graph(%x.1 : Long(3, strides=[1], requires_grad=0, device=cpu)): I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject() I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %2 : int = prim::Constant[value=1](), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %3 : int = prim::Constant[value=0](), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] return (%4) I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_CreateObject] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_tolist] succeeds I0624 13:19:26.427000 140608224474112 torch/_export/converter.py:760] TS2EPConverter IR-to-IR conversion succeeds ``` #### Test Plan `pytest test/export/test_converter` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129405 Approved by: https://github.com/angelayi	2024-06-27 00:20:20 +00:00
Tugsbayasgalan Manlaibaatar	90f6043368	Don't decompose functional composite ops in export inference IR (#128077 ) Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path. Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077 Approved by: https://github.com/bdhirsh	2024-06-26 23:07:55 +00:00
Chirag Pandya	64f1111d38	Expose nholmann json to torch (#129570 ) Summary: Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`. The next PR makes actual use of this header. imported-using-ghimport Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D59035246 Pulled By: c-p-i-o Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570 Approved by: https://github.com/d4l3k, https://github.com/malfet	2024-06-26 21:59:26 +00:00
HOOLoLo	5ad2ad5921	Update start_, end_ and retired only for the right entry when retire a work (#128948 ) Fixes #128805 If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128948 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-06-26 21:58:00 +00:00
Elias Ellison	b8e5678ad2	Delete lazy ddp optimizer (#120727 ) This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides. Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727 Approved by: https://github.com/jansel, https://github.com/yf225	2024-06-26 21:53:54 +00:00
Shivam Raikundalia	13316a8d46	[Profiler] Add Rank to NCCL Debug Info (#129528 ) Summary: We need to add the Rank information to the NCCL debug data so that kineto can infer all the necessary process group info such that on-demand can create distributedInfo metadata. Kineto portion will be added in a follow up diff Test Plan: Tested in D58736045, this diff just splits the kineto and profiler instances Differential Revision: D59028819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129528 Approved by: https://github.com/aaronenyeshi	2024-06-26 21:24:05 +00:00
Catherine Lee	7b1988f922	[ez] Give trymerge id token write permissions after #129503 (#129594 ) Forgot to do this in #129503 Also fix minor typo Pull Request resolved: https://github.com/pytorch/pytorch/pull/129594 Approved by: https://github.com/huydhn	2024-06-26 20:33:14 +00:00
Catherine Lee	795db80975	Upload release tag source code to s3 (#128842 ) Upload tarball containing source code to s3 for release tags Can be found here https://us-east-1.console.aws.amazon.com/s3/buckets/pytorch?region=us-east-1&bucketType=general&prefix=source_code/test/&showversions=false D58695048 for adding permissions to allow uploading to the s3 folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/128842 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-26 20:32:40 +00:00
Andrea Frittoli	28480dd7dc	[CI] Fix runner determinator for ciflow (#129500 ) In case of ciflow, runs are triggered by a tag which is created by @pytorchbot, which breaks the logic of the runner determinator. In case of tag triggers, extract the pr number from the tag name, fetch the pr and extract the user login from it. Both the inline and standalone python scripts have been updated for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129500 Approved by: https://github.com/malfet, https://github.com/zxiiro	2024-06-26 20:27:06 +00:00
James Perng	d3d6764082	[pytorch][logging] add fb internal ODS implementation of wait counter (#128605 ) * created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` * uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it. * this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons. * `facebook::data_preproc::WaitCounterUs` uses 3 singletons: 1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton 2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton 3. publisherSingleton — normal singleton since it manages resources (threads) * `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0. * linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl` * wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation. as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library? Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128605 Approved by: https://github.com/c-p-i-o ghstack dependencies: #128466	2024-06-26 19:11:21 +00:00
Catherine Lee	90f82426b9	RS migration - trymerge to upload merge records to s3 (#129503 ) Uploads merge records to to ossci-raw-job-status (public) bucket instead of directly to rockset The runner used by trymerge is a GH runner, so it doesn't have access to s3. Instead, I save the record as a json and upload the json to s3 in a different step that runs after the aws credentials are configured. The role is defined [here](https://togithub.com/pytorch-labs/pytorch-gha-infra/pull/421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129503 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet	2024-06-26 19:06:52 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 0314c4c101c44d5d89b4fad9d37a012dc6f31128. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
PyTorch MergeBot	e9aefad641	Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 )" This reverts commit 551e4127185195ae8a5331dc8bbfdffd5d4dd1b8. Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))	2024-06-26 18:54:41 +00:00
Shangdi Yu	cca85c96cd	[export] minor typo fix (#129543 ) Fixes a typo in torch.export doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543 Approved by: https://github.com/angelayi	2024-06-26 18:35:31 +00:00
Sam Larsen	87d14ad419	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-26 18:34:48 +00:00
Huy Do	61bf1452a3	Add one more shard for CPU jobs (#129299 ) The first shard is very close to 3.5h and timeout sometimes now `1c75ddff35 (26540310592)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129299 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-06-26 18:32:10 +00:00
Andres Lugo	b9a1c2c991	[ROCm] Enable F8 Inductor Unit tests (#128353 ) First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-26 18:30:43 +00:00
Saurabh Mishra	8e4f7f742f	[DCP] Capture reader, writer and planner components in the DCP API logger (#129548 ) Summary: Capture reader, writer and planner components in the DCP API logger Test Plan: logs can be found in scuba pytorch_dcp_logging https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki Differential Revision: D59040866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548 Approved by: https://github.com/wz337, https://github.com/fegin	2024-06-26 18:11:16 +00:00
Isuru Fernando	7373492c9b	Use _unsafe_masked_index in masked_scatter decomposition (#123667 ) and remove masked_scatter_with_index inductor prims Pull Request resolved: https://github.com/pytorch/pytorch/pull/123667 Approved by: https://github.com/peterbell10	2024-06-26 17:18:24 +00:00
Jack Taylor	1b1fd0f4fe	[ROCm] Use additional shard for inductor workflow to resolve timeouts (#129480 ) This will help timeouts on inductor workflow. The cuda equivalent job also moved to 2 shards since `e0aa992d73` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129480 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet	2024-06-26 17:18:20 +00:00
Nikita Shulga	bc68907caa	[EZ][BE] Replace `assertTrue` with more appropriate checks (#129569 ) Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007	2024-06-26 16:29:59 +00:00
Piotr Kluska	9cf8e5dd32	chore(quantization): Enable PT2E symmetric dynamic quantization (#124615 ) in `_find_choose_qparams_node` function compare the current node if it is affine or symmetric Pull Request resolved: https://github.com/pytorch/pytorch/pull/124615 Approved by: https://github.com/kimishpatel, https://github.com/malfet	2024-06-26 16:14:58 +00:00
PyTorch MergeBot	f7708ffebb	Revert "[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 )" This reverts commit 52009068bc39ebc846bd37b44f5f9c5f62257778. Reverted https://github.com/pytorch/pytorch/pull/129378 on behalf of https://github.com/clee2000 due to broke inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_sympy_expr_arg_abi_compatible_cuda and a few other tests https://github.com/pytorch/pytorch/actions/runs/9680978494/job/26713689249 `52009068bc`. The tests were added in https://github.com/pytorch/pytorch/pull/129301 which is before your base ([comment](https://github.com/pytorch/pytorch/pull/129378#issuecomment-2192032697))	2024-06-26 15:46:17 +00:00
Xu Zhao	474d743dba	[torchao][benchmark] Skip all accuracy tests by returning `pass_due_to_skip` (#129545 ) Summary: As the title says. Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy ``` Differential Revision: D59040593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545 Approved by: https://github.com/HDCharles	2024-06-26 14:21:53 +00:00
Mikayla Gawarecki	25cec43678	Remove dependency on private _compat_pickle in CPython (#129509 ) Use the IMPORT_MAPPING and NAME_MAPPING from here https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129509 Approved by: https://github.com/malfet ghstack dependencies: #129239, #129396	2024-06-26 14:20:27 +00:00
Mikayla Gawarecki	3b531eace7	Add example for torch.serialization.add_safe_globals (#129396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396 Approved by: https://github.com/albanD, https://github.com/malfet ghstack dependencies: #129239	2024-06-26 14:20:27 +00:00
Mikayla Gawarecki	303ad8d7f5	Add warning for weights_only (#129239 ) Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239 Approved by: https://github.com/albanD, https://github.com/malfet	2024-06-26 14:20:19 +00:00
Bin Bao	52009068bc	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 ) Summary: Unify the UserDefinedTritonKernel argument codegen logic between python wrapper and cpp wrapper. This prepares for later PRs that will simplify AOTI codegen. Differential Revision: [D59002226](https://our.internmc.facebook.com/intern/diff/D59002226) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129378 Approved by: https://github.com/oulgen, https://github.com/chenyang78 ghstack dependencies: #129267	2024-06-26 13:53:27 +00:00
Bin Bao	42d490d41d	[AOTI][refactor] Move generate_user_defined_triton_kernel (#129267 ) Summary: Move generate_user_defined_triton_kernel from cpp_wrapper_cpu to cpp_wrapper_cuda as it's for CUDA only Differential Revision: [D58953005](https://our.internmc.facebook.com/intern/diff/D58953005) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129267 Approved by: https://github.com/chenyang78	2024-06-26 13:50:39 +00:00
Jean Schmidt	53fafdd0c3	[BE] Runner determinator: more resilient user matching (#129462 ) Small improvements on runner determinator script: * Don't do splitting of the issue comment, unless necessary; * Match username against a set over a list; * Match both triggering_actor and issue owner over only actor (to avoid edge cases, where we get `pytorch-bot[bot]`) * Add stripping, to remove potential breaking and not visible whitespaces; * Don't use linux.4xlarge as a runner: it should not depend on meta runners, for reliability; Pull Request resolved: https://github.com/pytorch/pytorch/pull/129462 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi	2024-06-26 13:47:52 +00:00
PyTorch MergeBot	211f38e742	Revert "[ALI] [Reland] Use LF runners for Lint (#129071 )" This reverts commit 1b92bdd0ea326cd30bc3945602701ffe28c85fd5. Reverted https://github.com/pytorch/pytorch/pull/129071 on behalf of https://github.com/malfet due to All LF jobs are backlogged, so revert this one ([comment](https://github.com/pytorch/pytorch/pull/129071#issuecomment-2191676677))	2024-06-26 13:19:00 +00:00
Yifu Wang	92be3403ea	Fix an issue in oneShotAllReduce where different ranks perform reduction in different order (#129501 ) In `oneShotAllReduce`, ranks read data from peers in a round-robin fashion to load-balance NVLinks. However, the following reduction is also performed in the this order which is different across ranks. This can results in slight numerical differences across ranks, which can lead to a hang in data dependent applications like speculative decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129501 Approved by: https://github.com/Chillee	2024-06-26 08:43:10 +00:00
Animesh Jain	f2840bb220	[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 ) TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow. With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model. Functionality impact - The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR). Perf impact - I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). Typing impact - I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #129163	2024-06-26 07:59:42 +00:00
Will Feng	ead97ee486	[Compile+SAC] Only warn for in-place ops once (#129397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129397 Approved by: https://github.com/tianyu-l	2024-06-26 07:25:02 +00:00
cdzhan	c422a9549d	[easy][DCP] Fix test_fsdp_ep.py for _MeshEnv.create_child_mesh API ch… (#129445 ) …ange Update test/distributed/checkpoint/e2e/test_fsdp_ep.py for #127465 change. Failure info: ```bash [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Caught exception: [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Traceback (most recent call last): [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 657, in run_test [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] getattr(self, test_name)() [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 539, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] fn() [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_utils.py", line 2744, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] method(args, kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 369, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] func(self, args, *kwargs) # type: ignore[misc] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 180, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] return func(args, *kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/checkpoint_utils.py", line 44, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] func(self, args, **kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 76, in test_e2e [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, 0, "dp") [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] TypeError: _MeshEnv.create_child_mesh() takes 3 positional arguments but 4 were given [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] To execute this test, run the following from the base repo dir: [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] python test/distributed/checkpoint/e2e/test_fsdp_ep.py -k TestFSDPWithEP.test_e2e [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129445 Approved by: https://github.com/fegin, https://github.com/wz337	2024-06-26 06:43:30 +00:00
wz337	8b8e2fcdda	[DCP] Fix Optimizer Learning Rate not being loaded correctly (#129398 ) Fixes #129079 Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after `f518cf811d`, which is a regression introduced in 2.3. This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading. Test: ``` # test to make sure lr is loading correctly python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict # test to make sure load on meta device model still works python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129398 Approved by: https://github.com/fegin	2024-06-26 06:41:47 +00:00
Sheng Fu	000f2d637b	Refactoring the code to make it lint clean (#129424 ) Summary: Refactoring the code to make it lint clean Test Plan: buck2 build mode/dev-tsan caffe2/test:test_profiler_cuda Differential Revision: D58971175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129424 Approved by: https://github.com/aaronenyeshi	2024-06-26 06:12:01 +00:00
Li-Huai (Allan) Lin	610894e978	[MPS][BE] Generalize Fused optimizers (#129105 ) This PR generalizes the multi_tensor_apply function for other fused optimizers Pull Request resolved: https://github.com/pytorch/pytorch/pull/129105 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008, #129007	2024-06-26 06:00:41 +00:00
Pian Pawakapan	d02bba519c	[export] match fake mode for _decompose_exported_program() (#129421 ) Summary: _decompose_exported_program() ran into an issue with trace_joint, where trace_joint() produces values with mismatching FakeModes. Adding fake mode context to aot_export_module() so this doesn't happen. #thanks to tugsbayasgalan for the fix! Test Plan: test_experimental Differential Revision: D58977694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129421 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-06-26 05:52:31 +00:00
Chien-Chin Huang	7420bad74c	[BE] Do not assert if the barrier is not created (#129497 ) the foler will be created as long as TEMP_DIR is set and the program has the write permission. This will ensure some test environment can run the spawn tests. Differential Revision: [D59020736](https://our.internmc.facebook.com/intern/diff/D59020736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129497 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-06-26 05:51:36 +00:00
Anshul Sinha	c04cec609d	[dtensor][debug] fixing CommDebugMode module collective tracing (#128887 ) Summary The logic for CommDebugMode module collective tracing is incorrect as it only worked for leaf module nodes on the model's module tree. If we had a sub-module that had a collective call along with a nested module inside it, the sub-module was not removed from the module_tracker parent set leading to double-counting collectives. This problem was addressed by checking to make sure the current sub-module was not already in the parent set. The output of the below test cases should remain the same. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/128887 Approved by: https://github.com/XilunWu ghstack dependencies: #128729	2024-06-26 05:25:57 +00:00
Anshul Sinha	bd3a11776f	[dtensor][test] test case suite for comm_mode features (#128729 ) Summary Currently, there is only an example file for comm_mode and its features. I have created test cases that mirror the examples while the more complicated test cases also ensure that comm_mode resets all variables when used multiple times in the same function. This test case suite will also help developers ensure that new code they add to comm_mode does not affect correctness of old features. #128536 Test Plan pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128729 Approved by: https://github.com/XilunWu	2024-06-26 05:25:57 +00:00
Tugsbayasgalan Manlaibaatar	6181e65cd8	Nested tensor subclass support (#127431 ) When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive. Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431 Approved by: https://github.com/bdhirsh	2024-06-26 04:45:22 +00:00
Huy Do	cda4d4887d	Skip signals from older runs of the same workflows (#129291 ) I discovered this bug in trymerge when debugging https://github.com/pytorch/pytorch/pull/129013 in which Dr.CI reported no relevant failures while mergebot complained about some unrelated ROCm failures https://github.com/pytorch/pytorch/pull/129013#issuecomment-2183009217. It turns out that mergebot took into account stale signals from older runs of the same workflow here. For example, * https://github.com/pytorch/pytorch/actions/runs/9604985361 was the first run where it had a ROCm failure * While https://github.com/pytorch/pytorch/actions/runs/9608926565 was the second attempt and it was all green Notice that both runs came from the same push to commit [be69191](`be69191f2d`) with [ciflow/rocm/129013](https://github.com/pytorch/pytorch/tree/ciflow/rocm/129013). So, we just need to check the signals from the newer run. Note that Dr.CI handles this part correctly using the logic in https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L1079-L1088. So, the fix in this PR is to bring the same logic to trymerge. ### Testing `pytest -v test_trymerge.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129291 Approved by: https://github.com/ZainRizvi	2024-06-26 03:49:09 +00:00
James Perng	c718e2f43b	[pytorch][logging] add empty wait counter implementation (#128466 ) Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128466 Approved by: https://github.com/c-p-i-o	2024-06-26 03:47:17 +00:00
xinan.lin	54f27b886e	[Inductor UT] Reuse test_distributed_patterns.py for Intel GPU (#129437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129437 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-26 02:58:45 +00:00
CaoE	555f71a15b	Fix test_auto_simd in machine with AMX support (#129444 ) Fixes #129438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129444 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-06-26 02:50:55 +00:00
cdzhan	a89a1ed072	[easy][DCP] make BroadcastingTorchSaveReader device generic (#129231 ) Test test/distributed/checkpoint/test_format_utils.py on GPU and othor device pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129231 Approved by: https://github.com/fegin	2024-06-26 02:37:30 +00:00
Peter Bell	90d5a6f001	[inductor] Add lowering and codegen for aten.sort (#128458 ) Closes #125633 Benchmarks: \| Shape \| dim \| stable \| compiled \| eager \| speedup \| \|-------------\|-----\|--------\|----------\|---------\|---------\| \| (256, 4096) \| 0 \| False \| 0.73 ms \| 1.26 ms \| 1.7 \| \| (256, 4096) \| 0 \| True \| 0.75 ms \| 1.27 ms \| 1.7 \| \| (4096, 256) \| 1 \| False \| 0.20 ms \| 0.73 ms \| 3.7 \| \| (4096, 256) \| 1 \| True \| 0.21 ms \| 0.73 ms \| 3.5 \| \| (255, 4096) \| 0 \| False \| 1.05 ms \| 1.48 ms \| 1.4 \| \| (255, 4096) \| 0 \| True \| 1.03 ms \| 1.47 ms \| 1.4 \| \| (4096, 255) \| 1 \| False \| 0.52 ms \| 0.98 ms \| 1.9 \| \| (4096, 255) \| 1 \| True \| 0.54 ms \| 1.00 ms \| 1.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458 Approved by: https://github.com/lezcano, https://github.com/eellison	2024-06-26 01:36:39 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Yanbo Liang	9554a9af87	[GPT-benchmark] Distinguish LLM models and mirco-benchmarks (#129498 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129498 Approved by: https://github.com/huydhn	2024-06-26 00:25:05 +00:00
Catherine Lee	0d0d42c4a7	test_qat_mobilenet_v2 succeeding on dynamo (#129532 ) https://github.com/pytorch/pytorch/actions/runs/9669572961/job/26677024995 Test is usually marked as slow so it doesn't get run on dynamo since dynamo doesn't have a slow equivalent However, it is succeeding, so we might as well as do what the logs tell us to do and remove the failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/129532 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-06-25 23:55:12 +00:00
Peter Bell	112ef79f29	[inductor] Remove comm-specific node attributes from scheduler (#129084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129084 Approved by: https://github.com/lezcano	2024-06-25 23:52:19 +00:00
wz337	d1f9e822dd	[DTensor][Test] Update implicit replication unit tests for tensor arg being the first in args list (#127803 ) Change the operands order so we can have test coverage for when the first arg is a tensor arg instead of DTensor arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127803 Approved by: https://github.com/XilunWu	2024-06-25 23:51:58 +00:00
Will Feng	575bc1e3af	[Reopen #114036 ] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295 Approved by: https://github.com/Chillee	2024-06-25 23:47:08 +00:00
joydddd	f389541ce0	Add Strided Input test for flex attention (#128915 ) Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-06-25 23:26:34 +00:00
Catherine Lee	87ebd627a7	RS migration - upload sccache stats to s3 instead of rockset (#129490 ) Upload sccache stats to s3 instead of rockset I don't think we use these anywhere, so it's ok to cut off the ingest into rockset right now. We should consider deleting this entirely if we don't plan on using it I will work on copying existing data over from rockset to s3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129490 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-06-25 23:23:16 +00:00
PyTorch MergeBot	52341c28e8	Revert "[FSDP2] Ran post-acc-grad hooks manually (#129450 )" This reverts commit 7ebffef4d02a3cc68dbbcf44b92d63c7fe0ebb67. Reverted https://github.com/pytorch/pytorch/pull/129450 on behalf of https://github.com/clee2000 due to broke distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager `7ebffef4d0` https://github.com/pytorch/pytorch/actions/runs/9667812641/job/26671489454. Test got added in https://github.com/pytorch/pytorch/pull/129157 which is before your mergebase ([comment](https://github.com/pytorch/pytorch/pull/129450#issuecomment-2190174363))	2024-06-25 23:13:57 +00:00
Yifu Wang	bbd47f7b2f	Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762 ) This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762 Approved by: https://github.com/wanchaol	2024-06-25 22:32:21 +00:00
Chien-Chin Huang	1c5df9107d	[BE] Fix several incorrect skip tests (#129488 ) These tests may not be skipped properly if NCCL library exists but CUDA is not avaiable. Differential Revision: [D59013855](https://our.internmc.facebook.com/intern/diff/D59013855/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129488 Approved by: https://github.com/wz337, https://github.com/fduwjj	2024-06-25 22:10:31 +00:00
Shunting Zhang	fd414d6189	[inductor] don't materialize the large sparse matrix in CE bwd (#129043 ) Inductor currently materialize a large sparse matrix in the backward pass for CrossEntropyLoss and load that to compute gradients of Softmax input. If we could fuse the sparse matrix computation to the consumer sides, we gonna have both perf and memory usage wins. The Fx graph snippets that construct this aforementioned sparse matrix looks like: ``` full_default_3: "bf16[32768, 50257]" = torch.ops.aten.full.default([32768, 50257], 0, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False) scatter: "bf16[32768, 50257]" = torch.ops.aten.scatter.value(full_default_3, 1, where_2, -1.0); full_default_3 = where_2 = None ``` Leveraging the following observations: - the scatter is applied upon a all zero (or more generally a const tensor) - the index tensor for the scatter has a single element on the scatter dimension. In this case it's the label tensor allow us to lower this 'scatter_upon_const_tensor' pattern to a pointwise kernel that can be easily fused with downstream kernels: ``` def inner_fn(idx): selector_idx = list(idx) selector_idx[dim] = 0 # can do this since the index tensor has a single element on the scatter dimension selector = selector_loader(selector_idx) return ops.where( selector == ops.index_expr(idx[dim], torch.int64), ops.constant(val, dtype), ops.constant(background_val, dtype), ) ``` ## Test result on microbenchmark For the microbenchmark added as `test_cross_entropy_loss`, we improve latency from 47.340ms to 42.768ms, memory footprint from 10.524GB to 7.227GB on A100. (on H100, we improve latency from 27.54ms to 23.51ms, memory footprint from 10.574GB to 7.354GB). The saving matches the back-of-envelope calculation. We avoid storing a BF16 tensor with shape [30K, 50K] which is about 3GB in size. On A100, avoid loading and storing such a tensor can roughly save 3GB x 2 / 1.5TBGS = 4ms ## Test result on llm.c We also test this on llm.c and the saving is much larger especially for memory footprint. The reason is due to autotuning that allocates extra memory for benchmarking. (Check https://github.com/pytorch/pytorch/issues/129258 and https://github.com/pytorch/pytorch/pull/129399 for more details). For llm.c PyTorch implementation on A100, we improve from 171K tokens/s , 33.6G peak memory usage to 180K tokens/s, 18.6G peak memory usage. (A 45% saving of peak memory) ## Test on PyTorch 2.0 Dashboard The optimization is quite general especially for transformers. We tested this on PyTorch2.0 dashboard. Here is the [result](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2017%20Jun%202024%2018%3A07%3A51%20GMT&stopTime=Mon%2C%2024%20Jun%202024%2018%3A07%3A51%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/158/head&lCommit=c62c55e29c65497d495217b6574bb36b0c4da7d4&rBranch=main&rCommit=0d25f096c1beaf8749932a3d6083ad653405ed71). TLDR, for Huggingface benchmark suite, we get 6% geomean perf improvement and 10% geomean memory footprint improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129043 Approved by: https://github.com/jansel, https://github.com/Chillee	2024-06-25 21:25:50 +00:00
Will Constable	e1499f6342	[C10D] Make new_group eager when used with comm_split (#129284 ) If users pass `device_id` to init_process_group, they enable eager init for the default group. Then if they subsequently call `new_group`, the device_id argument is not required as it should be assumed to match the one used for init_process_group. However, both `init_process_group` and `new_group` apis share a helper function, which expects a `device_id` value that defaults to None. When it's None, eager initialization is disabled. This PR ensures that if a device_id was passed to init_process_group, the same device_id will automatically be fed into the helper function for any new_group calls that follow. Test plan I found an existing test in CI `test_comm_split_subgroup` that failed after my change, because it was asserting that backend comm_split counter did not increment eagerly, and its behavior had changed to increment eagerly. I updated the test in the PR to pass with my change. I also tested locally via simple program with TORCH_CPP_LOG_LEVEL=INFO and observed eager initialization of the 'lows' and 'highs' PGs before the 'Here' print. ``` import torch import torch.distributed as dist dist.init_process_group(backend="nccl", device_id =torch.device(f"cuda:{torch.distributed.get_node_local_rank(0)}")) dist.new_group([0, 1], group_desc="lows") dist.new_group([2, 3], group_desc="highs") print("Here") torch.distributed.destroy_process_group() ``` Output: https://gist.github.com/wconstab/88a5ba0b970244ca1f79133f989e0349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129284 Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj, https://github.com/d4l3k, https://github.com/nvcastet	2024-06-25 21:09:34 +00:00
Zhengxu Chen	e58ef5b65f	[export] Rewrite exportdb formatting. (#129260 ) Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library. Test Plan: CI Differential Revision: D58886554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260 Approved by: https://github.com/tugsbayasgalan	2024-06-25 21:04:53 +00:00
Wei Wang	551e412718	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-25 20:59:49 +00:00
Max Podkorytov	79959d707c	[Inductor][ROCm] Composable Kernel backend for Inductor (#125453 ) This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection. The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are (1) customizing compiler for the ROCm platform (2) customizing template code generation for Composable Kernel Universal GEMM instances. We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance. ### Testing Install the ck library ``` pip install git+https://github.com/rocm/composable_kernel@develop ``` Run the test ``` TORCH_LOGS=+torch._inductor \ pytest --capture=tee-sys test/inductor/test_ck_backend.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453 Approved by: https://github.com/eellison, https://github.com/jansel	2024-06-25 20:54:14 +00:00
DiweiSun	ae0f84d89c	[CI] Enable amp accuracy check for inductor cpu (#127758 ) This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-25 20:34:18 +00:00
Jiashen Cao	45f2876934	[Fix] NumToTensor resulted from numel() and size() in TSCovnerter (#128761 ) #### Issue In jit.trace, torch.numel() is automatically cast to a `LongTensor`. But during conversion, we lost the casting part. `prim::NumToTensor` was previously converted to `torch.ops.aten.scalar_tensor`, which uses the same `dtype` as the input tensor instead of `LongTensor`. in this PR, we add a casting to convert it to the correct `dtype`. #### Test Plan We activate previously failing test case. * `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128761 Approved by: https://github.com/angelayi	2024-06-25 20:20:03 +00:00
Jeff Daily	e68ee2cadb	TunableOp hotfix (#129281 ) Fixes. - PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 had a memory leak. - The strided batched gemm size calculation for buffer rotation was incorrect resulting in a mem fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129281 Approved by: https://github.com/xw285cornell, https://github.com/eqy, https://github.com/mxz297	2024-06-25 20:12:46 +00:00
Chirag Pandya	1865fe282f	Log whenever we sleep (#129197 ) Summary: Log whenever we sleep for heartbeatTimeout. Useful for debugging stuck jobs. This will eventually turn into a metric. Test Plan: none. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129197 Approved by: https://github.com/Skylion007, https://github.com/d4l3k, https://github.com/wconstab	2024-06-25 20:09:41 +00:00
PyTorch MergeBot	b1f486aff9	Revert "Add warning for weights_only (#129239 )" This reverts commit 381ce0821c3fa2b342f0b8660c76cc27f48543c4. Reverted https://github.com/pytorch/pytorch/pull/129239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm `381ce0821c`, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))	2024-06-25 19:30:07 +00:00
PyTorch MergeBot	7cf454ec52	Revert "Add example for torch.serialization.add_safe_globals (#129396 )" This reverts commit f18becaaf1c7a7bf851e3ae8d215eee8dba688b6. Reverted https://github.com/pytorch/pytorch/pull/129396 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm `381ce0821c`, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))	2024-06-25 19:30:07 +00:00
Tristan Rice	0298560ca2	TCPStore: improve connect and retry logic (#129261 ) We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times. This PR does a few things: * Retry that connect and validate up to the specified timeout. * Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep. * Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141 Test plan: ``` python test/distributed/test_store.py -v ./build/bin/BackoffTest ``` Will do internal testing with some large scale jobs to ensure TCPStore works correctly. At 4k scale: 4x improvement ``` tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 1.98 secs fish external usr time 0.93 secs 91.00 micros 0.93 secs sys time 1.98 secs 954.00 micros 1.97 secs tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10) tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 8.20 secs fish external usr time 2.15 secs 0.00 micros 2.15 secs sys time 2.76 secs 843.00 micros 2.76 secs ``` ```py import time import os import threading from multiprocessing import Pool WORLD_SIZE = 10000 import torch.distributed as dist def run(rank): should_log = rank % (WORLD_SIZE // 10) == 0 if should_log: print(f"started {rank}") store = dist.TCPStore( host_name="devvm4382.nao0.facebook.com", port=29500, world_size=WORLD_SIZE, is_master=rank == 0, use_libuv=True, ) if should_log: print(f"init {rank}") store.set(f"key{rank}", "1234") if should_log: print(f"set {rank}") del store def noop(rank): pass print("starting pool") with Pool(WORLD_SIZE) as pool: pool.map(noop, range(WORLD_SIZE), 1) print("pool hot") start = time.time() pool.map(run, range(WORLD_SIZE), 1) print("run finished", time.time()-start) ``` ``` tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10) starting pool pool hot started 0 [W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. started 1000 init 1000 set 1000 started 2000 init 2000 set 2000 started 3000 init 3000 set 3000 started 4000 init 4000 set 4000 started 5000 init 5000 set 5000 started 6000 init 6000 set 6000 started 7000 init 7000 set 7000 started 8000 init 8000 set 8000 started 9000 init 9000 set 9000 init 0 set 0 run finished 0.705092191696167 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261 Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o	2024-06-25 19:24:22 +00:00
Nikita Shulga	816e8a3f21	[MacOS] Improve libomp packaging (#129473 ) Instead of replacing `@rpath/libomp.dylib` with `@loadper_path/libomp.dylib`, keep it in place and add `@loadper_path` as new rpath This should prevent double-loading of OpenMP runtime, because in case of `@rpath` loader is allowed to reuse other libraries, but `loadper_path` directive forces it to load it from the location relative to the executable Test plan: - Prepare the environment ```shell conda create -n py310-cf python=3.10 numpy pip -c conda-forge conda activate py310-cf pip install torch --index-url https://download.pytorch.org/whl/test/cpu ``` - Verify that OpenMP is loaded twice and than crashes ```shell KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())" ``` output: ``` LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 16.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 12.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no 2.4.0 True zsh: segmentation fault KMP_VERSION=true python -c ``` - Install artifact from this PR and make sure it passes the same test ```shell python -mpip install ~/Downloads/torch-2.5.0.dev20240625-cp310-none-macosx_11_0_arm64.whl KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())" ``` output ``` LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 16.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no 2.5.0.dev20240625 True tensor(1.0000) ``` - Make sure it still uses bundled OpenMP if none is available in the environment ``` conda uninstall numpy -c conda-forge KMP_VERSION=true python -c "from ctypes import cdll, c_char_p, c_uint32; import torch; from ctypes import cdll, c_char_p, c_uint32; libdyld = cdll.LoadLibrary('libSystem.dylib'); libdyld._dyld_image_count.restype = c_uint32; libdyld._dyld_get_image_name.restype = c_char_p; libdyld._dyld_get_image_name.argtypes = [c_uint32]; print(torch.rand(300, 300).abs().max()); libs = [libdyld._dyld_get_image_name(i).decode('ascii') for i in range(libdyld._dyld_image_count())]; print([l for l in libs if 'libomp.dylib' in l])" ``` Fixes https://github.com/pytorch/pytorch/issues/124497 and https://github.com/pytorch/pytorch/issues/126385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129473 Approved by: https://github.com/atalman	2024-06-25 19:12:34 +00:00
PyTorch MergeBot	b045878f81	Revert "Remove test_mps_allocator_module XFAIL (#129340 )" This reverts commit c888ee36325148ed99db4298bf2ae739ebbeacdc. Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))	2024-06-25 18:37:54 +00:00
Andrew Gu	7ebffef4d0	[FSDP2] Ran post-acc-grad hooks manually (#129450 ) FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually. Discussion Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity. Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not. Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually. Caveats - Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass). - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be. - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers. - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`. - The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream. - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues. - This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope. Experiments (torchtitan) - Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision: - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped) - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450 Approved by: https://github.com/weifengpy	2024-06-25 18:34:56 +00:00
Yidi Wu	dd00f5e78d	Fixes T192448049 (#129146 ) Differential Revision: D58767610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129146 Approved by: https://github.com/angelayi	2024-06-25 17:50:15 +00:00
Weizhuo Zhang	53f462c506	Write dynamo benchmarks performance result to csv when throw exceptions (#126764 ) Performance mode Issue: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files. ![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c) - Fix: The warm-up failed models will be recorded into csv file shown as following: ![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117) Accuracy mode issue: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR `ee557d8f61`. ``` Dynamic Shape: Traceback (most recent call last): File "benchmarks/dynamo/torchbench.py", line 449, in <module> torchbench_main() File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main main(TorchBenchmarkRunner(), original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main process_entry(0, runner, original_dir, args) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry return run(runner, args, original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` ![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a) - Fix: same as PR `ee557d8f61`, the batch_size will be skipped to set as 4 when testing dynamic shapes. Dynamic shapes passrate improved from 89% -> 95% \| Comp Item \| Compiler \| suite \| before \| After fix \| \|-----------\|----------\|------------\|------------\|------------\| \| Pass Rate \| Inductor \| torchbench \| 89%, 73/82 \| 95%, 79/83 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764 Approved by: https://github.com/jansel	2024-06-25 17:49:04 +00:00
atalman	e317a8b264	Add guard to use AMX for x86_64 only (#129479 ) Trying to mitigate aarch64 and s390 nightly failures as per this comment: https://github.com/pytorch/pytorch/pull/127195#issuecomment-2189177949 Fixes https://github.com/pytorch/pytorch/issues/129443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129479 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2024-06-25 17:31:28 +00:00
PyTorch MergeBot	45b2931b7e	Revert "[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 )" This reverts commit b24787b7576c184a54d13c1833ada23a395f5c31. Reverted https://github.com/pytorch/pytorch/pull/129414 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
PyTorch MergeBot	fb40ba6fc2	Revert "[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 )" This reverts commit aa4ee2cb9e1f9be6bbdd27654e0f768b7fe9be6c. Reverted https://github.com/pytorch/pytorch/pull/127247 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
PyTorch MergeBot	ad76da6c16	Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 )" This reverts commit 7b57ddd38c6d502ba313c0e6b0c92b6787d69986. Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
PyTorch MergeBot	b38f6d4cd2	Revert "[inductor] Enable FX graph caching in OSS by default (#125863 )" This reverts commit 4c1e4c5f307f9743014a08cf97d3fa8de7e1ce5f. Reverted https://github.com/pytorch/pytorch/pull/125863 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
vinithakv	f8db12a538	Fix logic to find sbgemm in BLAS library (#125227 ) Current logic to set the HAS_SBGEMM flag is ignored in case the BLAS libraries are found already, ie, if set from environment variable BLAS=OpenBLAS . If BLAS_LIBRARIES are already set the code to find if BLAS_LIBRARY has sbgemm is never executed. The following commit brings out this logic outside unconditionally. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125227 Approved by: https://github.com/malfet	2024-06-25 16:34:38 +00:00
Zhengxu Chen	665d6ea05b	[export] Fix IR canonlization. (#129401 ) Summary: as title. we should unpack results from _canonicalize_graph. Differential Revision: D58963429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129401 Approved by: https://github.com/tugsbayasgalan	2024-06-25 16:33:02 +00:00
Joel Schlosser	e364290718	Support linear backward for NJT with dim > 3 (#129393 ) Replaces usage of `torch.mm()` with `torch.matmul()` in NJT's impl of linear_backward to support higher dims. See [here](https://github.com/pytorch/pytorch/issues/125214#issuecomment-2184968703) for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129393 Approved by: https://github.com/soulitzer	2024-06-25 16:06:23 +00:00
Klein Shen	0e6bb7f1ce	[caffe2][be] migrate gloabl static initializer (#128784 ) Summary: Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154. This Diff migrate StorageImpl.cpp Addtional Context: https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154 Test Plan: CI Differential Revision: D58639283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128784 Approved by: https://github.com/aaronenyeshi	2024-06-25 15:30:49 +00:00
Nikita Shulga	fd4af87855	Fix non-portable path warning (#129474 ) MacOS uses case-insensitive filesystem by default, but it's better to specify include path using proper capitalization Should fix ``` MultiTensorApply.h:4:10: warning: non-portable path to file '<ATen/native/mps/operations/FusedOptimizerOps.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path] #include <Aten/native/mps/operations/FusedOptimizerOps.h> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129474 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/qqaatw	2024-06-25 15:17:21 +00:00
drisspg	cb1c56caba	Set target dependencies to always build for sm90a on rowwise scaling (#129402 ) # Summary Instead of landing global builder changes; https://github.com/pytorch/builder/pull/1878 This PR targets only the Rowwise file and adds the sm90a featurs. Verified locally by setting: ``` TORCH_CUDA_ARCH_LIST=9.0 ``` We can see in the build.ninja file that the proper flags are set: ``` build caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o: CUDA_COMPILER__torch_cuda_unscanned_Release /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu \|\| cmake_object_order_depends_target_torch_cuda DEFINES = -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS DEP_FILE = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o.d FLAGS = -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-unused-function,-Wno-maybe-uninitialized -Wno-deprecated-copy -gencode arch=compute_90a,code=sm_90a INCLUDES = -I/home/drisspg/meta/pytorch/build/aten/src -I/home/drisspg/meta/pytorch/aten/src -I/home/drisspg/meta/pytorch/build -I/home/drisspg/meta/pytorch -I/home/drisspg/meta/pytorch/third_party/onnx -I/home/drisspg/meta/pytorch/build/third_party/onnx -I/home/drisspg/meta/pytorch/third_party/foxi -I/home/drisspg/meta/pytorch/build/third_party/foxi -I/home/drisspg/meta/pytorch/aten/src/THC -I/home/drisspg/meta/pytorch/aten/src/ATen/cuda -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/drisspg/meta/pytorch/build/caffe2/aten/src -I/home/drisspg/meta/pytorch/aten/src/ATen/.. -I/home/drisspg/meta/pytorch/build/nccl/include -I/home/drisspg/meta/pytorch/c10/cuda/../.. -I/home/drisspg/meta/pytorch/c10/.. -I/home/drisspg/meta/pytorch/third_party/tensorpipe -I/home/drisspg/meta/pytorch/build/third_party/tensorpipe -I/home/drisspg/meta/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/drisspg/meta/pytorch/torch/csrc/api -I/home/drisspg/meta/pytorch/torch/csrc/api/include -isystem /home/drisspg/meta/pytorch/build/third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/drisspg/meta/pytorch/third_party/protobuf/src -isystem /home/drisspg/meta/pytorch/third_party/ittapi/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.3/include -isystem /home/drisspg/meta/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/drisspg/meta/pytorch/third_party/ideep/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/cudnn_frontend/include OBJECT_DIR = caffe2/CMakeFiles/torch_cuda.dir OBJECT_FILE_DIR = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129402 Approved by: https://github.com/malfet	2024-06-25 13:54:51 +00:00
Li-Huai (Allan) Lin	71ebe5121a	[MPS] Fast math env var (#129007 ) Allow users to decide whether they want to have fast math enabled via env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008	2024-06-25 13:52:07 +00:00
Shangdi Yu	bbdeff76fc	fix add decomposition for complex numbers (#129044 ) Fixes #125745 Bug source: When addition requires broadcasting, adding complex numbers is not implemented correctly in `torch/_inductor/decomposition.py` because `x.view(x.real.dtype)` would multiply the last dimension by 2, and then broadcasting wouldn't work. Fix: re-shape the complex tensors after view and before broadcasting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129044 Approved by: https://github.com/zou3519, https://github.com/lezcano	2024-06-25 11:05:41 +00:00
Sanket Jayant Purandare	6508f0f5d4	Improved backward tracking and attribution, fixed typing for python < 3.10 (#129400 ) For #125323 * Fixes typing for python < 3.10 * Fixes #129390 For #124688 * Improved attribution by registering `register_hook` and `post_accumulate_grad_hook` on params. * Fixed pre-mature per module bw peak state initialization for AC. * This improves per-module stats, global `peak_mem` was already accurate and remains unaffected. For #128508 * When AC is applied to a `mod (nn.Module)` the backward order of execution is `pre-bw -> pre-fw -> post-fw -> post-bw`. Since the `ModTracker` maintains the `parents` attribute as set, the `post-fw` during backward was prematurely removing it from parents. * With the fix we now maintain a per-module counter and only remove a module from `parents` when its counter goes to 0. * Added tests to ensure this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129400 Approved by: https://github.com/awgu, https://github.com/huydhn	2024-06-25 10:54:58 +00:00
Alexander Grund	63474620ab	test_jit: Replace plain assert by test assert (#128950 ) The plain assert doesn't show the values in case of failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/128950 Approved by: https://github.com/zou3519	2024-06-25 09:04:53 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
Fuzzkatt	4ca8eecca4	skip test_graph_capture_oom for jetson (#128661 ) On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error: ``` RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, **kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom with self.assertRaisesRegex(RuntimeError, oom_regex): File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__ self._raiseFailure('"{}" does not match "{}"'.format( File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure raise self.test_case.failureException(msg) AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. " ``` This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661 Approved by: https://github.com/eqy, https://github.com/atalman	2024-06-25 08:25:11 +00:00
eqy	8bfd9e9815	[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 ) cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant ~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~ Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-25 06:01:50 +00:00
Jiong Gong	533c4190f9	[inductor][cpp] support nested kernel with indirect indexing (#129223 ) This PR makes sure the current kernel is used for generating CSE variables when nested kernel codegen is involved, e.g., nested CppKernel is used to generate epilogue of CppTemplateKernel. Without the fix, the epilogue with indirect indexing would fail to run. pytest -k test_linear_with_embedding_bias_False_cpu test_cpu_select_algorithm.py Epilogue code Before: ```c++ { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)m_start)); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp11 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0x0)), 16); auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 ? tmp3 : tmp0; auto tmp5 = decltype(tmp4)(tmp4 + tmp2); auto tmp6 = tmp1 ? tmp5 : tmp4; auto tmp7 = tmp6; auto tmp8 = c10::convert<int64_t>(tmp7); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); auto tmp10 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384Ltmp6)), 16); auto tmp12 = (tmp11); auto tmp13 = tmp10 + tmp12; tmp13.store(Y + static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp11 = local_acc_buf[static_cast<long>(x1 + (N0x0))]; auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 ? tmp3 : tmp0; auto tmp5 = decltype(tmp4)(tmp4 + tmp2); auto tmp6 = tmp1 ? tmp5 : tmp4; auto tmp7 = tmp6; auto tmp8 = c10::convert<int64_t>(tmp7); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); auto tmp10 = in_ptr3[static_cast<long>(n_start + x1 + (384Ltmp6))]; auto tmp12 = c10::convert<float>(tmp11); auto tmp13 = decltype(tmp10)(tmp10 + tmp12); Y[static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))] = tmp13; } } } ``` Epilogue code After: ```c++ { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)m_start)); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0x0)), 16); auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = decltype(tmp5)(tmp5 + tmp2); auto tmp7 = tmp5 < 0; auto tmp8 = tmp7 ? tmp6 : tmp5; auto tmp9 = tmp8; auto tmp10 = c10::convert<int64_t>(tmp9); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384Ltmp8)), 16); auto tmp14 = (tmp13); auto tmp15 = tmp12 + tmp14; tmp15.store(Y + static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp13 = local_acc_buf[static_cast<long>(x1 + (N0x0))]; auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = decltype(tmp5)(tmp5 + tmp2); auto tmp7 = tmp5 < 0; auto tmp8 = tmp7 ? tmp6 : tmp5; auto tmp9 = tmp8; auto tmp10 = c10::convert<int64_t>(tmp9); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); auto tmp12 = in_ptr3[static_cast<long>(n_start + x1 + (384Ltmp8))]; auto tmp14 = c10::convert<float>(tmp13); auto tmp15 = decltype(tmp12)(tmp12 + tmp14); Y[static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))] = tmp15; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129223 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-06-25 05:21:00 +00:00
cdzhan	665dbc2f52	[easy][DCP] Fix test_fine_tuning.py for get/set_state_dict API changes (#129365 ) Update test/distributed/checkpoint/e2e/test_fine_tuning.py for https://github.com/pytorch/pytorch/pull/112203 change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129365 Approved by: https://github.com/fegin	2024-06-25 05:12:02 +00:00
titaiwangms	0e1e289033	[ONNX] Benchmark refactored ONNX export (#129427 ) Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427 Approved by: https://github.com/justinchuby	2024-06-25 04:47:53 +00:00
Mikayla Gawarecki	f18becaaf1	Add example for torch.serialization.add_safe_globals (#129396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396 Approved by: https://github.com/albanD ghstack dependencies: #129244, #129251, #129239	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	381ce0821c	Add warning for weights_only (#129239 ) Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239 Approved by: https://github.com/albanD ghstack dependencies: #129244, #129251	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	c5f7755e86	Allow BUILD/NEWOBJ instruction for items added via torch.serialization.add_safe_globals (#129251 ) Previously, allowlisting functions/classes via `torch.serialization.add_safe_globals(obj)` for the `weights_only` Unpickler had the following effect: - For a [`GLOBAL`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1939) instruction, `GLOBAL obj.__module__ obj.__name__` would be allowed and translated back to obj to be pushed back to the stack. - For a [`REDUCE`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1982) instruction where we expect the stack to contain `func` and `args`, `func` is allowed if it was added via `add_safe_globals` However, it did not have an effect on `BUILD` and `NEWOBJ` instructions Some classes may be rebuilt via [`NEWOBJ`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L2091-L2104) instruction, which indicates that their constructor should be used to rebuild the class. Further, a [`BUILD`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1984-L2007) instruction might be used if an object's `__reduce__`/`__reduce_ex__` returns a non-None value for `state`. Which indicates a `__setstate__` or `__dict__.update`. This PR makes sure that adding objects to the allowlist will also allow `NEWOBJ` and `BUILD` instructions for them. In particular, the update for `NEWOBJ` should unblock allowlisting of [`ScaledMMConfig`](`d4ade877df/float8_experimental/float8_tensor.py (L26-L30)`) in float8_experimental @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/129251 Approved by: https://github.com/albanD ghstack dependencies: #129244	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	1bb1e3463c	Fix allowlisting of builtins for weights_only unpickler (#129244 ) Since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), some functions/classes that were renamed from python 2-->3 will be pickled with their python2 name. This PR ensures that when a mod `GLOBAL <python2_mod>.<python2_name> ` is encountered, [following the strategy used by pickle](https://github.com/python/cpython/blob/main/Lib/pickle.py#L1590C13-L1593C63) it is properly mapped to `<python3_mod>.<python3_name>`. This fix ensures that `add_safe_globals` works properly for such functions/classes (i.e. users will allowlist the python3 func and the weights_only unpickler will do the appropriate translation when checking whether a class was allowlisted). An example is as follows: `__builtin__` was named to `builtins`, see the [release notes for Python 3.0](https://docs.python.org/3/whatsnew/3.0.html) > Renamed module `__builtin__` to [`builtins`](https://docs.python.org/3/library/builtins.html#module-builtins) (removing the underscores, adding an ‘s’). The __builtins__ variable found in most global namespaces is unchanged. To modify a builtin, you should use [builtins](https://docs.python.org/3/library/builtins.html#module-builtins), not `__builtins__`! However, since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), builtins will be pickled with their module string as `__builtin__`. ```python >>> import pickle >>> import pickletools >>> print.__module__ 'builtins' >>> with open('print.pkl', 'wb') as f: >>> pickle.dump(print, f, protocol=2) # 2 because this is the default protocol used by pytorch >>> with open('print.pkl', 'rb') as f: >>> pickletools.dis(f) 0: \x80 PROTO 2 2: c GLOBAL '__builtin__ print' # pickle saves the module string as __builtin__ !!! :( 21: q BINPUT 0 23: . STOP ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129244 Approved by: https://github.com/albanD	2024-06-25 04:19:44 +00:00
Will Feng	aa4ee2cb9e	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 ) Test command: `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247 Approved by: https://github.com/bdhirsh ghstack dependencies: #129414	2024-06-25 03:13:38 +00:00
Will Feng	b24787b757	[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 ) This makes it easier to do pattern-matching on `fsdp.split_with_sizes_copy` in Inductor passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129414 Approved by: https://github.com/bdhirsh	2024-06-25 03:08:56 +00:00
Isuru Fernando	e6bfa2958b	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-25 02:45:02 +00:00
Zain Rizvi	4d04203852	[BE] Runner determinator: Expect usernames to be prefixed with '@' (#129246 ) Expect the username in the runner rollover issue (https://github.com/pytorch/test-infra/issues/5132) to be prefixed with a "@". This will make typos way less likely since github's autocomplete/autoformating will help out For now, I've updated the issue to have usernames both with and without the @ while this change rolls out Testing: Ran the script locally on both this issue and a new test issue and verified they both had the expected output: ``` (venv) (base) ➜ ~/pytorch git:(zainr/improve-get-workflow-type) python .github/scripts/get_workflow_type.py --github-token github_pat_*** --github-issue 5132 --github-user ZainRizvi --github-branch "zainr/stuff" {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129246 Approved by: https://github.com/zxiiro, https://github.com/huydhn	2024-06-25 02:39:33 +00:00
Kazuaki Ishizaki	533395e204	Fix build error on s390x (#129326 ) This PR fixes the build error on s390 after #127195. The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x. ``` ... [792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o [793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp /pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()': /pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'? 60 \| long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA); \| ^~~~~~~~~~~~~~ \| SYS_prctl [794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o [795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o [796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o [797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o [798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o [799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o [800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o [801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o [802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o ninja: build stopped: subcommand failed. Building wheel torch-2.5.0a0+git94dc325 -- Building version 2.5.0a0+git94dc325 cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch cmake --build . --target install --config Release Build step 'Execute shell' marked build as failure ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129326 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-06-25 02:39:13 +00:00
Animesh Jain	c4dd752d97	[dynamo][compile-time][inlining-inbuilt-nn-modules] Manually implement nn.Module._call_impl (#129285 ) # Compile time for eager backend ## AlbertForMaskedLM No inlining - 3.65 seconds Inlining on main - 7.48 seconds Inlining + this PR - 2.86 seconds ## MobileBertForMaskedLM No inlining - 26.90 seconds Inlining on main - 48.21 seconds Inlining + this PR - 24.25 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/129285 Approved by: https://github.com/jansel ghstack dependencies: #129316, #129315	2024-06-25 01:31:26 +00:00
Animesh Jain	514f9279f8	[dynamo][compile-time] Manually implement nn.Module.__getattr__ to reduce compile time (#129315 ) # Compile time for eager backend ## AlbertForMaskedLM No inlining - 3.65 seconds Inlining on main - 7.48 seconds Inlining + this PR - 6.70 seconds ## MobileBertForMaskedLM No inlining - 26.90 seconds Inlining on main - 48.21 seconds Inlining + this PR - 43.85 seconds Next PR in the stack makes the total compile time better/comparable to no inlining Pull Request resolved: https://github.com/pytorch/pytorch/pull/129315 Approved by: https://github.com/jansel ghstack dependencies: #129316	2024-06-25 01:31:26 +00:00
PyTorch MergeBot	c012013aa6	Revert "Add Strided Input test for flex attention (#128915 )" This reverts commit 41bb81b58279f492e72bd270b3b071dd2953ed8c. Reverted https://github.com/pytorch/pytorch/pull/128915 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk, i.e. `41bb81b582 (26627138290)` ([comment](https://github.com/pytorch/pytorch/pull/128915#issuecomment-2187695317))	2024-06-25 00:43:34 +00:00
Colin Peppler	1315be4893	[aotinductor] only autotune at compile time when enabled via config (#129413 ) internal breakage when enabled. Test Plan: CI Differential Revision: D58965784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129413 Approved by: https://github.com/jingsh, https://github.com/desertfire	2024-06-25 00:41:10 +00:00
Antoni Vros	78e40b271b	Change index_put on GPU to accept FP8 inputs (#128758 ) As the title says, this PR changes the dispatcher for the CUDA index_put_ kernel to accept FP8 inputs. This is useful for Transformers models where the KV cache is FP8 and has been pre-allocated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128758 Approved by: https://github.com/eqy, https://github.com/drisspg	2024-06-25 00:38:03 +00:00
wz337	8b6391ee59	[Test][DTensor] Temporarily skip gloo test for test_depthwise_convolution (#129391 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129391 Approved by: https://github.com/awgu	2024-06-25 00:29:50 +00:00
Shunting Zhang	81de71fdc5	[inductor] fix a double clone in coordesc tuning (#129399 ) It's embarrassing that there is a hidden double clone bug in coordinate descent tuning. In `CachingAutotuner.coordinate_descent_tuning`, we clone mutated args to make sure benchmarking does not cause numerical problems. But latter on in `CachingAutotuner.bench` we do that again. This double clone is fine if - the tensor is small - the allocation of the tensor is not on the critical path for memory footprint. But neither holds for quite common usage of cross entropy loss. This is related to the memory usage debugging in https://github.com/pytorch/pytorch/pull/129043 . Note that the general issue that peak memory usage increasing due to autotuning still exists. This bug just makes it worse (since we double allocate). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129399 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-06-25 00:18:51 +00:00
Nikita Shulga	14dc08ddc7	Inductor to fail gracefully on Voltas for bf16 tensors (#129288 ) Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame. Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`. Test plan: Modify `is_bf16_supported` to return False and see that warning is generated Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288 Approved by: https://github.com/eqy, https://github.com/jansel	2024-06-25 00:04:13 +00:00
Sam Larsen	4c1e4c5f30	[inductor] Enable FX graph caching in OSS by default (#125863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863 Approved by: https://github.com/eellison, https://github.com/oulgen ghstack dependencies: #129257	2024-06-24 23:39:43 +00:00
Sam Larsen	7b57ddd38c	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-24 23:39:43 +00:00
Yidi Wu	b22f0f5f51	[torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844 ) This PR does two things: 1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong. 2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844 Approved by: https://github.com/angelayi	2024-06-24 23:14:34 +00:00
joydddd	41bb81b582	Add Strided Input test for flex attention (#128915 ) Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-06-24 22:56:39 +00:00
yuqingj	00f675bb4c	[Nested Tensor]fix sdpa backward for the special case with ragged second batch dim and constant length (#128349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128349 Approved by: https://github.com/jbschlosser	2024-06-24 22:35:07 +00:00
Joel Schlosser	7b7f357042	Fix DEBUG=1 asserts with NJT ops (#129014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-06-24 22:32:01 +00:00
Isuru Fernando	5f912f480c	Fix max_pool2d decomposition for empty list and integer limits (#129106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106 Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #129096, #129097	2024-06-24 22:19:42 +00:00
Isuru Fernando	e096faaf30	Fix rot90 decomposition for no rotation (#129097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129097 Approved by: https://github.com/peterbell10 ghstack dependencies: #129096	2024-06-24 22:19:42 +00:00
Isuru Fernando	fbca70718f	Fix scatter lowering when src is a Number (#129096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129096 Approved by: https://github.com/peterbell10	2024-06-24 22:19:39 +00:00
Zain Rizvi	8edb7b96b1	Enable dynamic rollout for pull workflow (#129243 ) Enables dynamic migration of jobs to the LF AWS account for the pull workflow. For now, it leaves out a few jobs that need a bit more testing: Namely Windows and Android runners. The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132 Note: The non-pull jobs updated are the ones that have are synced to jobs in pull.yml (via `sync-tag`) and thus have to be updated whenever their corresponding pull.yml jobs are edited Based on https://github.com/pytorch/pytorch/pull/128597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129243 Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/malfet	2024-06-24 22:15:53 +00:00
ajbrent	30bfdf1afc	Errors when 0-dim tensor of complex or bool type passed to aminmax. (#128404 ) Fixes #126742 Added errors for the case of 0-dim tensors of complex or bool types passed to aminmax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128404 Approved by: https://github.com/janeyx99	2024-06-24 21:46:49 +00:00
PyTorch UpdateBot	18fdc0ae5b	[executorch hash update] update the pinned executorch hash (#129099 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129099 Approved by: https://github.com/pytorchbot	2024-06-24 21:01:40 +00:00
Xuehai Pan	93a33bf3ac	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 18:04:38 +00:00
PyTorch MergeBot	1a54bb0f96	Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 )" This reverts commit 4f9399bd0d2bc0cbd14348b80e32b263de5c6bc0. Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121))	2024-06-24 16:50:15 +00:00
PyTorch MergeBot	063facf352	Revert "[halide-backend] Generate standalone runtime (#129025 )" This reverts commit 10c64c3b49e2008a50f9229e600c68c8a3d49292. Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467))	2024-06-24 16:47:25 +00:00
Huy Do	c888ee3632	Remove test_mps_allocator_module XFAIL (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980	2024-06-24 16:26:38 +00:00
PyTorch MergeBot	cb4919344a	Revert "[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )" This reverts commit e53d9590287cbf97521f96d055910394f6e9a849. Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))	2024-06-24 16:18:43 +00:00
PyTorch MergeBot	7b910285db	Revert "[inductor] Refactor fusion of inplace operations (#128979 )" This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b. Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))	2024-06-24 15:29:40 +00:00
Colin Peppler	df51d0b623	[aotinductor][UserDefinedTritonKernel] use appropriate expr printer when printing args (#129301 ) Encountered the following C++ compile error. ``` Declared in this scope; did you mean ‘std::max’? 619 \| auto var_5 = max(1, u0); ``` This PR will use the C++ printer when it's doing C++ codegen, before this PR it was using the Python printer even during C++ codegen. Differential Revision: [D58913123](https://our.internmc.facebook.com/intern/diff/D58913123) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129301 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-24 15:23:05 +00:00
Xuehai Pan	e53d959028	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 14:35:41 +00:00
soulitzer	c89a9f5d17	Allow SAC policy_fn to return bool for backward compatibility (#129262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129262 Approved by: https://github.com/Chillee, https://github.com/fmassa ghstack dependencies: #125795, #128545	2024-06-24 13:54:30 +00:00
Andrew Gu	9094248090	[FSDP2] Fixed `unshard` without lazy init (#129241 ) Previously, the `FSDPCommContext` only defines the stream attributes when `FSDPCommContext.init` is called from lazy initialization. This means that if the user calls `module.unshard()` before lazy init (e.g. first forward pass), then it would error in `wait_for_unshard()`. This PR fixes this by making sure that the stream attributes are defined, only with the default stream, at construction time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129241 Approved by: https://github.com/Skylion007, https://github.com/weifengpy	2024-06-24 13:31:54 +00:00
Will Feng	d21f311af8	[Easy][Traceable FSDP2] Skip rocm for the E2E tests (#129339 ) The CUDA implementation of `resize_storage_bytes_` doesn't run on rocm yet, so need to skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129339 Approved by: https://github.com/msaroufim	2024-06-24 06:38:33 +00:00
Xuehai Pan	662e9e1076	[BE] enable UFMT for `torch/nn/functional.py` (#128592 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592 Approved by: https://github.com/mikaylagawarecki	2024-06-24 06:24:12 +00:00
leslie-fang-intel	8a2fed7e6a	[Inductor][CPP] Fallback QLinear Binaryfusion from postop sum to binary add when others is view (#128808 ) Summary In int8 GEMM Template, we will view the input from 3D to 2D and view the output back to 3D for QLinear which makes the output of this QLinear as `view`. So, if this output view inputs to a QLinear-Binary fusion which breaks the assumption of QLinear-Binary with post op inplace `sum`. We change the postop name from inplace `sum` to outplace `add` for this case which is similar as FP32/BF16 Linear Inplace as in `1208347d09/torch/_inductor/fx_passes/mkldnn_fusion.py (L541-L543)`. TestPlan ``` clear && numactl -C 56-111 -m 1 python -u -m pytest -s -v inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_cpu_input_dim_exceeds_2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128808 Approved by: https://github.com/jgong5 ghstack dependencies: #128804	2024-06-24 01:12:18 +00:00
leslie-fang-intel	287c68c5ec	[Inductor][Quant] Use output dtype torch.uint8 explicitly (#128804 ) Summary Previously, we use `None` as output data type in the lowering of QLinear/QConv for uint8 implicitly. It's not clear and we should use `torch.uint8` explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128804 Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5	2024-06-24 01:08:49 +00:00
PaliC	7b9e6430ed	[Split Build] Add periodic and trunk CI for cuda builds (#129269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129269 Approved by: https://github.com/atalman	2024-06-23 17:04:37 +00:00
Xuehai Pan	f85d1e845a	[BE] enable UFMT for `torch/nn/*.py` (#128593 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593 Approved by: https://github.com/mikaylagawarecki	2024-06-23 16:05:13 +00:00
Will Feng	dadc0ed4c8	[Traceable FSDP2] Add `aot_eager` backend E2E tests for transformer model (#129157 ) This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157 Approved by: https://github.com/awgu ghstack dependencies: #129203	2024-06-23 06:11:11 +00:00
Brian Hirsh	b91a9dc328	[Brian's PR #128754 ] Use torch.ops.fsdp.set_ for FSDP2 storage resize; dont functionalize resize_, set_, split_with_sizes_copy.out (#129203 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203 Approved by: https://github.com/bdhirsh	2024-06-23 06:07:19 +00:00
Xuehai Pan	62ccf6d7cd	[BE] enable UFMT for `torch/nn/modules` (#128594 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594 Approved by: https://github.com/mikaylagawarecki	2024-06-23 05:37:57 +00:00
sanketpurandare	440d8fbd4a	FSDP2 Memory Tracker (#125323 ) * __->__ #125323 ### Why do we need the FSDP Memory Tracker? Tuning Decisions 1. What is the expected peak memory with current configuration? 2. If I change my FSDP wrapping, how much effect will it have on peak memory? 3. What is the best batch size to use? 4. What is the maximum sequence length that one can run with current configuration? 5. How does increasing/decreasing the “DP” world size affect peak memory? 6. How much memory do I save if I move the optimizer to the CPU? 7. Which activation checkpointing policy should I use? 8. If I have various SAC policies, How do they compare against each other? 9. What happens if I apply different SAC policies to different FSDP units? 10. If I make my gradient reduction in fp32, what effect will it have on memory? 11. If I want to use a custom mixed precision policy, how will it affect the peak memory? 12. When does it make sense to use HSDP? 13. Can I reshard to a smaller mesh without increasing peak memory substantially? 14. Can safely disable post forward reshard without causing an OOM? Debugging 1. Which module contributes most to activation memory? 2. Which FSDP unit is holding a lot of unsharded memory? 3. AC is not releasing memory? The FSDP2 Memory Tracker addresses all of the above. It is based on: * #124688 * #128508 Example and Output: ``` if __name__== "__main__": from contextlib import nullcontext from functools import partial import torch from torch.distributed._composable import checkpoint from torch.distributed._composable.fsdp import ( CPUOffloadPolicy, fully_shard, MixedPrecisionPolicy, ) from torch.distributed._tensor import DeviceMesh from torch.distributed._tools.fsdp2_mem_tracker import FSDPMemTracker from torch._subclasses.fake_tensor import FakeTensorMode from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, TransformerBlock, ) from torch.testing._internal.distributed.fake_pg import FakeStore dev = torch.device("cuda:0") torch.cuda.set_device(dev) world_size = 4 store = FakeStore() torch.distributed.init_process_group( "fake", rank=0, world_size=world_size, store=store ) mesh = DeviceMesh("cuda", torch.arange(0, world_size)) torch.cuda.empty_cache() torch.manual_seed(42) use_fake_mode = False with FakeTensorMode() if use_fake_mode else nullcontext(): vocab_size = 8192 bsz, seq_len = 32, 1024 with torch.device(dev): model_args = ModelArgs( n_layers=2, n_heads=16, vocab_size=vocab_size, max_seq_len=seq_len, dropout_p=0.1, ) model = Transformer(model_args) foreach = True mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32) offload_policy = CPUOffloadPolicy(pin_memory=not use_fake_mode) reshard_after_forward = True fsdp_config = { } fully_shard_fn = partial( fully_shard, mesh=mesh, reshard_after_forward=reshard_after_forward, offload_policy=offload_policy, mp_policy=mp_policy, ) for module in model.modules(): if isinstance(module, TransformerBlock): checkpoint(module, preserve_rng_state=not use_fake_mode) fully_shard_fn(module) fully_shard_fn(model) optim = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=foreach) torch.manual_seed(42) inp = torch.randint(0, vocab_size, (bsz, seq_len), device=dev) torch.cuda.reset_accumulated_memory_stats() torch.cuda.reset_peak_memory_stats() fmt = FSDPMemTracker(model, optim) fmt.track_inputs((inp,)) with fmt: for iter_idx in range(2): loss = model(inp).sum() loss.backward() optim.step() optim.zero_grad() if iter_idx == 0: fmt.reset_mod_stats() mem_stats = torch.cuda.memory_stats() tracker_peak = fmt.get_tracker_snapshot("peak")[dev]["Total"] cuda_peak_active = mem_stats["active_bytes.all.peak"] fmt.display_modulewise_snapshots(depth=4, units="MiB", tabulate=True) fmt.display_snapshot("peak", units="MiB", tabulate=True) print( f"peak active: {cuda_peak_active / (10243)} GiB \| " f"Tracker Max: {tracker_peak / (1024 3)} GiB" ) if not use_fake_mode: print(f"Accuracy: {tracker_peak/cuda_peak_active}") try: torch.distributed.destroy_process_group() except Exception as e: print(e) ``` <img width="1236" alt="Screenshot 2024-06-21 at 5 16 49 PM" src="https://github.com/pytorch/pytorch/assets/12934972/9be40b8b-e635-4112-b111-418413e6b959"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125323 Approved by: https://github.com/awgu	2024-06-23 05:23:00 +00:00
Animesh Jain	17d1723aee	[dynamo][unspecialized-nn-modules] Remove dead (also incorrect) code (#129316 ) This code is unused because we just inline the `.parameters` call. The code was also wrong because side-effects only track the first level of mutations. An object might not marked mutated if one of the child objects (like a dict) is mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129316 Approved by: https://github.com/jansel	2024-06-23 03:02:27 +00:00
Huy Do	cac6f99d41	Fix Windows CUDA periodic inductor/test_pattern_matcher test (#129198 ) The check was run on Windows and crashed there because Windows doesn't have triton, i.e. https://github.com/pytorch/pytorch/actions/runs/9606662121/job/26502347998#step:15:13196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129198 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/malfet	2024-06-23 02:32:27 +00:00
Manuel Candales	749c03406c	[metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965 ) Adds _convert_weight_to_int4pack MPS kernel Replaces previous int4mm Metal shader, with shader authored by @kimishpatel which improves perf by ~40% Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965 Approved by: https://github.com/malfet	2024-06-23 02:10:46 +00:00
rzou	856541c701	[custom_op] support default dtype values (#129189 ) This PR: - moves some of the dtype-string utilities into ScalarType.{h, cpp} - adds a new utility to get a mapping from dtype name to the C++ dtype - the perser now checks if the string is a dtype name; if it is then it pulls the c++ dtype from the mapping. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189 Approved by: https://github.com/albanD ghstack dependencies: #129177, #129178, #129179	2024-06-23 00:13:23 +00:00
Isuru Fernando	3e02ecd740	Test only one sample with huber_loss (#129245 ) Fixes https://github.com/pytorch/pytorch/issues/129238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129245 Approved by: https://github.com/huydhn	2024-06-22 21:15:39 +00:00
Xuehai Pan	94dc3253a0	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-06-22 18:53:28 +00:00
Will Feng	e165a5971f	[Traceable FSDP2] Fix support for CUDA resize_storage_bytes_ (#129215 ) Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions. This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215 Approved by: https://github.com/jansel	2024-06-22 18:38:47 +00:00
Anshul Sinha	0e6118a68e	[dtensor][debug] added logging module tracing table to file feature (#128721 ) Summary Currently, only way for users to view the module tracing table is to print in the console which could be hard to read. I have added the functionality to comm_debug_mode for a user to log the module tracing table to output.txt file giving the user more options to view module tracing. I have implemented the use case in the module tracing examples. The expected output is shown below for MLPModule tracing: <img width="349" alt="Screenshot 2024-06-14 at 10 39 07 AM" src="https://github.com/pytorch/pytorch/assets/50644008/a05288a9-3cdb-483b-8e27-daab50da6251"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/128721 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #128720	2024-06-22 18:14:13 +00:00
Anshul Sinha	1afd492d88	[dtensor][example] add functionality allowing users to choose which example they'd to run (#128720 ) Summary The previous example file would run all examples at the same time, leading to confusing output as the 4 processors would mix up the order. In order to fix this, I have added the functionality to choose which example to run to make it easier for users to read the output. Due to importing from torch.testing._internal.distributed._tensor.common_dtensor, the argparser from a file in the dependency tree would overwrite the argparser that I attempted to place in the example file. As a result, I created an argparser in a different file and imported it above previously mentioned import. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -h The first four outputs will be the same as the outputs seen in previous PRs. The expected output for help argument is seen below: <img width="931" alt="Screenshot 2024-06-14 at 10 25 06 AM" src="https://github.com/pytorch/pytorch/assets/50644008/547ca112-1e7a-4769-857a-558292c6fe7b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128720 Approved by: https://github.com/XilunWu	2024-06-22 18:14:13 +00:00
Jason Ansel	10c64c3b49	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-22 17:39:52 +00:00
Jason Ansel	4f9399bd0d	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-22 17:39:52 +00:00
William Wen	79aabaf626	[3.13, dynamo] codegen PUSH_NULL when callable is codegen'd (#129172 ) Significant bytecode generation API change! The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples). In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable. This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable. This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172 Approved by: https://github.com/jansel	2024-06-22 17:25:23 +00:00
Mengwei Liu	905dfa186c	Fix ConstraintViolationError exception string when exprs are int (#129271 ) As titled. If `expr1` `expr2` are int, don't need to do `.xreplace`. See example error: ``` UserError: L['args'][0][0].size()[1] = 35 is not equal to L['args'][0][2].size()[1] = 23 ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129271 Approved by: https://github.com/lezcano	2024-06-22 16:33:40 +00:00
Jiong Gong	920ebccca2	[inductor][cpp] refactor CppTemplateKernel to inherit CppKernel (#129101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129101 Approved by: https://github.com/leslie-fang-intel	2024-06-22 12:50:37 +00:00
Peter Bell	72e3aca227	[inductor] Refactor fusion of inplace operations (#128979 ) `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979 Approved by: https://github.com/lezcano ghstack dependencies: #129082, #129083	2024-06-22 12:38:22 +00:00
Peter Bell	88a35b5b64	BE: User future annotations in _inductor/comms.py (#129083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129083 Approved by: https://github.com/lezcano ghstack dependencies: #129082	2024-06-22 12:38:22 +00:00
Peter Bell	73ba226d98	[inductor] Linear time dead node elimination (#129082 ) The nodes are already topologically sorted by this point, so DCEing a chain of nodes will take one full iteration per node. Simply reversing the iteration order means all users will be removed before checking a node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129082 Approved by: https://github.com/lezcano	2024-06-22 12:38:17 +00:00
Jiong Gong	cb126711cd	[merge_rule] add more cpp inductor files (#129192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129192 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2024-06-22 09:04:14 +00:00
PaliC	b57fa8d9c0	[BE] Remove JNI from libtorch builds (#124995 ) Removes jni files from the libtorch build as we do not plan to distribute them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124995 Approved by: https://github.com/malfet	2024-06-22 07:41:54 +00:00
Driss Guessous	9ffdbb5d12	Forward Fix PR for #128683 (#129037 ) Summary: This forward fixes this diff: D58699985 Since we have a few things in flight it would be much better to forward fix this test Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)' Differential Revision: D58767577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129037 Approved by: https://github.com/vkuzo	2024-06-22 05:50:21 +00:00
PaliC	64743de6d8	[Split Build][BE] consolidate pip install commands (#129253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129253 Approved by: https://github.com/atalman ghstack dependencies: #129011	2024-06-22 05:49:14 +00:00
PaliC	7661d1220a	[Split Build] Fix typo in pull ci (#129270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129270 Approved by: https://github.com/atalman	2024-06-22 05:48:01 +00:00
PaliC	b0044e2e18	[Split Build] Support nightly release (#129011 ) This PR adds the split build to our binaries workflow. Validation for the workflow is done using the PR above in conjunction with https://github.com/pytorch/builder/pull/1876. Test Workflow: Check CI in the workflow above Pull Request resolved: https://github.com/pytorch/pytorch/pull/129011 Approved by: https://github.com/atalman	2024-06-22 05:45:14 +00:00
Huy Do	b72ef9df0d	Update torchbench model expected accuracy values after pinning numpy (#129213 ) After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it. I'm updating the expected values here to make trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire	2024-06-22 04:59:50 +00:00
Aaron Enye Shi	f42d5b6dca	[Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242 ) Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric. Test Plan: CI and ran locally. Differential Revision: D58875576 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242 Approved by: https://github.com/zdevito	2024-06-22 04:05:55 +00:00
chilli	858fb05dac	Modify ExternKernelAlloc with NoneLayout to not assign its result to anything (#129188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129188 Approved by: https://github.com/yifuwang	2024-06-22 02:57:44 +00:00
Will Constable	2f8b301c32	Clean up distributed/CONTRIBUTING.md (#128450 ) Click [here](`cf6c88af48/torch/distributed/CONTRIBUTING.md`) to see the rendered version of the file in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450 Approved by: https://github.com/wanchaol	2024-06-22 02:41:22 +00:00
James Wu	5b14943213	Run TestAOTAutograd test suite with cache (#128222 ) This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache. To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss. We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future. In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222 Approved by: https://github.com/bdhirsh	2024-06-22 02:13:28 +00:00
Animesh Jain	c5b9ee7408	[easy][dynamo] Remove try except from call_getattr (#129217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129217 Approved by: https://github.com/lezcano ghstack dependencies: #129098, #129015	2024-06-21 23:56:00 +00:00
PyTorch MergeBot	1c75ddff35	Revert "[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 )" This reverts commit 40e8675fcbb233c98ec532607d5cd421ec850253. Reverted https://github.com/pytorch/pytorch/pull/128271 on behalf of https://github.com/malfet due to This makes PyTorch buildable only with CuDNN v9 ([comment](https://github.com/pytorch/pytorch/pull/128271#issuecomment-2183576996))	2024-06-21 23:29:20 +00:00
mori360	ef55446538	[FSDP2] Add 'TORCH_LOGS=+fsdp' to log hooks(pre/post forward/backward) and FQN (_init_fqns) (#128663 ) Summary: Add '`TORCH_LOGS=+fsdp`' in the CLI to print fsdp logs Example: `TORCH_LOGS=+fsdp torchrun --standalone --nproc_per_node=2 run_fsdp.py` Description: Add logging to `FSDPParamGroup.pre_forward`, `FSDPParamGroup.post_forward`, `FSDPParamGroup.pre_backward`, and `FSDPParamGroup.post_backward`, `FSDPState._root_pre_forward` if is the root, and `FSDPState._root_post_backward_final_callback`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128663 Approved by: https://github.com/weifengpy, https://github.com/awgu	2024-06-21 23:25:58 +00:00
Menglu Yu	9d1b65b569	[PT2][Observability] Change the log logic (#129201 ) Summary: We only log the multiplier when users changes the default value. Test Plan: see signal Differential Revision: D58854330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129201 Approved by: https://github.com/Skylion007, https://github.com/dshi7	2024-06-21 21:48:34 +00:00
Eddie Yan	40e8675fcb	[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 ) cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant ~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~ Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271 Approved by: https://github.com/ezyang	2024-06-21 21:40:23 +00:00
Mashrur Morshed	9103b40a47	Fix small typo in docstring in ParameterList (#129193 ) In the docstring of `nn.ParameterList`, ParameterDict.append/extend was being used, which is most likely a typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129193 Approved by: https://github.com/mikaylagawarecki	2024-06-21 20:53:52 +00:00
Andrew M. James	92ca17d85d	Update triton pin (#126098 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126098 Approved by: https://github.com/bertmaher	2024-06-21 18:46:15 +00:00
Aaron Gokaslan	d52684e9a8	[BE]: Update CUDNN_frontend submodule to v1.5.1 (#128612 ) Updates submodule to cudnn_frontend v1.5.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128612 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-21 18:17:35 +00:00
soulitzer	ebf25e128c	[autograd] Do not stash version counter for saved tensor (#128545 ) Fixes https://github.com/pytorch/pytorch/issues/128611 We detach using tensor_data, which already preserves the version counter, so there is no reason to save it prior to unpacking: ``` at::TensorBase VariableHooks::tensor_data(const at::TensorBase& self) const { TORCH_CHECK(self.defined(), "cannot call tensor_data() on undefined tensor"); auto self_impl_copy = self.unsafeGetTensorImpl()->shallow_copy_and_detach( /version_counter=/self.unsafeGetTensorImpl()->version_counter(), /allow_tensor_metadata_change=/ self.unsafeGetTensorImpl()->allow_tensor_metadata_change()); return at::Tensor(self_impl_copy); } ``` This changes the behavior when hooks are involved: - Previously, if you had a hook that replaced the saved tensor with an entirely new tensor, we would've smashed the saved version counter onto that during unpack, which is not quite correct because the tensor returned by user's pack hook is not necessarily aliased to the tensor originally being saved (unlikely), and even if it were, the version counter would already be shared, if the user did their operations not in inference mode (unlikely). - In this PR, we restore the version counter using the version counter from the unpack hook's output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128545 Approved by: https://github.com/albanD ghstack dependencies: #125795	2024-06-21 18:03:06 +00:00
Zhuoran Zhao	58cefaf53b	Fix hipify regular expression for AOTI wrapper (#128912 ) Summary: We need to redefine RE_PYTORCH_PREPROCESSOR here since in hipify_torch, it will apply positive lookbehind (?<=\W) and lookahead (?=\W) to the pattern to avoid matching keyword at the beginning and end of code line. However, this can happen in codegen, which will cause the pattern to not match. Test Plan: ``` buck2 run //caffe2/test/inductor:test_cpp_wrapper_hipify ``` ``` File changed: fbcode//caffe2/test/inductor/test_cpp_wrapper_hipify.py Buck UI: https://www.internalfb.com/buck2/395155fa-b2dc-4892-8c71-74e52c65fa2f Note: Using experimental modern dice Network: Up: 0B Down: 0B (reSessionID-8fcfc520-755c-48f9-bacc-507c62f59231) Jobs completed: 10947. Time elapsed: 0.5s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) BUILD SUCCEEDED /data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:282: NCCL_DEBUG env var is set to None /data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:300: NCCL_DEBUG is forced to WARN from None test_hipify_aoti_driver_header (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok test_hipify_basic_declaration (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok test_hipify_cross_platform (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok ---------------------------------------------------------------------- Ran 3 tests in 0.262s OK ``` e2e test: ``` TORCH_LOGS="output_code,graph_code" buck2 run mode/{opt,amd-gpu,inplace} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //aiplatform/modelstore/model_generation/gpu_lowering_service:gpu_lowering_cli -- --model_input_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/input.merge" --model_output_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/mi300_inductor_output.merge" --lowering_backend AOT_INDUCTOR --is_ads_model False --aot_inductor_lowering_settings_json='{"use_scripting":true,"preset_lowerer":"standalone_hstu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":4,"output_precision":4, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}' 2>&1 \| tee local_benchmark_log.txt ``` Differential Revision: D58705216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128912 Approved by: https://github.com/desertfire	2024-06-21 18:00:40 +00:00
iibrahimli	2db33054b3	Disable fast path in `TransformerEncoderLayer` when there are forward (pre-)hooks attached to modules (#128415 ) Fixes #128413 Disable fast-path if there are forward hooks or pre-hooks. Example failure case given in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128415 Approved by: https://github.com/mikaylagawarecki	2024-06-21 17:38:08 +00:00
Bin Bao	8edd4c71c6	[AOTI][refactor] Remove GridExprCppPrinter (#129142 ) Summary: Previously we thought using CppPrinter is not ABI-compatibility safe, but c10/util/generic_math.h has been changed to header-only implementation, so we can remove GridExprCppPrinter now. Differential Revision: [D58854214](https://our.internmc.facebook.com/intern/diff/D58854214) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129142 Approved by: https://github.com/chenyang78	2024-06-21 17:18:37 +00:00
Jason Ansel	bdc39eef3b	[inductor] Add --inductor-config benchmark flag (#129034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024, #129033	2024-06-21 16:53:42 +00:00
Jason Ansel	bb4ab59651	[inductor] Run more test on correct device (#129033 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129033 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024	2024-06-21 16:53:42 +00:00
Jason Ansel	feb3f3ad77	[inductor] Refactors for Halide backend (#129024 ) Pulling these inductor-related refactors out of the larger Halide backend PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-21 16:53:35 +00:00
chilli	237c4e6163	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-21 15:58:53 +00:00
William Wen	bdd11483ea	[3.13] get C dynamo to compile with python callback and custom frame eval (#129171 ) Start enabling parts of C Dynamo for 3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129171 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-21 15:58:02 +00:00
xinan.lin	b0ae0db815	[Inductor][Intel GPU] Support reduction split. (#129120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129120 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #129124	2024-06-21 15:11:59 +00:00
xinan.lin	fb0c51b61c	[Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722 (#129124 ) [Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722. Currently, XPU CI does not gate PR merge. So, we have to do some post-CI fixing as some PRs may break XPU CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129124 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-06-21 15:08:17 +00:00
PyTorch MergeBot	715b09ae2d	Revert "Fix DEBUG=1 asserts with NJT ops (#129014 )" This reverts commit 2bb8ee602b264b652a9dbd6877da61018054d313. Reverted https://github.com/pytorch/pytorch/pull/129014 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129014#issuecomment-2182922009))	2024-06-21 15:03:02 +00:00
cyy	479ce5e2f4	Remove outdated CUDA code from CMake (#128801 ) It's possible to simplify some CUDA handling logic in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801 Approved by: https://github.com/r-barnes, https://github.com/malfet	2024-06-21 15:00:00 +00:00
cyy	2c7c286fa4	[1/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129055 Approved by: https://github.com/r-barnes	2024-06-21 14:56:31 +00:00
lezcano	53be7ff0e4	Make tl.atomic_add relaxed (#129133 ) We don't use any fancy synchronization within out atomic ops, we just want them to be atomic, so better to have them be relaxed than the default aquire/release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133 Approved by: https://github.com/peterbell10	2024-06-21 14:49:58 +00:00
Bin Bao	62e5d045c0	[AOTI] Auto-tune Triton kernels in a seperate block (#129057 ) Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library. There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor: * Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation. * Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption. * Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while. This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before. Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-21 14:34:13 +00:00
Sahdev Zala	9795dba1e0	Optim package docstring fix (#129086 ) Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593 The fix can be verified by running pydocstyle path-to-file --count Fixes #112593 Related #128248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086 Approved by: https://github.com/janeyx99	2024-06-21 14:30:53 +00:00
Xuehai Pan	b697808056	[BE][Easy] eliminate relative import in `torchgen` (#128872 ) Fix generated by: ```bash ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872 Approved by: https://github.com/zou3519	2024-06-21 14:11:46 +00:00
Joel Schlosser	e1c1052829	Backward support for unbind() with NJT (#128032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032 Approved by: https://github.com/soulitzer	2024-06-21 14:05:23 +00:00
haozhe.zhu	27ae1f981d	[inductor] fix linear_add_bias for autocast case (#129138 ) Previously `linear_add_bias` only support the added tensor is `bfloat16`. ``` class M(torch.nn.Module): def __init__(self, dtype): super().__init__() self.linear1 = torch.nn.Linear(10, 64, bias=False) self.bias1 = torch.randn(64).bfloat16() # if the bias is not bf16, we will crash def forward(self, x): return self.linear1(x) + self.bias1 ``` For `Autocast(bf16)` cases, `self.bias1` will not be converted to bf16. And we also not checked the dtype for weight and bias in the pattern matcher, this will lead to error if weight is bfl6 while bias is fp32. We have 2 options to resolve this: - Check bias/weight dtype, only fold the bias when they are same dtype - We will fold them even they are not same dtype. By inserting to_dtypes for `bias node` to enforce it have same dtype with weight. This PR chose option1, since we can't implicitly cast bias to bf16 here which would lose precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129138 Approved by: https://github.com/jgong5	2024-06-21 14:04:30 +00:00
rzou	5d8e23b49c	[custom_op] Support string default values in schema (#129179 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129179 Approved by: https://github.com/albanD ghstack dependencies: #129177, #129178	2024-06-21 13:31:40 +00:00
rzou	08b616281f	[custom ops] Switch out references from old landing page to new landing page (#129178 ) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129178 Approved by: https://github.com/albanD ghstack dependencies: #129177	2024-06-21 13:31:40 +00:00
rzou	311fadb1fb	[docs] Redirect custom ops landing page to the correct place (#129177 ) I'm moving it to pytorch/tutorials Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177 Approved by: https://github.com/albanD	2024-06-21 13:31:32 +00:00
Yifu Wang	217aac96d7	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-21 08:49:11 +00:00
Simon Fan	f0443ad174	[compiled autograd] flatten runtime inputs with fast path (#129116 ) covered by test_compiled_autograd.py and test_standalone_compile.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129116 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905, #128982, #128987, #129181	2024-06-21 08:16:33 +00:00
Simon Fan	d97dfe9313	[compiled autograd] move inputs to cuda with non_blocking=True (#129181 ) non_blocking=True requires first pinning, which shouldn't be a problem given that they are cpu scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/129181 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #127960, #128905, #128982, #128987	2024-06-21 08:16:33 +00:00
Simon Fan	8f320fd6c6	[compiled autograd] treat input params as static (#128987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128987 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng ghstack dependencies: #127960, #128905, #128982	2024-06-21 08:16:33 +00:00
Simon Fan	fafa1867d1	[compiled autograd] use in_compiled_autograd_region instead of compiled_autograd_enabled_count (#128982 ) current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too: - no lazy compile for compiled fw - no aot autograd cache for inference graphs we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905	2024-06-21 08:16:33 +00:00
Simon Fan	68b33453f4	[aot autograd] collect static parameter metadata when graphs fallback to inference (#128905 ) https://github.com/pytorch/pytorch/pull/126820 but for graphs that have requires_grad inputs but no requires_grad outputs i.e. inference graph the implementation of inference graph fallback was throwing away the static parameter information during metadata recomputation also adding a cudagraphs counter to test this easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/128905 Approved by: https://github.com/mlazos ghstack dependencies: #127960	2024-06-21 08:16:33 +00:00
Simon Fan	123812790b	[compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960 Approved by: https://github.com/jansel	2024-06-21 08:16:33 +00:00
Anshul Sinha	aee512cc9d	[dtensor][op] Fixed stack op strategy (#129018 ) Summary The previous stack op strategy was causing the input to be resharded, resulting in list index out of range error. I delayed the resharding for after the input_specs were created so that the new dimension could be inserted, preventing the error above. I have also ran all the other test cases to ensure changes did not introduce any new bugs Test Plan pytest test/distributed/_tensor/test_tensor_ops.py -s -k test_stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/129018 Approved by: https://github.com/XilunWu	2024-06-21 08:10:28 +00:00
Animesh Jain	6b5fbc544e	[dynamo] Use polyfill to trace through the attributes of torch.jit.* and lru_cache_wrapper (#128336 ) Earlier we were taking the vt for `obj` and then monkeypatching that `vt.source` to be `obj._torchdynamo_inline`. If one accesses `obj.attr_a`, this would cause problems because Dynamo would then search it in `obj._torchdynamo_inline.attr_a`. This PR makes it more functional, so that we have different vts for obj and `ob._torchdynamo_inline`. Fixes https://github.com/pytorch/pytorch/issues/93698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128336 Approved by: https://github.com/jansel, https://github.com/yanboliang ghstack dependencies: #129117	2024-06-21 07:44:44 +00:00
Jiong Gong	914d3ca2ba	[inductor][cpp] BF16 AMX micro-gemm support (#127195 ) This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`. Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C: Static shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| timm_models \| mixer_b16_224 \| 1.54 \| \| timm_models \| convit_base \| 1.53 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.52 \| \| torchbench \| fastNLP_Bert \| 1.44 \| \| torchbench \| llama \| 1.33 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.31 \| \| torchbench \| dlrm \| 1.28 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| huggingface \| MobileBertForMaskedLM \| 1.27 \| \| timm_models \| vit_base_patch16_224 \| 1.26 \| \| timm_models \| beit_base_patch16_224 \| 1.23 \| \| timm_models \| jx_nest_base \| 1.21 \| \| torchbench \| pyhpc_equation_of_state \| 1.18 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.15 \| \| timm_models \| pit_b_224 \| 1.14 \| \| timm_models \| twins_pcpvt_base \| 1.14 \| \| torchbench \| maml_omniglot \| 1.1 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| torchbench \| BERT_pytorch \| 1.35 \| \| torchbench \| lennard_jones \| 2.43 \| \| torchbench \| hf_Albert \| 1.35 \| \| torchbench \| hf_T5 \| 1.34 \| \| torchbench \| soft_actor_critic \| 1.34 \| \| torchbench \| fastNLP_Bert \| 1.28 \| \| huggingface \| LayoutLMForSequenceClassification \| 1.26 \| \| torchbench \| llama \| 1.24 \| \| huggingface \| GPT2ForSequenceClassification \| 1.19 \| \| torchbench \| hf_Bart \| 1.17 \| \| torchbench \| hf_Bert_large \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| timm_models \| gmixer_24_224 \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.15 \| \| torchbench \| maml_omniglot \| 1.14 \| \| torchbench \| hf_Bert \| 1.13 \| \| torchbench \| hf_DistilBert \| 1.13 \| \| torchbench \| hf_T5_large \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.11 \| Dynamic shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| timm_models \| mixer_b16_224 \| 1.52 \| \| timm_models \| convit_base \| 1.5 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.49 \| \| torchbench \| fastNLP_Bert \| 1.42 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.27 \| \| torchbench \| llama \| 1.26 \| \| huggingface \| MobileBertForMaskedLM \| 1.25 \| \| timm_models \| vit_base_patch16_224 \| 1.25 \| \| timm_models \| beit_base_patch16_224 \| 1.24 \| \| timm_models \| jx_nest_base \| 1.2 \| \| torchbench \| dlrm \| 1.19 \| \| timm_models \| pit_b_224 \| 1.13 \| \| timm_models \| twins_pcpvt_base \| 1.13 \| \| torchbench \| hf_Bert_large \| 1.12 \| \| torchbench \| hf_BigBird \| 1.11 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.11 \| \| timm_models \| eca_botnext26ts_256 \| 1.11 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| torchbench \| BERT_pytorch \| 1.18 \| \| torchbench \| lennard_jones \| 2.18 \| \| torchbench \| hf_Albert \| 1.37 \| \| torchbench \| soft_actor_critic \| 1.31 \| \| huggingface \| GPT2ForSequenceClassification \| 1.29 \| \| torchbench \| hf_T5 \| 1.28 \| \| torchbench \| fastNLP_Bert \| 1.27 \| \| torchbench \| hf_Bart \| 1.21 \| \| torchbench \| hf_Bert_large \| 1.19 \| \| torchbench \| hf_T5_large \| 1.19 \| \| torchbench \| hf_Bert \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| huggingface \| CamemBert \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.13 \| \| torchbench \| functorch_maml_omniglot \| 1.12 \| \| huggingface \| BertForMaskedLM \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.12 \| \| torchbench \| hf_DistilBert \| 1.11 \| \| timm_models \| mixnet_l \| 1.11 \| \| timm_models \| tf_mixnet_l \| 1.11 \| No perf regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195 Approved by: https://github.com/jansel	2024-06-21 07:21:47 +00:00
Wu, Chunyuan	632910e2a8	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-21 07:19:28 +00:00
Sanket Jayant Purandare	62e425ab03	Memory Tracker for tracking Module wise memory (#124688 ) We present a utility MemTracker, that tracks the module-wise memory for the code executed under its context. The core features that this tool aims to provide are: 1. Capturing 'snapshots' of memory for each module during its execution. Specifically, at 8 points, during pre-forward, post-forward, pre-backward, 2nd pre-forward (if AC is applied), 2nd post-forward (if AC is applied), post-backward. Also capturing peak memory snapshot during forward and backward. 2. Each such snapshot provides the per device (cpu, cuda etc) memory breakdown in terms of the global parameters, gradients, activations, optimizer states and temporary memory. 3. A summary for each module (that can be analyzed or processed later), in terms of the memory occupied by its own parameters, buffers, inputs and outputs. The remaining components can be derived from these per module attributes and its corresponding captured snapshots. 4. Record the global peak memory consumption per device and their respective breakdowns. 5. Ability to do all of this under the FakeTensorMode so that all these statistics can be obtained without executing code on real data. 6. Ability to register and track modules, optimizers and any other tensors that are created outside the context of MemTracker. 7. Ability to capture a custom memory snapshot at any point during program execution execution. 8. Utility functions to display all of these statistics in user-friendly and human readable manner. These features will enable users to anticipate OOMs, debug and pinpoint where majority of memory comes from, experiment with different activation checkpointing policies, batch sizes, mixed precision, model architecture features (ex. number of layers, hidden dimensions, number of attention heads etc.) and inter-device memory movement (ex. CPU off-loading) among others. Basically anything and everything related to device memory. * __->__ #128508 Example: > import torch > import torchvision.models as models > from torch.distributed._tools.mem_tracker import MemTracker > device, dtype = "cuda", torch.float32 > with torch.device(device): > model = models.resnet18().to(dtype=dtype) > optim = torch.optim.Adam(model.parameters(), foreach=True) > mem_tracker = MemTracker() > mem_tracker.track_external(model, optim) > with mem_tracker as mt: > for i in range(2): > input_batch = torch.randn(256, 3, 224, 224, device=device, dtype=dtype) > model(input_batch).sum().backward() > optim.step() > optim.zero_grad() > if i == 0: > # to account for lazy init of optimizer state > mt.reset_mod_stats() > mt.display_snapshot("peak", units="MiB", tabulate=True) > mt.display_modulewise_snapshots(depth=2, units="MiB", tabulate=True) > # Check for accuracy of peak memory > tracker_max = mt.get_tracker_snapshot('peak')[device]['Total'] > cuda_max = torch.cuda.max_memory_allocated() > accuracy = tracker_max / cuda_max > print(f"Tracker Max: {tracker_max}, CUDA Max: {cuda_max}, Accuracy: {accuracy}") Output <img width="1197" alt="Screenshot 2024-06-15 at 12 10 12 AM" src="https://github.com/pytorch/pytorch/assets/12934972/83e953db-43dc-4094-90eb-9f1d2ca8e758"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124688 Approved by: https://github.com/awgu	2024-06-21 07:15:32 +00:00
PaliC	2b1b055a96	[Split Build] Fix libtorch_python RPATH (#129088 ) In the split build we end up with an incorrect RPATH for `libtorch_python.so`. This PR fixes said RPATH. What the rpath should look like: ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/main_so_files/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib: ``` Before ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/split_so_files/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /home/sahanp/pytorch/torch/lib:/home/sahanp/pytorch/build/lib: ``` After ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p build/lib/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/pytorch/torch/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib: ``` Testing that this works is in the above PR. Similarly, after running ciflow/binaries the output of objdump -p should not change https://www.diffchecker.com/14PRmCNz/ (checked manywheel py 3.10 cuda 12.1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129088 Approved by: https://github.com/malfet	2024-06-21 06:49:19 +00:00
Animesh Jain	c008488b9c	[dynamo][guards] Dont run TYPE_MATCH for DICT_LENGTH C++ guard (#129163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129163 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-21 06:27:19 +00:00
cyy	5c676bb8b3	Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-06-21 06:16:44 +00:00
Colin L. Rice	3a2fdbb142	[dynamo] - Add JK killswitch for dynamo compilation. (#128538 ) This allows easy disablement of dynamo in emergency situations where env variables are hard to set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128538 Approved by: https://github.com/jansel	2024-06-21 06:14:06 +00:00
PyTorch MergeBot	f73b451e78	Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013 )" This reverts commit ff89ebc50a738c734496393dc25313cf197fd0b4. Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/huydhn due to Sorry for reverting your change but one of the test_torchinductor_opinfo test starts to fail after this commit `ff89ebc50a`, I am reverting to see if it helps trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2182042422))	2024-06-21 05:46:46 +00:00
Deng Weishi	b542825066	Enable deterministic support for oneDNN (#127277 ) This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848. For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui	2024-06-21 05:21:24 +00:00
Animesh Jain	e8dbb45e98	[dynamo][user-defined-object] Check that object is valid (#129117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129117 Approved by: https://github.com/yf225	2024-06-21 04:18:54 +00:00
cyy	e99a24ce7c	Remove TensorImpl_test.cpp (#129054 ) It's not used because of removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129054 Approved by: https://github.com/albanD, https://github.com/malfet	2024-06-21 04:17:36 +00:00
Brian Hirsh	880e894c39	[Brian's PR #128981 ] fix dynamo isinstance inlining for nn.Parameter + subclasses (#129162 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128981, with very small changes to work around numpy related errors. For discussions, please see Brian's original PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129162 Approved by: https://github.com/bdhirsh	2024-06-21 03:48:10 +00:00
eellison	8cd9b10456	Fix exp decomp numerics (#129154 ) Our previous implementation would sometimes generate `inf` because we did not do the same numerics tricks as in eager: See comment / [link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TransformationHelper.h#L123-L144) : ``` # curand_uniform has (0,1] bounds. log(1) is 0 and exponential excludes 0. # we need log to be not 0, and not underflow when converted to half # fast __logf approximation can underflow, so set log to -epsilon/2 for 1 or close to 1 args ``` Fix for https://github.com/pytorch/pytorch/issues/127749. Added a test for non-inf, but it would be great to have more robust decomp distribution tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129154 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2024-06-21 03:21:30 +00:00
chilli	ff89ebc50a	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-21 03:01:16 +00:00
Zain Huda	0acd09aecd	[torchrec][pt-d][model store] introduce LocalShardsWrapper for DTensor (#129150 ) Summary: Same as D57688538, recreated because of GH issues This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this. It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic. This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP) See D54375878 for how it is used. LocalShardsWrapper supports the following torch ops: + torch.ops._c10d_functional.all_gather_into_tensor.default + aten._to_copy.default + aten.view.default + aten.equal.default + aten.detach.default With extensibility to add more as required by use cases. See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach. NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512 Test Plan: ` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details` `buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details` Sandcastle Reviewed By: XilunWu, wanchaol Differential Revision: D58570479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150 Approved by: https://github.com/XilunWu	2024-06-21 01:58:51 +00:00
wz337	31c9e3d2f4	[FSDP][Test] Test save model save with FSDP1 and load into FSDP2 applied model (#129028 ) A lot of models have already been saving the model state in FULL_STATE_DICT mode with FSDP1 in APF. This unit test is just to demonstrate FSDP1 -> FSDP2 transition. The use of deprecating APIs in this test is intentional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129028 Approved by: https://github.com/awgu, https://github.com/fegin	2024-06-21 01:40:58 +00:00
Pian Pawakapan	8758fedbfc	[export] copy sym ops when respecting call module signature (#129153 ) Summary: Export, through AOTAutograd, [deduplicates](`11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)`) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules. If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener Test Plan: export unflatten test case Reviewed By: TroyGarden, angelayi Differential Revision: D58697231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153 Approved by: https://github.com/angelayi	2024-06-21 01:40:22 +00:00
Valentine233	5da428d9eb	[cpu][flash attention] fix attention mask issue (#128816 ) For attention mask in flash attention: - Fix the issue of accessing illegal memory when the last size of mask is 1. - Add UT of attention mask for various shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128816 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-06-21 01:12:48 +00:00
PyTorch MergeBot	d4022b4658	Revert "[BE] enable UFMT for `torch/nn/modules` (#128594 )" This reverts commit 95ac2d648279ebc73feccf6d8eccafa4b2759de8. Reverted https://github.com/pytorch/pytorch/pull/128594 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128594#issuecomment-2181788935))	2024-06-21 00:50:08 +00:00
PyTorch MergeBot	cc8193c707	Revert "[BE] enable UFMT for `torch/nn/functional.py` (#128592 )" This reverts commit f6e6e55fa7d883a89ba99584f8632c260519ba73. Reverted https://github.com/pytorch/pytorch/pull/128592 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128592#issuecomment-2181783936))	2024-06-21 00:44:16 +00:00
PyTorch MergeBot	9c929f6ce9	Revert "[BE][Easy] enable UFMT for `torch/distributed/` (#128870 )" This reverts commit a0e1e20c4157bb3e537fc784a51d7aef1e754157. Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))	2024-06-21 00:38:28 +00:00
Jiong Gong	9dd8f8cf8b	[cpuinfo][submodule] bump cpuinfo to the latest to support amx isa check (#127505 ) Fix https://github.com/pytorch/pytorch/issues/127368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127505 Approved by: https://github.com/ezyang	2024-06-21 00:17:44 +00:00
Myungjin Lee	c027c8935b	[distributed] NCCL result code update (#128777 ) The nccl result codes are outdated. This PR fixes #128756. Fixes #128756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128777 Approved by: https://github.com/Skylion007	2024-06-20 23:51:39 +00:00
Huy Do	43060a1dbc	Add shard support to test_inductor (#129160 ) I added one more shard for inductor tests earlier in https://github.com/pytorch/pytorch/pull/129108, but didn't realize that the second shard didn't do any inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129160 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-06-20 23:41:00 +00:00
Joel Schlosser	31d5753247	Short-term fix to preserve NJT metadata cache in torch.compile (#122836 ) Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile. For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors. NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing. Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836 Approved by: https://github.com/soulitzer	2024-06-20 23:15:53 +00:00
PyTorch MergeBot	63a724d8e1	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit 8771e3429c3d7327f08c48d547ad73546d5603b3. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))	2024-06-20 22:31:29 +00:00
Jing Xu	5fba5d83f0	add xpu for amp (#127276 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet	2024-06-20 21:49:35 +00:00
Jane Xu	adc14adb88	Fix flakiness with test_binary_op_list_error_cases (#129003 ) So how come this PR fixes any flakiness? Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky. Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach. So we improve the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003 Approved by: https://github.com/soulitzer	2024-06-20 21:48:22 +00:00
Thanh Ha	61fa3de4cb	ci: Hardcode runner-determinator (#128985 ) Hardcode the runner-determinator script for testing ALI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128985 Approved by: https://github.com/ZainRizvi	2024-06-20 21:14:23 +00:00
PyTorch MergeBot	aace8ffc00	Revert "[BE] enable UFMT for `torch/nn/*.py` (#128593 )" This reverts commit a87d82abd746240e7b46b992fa9df7ae6d3e6d4a. Reverted https://github.com/pytorch/pytorch/pull/128593 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128593#issuecomment-2181562604))	2024-06-20 21:09:44 +00:00
Animesh Jain	f2f4dde2d3	[dynamo] Remove ID_MATCH for FSDPModuleVariable (#129015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129015 Approved by: https://github.com/yf225 ghstack dependencies: #129098	2024-06-20 19:23:32 +00:00
PyTorch MergeBot	e84cf805d2	Revert "Modularize aten parameter parser and checker (#125308 )" This reverts commit 60bbdc0b40656cf70b2b098c7d715e19f031fb0d. Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))	2024-06-20 18:52:05 +00:00
PyTorch MergeBot	254487f288	Revert "Separate AOTI Eager utils as a single file (#125819 )" This reverts commit 18634048a1f939a961b7c96b0acfe78b474c821e. Reverted https://github.com/pytorch/pytorch/pull/125819 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125819#issuecomment-2181317332))	2024-06-20 18:49:08 +00:00
PyTorch MergeBot	73340f0909	Revert "[3/N] Non-Tensor: Support string parameter for aten operations (#125831 )" This reverts commit a52c8ace98afe76dc9e2c330b415972fd1529077. Reverted https://github.com/pytorch/pytorch/pull/125831 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125831#issuecomment-2181313892))	2024-06-20 18:45:41 +00:00
Brian Hirsh	8c2542623b	[Traceable FSDP2] [Dynamo] Add tracing support for out-variant custom ops that return None (#129078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129078 Approved by: https://github.com/yanboliang	2024-06-20 17:46:13 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	734891ac22	Fix export log script (#128967 ) Summary: Title Test Plan: CI Differential Revision: D58699557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128967 Approved by: https://github.com/jiashenC	2024-06-20 17:01:00 +00:00
Tijmen Blankevoort	ddb95dbb0d	Fixing equalize with three things and improving functionality (#124632 ) Summary: (1) Make code work when a first layer does not have a bias. (2) Make it possible to provide both modules and module names as input (3) Allow sequences of contiguous layers as input, that then get split into pairs (4) fix documentation to be more clear on inputs to be provided Test Plan: Run this new version of the algorithm on a network and see if it throws errors. There's also this notebook to run and test N5199827 It you tell me where I can find the tests for this code, I can add some simple unit tests as well. Differential Revision: D55895862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124632 Approved by: https://github.com/jerryzh168	2024-06-20 16:55:56 +00:00
PyTorch MergeBot	832fc35211	Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013 )" This reverts commit 6d2b3c90f144d7b77d51da27e6696192b2b97ebd. Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing a flexattention test to fail on ROCm. Can you please fix that test before remerging this in? See `6d2b3c90f1` for details ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2181133070))	2024-06-20 16:51:41 +00:00
Zhengxu Chen	65286883d4	[export] reland "experimental joint graph API." (#129081 ) Summary: previous diff got reverted despite CI was green. Test Plan: CI Differential Revision: D58790048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081 Approved by: https://github.com/tugsbayasgalan	2024-06-20 16:50:53 +00:00
PaliC	fc5b0ff2d7	[BE][Hackaday] deprecate legacy cuda docker image (#128859 ) Fixes https://github.com/pytorch/builder/issues/1795 from the pytorch side specifically for the cuda image Pull Request resolved: https://github.com/pytorch/pytorch/pull/128859 Approved by: https://github.com/atalman	2024-06-20 16:30:49 +00:00
Nikita Shulga	b2a9b8d485	[CpuInductor] Enable NEON ISA detection on Linux ARM (#129075 ) Also, cleanup code a bit to use `x in [y, z]` instead of `x == y or x == z` And do not redefine `at_align`, but instead use `alignas(64)` as was suggested in https://github.com/pytorch/pytorch/pull/128686/files#r1639365978 Test plan: `python3 -c "import torch._inductor.codecache as cc; isa = cc.valid_vec_isa_list()[0];print(str(isa), bool(isa))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129075 Approved by: https://github.com/jansel	2024-06-20 16:22:57 +00:00
Huy Do	e0aa992d73	Fix inductor and deploy jobs timing out (#129108 ) Some trunk and periodic jobs are timing out at the moment, including: * `deploy`. This is because https://github.com/pytorch/pytorch/pull/127952 has removed `deploy` config, but there is one left over in periodic. * [periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu](https://github.com/pytorch/pytorch/actions/runs/9525590191/job/26260620457). * `inductor`, including `py3.10`, `py3.12`, and `cuda12.1`, `cuda12.4`. The increase comes from this change https://github.com/pytorch/pytorch/pull/128343, so I add another GPU shard. * [inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9522817887/job/26255069269) * [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9524651902/job/26260009757) * [inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440205869) * [inductor-cu124 / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440634200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129108 Approved by: https://github.com/malfet	2024-06-20 16:03:11 +00:00
Joel Schlosser	2bb8ee602b	Fix DEBUG=1 asserts with NJT ops (#129014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-06-20 15:15:28 +00:00
rzou	7178b4e987	[Dynamo x torch_function] fix incorrect source (#128980 ) Fixes https://github.com/pytorch/pytorch/issues/128964 The problem was that we were installing the source for a type incorrectly. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/128980 Approved by: https://github.com/mlazos	2024-06-20 14:54:00 +00:00
Animesh Jain	ea47d542ca	[dynamo][guards] Remove BOOL_FALSE - not needed after C++ guards (#129098 ) PyDict_Size is very fast ... earlier with Python guards, Cpython will go through layers of fluff to finally call the PyDict_Size. With C++ guards, its not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129098 Approved by: https://github.com/jansel	2024-06-20 14:40:27 +00:00
Oguz Ulgen	54b0006cb2	Evaluate symexprs on load path of cache not write (#128997 ) When caching is enabled, an internal model fails with ``` assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1)) AssertionError: expected size 17==17, stride 57344==54784 at dim=0 ``` looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected. This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path. I have not been able to come up with a unit test repro for this problem. Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997 Approved by: https://github.com/ezyang	2024-06-20 08:55:12 +00:00
Li-Huai (Allan) Lin	799acd31b4	[MPS] Add lu_factor (#99269 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at d75cde1</samp> Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269 Approved by: https://github.com/kulinseth, https://github.com/lezcano	2024-06-20 07:35:29 +00:00
Nikita Shulga	0d25f096c1	[CppInductor] Fix erfinv codegen when non-vectorized isa (#129090 ) Fix erfinv codegen when ISA could not be detected Manual test plan (on MacOS): - Modify `valid_vec_isa_list` to return empty list - Run `python3 inductor/test_torchinductor_opinfo.py -v -k test_comprehensive_erfinv_cpu_bool` Before this change, abovementioned test will fail with ``` Output: /var/folders/rk/fxg20zvx6vvb5bk7cplq4xrc0000gn/T/tmpgic60b6c/ns/cnsp7snp7fyclkm5lsfiyiv3m6c3svevkbhcb3v7pijdfjwlyaij.cpp:11:25: error: use of undeclared identifier 'calc_erfinv' auto tmp2 = calc_erfinv(tmp1); ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129090 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-20 06:09:48 +00:00
chilli	6d2b3c90f1	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-20 05:15:48 +00:00
Will Feng	ad2593cb86	[Animesh's PR #125340 ] [dynamo][fsdp] Track FSDPNNModuleVariable for mutations (#129045 ) This is a copy of Animesh's work in https://github.com/pytorch/pytorch/pull/125340, with very small changes to the unit test. It's needed sooner for the Traceable FSDP2 work, so I copy it here and will work through landing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129045 Approved by: https://github.com/anijain2305	2024-06-20 04:02:36 +00:00
Li-Huai (Allan) Lin	19f3abcde4	[Docs][MPS] Add mps environment variable table (#129008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008 Approved by: https://github.com/malfet ghstack dependencies: #129006	2024-06-20 03:30:35 +00:00
Huy Do	609ffaf717	Add more shards for slow CPU and ROCm jobs (#128873 ) As they start to timeout in trunk `fc2913fb80/1`. Adding one more shard for slow CPU job is trivial. ROCm runners is harder to find, but I assume that this is ok because slow jobs only run periodically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128873 Approved by: https://github.com/PaliC	2024-06-20 03:13:19 +00:00
Will Feng	d8db074988	[Traceable FSDP2] [Dynamo] Fix OptimizedModule._initialize to allow tracing into FSDP2 module hooks for module from user-defined module class (#129046 ) This is a workaround to allow inplace fully-sharded module to still go into this branch: `3a185778ed/torch/_dynamo/eval_frame.py (L163)` instead of the second branch: `3a185778ed/torch/_dynamo/eval_frame.py (L166)` If we don't do this, `torch.compile(fully_shard(module_from_user_defined_module_class))` will ignore all module hooks which will break FSDP tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129046 Approved by: https://github.com/anijain2305	2024-06-20 00:15:55 +00:00
Peter Bell	859fa183fe	BE: Use future annotations in inductor scheduler and ir (#128892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128892 Approved by: https://github.com/lezcano	2024-06-20 00:10:43 +00:00
chilli	a2b1673dfb	[Horace's PR #126446 ] Prevent partitioner from ever saving views (#129039 ) Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039 Approved by: https://github.com/Chillee	2024-06-19 23:21:16 +00:00
leslie-fang-intel	9d06e3783d	[Inductor][CPP] Fix the symbolic size cast issue in GEMM Benchmark (#128824 ) Summary The symbolic size generated from size hint (python int) is different with c type `long` of kernel args which may cause the benchmark failing to run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128824 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-19 23:11:53 +00:00
Jithun Nair	a6ac6447b5	Re-enable py3.12 nightly wheel builds and add triton dependency for ROCm (#128525 ) The llnl-hatchet developers have published the py3.12 binaries on [PyPI](https://pypi.org/project/llnl-hatchet/#files). In fact, looking [here](https://download.pytorch.org/whl/nightly/llnl-hatchet), it seems we already have the py3.12 wheels mirrored. This should allow us to re-enable py3.12 binaries for ROCm. This PR reverts commit 9d849d4312cd1e62d97b9e9d58979ec78d36c95f. It also adds the pytorch-triton-rocm dependency for torch wheels on ROCm since pytorch-triton-rocm py3.12 wheels are available now Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128525 Approved by: https://github.com/malfet	2024-06-19 21:56:54 +00:00
Sam Larsen	571a0db132	[inductor] Fix logging for run_and_get_cpp_code (#128794 ) Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794 Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel	2024-06-19 21:32:34 +00:00
cyy	277f2914a5	[9/N] Remove unused functions (#128704 ) MKL can not be enabled on aarch64, and as CI compiles code with `-Werror=unused-function` it will fail to compile with ``` /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/workspace/build/third_party/foxi -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-2.1.0 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/FP16/include -I/var/lib/jenkins/workspace/third_party/tensorpipe -I/var/lib/jenkins/workspace/build/third_party/tensorpipe -I/var/lib/jenkins/workspace/third_party/tensorpipe/third_party/libnop/include -I/var/lib/jenkins/workspace/third_party/fmt/include -I/var/lib/jenkins/workspace/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/third_party/flatbuffers/include -isystem /var/lib/jenkins/workspace/build/third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/workspace/third_party/protobuf/src -isystem /var/lib/jenkins/workspace/third_party/XNNPACK/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/eigen -isystem /var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/workspace/third_party/ideep/include -isystem /var/lib/jenkins/workspace/build/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp:426:15: error: ‘at::Tensor at::native::mkl_linear(const at::Tensor&, const at::Tensor&, const at::Tensor&, const std::optional<at::Tensor>&, int64_t)’ defined but not used [-Werror=unused-function] 426 \| static Tensor mkl_linear( \| ^~~~~~~~~~ ``` Follows #128499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128704 Approved by: https://github.com/malfet	2024-06-19 20:46:45 +00:00
Aleksei Nikiforov	fca408fa29	s390x vectorization: rework operators (#129066 ) Move operators from member functions to free functions. This is needed to fix torch inductor on s390x. This change fixes tests like DynamicShapesMiscTests::test_numpy_min_dynamic_shapes from test/dynamo/test_dynamic_shapes.py This change also fixes recently intorduced build failure on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129066 Approved by: https://github.com/malfet	2024-06-19 20:12:41 +00:00
Huy Do	73f5d2b787	Run ET unit tests on PT CI (#128560 ) This is the first PR to add all existing ET unit tests into PT CI. The goal is to improve the coverage there to avoid breaking change from PT that could break ET. With this, any future unit tests on ET will automatically be run on PT CI. The duration of the job is now 40+ minutes, not too bad. This also fixed the failed ET build in https://github.com/pytorch/pytorch/pull/123043. Adding model coverage is a bit more evolved and requires adding new shards, so I will follow up on that in separate PRs. [T192117506](https://www.internalfb.com/intern/tasks/?t=192117506), with the failed diffs D58295865 and D58394154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128560 Approved by: https://github.com/guangy10, https://github.com/digantdesai	2024-06-19 20:08:58 +00:00
PyTorch MergeBot	df94d57c0a	Revert "[export] experimental joint graph API. (#128847 )" This reverts commit 0707811286d1846209676435f4f86f2b4b3d1a17. Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))	2024-06-19 19:04:36 +00:00
Aaron Enye Shi	b5d541609d	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI Pulled By: aaronenyeshi Differential Revision: D55941362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072 Approved by: https://github.com/zdevito	2024-06-19 18:05:41 +00:00
Xu Han	bafd68b4fc	[inductor] fix windows python module ext and func export declaration (#129059 ) I have run the first inductor case on Windows base on the exploration code: https://github.com/pytorch/pytorch/pull/128330 Due to some fundamental PR still need pass `fb_code`: https://github.com/pytorch/pytorch/pull/128303 This PR would land some part of exploration code: 1. Fix Windows python module ext type: pyd. 2. Add function export declaration for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129059 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-19 17:51:32 +00:00
Zhengxu Chen	0707811286	[export] experimental joint graph API. (#128847 ) Summary: WARNING: This API is highly unstable and will be subject to change in the future. Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph. Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental Differential Revision: D55657917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847 Approved by: https://github.com/tugsbayasgalan	2024-06-19 16:45:27 +00:00
Li-Huai (Allan) Lin	0fc603ece4	[optim] Fused implementation stability table (#129006 ) I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006 Approved by: https://github.com/malfet	2024-06-19 16:29:49 +00:00
Jean Schmidt	1b92bdd0ea	[ALI] [Reland] Use LF runners for Lint (#129071 ) Quick experiment with using LF runners for lint jobs. Picking a set of jobs where infra failures would be obvious to most people (lint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129071 Approved by: https://github.com/malfet	2024-06-19 16:10:51 +00:00
PaliC	236fbcbdf4	[Split Build] Test split build in pull CI workflow (#126813 ) This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813 Approved by: https://github.com/atalman ghstack dependencies: #127934	2024-06-19 15:57:21 +00:00
PaliC	7d33ff59ba	[Split Build]Use same package (#127934 ) This PR removes the second separate package we were using for the libtorch wheel. In terms of testing that this works we will look use the PRs above this in the stack. As for sanity checking these are the wheels that are produced by running ``` python setup.py clean && BUILD_LIBTORCH_WHL=1 with-proxy python setup.py bdist_whee l && BUILD_PYTHON_ONLY=1 with-proxy python setup.py bdist_wheel --cmake ``` ``` sahanp@devgpu086 ~/pytorch ((5f15e171…))> ls -al dist/ (pytorch-3.10) total 677236 drwxr-xr-x 1 sahanp users 188 Jun 4 12:19 ./ drwxr-xr-x 1 sahanp users 1696 Jun 4 12:59 ../ -rw-r--r-- 1 sahanp users 81405742 Jun 4 12:19 torch-2.4.0a0+gitca0a73c-cp310-cp310-linux_x86_64.whl -rw-r--r-- 1 sahanp users 612076919 Jun 4 12:19 libtorch-2.4.0a0+gitca0a73c-py3-none-any.whl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127934 Approved by: https://github.com/atalman	2024-06-19 15:57:21 +00:00
lyb	ffb50fb691	[ONNX] Add onnx::Gelu support for version 20 (#128773 ) Fixes https://github.com/pytorch/pytorch/issues/128772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128773 Approved by: https://github.com/justinchuby	2024-06-19 15:39:02 +00:00
Jean Schmidt	3397d5ef90	Revert "[ALI] Use lf runners for Lint" (#129070 ) Reverts pytorch/pytorch#128978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129070 Approved by: https://github.com/atalman	2024-06-19 14:48:16 +00:00
Xu Zhao	118f9ceb7c	[inductor][ci] Fix torchbench dependency issue with numpy (#128968 ) For some reason, pip will always upgrade the numpy version even when an older version has been installed. We have to lock numpy version to the old version to make this constraint explicit. Torchbench commit: `23512dbebd` Second attempt to fix #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128968 Approved by: https://github.com/eellison	2024-06-19 12:10:50 +00:00
FFFrog	e49525275d	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-19 09:06:49 +00:00
Zain Rizvi	7fac03aee9	[ALI] Use lf runners for Lint (#128978 )	2024-06-19 10:59:07 +02:00
Daulet Askarov	50567f7081	Pass device to is_pinned call inside TensorProperties.create_from_tensor (#128896 ) Summary: The default input device for is_pinned function is Cuda. This can unnecessarily create Cuda context for CPU tensors when just generating TensorProperties, bloating memory usage. Passing the device to the is_pinned call site inside def create_from_tensor solves this issue. This also fixes Model Store test https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 which is currently broken on memory usage assertions. Test Plan: UT Differential Revision: D58695006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128896 Approved by: https://github.com/fegin	2024-06-19 08:50:46 +00:00
Frank Lin	d3e8b8bf47	Remove cuda check in the CUDAGraph destructor (#127382 ) Fixes #125804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127382 Approved by: https://github.com/eqy, https://github.com/eellison	2024-06-19 08:09:31 +00:00
Bin Bao	ba92f5277f	[inductor][refactor] Unify the use of generate_kernel_call (#128467 ) Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack. Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467 Approved by: https://github.com/shunting314	2024-06-19 07:47:25 +00:00
Colin Peppler	3a185778ed	[aotinductor] Add torch.polar fallback op for shim v2 (#128722 ) Compilation error: ``` $ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar /tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’? ``` Steps: 1. Add aten.polar 2. run `python torchgen/gen.py --update-aoti-c-shim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-19 05:06:58 +00:00
PyTorch MergeBot	a584b2a389	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit df85f34a14dd30f784418624b05bd52b12ab8b0b. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to The failure shows up in trunk `df85f34a14` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2177744578))	2024-06-19 04:59:10 +00:00
drisspg	fcf2a1378b	Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989 ) # Summary First PR got reverted and needed a redo This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989 Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo	2024-06-19 04:49:39 +00:00
Sam Larsen	2f88597aad	[inductor] For internal, allow multiple workers if the method is "subprocess" (#129002 ) Summary: This does not change the current default behavior in fbcode ("fork" if unspecified and no worker processes if unspecified). But it allows us to more easily test the subprocess-based parallel if we override the start method to subprocess. Test Plan: Set `TORCHINDUCTOR_WORKER_START=subprocess` and locally ran all torchbench models listed [here](https://www.internalfb.com/intern/wiki/PyTorch/Teams/PyTorch_Perf_Infra/TorchBench/#torchbench-internal-mode) Differential Revision: D58755021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129002 Approved by: https://github.com/eellison	2024-06-19 04:28:27 +00:00
Jerry Mannil	1f0a68b572	[ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (#128750 ) Current implementation is very specific to MI100. This is causing performance degradation for other GPUs. Fixes #128631 Benchmarking on MI300X: ``` Before: 1918.5126953125 ms After: 0.8285150527954102 ms ``` Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128750 Approved by: https://github.com/xw285cornell	2024-06-19 03:56:20 +00:00
Yanbo Liang	acefc5c016	[torch.compile] Enable bwd compilation metrics (#128973 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128973 Approved by: https://github.com/dshi7	2024-06-19 03:45:41 +00:00
chilli	eb9f4da11e	Modified template indexing to broadcast indices to out instead of mask and some other flexattention micro-opts (#128938 ) For headdim=64 and headdim=128 Old: <img width="656" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2c5d1613-96dc-4300-8dc0-dccaef59e73c"> New: <img width="644" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/730004a8-6d5f-46a5-82a0-2594feb5e192"> Note, this does regress headdim=256. We can unregress it by special casing `headdim=256`, but ehh.... we can do it later Pull Request resolved: https://github.com/pytorch/pytorch/pull/128938 Approved by: https://github.com/drisspg	2024-06-19 03:41:22 +00:00
Yifu Wang	8771e3429c	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-19 03:38:58 +00:00
Alnis Murtovi	ed5b8432cd	Enable mixed_mm only if casting from lower-bitwidth type to a higher one (#128899 ) This PR changes the behavior of `cuda_and_enabled_mixed_mm` such that mixed_mm is only enabled if we are casting from a lower-bitwidth type to a higher one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128899 Approved by: https://github.com/eellison	2024-06-19 03:12:18 +00:00
Wu, Chunyuan	df85f34a14	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-19 01:18:37 +00:00
Thanh Ha	4bc90185fb	fix: Print statements causing parse error (#128969 ) The print statements for the get_workflow_type script is problematic because the shell script calling this script is expecting the output to only be JSON. This PR resolves this by removing all print statements to covert them to a message field in the JSON return output so that the output can continue to expect to be JSON while giving us the debug data we are looking for. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128969 Approved by: https://github.com/tylertitsworth, https://github.com/ZainRizvi	2024-06-19 01:17:08 +00:00
leslie-fang-intel	eda375a490	[Inductor] Remove min/max from inductor opinfo test (#128925 ) Summary Remove `max.binary, min.binary, maximum, minimum` from `inductor_one_sample` op list as we fix the bool vectorization issue in https://github.com/pytorch/pytorch/pull/126841. Test Plan ``` python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_maximum python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_minimum python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_min_binary python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_max_binary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128925 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-06-19 01:14:27 +00:00
xinan.lin	2458f79f83	[Inductor UT][Intel GPU] Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU (#128881 ) Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU because it have not implemented reduction kernel split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128881 Approved by: https://github.com/blaine-rister, https://github.com/EikanWang, https://github.com/malfet	2024-06-19 00:44:57 +00:00
PyTorch MergeBot	b0d2fe6299	Revert "Short-term fix to preserve NJT metadata cache in torch.compile (#122836 )" This reverts commit 2a41fc03903de63270d325bd1886a50faf32d7e4. Reverted https://github.com/pytorch/pytorch/pull/122836 on behalf of https://github.com/jbschlosser due to internal test failures with DEBUG=1 asserts ([comment](https://github.com/pytorch/pytorch/pull/122836#issuecomment-2177298245))	2024-06-19 00:28:53 +00:00
PyTorch MergeBot	5ffb032be6	Revert "Backward support for unbind() with NJT (#128032 )" This reverts commit 5dc4f652bc5c068ef15130c955e3f2ffe11f4b74. Reverted https://github.com/pytorch/pytorch/pull/128032 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128032#issuecomment-2177296325))	2024-06-19 00:26:40 +00:00
Jane Xu	35c78668b4	Improve the debugging message for when foreach mta_called (#128991 ) The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern: - a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called. - then, a later test fails deterministically, usually failing to compare two results. ``` ================== 1 failed, 241 deselected, 2 rerun in 1.76s ================== Got exit code 1 Stopping at first consistent failure The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16'] The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16'] ``` So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally. Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991 Approved by: https://github.com/clee2000	2024-06-19 00:25:09 +00:00
PyTorch MergeBot	99f042d336	Revert "Forward fix to skip ROCm tests for #122836 (#128891 )" This reverts commit 4061b3b8225f522ae0ed6db00111441e7d3cc3d5. Reverted https://github.com/pytorch/pytorch/pull/128891 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128891#issuecomment-2177291249))	2024-06-19 00:21:21 +00:00
Animesh Jain	670b94c9c8	[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484 Approved by: https://github.com/mlazos ghstack dependencies: #128428	2024-06-19 00:06:46 +00:00
Animesh Jain	c5e0b84484	[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2024-06-19 00:06:46 +00:00
cyy	cb5e9183c6	[Caffe2] [2/N] Remove Caffe2 from tests (#128911 ) Follows #128675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128911 Approved by: https://github.com/titaiwangms, https://github.com/r-barnes	2024-06-19 00:05:50 +00:00
Andrew Gu	ac5f565fa7	[FSDP2] Added `set_post_optim_event` (#128975 ) This PR adds `set_post_optim_event` that allows power users to provide their own CUDA event that is recorded after the optimizer step for the FSDP root module to wait the all-gather streams on. ``` def set_post_optim_event(self, event: torch.cuda.Event) -> None: ``` By default, the root would have the all-gather streams wait on the current stream (`wait_stream`), which may introduce false dependencies if there is unrelated computation after the optimizer step and before the wait. For example, this pattern can appear in recommendation models. To avoid those false dependencies while preserving the correctness guarantee, we provide this API so that the user can provide their own CUDA event to wait the all-gather streams on. We include both correctness test (`test_fully_shard_training.py`) and overlap test (`test_fully_shard_overlap.py`). --- One possible way to use the API is to register a post-step hook on the optimizer. For example: `12e8d1399b/test/distributed/_composable/fsdp/test_fully_shard_training.py (L546-L552)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128975 Approved by: https://github.com/sanketpurandare, https://github.com/weifengpy ghstack dependencies: #128884	2024-06-18 22:26:14 +00:00
Jokeren	d9c294c672	[Inductor] Fix arguments passed to triton kernel launch hooks (#128732 ) `binary.launch_enter_hook` is treated as an instance method and will add a `self` argument to the hooks. `CompiledKernel.launch_enter_hook` is a static method, which matches the hook calling convention of profilers (i.e., a single `LazyDict` argument only). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128732 Approved by: https://github.com/shunting314, https://github.com/bertmaher	2024-06-18 22:06:55 +00:00
Xuehai Pan	a0e1e20c41	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin ghstack dependencies: #128868, #128869	2024-06-18 21:49:08 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Xuehai Pan	cec31050b4	[BE][Easy] enable UFMT for `torch/distributed/{tensor,_tensor}/` (#128868 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128868 Approved by: https://github.com/fegin	2024-06-18 21:49:02 +00:00
Nikita Shulga	e47603a549	Fix weight_norm decomposition behavior (#128956 ) By upcasting norm to float32 to align with CUDA and CPU behaviors `e6d4451ae8/aten/src/ATen/native/WeightNorm.cpp (L56-L59)` Discovered this when started running OpInfo tests, see https://github.com/pytorch/pytorch/actions/runs/9552858711/job/26332062502#step:20:1060 ``` File "/var/lib/jenkins/workspace/test/test_decomp.py", line 185, in op_assert_ref assert orig.dtype == decomp.dtype, f"{i} Operation: {op}" AssertionError: 1 Operation: aten._weight_norm_interface.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128956 Approved by: https://github.com/albanD ghstack dependencies: #128955	2024-06-18 21:24:12 +00:00
Aaron Enye Shi	2227da4431	[Profiler] Clean up use_mtia to follow standard use_device instead (#126284 ) Summary: use_mtia should instead set use_device='mtia' similar to cuda, xpu, and privateuseone. Avoid an ever-growing list of use_* arguments. Since use_mtia is specific to FBCode, we don't need a deprecation warning. Test Plan: CI. Differential Revision: D57338005 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/126284 Approved by: https://github.com/fenypatel99	2024-06-18 21:01:03 +00:00
dependabot[bot]	4cc3fb5ee2	Bump urllib3 from 2.2.1 to 2.2.2 in /tools/build/bazel (#128908 ) Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2) --- updated-dependencies: - dependency-name: urllib3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-06-18 13:38:22 -07:00
Joel Schlosser	5dc4f652bc	Backward support for unbind() with NJT (#128032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032 Approved by: https://github.com/soulitzer	2024-06-18 20:29:00 +00:00
PyTorch MergeBot	44722c6b10	Revert "[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453 )" This reverts commit 2b28b107dbafeec18d1095a2002e79511aa241df. Reverted https://github.com/pytorch/pytorch/pull/128453 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
PyTorch MergeBot	1babeddbbf	Revert "[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 )" This reverts commit 1f6e84fa6852805e15ddc9583c5f36c3a7f93df8. Reverted https://github.com/pytorch/pytorch/pull/128484 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
PyTorch MergeBot	5bc9835d64	Revert "[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 )" This reverts commit c52eda896eb3ec7f8d04b6321861f4c5614a40bb. Reverted https://github.com/pytorch/pytorch/pull/128428 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
Li-Huai (Allan) Lin	9a7e2519d3	[MPS] Fused Adam & AdamW (#127242 ) Summary: This PR adds fused Adam and AdamW implementations. Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory: Fast math enabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 89 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 90 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 83 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 12 \| 94 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 88 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 12 \| 90 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 100 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 23 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 23 \| 98 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 480 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 72 \| 450 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 82 \| 450 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 73 \| 420 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 91 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 83 \| 400 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 78 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 170 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 140 \| 600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 170 \| 600 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 140 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 250 \| 890 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 220 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 250 \| 830 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 220 \| 770 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 270 \| 870 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 230 \| 840 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 270 \| 810 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 240 \| 800 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 400 \| 1000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 360 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 430 \| 2000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 360 \| 1300 Times are in milliseconds (ms). ``` Fast math disabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 79 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 93 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 10 \| 90 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 91 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 81 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 34 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 34 \| 95 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 92 \| 430 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 81 \| 390 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 98 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 88 \| 430 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 100 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 88 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 210 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 190 \| 610 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 210 \| 510 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 190 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 300 \| 900 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 260 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 295 \| 900 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 260 \| 800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 320 \| 910 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 280 \| 900 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 320 \| 900 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 300 \| 900 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 500 \| 2000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 480 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 540 \| 1500 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 480 \| 1200 Times are in milliseconds (ms). ``` ```python def profile_fused_adam(): from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)] if adamWflag: fn = adamw.adamw else: fn = adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label='Fused Adam', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242 Approved by: https://github.com/kulinseth, https://github.com/janeyx99	2024-06-18 19:59:50 +00:00
Chien-Chin Huang	fe8558b7aa	[DSD] Add unittest to verify HSDP1 + broadcast_from_rank0 (#128755 ) HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from https://github.com/pytorch/pytorch/pull/128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128755 Approved by: https://github.com/Skylion007, https://github.com/wz337 ghstack dependencies: #128685	2024-06-18 19:42:51 +00:00
Sam Larsen	abde6cab4c	Remove compile_threads=1 in test_inductor_collectives.py (#128580 ) Summary: I believe https://github.com/pytorch/pytorch/issues/125235 should be fixed after switching to subprocess-based parallel compile. Test Plan: Ran locally with python-3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128580 Approved by: https://github.com/eellison	2024-06-18 19:31:13 +00:00
Boyuan Feng	04a5d3228e	[ts migration] Support prim::tolist and aten::len (#128894 ) Support prim::tolist and aten::len. Add unit tests for prim::min. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128894 Approved by: https://github.com/angelayi	2024-06-18 19:11:07 +00:00
Nikita Shulga	44483972bd	[EZ] Keep weight_norm var name aligned (#128955 ) To keep it aligned with `e6d4451ae8/aten/src/ATen/native/native_functions.yaml (L6484)` I.e. `x`->`v`, `y`->`g` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128955 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-06-18 18:40:59 +00:00
Animesh Jain	bdffd9f0c6	[export] Graph break on nn.Parameter construction (#128935 ) Fixes https://github.com/pytorch/pytorch/issues/126109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128935 Approved by: https://github.com/angelayi	2024-06-18 18:37:44 +00:00
Chien-Chin Huang	1a527915a6	[DSD] Correctly handle shared parameters for optimizer state_dict (#128685 ) * Fixes https://github.com/pytorch/pytorch/issues/128011 See the discussion in https://github.com/pytorch/pytorch/pull/128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128685 Approved by: https://github.com/LucasLLC	2024-06-18 18:34:32 +00:00
loganthomas	d77a1aaa86	DOC: add note about same sized tensors to dist.gather() (#128676 ) Fixes #103305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128676 Approved by: https://github.com/wconstab	2024-06-18 18:26:07 +00:00
soulitzer	1877b7896c	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) ### bc-breaking for existing users of the private API: - Existing policy functions must now change their return value to be [CheckpointPolicy](`c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230)`) Enum instead of bool. - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy. - Policy function now accepts a `ctx` object instead of `mode` for its first argument. - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE : metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-18 18:18:50 +00:00
PyTorch MergeBot	77830d509f	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit 7a39755da28d5a109bf0c37f72b364d3a83137b1. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))	2024-06-18 18:11:43 +00:00
Huy Do	84c86e56bd	Update tracker issues after successfully cherry-picking a PR (#128924 ) This extends the capacity of the cherry-pick bot to automatically update the tracker issue with the information. For this to work, the tracker issue needs to be an open one with a `release tracker` label, i.e. https://github.com/pytorch/pytorch/issues/128436. The version from the release branch, i.e. `release/2.4`, will be match with the title of the tracker issue, i.e. `[v.2.4.0] Release Tracker` or `[v.2.4.1] Release Tracker` ### Testing `python cherry_pick.py --onto-branch release/2.4 --classification release --fixes "DEBUG DEBUG" --github-actor huydhn 128718` * On the PR https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174846771 * On the tracker issue https://github.com/pytorch/pytorch/issues/128436#issuecomment-2174846757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128924 Approved by: https://github.com/atalman	2024-06-18 17:48:47 +00:00
eqy	4e03263224	[CUDA][Convolution] Add missing launch bounds to `vol2col_kernel` (#128740 ) Fix "too many resources requested" that can happen with recent toolkits on V100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128740 Approved by: https://github.com/mikaylagawarecki	2024-06-18 17:26:23 +00:00
Kazuaki Ishizaki	26e374e3ca	[EZ] Fix typos in RELEASE.md (#128769 ) This PR fixes typo in `RELEASE.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128769 Approved by: https://github.com/yumium, https://github.com/mikaylagawarecki	2024-06-18 17:15:05 +00:00
Guilherme Leobas	9818283da1	re-enable jacrev/jacfwd/hessian after #128028 landed (#128622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622 Approved by: https://github.com/zou3519	2024-06-18 17:08:58 +00:00
eqy	ec616da518	RNN API cleanup for cuDNN 9.1 (#122011 ) Can potentially avoid a bit of boilerplate if we move directly to cuDNN 9.1's RNN API... Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122011 Approved by: https://github.com/Skylion007	2024-06-18 16:16:38 +00:00
David Berard	108318ad10	[BE][JIT] Handle case where codegen object can be unset (#128951 ) Summary: Unblocks a test that's failing. `codegen` can be unset until `compile` is called. If `codegen` is not set, then just use the kernel name directly. Test Plan: ``` buck2 run //caffe2/test:tensorexpr -- --regex test_simple_add ``` Differential Revision: D58727391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128951 Approved by: https://github.com/aaronenyeshi	2024-06-18 15:40:45 +00:00
Isuru Fernando	4817180601	make fallback for aten.argsort.stable (#128907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128907 Approved by: https://github.com/lezcano ghstack dependencies: #128343	2024-06-18 14:56:35 +00:00
Xuehai Pan	22d258427b	[BE][Easy] enable UFMT for `torch/distributed/_shard/` (#128867 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128867 Approved by: https://github.com/fegin ghstack dependencies: #128866	2024-06-18 14:39:25 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Andrew Gu	f2805a0408	[FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884 ) This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively. ``` def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None ``` Motivation FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order. However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)). The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do. Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs). Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884 Approved by: https://github.com/ckluk2, https://github.com/weifengpy	2024-06-18 13:32:57 +00:00
Ahmed Gheith	3dd5f0ecbb	Remove circular import (#128875 ) Summary: A spurious import is causing circular dependency errors Test Plan: phabricator signals Differential Revision: D58685676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875 Approved by: https://github.com/kit1980	2024-06-18 12:30:13 +00:00
leslie-fang-intel	304c934572	Move MKLDNN Specific IR to Separate File (#126504 ) Summary Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file. Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504 Approved by: https://github.com/desertfire, https://github.com/jgong5 ghstack dependencies: #126841, #126940	2024-06-18 09:29:13 +00:00
Chien-Chin Huang	6e43897912	[BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843 ) Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512. Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843 Approved by: https://github.com/wz337	2024-06-18 07:05:31 +00:00
Chien-Chin Huang	60baeee59f	[BE] Skip the test if CUDA is not available (#128885 ) As title Differential Revision: [D58690210](https://our.internmc.facebook.com/intern/diff/D58690210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128885 Approved by: https://github.com/wz337	2024-06-18 07:02:44 +00:00
Will Feng	e3a39d49a0	[Traceable FSDP][Compiled Autograd] Add queue_callback() support (#126366 ) Adds support for `Variable._execution_engine.queue_callback()`, which is used in FSDP2. Important tests: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_callback_graph_break_throws_error` - `pytest -rA test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_callback_adds_callback` - `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_callback_adds_callback` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126366 Approved by: https://github.com/xmfan	2024-06-18 06:22:14 +00:00
Chirag Pandya	f7eae27946	Pass params to dump_nccl_trace_pickle (#128781 ) Summary Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Differential Revision: [D58640474](https://our.internmc.facebook.com/intern/diff/D58640474) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128781 Approved by: https://github.com/d4l3k	2024-06-18 03:46:57 +00:00
Joona Havukainen	d9eaa224f2	Fixes #128429 : NaN in triu op on MPS (#128575 ) Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead. Fixes #128429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575 Approved by: https://github.com/kulinseth	2024-06-18 03:44:42 +00:00
Tristan Rice	59b4983dc0	DebugPlane: add dump_traceback handler (#128904 ) This adds a `dump_traceback` handler so you can see all running threads for a job. This uses a temporary file as a buffer when calling `faulthandler.dump_traceback` and requires the GIL to be held during dumping. Test plan: ``` python test/distributed/elastic/test_control_plane.py -v -k traceback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128904 Approved by: https://github.com/c-p-i-o	2024-06-18 03:40:16 +00:00
Xu Han	17abbafdfc	[inductor] Fix some windows cpp builder issue (#128765 ) 1. fix some Windows build args. 2. fix c++20 likely issue on Windows, reference: https://github.com/pytorch/pytorch/pull/124997. 3. remove compiler return value check, different compilers return variant value, let's check exception to catch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128765 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-18 03:25:20 +00:00
Joel Schlosser	4061b3b822	Forward fix to skip ROCm tests for #122836 (#128891 ) Fixes broken ROCm tests from #122836. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128891 Approved by: https://github.com/huydhn ghstack dependencies: #127007, #128057, #122836	2024-06-18 03:01:19 +00:00
Animesh Jain	c017c97333	[dynamo][inlining-inbuilt-nn-modules] Update test output (#128880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128880 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748, #128877, #128878	2024-06-18 02:18:09 +00:00
Animesh Jain	4e97d37fd9	[inlining-inbuilt-nn-modules][pre-grad] Adjust efficient_conv_bn_eval_graph for inlining (#128878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128878 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748, #128877	2024-06-18 02:18:09 +00:00
Animesh Jain	22f1793c0a	[dynamo][easy] Use LazyVariableTracker for UserDefinedObject var_getattr (#128877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128877 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748	2024-06-18 02:17:56 +00:00
Boyuan Feng	43998711a7	[CUDAGraph] add more docs for cudagraph trees (#127963 ) This PR adds more documentation for CUDAGraph Trees, including - Iteration Support - Input Mutation Support - Dynamic Shape Support - NCCL Support - Reasons for Skipping CUDAGraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963 Approved by: https://github.com/eellison	2024-06-18 02:07:07 +00:00
Fuzzkatt	e12fa93b8b	add is_big_gpu(0) check to test_select_algorithm tests in tests/inductor/test_cuda_cpp_wrapper.py (#128652 ) In NVIDIA internal CI, on Jetson devices we are seeing this failure for `python test/inductor/test_cuda_cpp_wrapper.py -k test_addmm_cuda_cuda_wrapper -k test_linear_relu_cuda_cuda_wrapper`: ``` /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:132: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm mode frames [('total', 1), ('ok', 1)] stats [('calls_captured', 2), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1)] aot_autograd [('total', 1), ('ok', 1)] F ====================================================================== FAIL: test_linear_relu_cuda_cuda_wrapper (__main__.TestCudaWrapper) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 9818, in new_test return value(self) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/opt/pytorch/pytorch/test/inductor/test_cuda_cpp_wrapper.py", line 152, in fn _, code = test_torchinductor.run_and_get_cpp_code( File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 356, in run_and_get_cpp_code result = fn(args, *kwargs) File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 43, in wrapped return fn(args, *kwargs) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched return func(newargs, *newkeywargs) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, **kwds) File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 62, in test_linear_relu_cuda self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3642, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Scalars are not equal! Expected 1 but got 0. Absolute difference: 1 Relative difference: 1.0 ``` Looking into it, we see the failure is from https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L62. The warning `W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm ` is triggered from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L973. Printing torch.cuda.get_device_properties(0).multi_processor_count returns 16 on the computelab AGX Orin; thus it makes sense that this check is failing, since the min_required_sms is 68, thus not letting it pick the autotune algorithm. Looking at the main for test_select_algorithm.py, we see that these tests should only be run if is_big_gpu(0) is true: https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L344. Thus this PR adds a similar check to the invocation of these tests in test_cuda_cpp_wrapper.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128652 Approved by: https://github.com/soulitzer, https://github.com/eqy	2024-06-18 02:00:04 +00:00
Huy Do	9e8443b56f	Remove dtype from gpt-fast micro benchmark experiments model name (#128789 ) Per comments on https://github.com/pytorch/test-infra/pull/5344, we already have a dtype column with the same information Pull Request resolved: https://github.com/pytorch/pytorch/pull/128789 Approved by: https://github.com/yanboliang	2024-06-18 01:26:45 +00:00
Shangdi Yu	fbc7559ceb	[custom ops] convert string type annotation to real type (#128809 ) Fixes #105157 Bug source: `from __future__ import annotations` converts type annotation to strings to make forwards references easier. However, existing custom ops do not consider strings to be valid types. Fix: We check if the argument and return type annotation is string type. If so, we try to use `eval` to convert it to a type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128809 Approved by: https://github.com/zou3519	2024-06-18 00:55:50 +00:00
leslie-fang-intel	c35ffaf954	[Inductor][CPP] Add ne with VecMask (#126940 ) Summary Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`. Test Plan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool ``` Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126841	2024-06-18 00:23:03 +00:00
leslie-fang-intel	beb29836cd	[Inductor][CPP] Add Min/Max with VecMask (#126841 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`. TestPlan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool ``` Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-06-18 00:20:32 +00:00
chilli	11ff5345d2	Changed colored logging to only be turned on if printing to interactive terminal (#128874 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128874 Approved by: https://github.com/anijain2305	2024-06-17 23:53:26 +00:00
awayzjj	b70440f0a7	Document the torch.cuda.profiler.profile function (#128216 ) Fixes https://github.com/pytorch/pytorch/issues/127901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128216 Approved by: https://github.com/malfet, https://github.com/eqy	2024-06-17 23:42:40 +00:00
Edward Z. Yang	95b5ea9cde	Add mark_unbacked (#128638 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128638 Approved by: https://github.com/IvanKobzarev	2024-06-17 23:39:48 +00:00
Xiaodong Wang	8415a4ba98	Back out "[ROCm] TunableOp for gemm_and_bias (#128143 )" (#128815 ) Summary: Original commit changeset: 35083f04fdae Original Phabricator Diff: D58501726 This PR is bringing a large numerical gap. e.g. for 256 x 4096 x 4096 GEMM, if we enable tunable op + DISABLE_ADDMM_HIP_LT=0, the results are way off. Differential Revision: D58660832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128815 Approved by: https://github.com/mxz297, https://github.com/eqy, https://github.com/malfet	2024-06-17 22:52:27 +00:00
atalman	3b8c9b8ab1	[Docker Release] Test if pytorch was compiled with CUDA before pushing to repo (#128852 ) Related to: https://github.com/pytorch/pytorch/issues/125879 Would check if we are compiled with CUDA before publishing CUDA Docker nightly image Test ``` #18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi #18 1.656 Is torch compiled with cuda: False #18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 ------ > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi: 1.656 Is torch compiled with cuda: False ------ Dockerfile:80 -------------------- 79 \| RUN /opt/conda/bin/pip install torchelastic 80 \| >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\ 81 \| >>> echo "Is torch compiled with cuda: ${IS_CUDA}"; \ 82 \| >>> if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \ 83 \| >>> exit 1; \ 84 \| >>> fi 85 \| -------------------- ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 (base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain --platform="linux/amd64" --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" . #0 building with "default" instance using docker driver ``` Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet: https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128852 Approved by: https://github.com/malfet	2024-06-17 22:51:12 +00:00
Xu Zhao	1835e3beab	Fix the inductor ci (#128879 ) Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1. We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3 Fixes #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879 Approved by: https://github.com/eellison	2024-06-17 22:20:33 +00:00
Shengbao Zheng	7baf32b5e7	[c10d] fix p2p group commsplit (#128803 ) Summary: For PointToPoint(sendrecv), the deviceId is lower_rank:higher_rank. This means a p2p group cannot be created through commSplit since it cannot find a parent. Fix this by using the right device key of current rank. Differential Revision: D58631639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128803 Approved by: https://github.com/shuqiangzhang	2024-06-17 22:07:40 +00:00
Jun Luo	1fd7496ab2	[MTIA] Fix synchronize API (#128714 ) Reviewed By: fenypatel99 Differential Revision: D58590313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128714 Approved by: https://github.com/aaronenyeshi	2024-06-17 21:58:46 +00:00
cyy	163847b1bb	[1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675 Approved by: https://github.com/r-barnes	2024-06-17 21:25:59 +00:00
Yanbo Liang	8953725e6d	[Inductor][FlexAttention] Tune backwards kernel block sizes (#128853 ) This replaces #128767 which somehow closed by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128853 Approved by: https://github.com/angelayi	2024-06-17 21:10:55 +00:00
Yanbo Liang	a489792bb2	[GPT-benchmark] Fix memory bandwidth for MoE (#128783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128783 Approved by: https://github.com/Chillee ghstack dependencies: #128768	2024-06-17 21:04:57 +00:00
Yanbo Liang	8c06eae17e	[GPT-benchmark] Add metric: compilation time for GPT models (#128768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128768 Approved by: https://github.com/Chillee	2024-06-17 21:04:57 +00:00
Masaki Kozuki	a59766ee05	replace `AT_ERROR(...)` with `TORCH_CHECK(false, ...)` (#128788 ) as per title. encountered the old-fashioned by chance Pull Request resolved: https://github.com/pytorch/pytorch/pull/128788 Approved by: https://github.com/mikaylagawarecki	2024-06-17 20:50:22 +00:00
Kurman Karabukaev	0f89e66d17	Validate logs are created by default (#128522 ) Summary: Make sure that logs are caputured in default settings Test Plan: ci Differential Revision: D58395812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128522 Approved by: https://github.com/d4l3k	2024-06-17 20:07:13 +00:00
Huy Do	1577328ea4	Set bash shell on Windows (#128854 ) Attempt to fix the missing python3 command on the new Windows AMI https://github.com/pytorch/pytorch/actions/runs/9551494945/job/26325922503. I added the logic to copy python to python3 to make the command available, it worked with the previous AMI, but start to fail now and the cause is not clear (maybe it's not the AMI, but a new GitHub runner version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128854 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman	2024-06-17 19:24:09 +00:00
Mikayla Gawarecki	b181b58857	Fix Storage.filename to not track the filename when storage was mmap-ed with MAP_PRIVATE (#128725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128725 Approved by: https://github.com/albanD	2024-06-17 18:55:47 +00:00
Catherine Lee	213eba7d2e	Configure mergebot via config (#128840 ) Fixes #ISSUE_NUMBER * Companion to https://github.com/pytorch/test-infra/pull/5312 * See the above for details + possible risks * Without the above PR, this should have no effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/128840 Approved by: https://github.com/huydhn	2024-06-17 18:53:56 +00:00
PyTorch MergeBot	c172b58fe0	Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718 )" This reverts commit fd27138c4a86bd763a6b8128d940a7c98f951603. Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason `153362fbc9` ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))	2024-06-17 18:49:15 +00:00
eellison	5344c41d43	Use forked torchbench branch with pinned numpy (#128856 ) Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856 Approved by: https://github.com/huydhn, https://github.com/PaliC	2024-06-17 18:41:42 +00:00
cyy	d35cdee97f	[Caffe2] Remove caffe2 onnx tests (#128687 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128687 Approved by: https://github.com/r-barnes	2024-06-17 18:17:58 +00:00
Mihir Patel	153362fbc9	Support HSDP + Monolith Checkpointing (#128446 ) Fixes #128444. Rank 0 check should be in the same group as the broadcast Pull Request resolved: https://github.com/pytorch/pytorch/pull/128446 Approved by: https://github.com/fegin	2024-06-17 16:59:41 +00:00
ibartol	c6b180a316	Created docs (and example) for cudart function in torch.cuda (#128741 ) Fixes #127908 ## Description Created docs to document the torch.cuda.cudart function to solve the issue #127908. I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise). Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake. ### Summary of Changes - Added docs for torch.cuda.cudart - Added the cudart function in the autosummary of docs/source/cuda.rst ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741 Approved by: https://github.com/msaroufim	2024-06-17 16:50:37 +00:00
drisspg	fc2913fb80	Remove amax return from _scaled_mm (#128683 ) # Summary The primary reason for the change was lack of current use case and the need to work around an two Inductor issue. - Tensor arguments as kwarg only - multiple outputs from triton templates If the need for the amax return type arises we can consider either adding it, more likely creating a separate op. In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels. ### Changes: - This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision. - We currently still allow for fp8 returns and scaled result. Perhaps we should also ban this as well... New signature: ```Python def meta_scaled_mm( self: torch.Tensor, mat2: torch.Tensor, scale_a: torch.Tensor, scale_b: torch.Tensor, bias: Optional[torch.Tensor] = None, scale_result: Optional[torch.Tensor] = None, out_dtype: Optional[torch.dtype] = None, use_fast_accum: bool = False, ) -> torch.Tensor: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683 Approved by: https://github.com/vkuzo	2024-06-17 16:48:00 +00:00
Andrew Hoblitzell	73b78d1cbe	Document the torch.nn.parallel.scatter_gather.gather function (#128566 ) Fixes #127899 ### Description Add docstring to `torch/nn/parallel/scatter_gather.py:gather` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128566 Approved by: https://github.com/kwen2501	2024-06-17 16:44:17 +00:00
Jiashen Cao	316b729677	[Fix] TS converter constant to tensor (#128442 ) #### Issue Tensor constant was previously lifted directly as an input in the fx graph, which results errors for multiple test cases with tensor constant. This PR introduces a fix to convert tensor constant to a `GetAttr` in the fx graph. This PR also introduces other fixes to maintain a valid `state_dict` for exported program when there are tensor constants. In short, after tensor constants are converted as `GetAttr`, they are treated as buffers during retracing. The fix will convert those back from buffer to constant. #### Test Plan Add new test cases that generate tensor constants * `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128442 Approved by: https://github.com/angelayi	2024-06-17 16:42:43 +00:00
Xuehai Pan	a87d82abd7	[BE] enable UFMT for `torch/nn/*.py` (#128593 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596, #128594, #128592	2024-06-17 16:29:29 +00:00
Xuehai Pan	f6e6e55fa7	[BE] enable UFMT for `torch/nn/functional.py` (#128592 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596, #128594	2024-06-17 16:29:29 +00:00
Xuehai Pan	95ac2d6482	[BE] enable UFMT for `torch/nn/modules` (#128594 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596	2024-06-17 16:29:25 +00:00
Xuehai Pan	dff6342a0b	[BE][Easy] enable UFMT for `torch/nn/parallel` (#128596 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128596 Approved by: https://github.com/mikaylagawarecki	2024-06-17 16:29:22 +00:00
Zhengxu Chen	bfad0aee44	[export] Preserve requires_grad for export inputs. (#128656 ) Summary: Today meta['val'] on placeholder nodes doesn't preserve the consistent requires_grad information with the original inputs. Seems there's no easy way to fix this directly at proxy tensor layer. This is useful for reexporting joint graph. Test Plan: test_preserve_requires_grad_placeholders Differential Revision: D58555651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128656 Approved by: https://github.com/tugsbayasgalan	2024-06-17 16:26:08 +00:00
Joel Schlosser	2a41fc0390	Short-term fix to preserve NJT metadata cache in torch.compile (#122836 ) Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile. For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors. NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing. Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836 Approved by: https://github.com/soulitzer ghstack dependencies: #127007, #128057	2024-06-17 15:25:09 +00:00
Sam Larsen	24443fe16a	[inductor] parallel compile: Print traceback detail when there's an exception in a sub-process (#128775 ) Summary: We lose traceback info when an exception occurs in a subprocess because Python traceback objects don't pickle. In the subprocess-based parallel compile, we _are_ logging an exception in the subprocess, but a) those messages are easy to miss because they're not in the traceback output, and b) it seems that logging in the subproc is swallowed by default in internal builds. This PR captures the traceback in the subprocess and makes it available in the exception thrown in the main process. Users now see failures that look like this: ``` ... File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: SubprocException: An exception occurred in a subprocess: Traceback (most recent call last): File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 270, in do_job result = SubprocMain.foo() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 263, in foo SubprocMain.bar() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 260, in bar SubprocMain.baz() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 257, in baz raise Exception("an error occurred") Exception: an error occurred ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128775 Approved by: https://github.com/jansel	2024-06-17 15:10:47 +00:00
Nikita Shulga	e3093849e5	[Docs] Update links (#128795 ) From https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding to https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html And from https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag to https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html Fixes https://github.com/pytorch/pytorch/issues/128774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128795 Approved by: https://github.com/atalman	2024-06-17 14:55:32 +00:00
Ambareesh Shyam Sundar	0f81473d7b	Update fake tensor error checks for bool tensor subtraction (#128492 ) Fixes #127003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128492 Approved by: https://github.com/soulitzer	2024-06-17 13:41:15 +00:00
Animesh Jain	b0282071c4	[dynamo] override torch.nn.modules.activation._is_make_fx_tracing (#128748 ) Discovered while inlining `MultiHeadAttention` nn Module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128748 Approved by: https://github.com/jansel ghstack dependencies: #128315	2024-06-17 08:49:29 +00:00
Xu Han	b40a033c38	[cpp_extension][inductor] Fix sleef windows depends. (#128770 ) # Issue: During I'm working on enable inductor on PyTorch Windows, I found the sleef lib dependency issue. <img width="1011" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/423bd854-3c5f-468f-9a64-a392d9b514e3"> # Analysis: After we enabled SIMD on PyTorch Windows(https://github.com/pytorch/pytorch/pull/118980 ), the sleef functions are called from VEC headers. It bring the sleef to the dependency. Here is a different between Windows and Linux OS. ## Linux : Linux is default export its functions, so libtorch_cpu.so static link to sleef.a, and then It also export sleef's functions. <img width="647" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/00ac536c-33fc-4943-a435-25590508840d"> ## Windows: Windows is by default not export its functions, and have many limitation to export functions, reference: https://github.com/pytorch/pytorch/issues/80604 We can't package sleef functions via torch_cpu.dll like Linux. # Solution: Acturally, we also packaged sleef static lib as a part of release. We just need to help user link to sleef.lib, it should be fine. 1. Add sleef to cpp_builder for inductor. 2. Add sleef to cpp_extension for C++ extesion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128770 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-17 05:44:34 +00:00
Wang, Eikan	a52c8ace98	[3/N] Non-Tensor: Support string parameter for aten operations (#125831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-06-17 05:11:29 +00:00
cyy	74e11a4210	Enable clang-tidy on torch/csrc/mps (#128782 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128782 Approved by: https://github.com/Skylion007	2024-06-17 02:19:48 +00:00
cyy	f9dae86222	Concat namespaces in torch/csrc/utils/* (#128787 ) Concat namespaces in torch/csrc/utils/* Pull Request resolved: https://github.com/pytorch/pytorch/pull/128787 Approved by: https://github.com/Skylion007	2024-06-16 23:51:14 +00:00
Mark Saroufim	6cbdbb6c3c	Remove top lev numpy dependency from fuzzer.py (#128759 ) Test CI This fixes issues like this where I don't even intend to use the fuzzer. this way if someone is calling functions from the fuzzer numpy will be imported otherwise the import should not happen at the top of the file ``` >>> import torchao Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/__init__.py", line 26, in <module> from torchao.quantization import ( File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/__init__.py", line 7, in <module> from .smoothquant import * # noqa: F403 File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/smoothquant.py", line 18, in <module> import torchao.quantization.quant_api as quant_api File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 23, in <module> from torchao.utils import ( File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/utils.py", line 2, in <module> import torch.utils.benchmark as benchmark File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/__init__.py", line 4, in <module> from torch.utils.benchmark.utils.fuzzer import * # noqa: F403 File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/utils/fuzzer.py", line 5, in <module> import numpy as np ModuleNotFoundError: No module named 'numpy' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128759 Approved by: https://github.com/Skylion007	2024-06-16 16:34:12 +00:00
leslie-fang-intel	f8d60e0e0a	[Inductor][CPP] Fix Half data type cse cache issue for CPP Backend (#128498 ) Summary Fixing issue: https://github.com/pytorch/pytorch/issues/128263. After https://github.com/pytorch/pytorch/issues/115260, we cached the higher precision cse variable to avoid duplicate casting between buffers. However, it failed to check the original data type. This means if we convert `int32` to `bf16` for `store` and then convert `bf16` back to `fp32` for `load`, it would incorrectly hit the cache and reuse the `int32` cse var. This PR fixes the issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_128263 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128498 Approved by: https://github.com/jgong5, https://github.com/zhuhaozhe, https://github.com/jerryzh168	2024-06-16 11:27:13 +00:00
Will Feng	979edbbe12	[Traceable FSDP2] Dynamo support FSDP2 use_training_state context manager (#127854 ) Improve Dynamo to support the FSDP2 `use_training_state()` context manager. Test command: ` pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854 Approved by: https://github.com/yanboliang	2024-06-16 08:48:52 +00:00
Animesh Jain	e4d8aa4d24	[torchbench] Enable some models with inline_inbuilt_nn_modules (#128315 ) For all models, graph breaks/recompiles reduce. For drq, it increases and this is a legit one. Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315 Approved by: https://github.com/jansel	2024-06-16 08:37:23 +00:00
xinan.lin	cc518ebd38	[Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147 ) Reuse Inductor test case for Intel GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-16 08:07:05 +00:00
Blaine Burton Rister	f1ee3589a1	[Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342 ) Summary Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets. Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be: - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1. - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size. Note that a slice of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`. Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077 Example kernel: ``` triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4096 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK]) tmp1 = tmp0 + tmp0 tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32)) ''', device_str='cuda') ``` Test Plan This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories: - Can we generate strided block pointers for the appropriate shapes? - Powers of 2 - Non-power of 2, but multiple of the maximum block size - Arbitrary leading dimensions, with power of 2 inner dimensions - Weird strides and offsets - Reductions - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo) - Broadcasts (some variables are missing from the indexing expression) - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers? - Unsupported static shapes - Unsupported symbolic shapes - Mixing and matching these cases: - Pointwise and reduction in the same kernel - Sanity check the test harness - Do we raise an exception if the expected number of block pointers and the actual number are different? Follow-ups There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs: - Handle non-divisible shapes - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted. - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton. - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead? Differential Revision: D56739375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-06-16 07:35:57 +00:00
Michael Lazos	a61939467a	Enable passing dynamo-traced complex test (#128771 ) Fixes https://github.com/pytorch/pytorch/issues/118159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128771 Approved by: https://github.com/anijain2305	2024-06-16 07:28:09 +00:00
BowenBao	ab13980424	[ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364 ) The following are all constrained under the ONNX exporter project scope. - `personal_of_interest.rst` - Moving folks no longer working on the project to emeritus. - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre, who have all made countless contributions to this project. - `CODEOWNERS` - Removing folks no longer working on the project. - Updating new owners who will now be notified with PRs related to the specific file paths. - `merge_rules.yaml` - Removing folks no longer working on the project. 🫡 Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD	2024-06-16 04:52:16 +00:00
Oguz Ulgen	6079c50910	Make config.fx_graph_remote_cache be three-value switch (#128628 ) Summary: We want to allow for three configurations False: Force off True: Force on None: OFF for OSS and JK config for internal Test Plan: CI Differential Revision: D58535897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128628 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-06-15 17:52:09 +00:00
Sam Larsen	94c0dcbe1d	[inductor] Parallel compile: handle crashes in subprocesses (#128757 ) Summary: If any subprocess in the pool crashes, we get a BrokenProcessPool exception and the whole pool becomes unusable. Handle crashes by recreating the pool. Test Plan: * New unit test * Started a long-running test (`test/inductor/test_torchinductor.py`), periodically killed subprocess manually, made sure the test run recovers and makes progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128757 Approved by: https://github.com/jansel	2024-06-15 17:35:04 +00:00
David Berard	f0d68120f4	[subclasses] Handle dynamo inputs that are subclass views with (-1) in the view (#128662 ) When handling an input to dynamo that's a view of a subclass, dynamo does some handling to reconstruct the view. Part of this is to construct symints for the input parameters to the view. Previously, the code would just call `create_symbol()` which by default specifies a _positive_ symint (>= 0); this fails in the case where you have an aten::view that was called with a -1. Fix: just specify `positive=None` when calling `create_symbol()`, to avoid restricting the symint to >= 0 or <= 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128662 Approved by: https://github.com/jbschlosser	2024-06-15 14:58:18 +00:00
Wang, Eikan	18634048a1	Separate AOTI Eager utils as a single file (#125819 ) The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #125308	2024-06-15 13:42:49 +00:00
Yifu Wang	7a39755da2	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-15 10:20:21 +00:00
Wang, Eikan	60bbdc0b40	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used. - `Tensor` - `bool` - `int64_t` - `TensorList` - `Scalar` - `c10::SymIntArrayRef` - `::std::optional<Tensor>` - `IntArrayRef` - `double` - `c10::SymInt` - `::std::optional<ScalarType>` - `::std::optional<double>` - `::std::optional<bool>` - `::std::optional<Layout>` - `::std::optional<Device>` - `::std::optional<int64_t>` - `Dimname` - `::std::optional<Generator>` - `c10::string_view` - `::std::optional<c10::string_view>` - `OptionalIntArrayRef` - `::std::optional<Scalar>` - `OptionalSymIntArrayRef` - `::std::optional<MemoryFormat>` - `::std::optional<c10::SymInt>` - `ScalarType` - `ArrayRef<Scalar>` - `DimnameList` - `::std::optional<ArrayRef<double>>` - `::std::array<bool,3>` - `::std::optional<DimnameList>` - `c10::List<::std::optional<Tensor>>` - `::std::array<bool,2>` - `Storage` - `::std::array<bool,4>` - `Device` - `DeviceIndex` - `ITensorListRef` - `Stream` - `Layout` - `MemoryFormat` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 09:18:44 +00:00
Michael Lazos	de4f379cf2	run mkldnn test with inlining (#128749 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128749 Approved by: https://github.com/anijain2305	2024-06-15 09:04:08 +00:00
Tristan Rice	b50c0e94c2	TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739 ) This adjusts the settings of the libuv backend to match the older TCPStore. * DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset. * TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations. Test plan: ``` python test/distributed/test_store.py -v -k LibUv ``` Benchmark script: ``` import time import os import torch.distributed as dist rank = int(os.environ["RANK"]) store = dist.TCPStore( host_name="<server>", port=29500, world_size=2, is_master=(rank == 0), use_libuv=True, ) if rank == 1: total_iters = 0 total_dur = 0 for iter in range(10): iters = 500000 start = time.perf_counter() for i in range(iters): store.set(f"key_{i}", f"value_{i}") dur = time.perf_counter() - start print(f"{iter}. {iters} set, qps = {iters/dur}") total_iters += iters total_dur += dur print(f"overall qps = {total_iters/total_dur}") else: print("sleeping") time.sleep(1000000000) ``` Performance seems to be negligible difference between TCP_NODELAY and not for a single host Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739 Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o	2024-06-15 07:40:18 +00:00
cyy	e4c32d14a8	[3/N] Remove inclusion of c10/util/string_utils.h (#128504 ) Follows #128372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504 Approved by: https://github.com/malfet	2024-06-15 06:38:40 +00:00
Oguz Ulgen	472211c97a	Make assert_size_stride to return all errors (#128764 ) This will help debug some problems I'm encountering, but in general, it is best to show the entire error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128764 Approved by: https://github.com/jansel	2024-06-15 06:32:40 +00:00
Sahdev Zala	4ccbf711e2	Learning Rate Scheduler docstring fix (#128679 ) Fix docstrings in Learning Rate Scheduler. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR: pydocstyle torch/optim/lr_scheduler.py --count  92  AFTER the PR: pydocstyle torch/optim/lr_scheduler.py --count  0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679 Approved by: https://github.com/janeyx99	2024-06-15 05:30:35 +00:00
Animesh Jain	108adbc726	[dynamo][side effects] Raise assertion error if the object is already tracked for mutation (#128590 ) This issue was pointed out by @tombousso here - https://github.com/pytorch/pytorch/pull/128269#issuecomment-2163755792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128590 Approved by: https://github.com/mlazos ghstack dependencies: #128715, #128269	2024-06-15 05:07:49 +00:00
Xu Han	9ebf77b13b	Fix windows inductor defination issue (#128686 ) Changes: 1. Add memory align macro support on Windows. 2. Fix `#pragma unroll` not support on MSVC cl compiler. `#pragma unroll` occur error on msvc `cl` compiler, but it would be supported on Windows `clang`. We'd better disable it only on `__msvc_cl__` compiler, and get better performance if we enabled `clang`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128686 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 03:02:00 +00:00
Animesh Jain	7e092a62e6	[dynamo] Support weakref objects (#128533 ) Fixes https://github.com/pytorch/pytorch/issues/125720 I was earlier worried that DELETE_* or STORE_* on referent values should result in a graph break, because they could invalidate the weak ref. But then @zou3519 pointed out that weakref invalidation will happen EVENTUALLY, CPython provides no guarantees when the weakref will be invalidated (even when the user calls del x and x is the last reference). So any code that relies on del x to invalidate the weakref of x right away is BAD code. CPython provide no guarantees. Therefore we can (ab)use this nuance, and can just ignore DELETE_* or STORE_* on the referent objects. The only corner case is when Dynamo is reconstructing the weakref object. Dynamo will have a hard time being correct here, so just SKIP_FRAME on such a case. This is rare. Cpython notes 1) https://docs.python.org/3/library/weakref.html 2) https://docs.python.org/3/reference/datamodel.html#index-2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128533 Approved by: https://github.com/jansel	2024-06-15 02:16:25 +00:00
Animesh Jain	62a0e39ced	[dynamo][inlining-nn-modules] Update tests with new expected counts (#128463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128463 Approved by: https://github.com/yanboliang	2024-06-15 02:08:02 +00:00
vasiliy	2d01f87737	Enable torch.empty for float8 dtypes + deterministic mode + cpu (#128744 ) Summary: Enables creating empty float8 tensors for: * cuda when `torch.use_deterministic_algorithms` is set to True * cpu for all settings of `torch.use_deterministic_algorithms` Context for NaN values of float8_e4m3fn and float8_e5m2: https://arxiv.org/pdf/2209.05433, Section 3, Table 1 Context for NaN values of float8_e4m3fnuz and float8_e5m2fnuz: https://arxiv.org/pdf/2206.02915, Section 3.2, "instead of reserving one exponent field to represent Inf and NaN, we reserve only a single codeword (corresponding to negative zero)" Test Plan: ``` python test/test_quantization.py -k test_empty ``` Reviewers: Subscribers: Tasks: Tags: Fixes https://github.com/pytorch/pytorch/issues/128733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128744 Approved by: https://github.com/malfet, https://github.com/drisspg	2024-06-15 02:05:30 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
PyTorch MergeBot	5efe71f134	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit 5d9a609b4f6c94fb930188e4d7c99f53d989c022. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/huydhn due to Sorry for reverting your change but another failed test shows up in trunk inductor/test_flex_attention.py where it needs to be updated `5d9a609b4f`. I guess it is easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2169030779))	2024-06-15 01:46:23 +00:00
Huy Do	f37121bb74	Add model name, quantization and device to gpt_fast micro benchmark output (#128091 ) A small enhancement to https://hud.pytorch.org/benchmark/llms with these columns in the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128091 Approved by: https://github.com/yanboliang	2024-06-15 01:39:48 +00:00
Fuzzkatt	3f47c72268	add multiprocessing checks in test_dataloader.py (#128244 ) Add multiprocessing checks in test_dataloader.py for tests requiring multiprocessing similar to test_multiprocessing.py: https://github.com/pytorch/pytorch/blob/main/test/test_multiprocessing.py#L41-L52. Change all Jetson skips to TEST_CUDA_IPC checks since that is the root cause of the failures on Jetson in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128244 Approved by: https://github.com/eqy, https://github.com/malfet	2024-06-15 01:32:55 +00:00
Yueming Hao	73ba432d32	[custom_op]Fix None return schema (#128667 ) Fixes #125044 If users define a schema returns `None`, it will be parsed to a `torch.NoneType`. Auto functionalization support the `()` as a empty return but not for `None`. So, `None` return fails the check for [`can_auto_functionalize`](https://github.com/pytorch/pytorch/blob/findhao/fix_none_return_functionalize/torch/_higher_order_ops/auto_functionalize.py#L71) even we can take this as a `()` return. This PR is a fix to skip the check for None return. I hope it can be fixed in a [deeper level](`31e44c72ca`), but this fix breaks a lot of existing schemas. So it's better to fix this issue in the auto_functionalize.py at this moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128667 Approved by: https://github.com/zou3519	2024-06-15 00:41:37 +00:00
leslie-fang-intel	6616ad030f	[Inductor] Fix the High Order Op layout issue (#128275 ) Fix the issue: https://github.com/pytorch/pytorch/issues/127995 - In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. `921aa194c7/torch/_inductor/ir.py (L5632-L5649)` - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. `ef2b5ed500/torch/_inductor/scheduler.py (L2701-L2709)` Test Plan ``` python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128275 Approved by: https://github.com/eellison	2024-06-15 00:33:21 +00:00
angelayi	5d9a609b4f	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-06-15 00:26:04 +00:00
Sanket Jayant Purandare	d67923b955	Adding kwargs to composable AC API to enable full capabilities (#128516 ) Summary: Firstly, this does not change any existing behaviour, since all the default values for kwargs were hardcoded into the ``_checkpoint_without_reentrant_generator`` call. Secondly, this is needed for unlocking the full potential of composable checkpointing making it equivalent to ``torch.utils.checkpoint.checkpoint(use_reentrant=False)``. Finally, an added benefit is now composable checkpointing can be used under ``FakeTensorMode`` by passing ``preserve_rng_state=False``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128516 Approved by: https://github.com/awgu	2024-06-15 00:23:48 +00:00
Brian Hirsh	271852aa7e	inductor: pre-grad bmm pass shouldn't match if output is mutated (#128570 ) This PR is enough to get this test to pass when using `TORCHDYNAMO_INLINE_INBUILT_NN_MODULES`: ``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 python test/inductor/test_group_batch_fusion.py -k TestPostGradBatchLinearFusion.test_batch_linear_post_grad_fusion ``` inductor has a pre-grad pass to swap out multiple `linear` layers with with `addbmm`, but it also needs to insert an `unbind()` at the end. If that unbind is then followed by a mutation (like `add_()`), the autograd engine will complain (autograd does not let you mutate the output of multiple-out-view ops like unbind). I made a tweak to the pattern matching logic to avoid matching if the output of the linear is used in an op that mutates its input. My hope is that: (1) this situation is rare enough that it won't materially impact pattern matching in real world code (2) I had to use a heuristic for "is an op a mutable op", since the graph we get is from dynamo, so it can contain code like `operator.iadd` in it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128570 Approved by: https://github.com/eellison, https://github.com/mlazos ghstack dependencies: #127927	2024-06-15 00:08:44 +00:00
Brian Hirsh	ba19ed9a1a	FunctionalTensor: dispatch metadata directly to inner tensor (#127927 ) Fixes https://github.com/pytorch/pytorch/issues/127374 The error in the linked repro is: ``` AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_offset.default(_to_functional_tensor(FakeTensor(..., device='cuda:0', size=(16, 4), dtype=torch.uint8), device='cuda:0')) ``` Where we hit FakeTensor.__torch_dispatch__, but our input is a C++ `FunctionalTensorWrapper`. What should actually have happened is that the call to `aten.sym_storage_offset` hits the `Functionalize` dispatch key, which should remove the `FunctionalTensorWrapper` and redispatch. I spent some time debugging and haven't actually figured out why this isn't happening. Instead, this PR just skips that step completely, and asks `FunctionalTensor` to directly unwrap the C++ `FunctionalTensorWrapper` when querying tensor metadata. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127927 Approved by: https://github.com/tugsbayasgalan	2024-06-15 00:08:44 +00:00
dilililiwhy	574a2cbcb7	Enable UFMT on common_device_type.py and common_dtype.py (#128490 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > torch/testing/_internal/common_device_type.py > torch/testing/_internal/common_dtype.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128490 Approved by: https://github.com/ezyang, https://github.com/XuehaiPan	2024-06-15 00:07:42 +00:00
PaliC	0492ec460a	[BE] Remove external testing of torch::deploy (#127952 ) As we don't expect external users of torch::deploy as the library is no longer supported, we will remove external testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127952 Approved by: https://github.com/malfet	2024-06-14 23:32:02 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
Yang Chen	4abecd7102	[AOTI] fixed performance issue for AOTI_TORCH_CHECK (#128402 ) We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128402 Approved by: https://github.com/desertfire	2024-06-14 23:03:17 +00:00
Huy Do	fd27138c4a	Update DALLE2_pytorch expected accuracy result on CPU (#128718 ) I suspect that the issue shows up because of the new version of https://pypi.org/project/pyarrow/16.1.0/#history released yesterday. The package is a dependency of DALLE2_pytorch https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/DALLE2_pytorch/install.py#L22. I'll just update the expected accuracy result on CPU benchmark because the model fails to run there anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128718 Approved by: https://github.com/malfet	2024-06-14 22:54:21 +00:00
Catherine Lee	d3a4d9e4fe	Update cu124 dynamo benchmark expected values (#128737 ) Missed one in https://github.com/pytorch/pytorch/pull/128589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128737 Approved by: https://github.com/Skylion007	2024-06-14 22:23:00 +00:00
titaiwangms	bca2cf00ed	[ONNX] Add dynamic axes support to torchscript exporter with dynamo=True (#128371 ) This PR enables specific axe to be dynamic with calling torch.export.export and torch.export.Dim. Features: (1) Turn dynamic_axes to dynamic_shapes (2) Dim constraints remain the same (see test case with hitting constraints). This might give different user experience, since we didn't have any constraints in torchscript-onnx exporting. (3) If input_names is used in dynamic_axes, ValueError will be raised, as input_names is currently not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128371 Approved by: https://github.com/justinchuby	2024-06-14 21:56:51 +00:00
Isuru Fernando	f103247a14	Run all samples for torchinductor tests (#128343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343 Approved by: https://github.com/lezcano	2024-06-14 21:52:12 +00:00
angelayi	e9c6e8369c	Torchbind call method + effects support (#128397 ) Adds effect token support to torchbind method calls by allowing `with_effects` to take in `torch.ops._higher_order_ops.call_torchbind` as an input. Here is the print from `TORCH_LOGS="aot" python test/export/test_torchbind.py -k test_compile_obj_torchbind_op`: ```python def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2]", arg2_1): # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1266 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos()) cos: "f32[2]" = torch.ops.aten.cos.default(arg1_1) with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, cos); arg0_1 = cos = None getitem: "f32[0]" = with_effects[0]; with_effects = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1267 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos() + 1) cos_1: "f32[2]" = torch.ops.aten.cos.default(arg1_1) add: "f32[2]" = torch.ops.aten.add.Tensor(cos_1, 1); cos_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, add); getitem = add = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1268 in f, code: torch.ops._TorchScriptTesting.queue_pop(tq) with_effects_2 = torch._higher_order_ops.effects.with_effects(getitem_2, torch.ops._TorchScriptTesting.queue_pop.default, arg2_1); getitem_2 = None getitem_4: "f32[0]" = with_effects_2[0]; with_effects_2 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1269 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.sin()) sin: "f32[2]" = torch.ops.aten.sin.default(arg1_1); arg1_1 = None with_effects_3 = torch._higher_order_ops.effects.with_effects(getitem_4, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, sin); getitem_4 = sin = None getitem_6: "f32[0]" = with_effects_3[0]; with_effects_3 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1270 in f, code: return tq.pop(), tq.pop() + tq.size(), tq with_effects_4 = torch._higher_order_ops.effects.with_effects(getitem_6, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop'); getitem_6 = None getitem_8: "f32[0]" = with_effects_4[0] getitem_9: "f32[2]" = with_effects_4[1]; with_effects_4 = None with_effects_5 = torch._higher_order_ops.effects.with_effects(getitem_8, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop'); getitem_8 = None getitem_10: "f32[0]" = with_effects_5[0] getitem_11: "f32[2]" = with_effects_5[1]; with_effects_5 = None with_effects_6 = torch._higher_order_ops.effects.with_effects(getitem_10, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'size'); getitem_10 = arg2_1 = None getitem_12: "f32[0]" = with_effects_6[0]; with_effects_6 = None add_1: "f32[2]" = torch.ops.aten.add.Tensor(getitem_11, 0); getitem_11 = None return (getitem_12, getitem_9, add_1) ``` In order to support this, this PR makes the following changes: * Adds `FakeScriptObject` to `CustomObjArgument`, which will be put on the `meta["val"]` of nodes representing torchbind objects. * Adds pickle/deepcopy support to FunctionSchema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128397 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-06-14 21:28:17 +00:00
ibartol	65d3ddcb8b	Add GLIBC requirements for libtorch to solve #113124 (#128135 ) Fixes #113124. ## Description I modified the installing.rst file to address the system requirements and troubleshooting steps for using LibTorch with different GLIBC versions. ### Summary of Changes - Added system requirements specifying the GLIBC version needed for both the cxx11 ABI version and the pre-cxx11 ABI version of LibTorch. - Included a troubleshooting section with instructions on how to check the dependencies of the LibTorch libraries and identify the required GLIBC version using the `ldd lib/libtorch.so` command. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128135 Approved by: https://github.com/jbschlosser	2024-06-14 21:24:53 +00:00
titaiwangms	e9a29aaa4a	[ONNX] Add upsample trilinear to skip decomp (#128259 ) (1) Add upsample trilinear vec to skip decomposition (2) Add tests to make sure that torch.export.export still decomposes them Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259 Approved by: https://github.com/justinchuby	2024-06-14 21:20:44 +00:00
rzou	e6e102cf85	Dynamo testing: add some skips (#128734 ) The following tests are failing consistently for me locally, so we're going to skip them. They're disabled in CI but it looks like they're just always failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128734 Approved by: https://github.com/williamwen42 ghstack dependencies: #128731	2024-06-14 20:53:30 +00:00
rzou	11de50f17c	[Dynamo] skip some TorchScript tests (#128731 ) We don't care about the Dynamo x TorchScript composition, so I'm disabling these tests (so they don't get reported as flaky). Not disabling all of the TorchScript tests yet because they have been useful to catch random bugs. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731 Approved by: https://github.com/williamwen42	2024-06-14 20:53:30 +00:00
Simon Fan	4b96575a09	[dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196 ) FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched. For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196 Approved by: https://github.com/soulitzer	2024-06-14 20:28:08 +00:00
Animesh Jain	1aafb9eb90	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128715	2024-06-14 20:17:03 +00:00
Animesh Jain	9c77332116	[torch.compile][ci] Flaky models in CI (similar to DISABLED_TEST) (#128715 ) These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715 Approved by: https://github.com/eellison	2024-06-14 20:17:03 +00:00
Sanket Jayant Purandare	2e5366fbc0	Extended Module Tracker (#128508 ) This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes. 1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``. 2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``. 3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case. 4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508 Approved by: https://github.com/wanchaol	2024-06-14 19:48:46 +00:00
Menglu Yu	d50712e5e3	[PT2] add inductor log for unbind_stack_pass (#128684 ) Summary: Currently, we do not log the pass. To better enable pattern hit inspection, we enable it. Test Plan: see signal Differential Revision: D58571992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128684 Approved by: https://github.com/dshi7	2024-06-14 19:45:55 +00:00
Nikita Shulga	9035fff2de	[BE] Do not test deprecated `torch.nn.utils.weight_norm` (#128727 ) Test `torch.nn.utils.parametrizations.weight_norm` instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727 Approved by: https://github.com/kit1980 ghstack dependencies: #128726	2024-06-14 19:14:44 +00:00
Nikita Shulga	27458cc097	[BE] Refactor repeated code in test_weight_norm (#128726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726 Approved by: https://github.com/kit1980	2024-06-14 19:14:44 +00:00
Colin Peppler	a6bd154a42	[inductor] Support mm decomps for matrices with unbacked sizes (#128655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128655 Approved by: https://github.com/jansel	2024-06-14 18:35:42 +00:00
Nikita Shulga	b94c52dd29	[GHF] Refuse merge to non-default branch (#128710 ) Unless PR is ghstack one Test plan: ``` % GITHUB_TOKEN=$(gh auth token) python3 -c "from trymerge import GitHubPR; pr=GitHubPR('pytorch', 'pytorch', 128591); print(pr.base_ref(), pr.default_branch())" release/2.4 main ``` Fixes: https://github.com/pytorch/test-infra/issues/5339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128710 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-06-14 18:23:25 +00:00
Zhengxu Chen	be0eec9031	[export] Improve static typing in tracer. (#128552 ) Summary: as title. Test Plan: CI Differential Revision: D58485487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128552 Approved by: https://github.com/angelayi	2024-06-14 17:57:37 +00:00
PyTorch MergeBot	2367161e4b	Revert "[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 )" This reverts commit c339efaf023b4af056dad4cb2f11c07930ed8af6. Reverted https://github.com/pytorch/pytorch/pull/127966 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/127966#issuecomment-2168505985))	2024-06-14 17:57:23 +00:00
Peter Bell	d7fc871175	[inductor] Improve superfluous mask handling in triton codegen (#128518 ) This takes the logic from `filter_masks` and factors it out into `_has_constant_mask`. I also improve support for `persistent_reduction` kernels by making use of the static RBLOCK value and potentially XBLOCK too in the `no_x_dim` case. I then use this helper when generating the `xmask` and `rmask`, so we can generate them as constants meaning triton can optimize them even if they are included. e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)` before: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) * XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512x0)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = tl.where(rmask & xmask, tmp1, 0) tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0)) tl.store(out_ptr0 + (x0), tmp4, xmask) ``` after: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = tl.full([RBLOCK], True, tl.int1) rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = tl.full([RBLOCK], True, tl.int1) r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0)) tl.store(out_ptr0 + (x0), tmp3, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518 Approved by: https://github.com/lezcano	2024-06-14 17:52:55 +00:00
Menglu Yu	2357490524	[PT2] Enable shape_padding multiplier adjustment (#128346 ) Summary: Our experiments demonstrate that the current defautl value 1.1 may not be the best multiplier, and we thus enable the adjustment of the value to further improve the QPS. context: https://docs.google.com/document/d/10VjpOJkTv5A4sNX7dD6qT7PyhBxn6LSeLAuaqYtoOto/edit Test Plan: # IG_CTR {F1682138315} Differential Revision: D58373261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128346 Approved by: https://github.com/jackiexu1992	2024-06-14 17:49:24 +00:00
cyy	d4807da802	Various fixes of torch/csrc files (#127252 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127252 Approved by: https://github.com/r-barnes	2024-06-14 17:31:24 +00:00
Aart Bik	089e76cca3	[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 ) The assertEqualMeta() method already tests that the first argument is a FakeTensor https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523 Approved by: https://github.com/huydhn	2024-06-14 17:05:17 +00:00
Yanbo Liang	1fb4effe7a	[GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002 ) Output example: ``` \| name \| metric \| target \| actual \| \|------------------------------\|---------------------------\|---------\|---------\| \| layer_norm_bfloat16 \| memory_bandwidth(GB/s) \| 1017 \| 1000.01 \| \| mlp_layer_norm_gelu_bfloat16 \| flops_utilization \| 0.71 \| 0.71 \| \| gemv_int8 \| memory_bandwidth(GB/s) \| 990 \| 984.06 \| \| gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1137 \| 1137.92 \| \| gather_gemv_int8 \| memory_bandwidth(GB/s) \| 1113 \| 1111.09 \| \| gather_gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1249 \| 1248.15 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002 Approved by: https://github.com/Chillee	2024-06-14 17:03:22 +00:00
Laith Sakka	4c84af0f5d	Fix indexing and slicing of ranges in dynamo (#128567 ) Fix https://github.com/pytorch/pytorch/issues/128520 Dynamo does not handle range()[binary subscript] or range()[trinary_subscript] correctly. Right now it calls the get_item function which basically applies the subscript operation on top of the list of [start, end, step]! which is completely not related to what is expected. in python, range()[complex subscript] is another range, ex: range(1, 10, 2)[1:4:1] is range(3, 9, 2) and range(1, 10, 2)[1:4:1] is range(-9, 9, 2) This diff fix index and slice applications on range. it mimics implementations from (https://github.com/python/cpython/blob/main/Objects/rangeobject.c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128567 Approved by: https://github.com/anijain2305	2024-06-14 16:49:49 +00:00
PyTorch MergeBot	f75f5987aa	Revert "Extended Module Tracker (#128508 )" This reverts commit 1f46284f9ed5b60981174e689d750b358b19e4c4. Reverted https://github.com/pytorch/pytorch/pull/128508 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/9515753429/job/26230639980 ([comment](https://github.com/pytorch/pytorch/pull/128508#issuecomment-2168405784))	2024-06-14 16:46:03 +00:00
Aaron Orenstein	732b4e9074	Fix generated vararg types (#128648 ) In the generated files torchgen is incorrectly generating types on the varargs. The changes all look like this (changing `size: _int` to `size: Union[_int, SymInt]`): ``` --- ./torch/_VF.pyi.sav 2024-06-13 20:36:49.189664629 -0700 +++ ./torch/_VF.pyi 2024-06-13 20:36:57.208894614 -0700 @@ -168,17 +168,17 @@ @overload def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], , dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... @overload -def _efficientzerotensor(size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... +def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... @overload ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648 Approved by: https://github.com/jamesjwu	2024-06-14 16:04:37 +00:00
Kiuk Chung	8629939a51	[torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967 ) Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check). Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED` and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`. > NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967 Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet	2024-06-14 16:01:12 +00:00
PyTorch MergeBot	ee140a198f	Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591 )" This reverts commit 03e8a4cf45ee45611de77b55b515a8936f60ce31. Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))	2024-06-14 15:51:00 +00:00
eellison	c187593418	Prevent expansion of cat indexing to avoid int64 intermediate (#127815 ) Fix for https://github.com/pytorch/pytorch/issues/127652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815 Approved by: https://github.com/shunting314, https://github.com/peterbell10	2024-06-14 15:42:08 +00:00
Andres Lugo-Reyes	c339efaf02	[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 ) Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560 This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069 unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping. The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966 Approved by: https://github.com/pruthvistony, https://github.com/zou3519	2024-06-14 15:24:28 +00:00
Huamin Li	c76a9d13cb	Revert D56709309 (#128481 ) Summary: potential fw compatibility issue raised from D58397323 Test Plan: Sandcastle Reviewed By: houseroad Differential Revision: D58443190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128481 Approved by: https://github.com/desertfire	2024-06-14 14:57:17 +00:00
rzou	9972e5f447	Rename impl_abstract to register_fake, part 2/2 (#123938 ) This PR renames the implementation details of register_fake to align more with the new name. It is in its own PR because this is risky (torch.package sometimes depends on private library functions and implementation details). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938 Approved by: https://github.com/williamwen42	2024-06-14 14:37:24 +00:00
Zheng, Zhaoqiong	a2d9c430b4	Adding a note for Getting Started with PyTorch on Intel GPUs (#127872 ) Adding a note for Getting Started with PyTorch on Intel GPUs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872 Approved by: https://github.com/svekars	2024-06-14 14:24:28 +00:00
Luca Wehrstedt	dfc4b608e1	Remove leftover warning causing log spew (#128688 ) This warning was left by mistake, and is uninformative (the user is doing nothing wrong) and causing log spew in trainings. See https://github.com/pytorch/pytorch/pull/120750#discussion_r1638430500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128688 Approved by: https://github.com/drisspg	2024-06-14 14:08:11 +00:00
Nikita Shulga	e1dfc61250	Document CI/CD security philosophy (#128316 ) Namely: - when use of non-ephemeral runners is OK, vs when it is not - Why binary build pipelines should not use distributed caching - Why temporary CI artifacts should not be considered safe Pull Request resolved: https://github.com/pytorch/pytorch/pull/128316 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-06-14 13:47:25 +00:00
cyy	bfd5ea93e0	Enable clang-tidy on c10/util/Float8.h (#120573 ) This PR clears warnings and enables clang-tidy on c10/util/Float8.h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120573 Approved by: https://github.com/drisspg	2024-06-14 13:47:07 +00:00
Sanket Jayant Purandare	1f46284f9e	Extended Module Tracker (#128508 ) This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes. 1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``. 2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``. 3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case. 4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508 Approved by: https://github.com/wanchaol	2024-06-14 12:01:53 +00:00
Isuru Fernando	e397ad6883	Improve codegen for ops.masked in triton (#128054 ) Fixes https://github.com/pytorch/pytorch/issues/127930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-06-14 11:52:56 +00:00
Colin Peppler	7e734e2d08	[inductor] Fix nested indirect indexing case for index_propagation (#128378 ) Tries to fix #127677. # Context Just as @peterbell10 pointed out, we have the following scenario: ``` a = ops.indirect_indexing(...) b = ops.index_expr(a, ...) c = ops.indirect_indexing(b, ...) ``` We can repro this as: ``` def forward(self, arg0_1, arg1_1, arg2_1): iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0), repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1); index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]); index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]); return (index_1,) ``` which should generate a JIT py file like this: ``` def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last') tmp1 = ks0 tmp2 = tmp0 + tmp1 tmp3 = tmp0 < 0 tmp4 = tl.where(tmp3, tmp2, tmp0) # check_bounds() tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) \| ~(xmask), "index out of bounds: 0 <= tmp4 < ks0") def call(): arg0_1, arg1_1, arg2_1 = args buf1 = aten.repeat_interleave.Tensor(arg1_1) buf4 = empty_strided_cuda((u0, 64), (64, 1)) triton_poi_fused_index_select_0.run( buf1, arg2_1, buf4, s0, triton_poi_fused_index_select_0_xnumel, grid=grid(triton_poi_fused_index_select_0_xnumel), stream=stream0) ``` # Issue In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`. `3b555ba477/torch/_inductor/ir.py (L8154-L8160)` When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`. # Approach When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-14 10:07:06 +00:00
Jason Ansel	99988be423	[halide-backend] Add test shard (#127308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127308 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #128266	2024-06-14 10:02:57 +00:00
Xia, Weiwen	03e8a4cf45	[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591 ) Port #127592 from main to release/2.4 ------ Fixes #127402 - Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py - Add checks of mutation for QLinearPointwiseBinaryPT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592 Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591 Approved by: https://github.com/jgong5, https://github.com/Chillee	2024-06-14 09:31:38 +00:00
PyTorch MergeBot	43ae3073f9	Revert "[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 )" This reverts commit ba3726d02b25dff92762c59d4dffe96a7babfa75. Reverted https://github.com/pytorch/pytorch/pull/128523 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Looks like your changes broke the inductor tests: inux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor. [Here you can find more details](`ba3726d02b`). ([comment](https://github.com/pytorch/pytorch/pull/128523#issuecomment-2167518145))	2024-06-14 08:27:05 +00:00
Will Constable	0344f95c2e	Add missing #include <array> to thread_name.cpp (#128664 ) I got local compile errors (using clang 14.0.6) due to this missing include after pulling the latest pytorch main. It's totally puzzling why CI appears to pass without this fix. Hopefully someone else will have an idea if we are missing some CI coverage or if I am using a strange build setup locally. The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k	2024-06-14 07:49:09 +00:00
Anshul Sinha	03725a0512	[dtensor][example] added MLPStacked example for printing sharding (#128461 ) Summary Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below: <img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92"> Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461 Approved by: https://github.com/XilunWu ghstack dependencies: #128369, #128451	2024-06-14 07:30:31 +00:00
Anshul Sinha	dd3b79a08f	[dtensor][be] improving readability of comm_mode.py and comm_mode_features_example.py (#128451 ) Summary I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code. Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451 Approved by: https://github.com/XilunWu ghstack dependencies: #128369	2024-06-14 07:30:31 +00:00
Anshul Sinha	e886122e98	[dtensor][debug] add module level tracing and readable display (#128369 ) Summary Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is: <img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791"> Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369 Approved by: https://github.com/XilunWu	2024-06-14 07:30:31 +00:00
yiliu30	4669c6d3ae	[quant][pt2e][quantizer] Support `set_module_name_qconfig` in X86InductorQuantizer (#126044 ) Summary: Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`. For example, only quantize the `sub`: ```python class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) self.sub = Sub() def forward(self, x): x = self.linear(x) x = self.sub(x) return x m = M().eval() example_inputs = (torch.randn(3, 5),) # Set config for a specific submodule. quantizer = X86InductorQuantizer() quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config()) ``` - Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level. - Unified the annotation process to follow this order: `module_name_qconfig`, `operator_type_qconfig`, and `global_config`. - Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs. - Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer. Test Plan ```bash python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name ``` @Xia-Weiwen @leslie-fang-intel @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2024-06-14 07:13:10 +00:00
Catherine Lee	674be9d3be	Update cu124 dynamo benchmark expected values (#128589 ) I believe this corresponds to changes in https://github.com/pytorch/pytorch/pull/127780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128589 Approved by: https://github.com/nWEIdia, https://github.com/DanilBaibak	2024-06-14 07:04:34 +00:00
PyTorch MergeBot	18f35d9e12	Revert "Run all samples for torchinductor tests (#128343 )" This reverts commit 41df20c07caecddb6d21d69a125f2998ae9313e8. Reverted https://github.com/pytorch/pytorch/pull/128343 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float16 and other tests `41df20c07c` https://github.com/pytorch/pytorch/actions/runs/9509191526/job/26213490266. I think this might be a landrace ([comment](https://github.com/pytorch/pytorch/pull/128343#issuecomment-2167275337))	2024-06-14 06:08:17 +00:00
David Berard	f48f7615dc	[easy][subclasses] dynamo.reset() in test_subclass_views (#128659 ) When we don't dynamo.reset(), we don't recompile on different dynamic shapes. Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple. Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes: ``` File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint return self._str(self._print(expr)) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, kwargs) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add t = self._print(term) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, kwargs) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul a_str = [self.parenthesize(x, prec, strict=False) for x in a] File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp> a_str = [self.parenthesize(x, prec, strict=False) for x in a] File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize return self._print(item) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, **kwargs) File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol assert self.symbol_to_source.get(expr), ( AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}. If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659 Approved by: https://github.com/YuqingJ	2024-06-14 05:18:07 +00:00
amdfaa	9ac08dab1f	Updates diskspace-cleanup for ROCm CI (#127947 ) Gets the location of the docker directory and outputs how much disk space is being used by docker. This is required since the new Cirrascale CI nodes for ROCm have docker root directory in a different partition. Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127947 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-06-14 04:32:38 +00:00
Huy Do	eff01bce21	Only run inductor A100 perf benchmark smoke test periodically (#128677 ) Attempt to mitigate the long queue on A100 as reported in https://github.com/pytorch/pytorch/issues/128627. From what I see, this change `03467b3fed/1` doubles the job duration from 20+ to 40+ minutes. This, together https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-cu124.yml and maybe an increase number of PR with `ciflow/inductor`, are all contributing to the long queue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128677 Approved by: https://github.com/atalman, https://github.com/desertfire	2024-06-14 02:39:33 +00:00
Aart Bik	ba3726d02b	[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 ) The assertEqualMeta() method already tests that the first argument is a FakeTensor https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523 Approved by: https://github.com/soulitzer	2024-06-14 02:34:51 +00:00
Sahdev Zala	685fcfb40d	Fix docstring in autograd (#128657 ) Fix docstrings in autograd files. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR:  pydocstyle torch/autograd/anomaly_mode.py --count 8 pydocstyle torch/autograd/__init__.py --count 9 AFTER the PR:  pydocstyle torch/autograd/anomaly_mode.py --count 0 pydocstyle torch/autograd/__init__.py --count 0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128657 Approved by: https://github.com/soulitzer	2024-06-14 02:18:59 +00:00
PyTorch MergeBot	0186b386cd	Revert "[ONNX] Add upsample trilinear to skip decomp (#128259 )" This reverts commit b72989a2b5ac4637612e31e325d7c8233fcbd7a1. Reverted https://github.com/pytorch/pytorch/pull/128259 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its ONNX job is failing in trunk `b72989a2b5` ([comment](https://github.com/pytorch/pytorch/pull/128259#issuecomment-2167058937))	2024-06-14 01:44:26 +00:00
anandptl84	f48ca2561d	Document `torch.cuda.profiler.start` (#128098 ) document https://github.com/pytorch/pytorch/issues/127917 start function of cuda/ profiler.py Fixes 127917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128098 Approved by: https://github.com/aaronenyeshi	2024-06-14 01:44:18 +00:00
Isuru Fernando	41df20c07c	Run all samples for torchinductor tests (#128343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343 Approved by: https://github.com/lezcano	2024-06-14 01:28:32 +00:00
PyTorch MergeBot	6895a5804c	Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795 )" This reverts commit c472cec5656b9ffb668af97a02d711bdbdf5ebec. Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))	2024-06-14 01:14:59 +00:00
Mengwei Liu	6564d63e69	Use mv kernel for small M (#128632 ) Previously we are using: * mv kernel for M == 1 * mm kernel for 1 < M < 4 * llama.cpp inspired mm kernel for M >= 4 This PR consolidate it to only 2 kernels, use the same mv kernel for M < 12. Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm Mac M1 Max, input size M x 4128 x 4096 ![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632 Approved by: https://github.com/malfet	2024-06-14 01:06:53 +00:00
Sheng Fu	ae2359638b	Save DOT file of graph instead of SVG for GraphTranformObserver (#128634 ) Summary: GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable. This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG. To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details. Test Plan: buck2 test mode/dev-sand caffe2/test:fx -- fx.test_fx_xform_observer.TestGraphTransformObserver Differential Revision: D58539182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634 Approved by: https://github.com/mengluy0125	2024-06-14 00:54:22 +00:00
Scott Wolchok	6f181756dc	Use by-column algorithm for fp16/bf16 CPUBlas gemm_transb kernels (#127318 ) Summary: #96074 (D44340826) changed the algorithm for 16-bit types for gemm_notrans_ and gemm_transb_ for the sake of precision. In this diff, we go back to the old algorithm for gemm_transb_, maintaining precision by allocating temporary space equal to (in elements, so actually double since we are accumulating 16-bit types into fp32) the size of `c` to accumulate into. Test Plan: Used https://github.com/malfet/llm_experiments (benchmarks/benchmark_torch_mm.py) to benchmark before and after: before: ``` mv_nt torch.float32 5.47 usec mv_nt torch.float16 8.45 usec mv_nt torch.bfloat16 183.43 usec mv_ta torch.float32 5.70 usec mv_ta torch.float16 24.17 usec mv_ta torch.bfloat16 97.27 usec notrans torch.float32 5.58 usec notrans torch.float16 25.18 usec notrans torch.bfloat16 63.11 usec trans_a torch.float32 5.59 usec trans_a torch.float16 68.94 usec trans_a torch.bfloat16 311.60 usec trans_b torch.float32 5.63 usec trans_b torch.float16 8.76 usec trans_b torch.bfloat16 29.17 usec ``` after: ``` mv_nt torch.float32 5.53 usec mv_nt torch.float16 8.57 usec mv_nt torch.bfloat16 188.17 usec mv_ta torch.float32 5.78 usec mv_ta torch.float16 28.59 usec mv_ta torch.bfloat16 98.45 usec notrans torch.float32 5.71 usec notrans torch.float16 26.08 usec notrans torch.bfloat16 64.06 usec trans_a torch.float32 5.72 usec trans_a torch.float16 32.21 usec trans_a torch.bfloat16 32.10 usec trans_b torch.float32 5.83 usec trans_b torch.float16 9.05 usec trans_b torch.bfloat16 29.66 usec ``` Also expanded coverage to a range of larger matrix-vector and matrix-matrix sizes. before: ``` Matrix-vector: m=1024, n=1024, k=1 ==================== notrans torch.float32 24.75 usec notrans torch.float16 258.04 usec notrans torch.bfloat16 245.64 usec trans_a torch.float32 26.94 usec trans_a torch.float16 692.09 usec trans_a torch.bfloat16 1709.53 usec m=4100, n=4100, k=1 ==================== notrans torch.float32 2811.48 usec notrans torch.float16 4192.06 usec notrans torch.bfloat16 4041.01 usec trans_a torch.float32 2778.38 usec trans_a torch.float16 17218.41 usec trans_a torch.bfloat16 27561.21 usec m=16384, n=16384, k=1 ==================== notrans torch.float32 60157.66 usec notrans torch.float16 64121.38 usec notrans torch.bfloat16 65714.65 usec trans_a torch.float32 84975.39 usec trans_a torch.float16 1024223.33 usec trans_a torch.bfloat16 1078683.21 usec Matrix-matrix: m=1024, n=1024, k=256 ==================== notrans torch.float32 302.55 usec notrans torch.float16 172869.06 usec notrans torch.bfloat16 172837.81 usec trans_a torch.float32 250.03 usec trans_a torch.float16 333373.38 usec trans_a torch.bfloat16 432760.00 usec m=4100, n=4100, k=128 ==================== notrans torch.float32 5278.56 usec notrans torch.float16 1426335.29 usec notrans torch.bfloat16 1404249.37 usec trans_a torch.float32 4818.63 usec trans_a torch.float16 2969936.17 usec trans_a torch.bfloat16 3432565.96 usec m=16384, n=16384, k=16 ==================== notrans torch.float32 72225.71 usec notrans torch.float16 1439875.54 usec notrans torch.bfloat16 1443716.33 usec trans_a torch.float32 221130.21 usec trans_a torch.float16 16910654.17 usec trans_a torch.bfloat16 21447377.63 usec ``` after: ``` Matrix-vector: m=1024, n=1024, k=1 ==================== notrans torch.float32 25.11 usec notrans torch.float16 252.76 usec notrans torch.bfloat16 238.58 usec trans_a torch.float32 26.62 usec trans_a torch.float16 167.40 usec trans_a torch.bfloat16 174.08 usec m=4100, n=4100, k=1 ==================== notrans torch.float32 2774.28 usec notrans torch.float16 3991.70 usec notrans torch.bfloat16 3945.44 usec trans_a torch.float32 3011.25 usec trans_a torch.float16 2666.85 usec trans_a torch.bfloat16 2686.95 usec m=16384, n=16384, k=1 ==================== notrans torch.float32 58682.15 usec notrans torch.float16 63077.52 usec notrans torch.bfloat16 63319.33 usec trans_a torch.float32 70549.57 usec trans_a torch.float16 42145.45 usec trans_a torch.bfloat16 42270.13 usec Matrix-matrix: m=1024, n=1024, k=256 ==================== notrans torch.float32 289.37 usec notrans torch.float16 179704.87 usec notrans torch.bfloat16 173490.33 usec trans_a torch.float32 330.89 usec trans_a torch.float16 42466.26 usec trans_a torch.bfloat16 42811.19 usec m=4100, n=4100, k=128 ==================== notrans torch.float32 4793.33 usec notrans torch.float16 1407557.04 usec notrans torch.bfloat16 1388212.17 usec trans_a torch.float32 4714.20 usec trans_a torch.float16 359406.58 usec trans_a torch.bfloat16 350419.42 usec m=16384, n=16384, k=16 ==================== notrans torch.float32 65757.08 usec notrans torch.float16 1427715.71 usec notrans torch.bfloat16 1440883.00 usec trans_a torch.float32 202263.44 usec trans_a torch.float16 1387522.33 usec trans_a torch.bfloat16 1762253.92 usec ``` We are improving, but still have a lot of room for improvement compared to float32 BLAS. Full disclosure: applying this same method to gemm_notrans (which does correspond to notrans in the benchmark's nomenclature) does not approve performance across the board; the 16KB x 16KB x 16 matmul regresses and I haven't figured out why yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127318 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-06-14 00:39:18 +00:00
Alnis Murtovi	18f5357f4f	Introduce heuristic for mixed_mm on A100 (#128232 ) This PR introduces a heuristic for tuned_mixed_mm. The heuristic is only enabled on an A100, because it has only been tested on an A100, and it is only enabled if force_mixed_mm="heuristic". I compared the heuristic to the aten fallback implementation and triton+autotune: Geometric mean speedup: 2.51 ``` m n k triton + autotune (GB/s) aten (GB/s) heuristic (GB/s) used_heuristic speedup (heuristic/aten) 1 4096 4096 456.95 134.59 459.37 True 3.41 1 4096 8192 523.93 138.29 553.50 True 4.00 1 4096 16394 233.70 161.62 234.14 True 1.45 1 8192 4096 633.25 140.64 574.86 True 4.09 1 8192 8192 737.54 147.41 690.26 True 4.68 1 8192 16394 413.67 175.88 408.68 True 2.32 1 16394 4096 717.22 167.22 665.36 True 3.98 1 16394 8192 812.69 177.17 815.90 True 4.61 1 16394 16394 473.17 178.58 435.11 True 2.44 4 4096 4096 479.46 134.80 486.74 True 3.61 4 4096 6333 174.27 106.74 171.64 True 1.61 4 4096 8192 567.14 138.32 571.09 True 4.13 4 4096 12313 179.65 105.91 180.03 True 1.70 4 4096 16394 222.96 145.54 222.81 True 1.53 4 6333 4096 491.78 126.37 473.20 True 3.74 4 6333 6333 268.79 143.40 269.75 True 1.88 4 6333 8192 783.80 135.12 796.23 True 5.89 4 6333 12313 286.35 142.37 287.30 True 2.02 4 6333 16394 362.47 139.66 361.47 True 2.59 4 8192 4096 642.73 140.53 641.88 True 4.57 4 8192 6333 287.65 137.63 287.38 True 2.09 4 8192 8192 738.42 150.16 721.59 True 4.81 4 8192 12313 301.27 146.18 302.31 True 2.07 4 8192 16394 415.37 167.66 393.41 True 2.35 4 12313 4096 823.66 141.81 745.40 True 5.26 4 12313 6333 433.92 148.17 429.83 True 2.90 4 12313 8192 984.60 149.30 988.95 True 6.62 4 12313 12313 452.00 150.87 452.50 True 3.00 4 12313 16394 609.88 159.20 609.71 True 3.83 4 16394 4096 779.44 157.46 777.10 True 4.94 4 16394 6333 402.93 139.50 309.47 True 2.22 4 16394 8192 950.38 175.49 949.67 True 5.41 4 16394 12313 414.62 153.99 315.95 True 2.05 4 16394 16394 497.56 174.97 461.77 True 2.64 16 4096 4096 475.92 134.45 478.57 True 3.56 16 4096 6333 146.36 112.50 145.35 True 1.29 16 4096 8192 560.00 138.22 557.19 True 4.03 16 4096 12313 152.02 105.06 151.27 True 1.44 16 4096 16394 222.48 156.72 222.88 True 1.42 16 6333 4096 692.41 122.14 696.88 True 5.71 16 6333 6333 220.74 140.90 225.41 True 1.60 16 6333 8192 813.56 140.21 820.28 True 5.85 16 6333 12313 232.48 131.19 232.55 True 1.77 16 6333 16394 367.39 134.93 361.87 True 2.68 16 8192 4096 665.54 140.29 266.24 True 1.90 16 8192 6333 254.77 136.65 240.12 True 1.76 16 8192 8192 750.63 146.26 736.93 True 5.04 16 8192 12313 266.61 127.13 251.81 True 1.98 16 8192 16394 397.25 160.42 390.76 True 2.44 16 12313 4096 857.48 141.36 851.36 True 6.02 16 12313 6333 423.21 132.40 357.55 True 2.70 16 12313 8192 1021.24 145.68 1024.60 True 7.03 16 12313 12313 370.12 143.94 383.52 True 2.66 16 12313 16394 608.52 141.03 608.48 True 4.31 16 16394 4096 826.48 155.94 826.74 True 5.30 16 16394 6333 420.38 144.09 265.23 True 1.84 16 16394 8192 988.07 156.21 984.63 True 6.30 16 16394 12313 431.40 146.92 265.49 True 1.81 16 16394 16394 497.39 167.86 461.79 True 2.75 23 4096 4096 344.43 132.84 338.64 True 2.55 23 4096 6333 195.34 118.48 195.31 True 1.65 23 4096 8192 389.83 140.02 376.62 True 2.69 23 4096 12313 204.49 137.96 204.80 True 1.48 23 4096 16394 242.48 148.99 242.74 True 1.63 23 6333 4096 429.25 126.52 517.75 True 4.09 23 6333 6333 295.56 133.51 296.14 True 2.22 23 6333 8192 594.88 137.05 581.78 True 4.25 23 6333 12313 315.18 131.67 314.64 True 2.39 23 6333 16394 386.46 141.45 386.54 True 2.73 23 8192 4096 553.52 142.05 568.35 True 4.00 23 8192 6333 215.58 139.01 210.86 True 1.52 23 8192 8192 609.21 154.85 528.76 True 3.41 23 8192 12313 220.38 142.93 233.54 True 1.63 23 8192 16394 402.63 158.39 403.21 True 2.55 23 12313 4096 723.54 131.58 581.94 True 4.42 23 12313 6333 307.90 131.58 307.90 True 2.34 23 12313 8192 893.36 129.97 623.72 True 4.80 23 12313 12313 322.40 134.84 317.80 True 2.36 23 12313 16394 512.97 142.31 409.45 True 2.88 23 16394 4096 703.66 154.54 643.53 True 4.16 23 16394 6333 305.55 127.55 293.17 True 2.30 23 16394 8192 768.12 154.60 681.53 True 4.41 23 16394 12313 311.61 140.92 307.01 True 2.18 23 16394 16394 467.24 171.07 467.29 True 2.73 32 4096 4096 344.71 132.30 338.62 True 2.56 32 4096 6333 206.48 107.59 205.55 True 1.91 32 4096 8192 387.24 137.82 353.12 True 2.56 32 4096 12313 216.35 120.61 214.50 True 1.78 32 4096 16394 242.05 149.92 241.94 True 1.61 32 6333 4096 525.50 127.12 518.02 True 4.08 32 6333 6333 300.50 118.41 296.55 True 2.50 32 6333 8192 600.92 136.99 601.94 True 4.39 32 6333 12313 316.13 136.45 316.03 True 2.32 32 6333 16394 386.11 141.34 386.10 True 2.73 32 8192 4096 546.18 140.18 341.14 True 2.43 32 8192 6333 218.40 130.65 263.42 True 2.02 32 8192 8192 608.29 147.16 542.12 True 3.68 32 8192 12313 225.60 135.04 225.23 True 1.67 32 8192 16394 434.75 160.42 401.28 True 2.50 32 12313 4096 787.80 136.28 583.60 True 4.28 32 12313 6333 316.66 125.76 323.35 True 2.57 32 12313 8192 891.38 128.88 639.50 True 4.96 32 12313 12313 326.11 132.37 325.88 True 2.46 32 12313 16394 521.64 139.47 395.69 True 2.84 32 16394 4096 625.55 158.46 651.16 True 4.11 32 16394 6333 304.14 131.13 284.55 True 2.17 32 16394 8192 767.79 162.95 704.34 True 4.32 32 16394 12313 310.74 137.68 303.39 True 2.20 32 16394 16394 465.92 171.43 465.37 True 2.71 43 4096 4096 345.05 133.87 196.47 True 1.47 43 4096 6333 148.64 99.92 148.97 True 1.49 43 4096 8192 386.50 135.39 214.00 True 1.58 43 4096 12313 190.39 109.36 156.27 True 1.43 43 4096 16394 203.63 150.24 204.05 True 1.36 43 6333 4096 421.35 106.04 132.25 True 1.25 43 6333 6333 224.75 113.01 224.97 True 1.99 43 6333 8192 471.11 117.61 327.39 True 2.78 43 6333 12313 234.55 115.61 234.74 True 2.03 43 6333 16394 311.56 132.24 312.01 True 2.36 43 8192 4096 400.73 140.12 269.11 True 1.92 43 8192 6333 167.32 119.13 168.84 True 1.42 43 8192 8192 435.45 146.98 286.21 True 1.95 43 8192 12313 161.05 127.82 162.78 True 1.27 43 8192 16394 207.16 156.40 208.90 True 1.34 43 12313 4096 484.01 120.10 313.35 True 2.61 43 12313 6333 234.54 106.63 232.85 True 2.18 43 12313 8192 515.34 130.23 411.70 True 3.16 43 12313 12313 239.39 130.04 239.03 True 1.84 43 12313 16394 316.02 137.39 316.29 True 2.30 43 16394 4096 475.60 152.57 340.97 True 2.23 43 16394 6333 241.21 132.49 208.59 True 1.57 43 16394 8192 499.34 157.43 361.61 True 2.30 43 16394 12313 246.25 132.31 211.68 True 1.60 43 16394 16394 302.90 158.56 277.05 True 1.75 64 4096 4096 280.48 126.82 195.97 True 1.55 64 4096 6333 150.94 101.63 150.48 True 1.48 64 4096 8192 305.47 135.06 211.03 True 1.56 64 4096 12313 158.12 110.06 158.15 True 1.44 64 4096 16394 206.68 136.21 201.28 True 1.48 64 6333 4096 409.11 105.10 296.07 True 2.82 64 6333 6333 229.98 108.46 230.59 True 2.13 64 6333 8192 469.32 112.24 330.58 True 2.95 64 6333 12313 245.02 117.16 244.84 True 2.09 64 6333 16394 317.78 125.80 318.37 True 2.53 64 8192 4096 323.42 139.92 267.31 True 1.91 64 8192 6333 167.51 118.45 167.56 True 1.41 64 8192 8192 341.13 146.71 284.88 True 1.94 64 8192 12313 172.21 123.42 171.97 True 1.39 64 8192 16394 217.22 153.18 216.99 True 1.42 64 12313 4096 482.19 123.32 311.82 True 2.53 64 12313 6333 238.73 123.88 238.66 True 1.93 64 12313 8192 516.32 122.11 330.50 True 2.71 64 12313 12313 248.73 125.32 296.82 True 2.37 64 12313 16394 314.98 134.06 320.31 True 2.39 64 16394 4096 476.59 154.58 340.84 True 2.20 64 16394 6333 240.54 119.60 214.82 True 1.80 64 16394 8192 501.36 149.02 359.45 True 2.41 64 16394 12313 244.65 126.01 222.47 True 1.77 64 16394 16394 302.48 160.36 283.66 True 1.77 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128232 Approved by: https://github.com/Chillee	2024-06-14 00:31:22 +00:00
cyy	9ebec1f345	Enable Wunused-function in torch_cpu (#128576 ) Follows #128499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128576 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-14 00:12:58 +00:00
Jane Xu	6767e38267	Fix manual licensing (#128630 ) It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for: clog - BSD-3 eigen - MPL-2.0 ffnvcodec - LGPL-2.1 -> hungarian - Permissive (free to use) irrlicht - The Irrlicht Engine License (zlib/libpng) -> pdcurses - Public Domain for core -> sigslot - Public Domain test - BSD-3 Vulkan - Apache-2.0 or MIT fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967 This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630 Approved by: https://github.com/malfet	2024-06-14 00:12:09 +00:00
Yidi Wu	afdaa7fc95	[while_loop] expose it as torch.while_loop (#128562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128562 Approved by: https://github.com/zou3519	2024-06-13 23:44:10 +00:00
chilli	c486e2ab64	Add coloring to fx graph print out (#128476 ) Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail. Old: <img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6"> New: <img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476 Approved by: https://github.com/ezyang	2024-06-13 23:39:04 +00:00
rzou	61421c42c0	[custom_op] don't invoke autograd.Function when unnecessary (#127976 ) This matches our autograd logic for pytorch native operators. There's no need to invoke an autograd.Function if we're under a torch.no_grad() or if none of the inputs have requires_grad=True (invoking an autograd.Function results in (noticeable) overhead). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976 Approved by: https://github.com/williamwen42	2024-06-13 23:38:23 +00:00
titaiwangms	b72989a2b5	[ONNX] Add upsample trilinear to skip decomp (#128259 ) (1) Add upsample trilinear vec to skip decomposition (2) Add tests to make sure that torch.export.export still decomposes them Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259 Approved by: https://github.com/justinchuby	2024-06-13 23:31:34 +00:00
Jane Xu	8c20f53a5e	Try seeding individual foreach tests (#128220 ) A first easy attempt to deflake foreach Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220 Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn	2024-06-13 22:42:16 +00:00
Animesh Jain	865d7b3424	[Reland][dynamo] Enable some inlining inbuilt nn module tests (#128440 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-13 22:39:22 +00:00
Shangdi Yu	3a0006ef22	Remove global variable SIZE, and fix linter warning (#128559 ) - Resolve a TODO by removing global variable `SIZE`. - Fix a linter warning in `test/test_nestedtensor.py`. `pytest pytorch/test/test_sort_and_select.py` and ` pytest test/test_nestedtensor.py` pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128559 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-06-13 22:09:51 +00:00
Andrew Hoblitzell	6211e67e49	Document `torch.jit.frontend.get_default_args` (#128408 ) Fixes #127896 ### Description Add docstring to `torch/jit/frontend.py:get_default_args` function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408 Approved by: https://github.com/malfet	2024-06-13 21:49:16 +00:00
Andrew Gu	bf8a05f483	[FSDP2] Included module FQN in `FSDPParamGroup` `record_function`s (#128624 ) This PR adds the module FQN into the `FSDPParamGroup` `record_function`s for improved clarity in profiler traces. Differential Revision: [D58544809](https://our.internmc.facebook.com/intern/diff/D58544809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128624 Approved by: https://github.com/ckluk2	2024-06-13 21:35:33 +00:00
PyTorch MergeBot	c8e9656a12	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit 49366b2640df1cba5a3b40bedd31b57b08529612. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk `49366b2640` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))	2024-06-13 21:30:07 +00:00
Jing Xu	8763d44bf1	add xpu to torch.compile (#127279 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279 Approved by: https://github.com/dvrogozh, https://github.com/svekars	2024-06-13 21:15:09 +00:00
Yifu Wang	790138fdc7	Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter (#127556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127556 Approved by: https://github.com/awgu ghstack dependencies: #127454, #127455	2024-06-13 20:52:46 +00:00
Yifu Wang	3b28dc6c9d	Improve the scheduling for fused_matmul_reduce_scatter (#127455 ) In fused_all_gather_matmul, each rank copies their shard into their local p2p buffer, performs a barrier, then performs (copy -> matmul) for each remote shard. The (copy -> matmul)s for remote shards run on two streams without synchronization. This not only allows for computation/communication overlapping, but also computation/computation overlapping which alleviates the wave quantization effect caused by computation decomposition. However, the synchronization-free approach doesn't work well with fused_matmul_reduce_scatter, in which there's a barrier in every step. Without synchronization between the two streams, a matmul in one stream can delay a barrier in the other stream, further delaying the copy waiting for the barrier. This PR addresss the issue by adding synchronization between the two streams such that the matmul of step i can only start after the barrier of step i-1 completes. With this approach, we lose the computation/computation overlapping, but avoid slowdown due to delayed barrier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455 Approved by: https://github.com/Chillee ghstack dependencies: #127454	2024-06-13 20:52:46 +00:00
Arun Pa	c0b40ab42e	doc string for torch.jit.frontend.get_jit_class_def method (#128391 ) Fixes #127904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128391 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-06-13 19:51:02 +00:00
James Wu	a3af32c2fb	Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618 ) This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses. That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache. To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function. By doing this, we can now run the dynamic shapes cache test with autograd turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618 Approved by: https://github.com/bdhirsh	2024-06-13 19:45:33 +00:00
Sam Larsen	39193b10e8	[inductor] fx graph cache: memoize devices to make cache key calculation more predictable (#128366 ) Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this: - We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key. - Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID. - That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects. The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation? I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-06-13 19:25:14 +00:00
Shunting Zhang	c54e358bdb	enable comprehensive padding internally (#128555 ) Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature. Test Plan: ``` time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus' ``` Differential Revision: D58425432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555 Approved by: https://github.com/eellison	2024-06-13 19:20:00 +00:00
Isuru Fernando	cdc37e4bff	Add a shape property to IR nodes (#127818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127818 Approved by: https://github.com/peterbell10	2024-06-13 19:11:52 +00:00
Xuehai Pan	5a80d2df84	[BE] enable UFMT for `torch/nn/utils` (#128595 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128595 Approved by: https://github.com/Skylion007	2024-06-13 18:34:57 +00:00
Bin Bao	9f55c80a9f	[AOTI] Fix a minimal_arrayref_interface test failure (#128613 ) Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter. Test Plan: CI Differential Revision: D58528575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613 Approved by: https://github.com/hl475	2024-06-13 18:25:04 +00:00
vasiliy	a265556362	inductor fusion logs: make it easier to attribute to aten graph (#127159 ) Summary: I want to be able to look at inductor fusion logs and reason about which parts of the aot_autograd aten graph were fused / not fused. This PR adds a short description of each buffer to the fusion logs. Example for forward of `Float8Linear`: ``` torch._inductor.scheduler.__fusion: ===== attempting fusion (1/10): 13 nodes ===== torch._inductor.scheduler.__fusion: fuse_nodes_once, candidates: torch._inductor.scheduler.__fusion: SchedulerNode(name='buf0'), Reduction(['[254201]', 'max', 'origins={abs_1, max_1}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf3'), Reduction(['[114688]', 'max', 'origins={abs_2, max_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf6'), Pointwise(['[]', 'origins={reciprocal_1, convert_element_type_6, clamp_min_2, mul_2, copy_1, reciprocal_3, convert_element_type_5}']) torch._inductor.scheduler.__fusion: ExternKernelSchedulerNode(name='buf10') torch._inductor.scheduler.__fusion: SchedulerNode(name='buf2'), Pointwise(['[]', 'origins={full_default}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf8'), Pointwise(['[8192, 7168]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_type _3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf4'), Reduction(['[512]', 'max', 'origins={abs_2, max_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf13'), Pointwise(['[8192, 7168]', 'origins={clone_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf7'), Pointwise(['[16384, 8192]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_typ e_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}']) torch._inductor.scheduler.__fusion: ExternKernelSchedulerNode(name='buf9') torch._inductor.scheduler.__fusion: SchedulerNode(name='buf1'), Reduction(['[528]', 'max', 'origins={abs_1, max_1}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf5'), Pointwise(['[]', 'origins={convert_element_type, clamp_min, convert_element_type_1, copy, reciprocal_2, mul, reciprocal}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf12'), Pointwise(['[8192, 16384]', 'origins={clone_1}']) torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf7: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf12: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf1: numel/rnumel mismatch (reduce) (528, 1), (254201, 528) torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf1: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf12 with buf1: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf7: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf12: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf8: intermediate nodes between node1 & node2 torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf13: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf4: numel/rnumel mismatch (reduce) (512, 1), (114688, 512) torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf4: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf13 with buf4: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf8: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf13: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf9 with buf10: node1 is extern or nop torch._inductor.scheduler.__fusion: found 4 possible fusions torch._inductor.scheduler.__fusion: fusing buf7 with buf12 torch._inductor.scheduler.__fusion: fusing buf8 with buf13 torch._inductor.scheduler.__fusion: fusing buf4 with buf6 torch._inductor.scheduler.__fusion: fusing buf1 with buf5 torch._inductor.scheduler.__fusion: completed fusion round (1/10): fused 13 nodes into 9 nodes ``` Test Plan: will add tests after we align some version of this can land Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127159 Approved by: https://github.com/mlazos	2024-06-13 18:22:02 +00:00
JJ Asghar	de9a072ac4	Updating the `sigslot` license to Public Domain (#128085 ) It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085 Approved by: https://github.com/janeyx99	2024-06-13 18:13:54 +00:00
Thanh Ha	8733c4f4be	docs: Add link to test-infra issue (#128608 ) It's not immediately obvious from this file that the issue being referred to is in another repo. Add that detail and link to make it easier for folks reading this code to jump to the correct issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128608 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/ZainRizvi	2024-06-13 18:00:53 +00:00
PyTorch MergeBot	dd19c9150c	Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016 )" This reverts commit b459713ca75f6ab7c8a59acec0258e0f77904ada. Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))	2024-06-13 17:56:42 +00:00
Yifu Wang	52f529105d	force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation (#127454 ) When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation. This copy can be avoided if the lhs operand already has the following stride order: lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride() In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454 Approved by: https://github.com/Chillee	2024-06-13 17:52:37 +00:00
Joel Schlosser	d5780396c7	Skip debug asserts for mixed dense, subclass views in autograd_not_implemented_fallback (#128057 ) Fixes #125503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128057 Approved by: https://github.com/albanD, https://github.com/soulitzer ghstack dependencies: #127007	2024-06-13 17:13:02 +00:00
Joel Schlosser	9a8917fdbd	Naive CPU kernels for jagged <-> padded dense conversions (#127007 ) This PR introduces naive CPU impls for: * `_jagged_to_padded_dense_forward()` * `_padded_dense_to_jagged_forward()` On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007 Approved by: https://github.com/davidberard98	2024-06-13 17:13:02 +00:00
Animesh Jain	a0604193a2	handle call_function with Parameter args in DDPOptimizer splitting (#128034 ) When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph. example: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_: "f32[1024, 1024]"): l_x_ = L_x_ # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x) l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_); l_x_ = None l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0); l__self___layers_0 = None return (l__self___layers_1,) ``` will be ``` class GraphModule(torch.nn.Module): def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"): l_self_layers_0_weight = L_self_layers_0_weight l_self_layers_0_bias = L_self_layers_0_bias l_x_ = L_x_ l_self_layers_1_weight = L_self_layers_1_weight l_self_layers_1_bias = L_self_layers_1_bias # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias) input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias); l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias); input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None return (input_2,) ``` The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead). This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call and the Parameter properties. This address #https://github.com/pytorch/pytorch/issues/127552 running the optimizer on the code above with inlining yields to the following splitting: ``` ---submod_0 graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_] %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight] %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias] %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {}) return linear ---submod_1 graph--- graph(): %input_1 : [num_users=1] = placeholder[target=input_1] %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight] %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias] %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {}) return linear ---final graph--- graph(): %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight] %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias] %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_] %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight] %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias] %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {}) %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {}) return (submod_1,) --------------- ``` where as without inlining it uses to be ``` ---submod_0 graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_] %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {}) return l__self___layers_0 /data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( ---submod_1 graph--- graph(): %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0] %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {}) return l__self___layers_1 ---final graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_] %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {}) %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {}) return (submod_1,) --------------- ``` TESTING: (1) running ``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 pytest test/distributed/test_dynamo_distributed.py -k ``` result in reduction in failures from 6 to 2 with this PR. The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work. Co-authored-by: Animesh Jain <anijain@umich.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034 Approved by: https://github.com/anijain2305, https://github.com/wconstab	2024-06-13 17:07:27 +00:00
lezcano	3e3435678c	Remove some implications from the static_eval pattern matcher (#128500 ) We should be able to remove this as, with the new canonicalisation, we have that `a < b` and `-a > -b` should be canonicalised to the same expression (if SymPy does not interfere too much). nb. I thought this would cut further the compilation time, but I was running the benchmarks wrong (not removing triton's cache oops). It turns out that after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500 Approved by: https://github.com/ezyang ghstack dependencies: #128410, #128411	2024-06-13 16:50:00 +00:00
lezcano	0fdd8d84fa	Do not generate -1* in SymPy expressions when canonicalising (#128411 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128411 Approved by: https://github.com/ezyang ghstack dependencies: #128410	2024-06-13 16:49:59 +00:00
lezcano	bdeb9225b0	Do not call `get_implications` unnecessarily (#128410 ) This should improve compilation times. With this PR and the patch in the original issue, I get a compilation time of `Compilation time: 307.30 second`. Fixes https://github.com/pytorch/pytorch/issues/128398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128410 Approved by: https://github.com/Chillee	2024-06-13 16:49:55 +00:00
cyy	e2a72313e8	Concat namespaces of torch/csrc/profiler code and other fixes (#128606 ) Improve namespaces and modernize codebase of torch/csrc/profiler code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128606 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-06-13 16:46:34 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
Zain Rizvi	b05b8d3989	[EZ][ALI Migration] Add logging for workflow type determination (#128619 ) To help figure out what went wrong when the wrong label appears to have been set Pull Request resolved: https://github.com/pytorch/pytorch/pull/128619 Approved by: https://github.com/zxiiro, https://github.com/clee2000	2024-06-13 16:37:07 +00:00
Yidi Wu	e9b81e4edf	Fakify torch bind input by default (#128454 ) Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed Differential Revision: D58418251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454 Approved by: https://github.com/angelayi	2024-06-13 16:25:11 +00:00
PyTorch MergeBot	c63ccead5e	Revert "[dynamo] Enable some inlining inbuilt nn module tests (#128440 )" This reverts commit 1602c7d0c861a4382746ccb18c76d8703a636f4e. Reverted https://github.com/pytorch/pytorch/pull/128440 on behalf of https://github.com/clee2000 due to new test broke internally D58501220 ([comment](https://github.com/pytorch/pytorch/pull/128440#issuecomment-2166127531))	2024-06-13 16:14:37 +00:00
Oguz Ulgen	17b45e905a	Fix get output code when caching is enabled (#128445 ) Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits. Test Plan: ci Differential Revision: D58429602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral	2024-06-13 16:00:30 +00:00
Aaron Gokaslan	93a14aba6e	[BE]: Update mypy to 1.10.0 (#127717 ) Updates mypy to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717 Approved by: https://github.com/ezyang	2024-06-13 15:57:13 +00:00
Wu, Chunyuan	49366b2640	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 15:32:15 +00:00
xinan.lin	cf7adc2fa1	[Inductor] Update Intel GPU Triton commit pin. (#124842 ) Update Intel triton for Pytorch 2.4 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124842 Approved by: https://github.com/EikanWang	2024-06-13 14:34:37 +00:00
Tom Ritchford	edb45dce85	Add OpInfo entry for as_strided_copy (#127231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231 Approved by: https://github.com/lezcano	2024-06-13 13:58:47 +00:00
rzou	7cc07a3eb1	[custom_op] stop using nonlocals to store information (#128547 ) Fixes https://github.com/pytorch/pytorch/issues/128544 Fixes https://github.com/pytorch/pytorch/issues/128535 We had a problem with multithreading where the nonlocals were being clobbered. In the first place, we stored these nonlocals because we wanted to ferry information from an autograd.Function.apply to autograd.Function.forward. Our new approach is: - pass the information directly as an input to the autograd.Function.apply. This means that the autograd.Function.forward will receive the information too. - this messes up ctx.needs_input_grad, which has an element per input to forward. The user should not see the additional information we passed. We fix this by temporarily overriding ctx.needs_input_grad to the right thing. - this exposed a bug in that ctx.needs_input_grad wasn't correct for TensorList inputs. This PR fixes that too. Test Plan: - existing and new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547 Approved by: https://github.com/williamwen42, https://github.com/soulitzer	2024-06-13 13:36:39 +00:00
IvanKobzarev	2b9465d62a	[aota] Allow some mutations in backward (#128409 ) https://github.com/pytorch/pytorch/issues/127572 Allow mutations in backward on forward inputs, if 1/ not mutationg metadata Enforced at compilation time. 2/ if create_graph=True: mutated input does not require_grad Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled() Adding input_joint_info to track mutations of inputs during joint. Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409 Approved by: https://github.com/bdhirsh	2024-06-13 12:09:08 +00:00
Laith Sakka	d0c08926d1	allow inlining functions in _python_dispatch and _is_make_fx_tracing (#128485 ) This fix grab breaks in torch_multimodal_clip benchmark. Co-authored-by: Animesh Jain <anijain@umich.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128485 Approved by: https://github.com/anijain2305 ghstack dependencies: #128428	2024-06-13 09:56:39 +00:00
Jiong Gong	1fd2cd26a0	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel	2024-06-13 09:46:22 +00:00
Jason Ansel	c897651392	[inductor] Add BackendFeature gating (#128266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266 Approved by: https://github.com/shunting314	2024-06-13 07:31:51 +00:00
Yu, Guangye	88974fedd0	Clean up xpu ut to make CI happy (#128383 ) # Motivation Before #127611 merged, the xpu-specific UT `test/test_xpu.py` was skipped temporarily. This PR aims to fix the UT bug introduced by #127741. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128383 Approved by: https://github.com/EikanWang	2024-06-13 07:06:41 +00:00
Eddie Yan	ce79b09415	[CUDA][Sparse] Change comparison function of `test_sparse_semi_structured.py` and bump tolerances for `sp24_matmuls` (#128553 ) Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553 Approved by: https://github.com/jcaip	2024-06-13 06:58:07 +00:00
Nikita Shulga	0678742924	[MPS] Add Metal implementation of exp op (#128421 ) To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes https://github.com/pytorch/pytorch/issues/84936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421 Approved by: https://github.com/kulinseth ghstack dependencies: #128373, #128375	2024-06-13 06:53:17 +00:00
Wang, Eikan	14c9eb5ed2	Add XPU code owners (#128486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128486 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-13 06:33:45 +00:00
Catherine Lee	518c9e6455	Forward fix lint (#128587 ) merge at will After https://github.com/pytorch/pytorch/pull/125968 and https://github.com/pytorch/pytorch/pull/127693 landrace Pull Request resolved: https://github.com/pytorch/pytorch/pull/128587 Approved by: https://github.com/huydhn	2024-06-13 06:19:03 +00:00
Animesh Jain	c52eda896e	[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428 Approved by: https://github.com/yanboliang, https://github.com/mlazos ghstack dependencies: #126578, #128440, #128470, #128453, #128484	2024-06-13 06:08:56 +00:00
Animesh Jain	1f6e84fa68	[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484 Approved by: https://github.com/mlazos ghstack dependencies: #126578, #128440, #128470, #128453	2024-06-13 06:08:56 +00:00
Shaz Qadeer	ea541dd965	SymIntify cross_entropy_loss_prob_target numel call (#128141 ) This PR replaces call to ```numel``` with ```sym_numel``` in cross_entropy_loss_prob_target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128141 Approved by: https://github.com/ezyang	2024-06-13 05:37:17 +00:00
Mengwei Liu	ade3d07483	GGML inspired int8 MM Metal shader (#127646 ) ## Context This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library. llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE ## Key Changes Made the following changes to the original code: * Memory layout of weight and scales is different than llama.cpp. * Weight dequantization (scales multiplication) is done after MM is finished. * Following PyTorch naming convention (M, K, N and assuming row major). ## Benchmark When M = 1, mv shader improves existing ATen int8mm by 40%. When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow. ![image](https://github.com/pytorch/pytorch/assets/8188269/fd9eff71-c538-4263-a7b5-f96fe479ae9d) Hence the kernel chooses different shaders based on M. ## Test Plan Tests are passing: ``` ❯ python test/test_mps.py -v -k _int8_ /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so Expected in: <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok ---------------------------------------------------------------------- Ran 12 tests in 1.180s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646 Approved by: https://github.com/malfet	2024-06-13 05:23:56 +00:00
Michael Lazos	b86b4ace88	Invalidate eager params when inlining and freezing nn modules (#128543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128543 Approved by: https://github.com/anijain2305	2024-06-13 04:50:17 +00:00
Xuehai Pan	83bb9b7c53	[BE] explicitly export subpackage `torch.utils` (#128342 ) Resolves #126401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128342 Approved by: https://github.com/Skylion007 ghstack dependencies: #127707	2024-06-13 04:39:16 +00:00
Edward Z. Yang	2229884102	Introduce int_oo (#127693 ) In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range. After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better. But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. test/test_sympy_utils.py describes some basic properties of the number, and torch/utils/_sympy/numbers.py has the actual implementation. The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments. Fixes https://github.com/pytorch/pytorch/issues/127396 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693 Approved by: https://github.com/lezcano ghstack dependencies: #126905	2024-06-13 04:08:20 +00:00
Shivam Raikundalia	d3b8230639	Fix profiler_kineto Clang errors (#128464 ) Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense. Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass Differential Revision: D58431005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464 Approved by: https://github.com/aaronenyeshi	2024-06-13 03:10:50 +00:00
PyTorch MergeBot	d630e1e838	Revert "[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 )" This reverts commit f2d7f235a684c593f5a1ff2ca0b47b47274bfe85. Reverted https://github.com/pytorch/pytorch/pull/128269 on behalf of https://github.com/anijain2305 due to incorrect ([comment](https://github.com/pytorch/pytorch/pull/128269#issuecomment-2164267320))	2024-06-13 03:04:26 +00:00
Jing Xu	7fe9ab9ccc	update amp example to device-agnostic (#127278 ) As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic. Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars	2024-06-13 02:01:16 +00:00
cyy	3f9b8446cf	[8/N] Remove unused functions (#128499 ) Follows #128407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499 Approved by: https://github.com/malfet	2024-06-13 01:15:11 +00:00
Xu Han	ede74940a1	optimize vec isa check dispatch logical. (#128320 ) Optimize cpu vec isa check dispatch by archecture, it makes code easy to read and maintaince. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128320 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 01:06:34 +00:00
Yidi Wu	c1cd946818	[cond] add a set_ and data mutation expected failure test (#128457 ) A follow up of the discussion in https://github.com/pytorch/pytorch/pull/126936. Cond errors out early because of a graph break triggered by DelayGraphBreakVariable, which is created due to `aten.set_` [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/tensor.py#L366-L376). We might need to see what happened to this test if we allow graph break in higher order op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128457 Approved by: https://github.com/zou3519	2024-06-13 00:16:59 +00:00
soulitzer	c472cec565	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. "bc-breaking" for existing users of the private API: - Existing policy functions must now change their return value to use the Enum. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-12 23:57:33 +00:00
pradeepfn	25b7537a27	doc comment typo fixes and improvements (#128512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128512 Approved by: https://github.com/LucasLLC	2024-06-12 23:55:09 +00:00
Huamin Li	eb1db6702f	[2nd try][AOTI] Switch to use shim v2 (#128521 ) Test Plan: Sandcastle Differential Revision: D58470269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128521 Approved by: https://github.com/desertfire	2024-06-12 23:44:24 +00:00
Andrey Talman	4423e1bbdc	[release] Increase version 2.4.0->2.5.0 (#128514 ) Same as https://github.com/pytorch/pytorch/pull/121974 Branch cut for 2.4.0 completed hence advance main version to 2.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128514 Approved by: https://github.com/malfet	2024-06-12 23:40:01 +00:00
Angela Yi	3bc2004f91	[ts_converter] Fix prim::dtype (#128517 ) Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch. Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909 Reviewed By: jiashenC Differential Revision: D58469232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517 Approved by: https://github.com/jiashenC	2024-06-12 23:02:50 +00:00
Edward Z. Yang	2fa6f80b13	Perform reciprocal optimization with foreach_div (#128433 ) Fixes https://github.com/pytorch/pytorch/issues/114165 Internal xref https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433 Approved by: https://github.com/awgu	2024-06-12 22:57:03 +00:00
Shaz Qadeer	8db4a41973	Use computeStorageNbytesContiguous if possible (#128515 ) ```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515 Approved by: https://github.com/ezyang	2024-06-12 22:53:06 +00:00
Prachi Gupta	e2610240f9	[ROCm] Enable several inductor UTs (#127761 ) Fixes #ISSUE_NUMBER Needs https://github.com/pytorch/pytorch/pull/125396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761 Approved by: https://github.com/peterbell10, https://github.com/pruthvistony	2024-06-12 22:47:45 +00:00
Joel Schlosser	bb3cf8a339	Lift inductor lowerings for jagged <-> padded dense kernels (#125968 ) This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops: * `_jagged_to_padded_dense_forward()` * `_padded_dense_to_jagged_forward()` * NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968 Approved by: https://github.com/davidberard98	2024-06-12 22:46:09 +00:00
Sam Larsen	b4a7b543e5	Add targeted unit tests for guards-related functions used in the codecache (#128482 ) Summary: Add a few unit tests that exercise `produce_guards_expression` and `evaluate_guards_expression` (and specifically "ToFloat" "FloatTrueDiv" added in https://github.com/pytorch/pytorch/pull/128418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128482 Approved by: https://github.com/ezyang ghstack dependencies: #128418	2024-06-12 22:41:50 +00:00
Wang, Eikan	1f302d6885	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-12 22:31:59 +00:00
Shengbao Zheng	f4edd67fe7	[c10d] fix OSS commSplit bug (#128459 ) Summary: D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL` Revert this change from D56907877 Test Plan: CI Differential Revision: D58436088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459 Approved by: https://github.com/shuqiangzhang	2024-06-12 22:29:01 +00:00
rzou	f39ab8a0fe	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-12 22:25:37 +00:00
cyy	3008644297	[Caffe2] Remove remaining unused perfkernels (#128477 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-12 22:19:36 +00:00
Sam Larsen	55a6b38f52	[inductor] enable fx graph cache on torchbench (#128239 ) Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239 Approved by: https://github.com/oulgen	2024-06-12 22:15:02 +00:00
Huy Do	6206da55ef	Fix lint after #119459 (#128558 ) TSIA Pull Request resolved: https://github.com/pytorch/pytorch/pull/128558 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet	2024-06-12 22:11:37 +00:00
Animesh Jain	2b28b107db	[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128453 Approved by: https://github.com/yf225 ghstack dependencies: #126578, #128440, #128470	2024-06-12 22:03:45 +00:00
James Wu	6aef2052ea	Save backward graphs lazily to cache (#126999 ) This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999 Approved by: https://github.com/bdhirsh ghstack dependencies: #126791	2024-06-12 21:58:34 +00:00
rzou	87072dcfdb	Change Dynamo's custom ops warning message to be less spammy (#128456 ) This is a short-term fix (for 2.4). In the longer term we should fix https://github.com/pytorch/pytorch/issues/128430 The problem is that warnings.warn that are inside Dynamo print all the time. Python warnings are supposed to print once, unless their cache is reset: Dynamo ends up resetting that cache everytime it runs. As a workaround we provide our own warn_once cache that is keyed on the warning msg. I am not worried about this increasing memory usage because that's effectively what python's warnings.warn cache does. Test Plan: - fix tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456 Approved by: https://github.com/anijain2305	2024-06-12 21:57:12 +00:00
haozhe.zhu	c53d65b3d3	[inductor] fix linear add bias pattern (#128473 ) Fix https://github.com/pytorch/pytorch/issues/128287. Previous the assertion in `linear_add_bias` are pretty bad ``` assert packed_weight_node.name == "_reorder_linear_weight" assert transpose_weight_node.name == "permute_default" ``` because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute. Check `target` instead `name` can solve this issue. UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473 Approved by: https://github.com/jgong5	2024-06-12 21:55:35 +00:00
Kurman Karabukaev	bb13fad7aa	Share TCPStore by default when using c10d rdzv handler (#128096 ) Summary: Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail. Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server. Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler. Any handler would like to manage tcp store has to: - Return true on `use_agent_store` property - `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call) Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change) Test Plan: `cat ~/workspace/dist-demo/stores.py` ~~~ import torch import logging import sys import torch.distributed as dist import torch import os import time logger = logging.getLogger(__name__) logger.addHandler(logging.StreamHandler(sys.stderr)) logger.setLevel(logging.INFO) def _run_test(store): if dist.get_rank() == 1: logger.info("Rank %s is sleeping", dist.get_rank()) time.sleep(5) key = "lookup_key" logger.info("Checking key %s in store on rank %s", key, dist.get_rank()) store.check([key]) else: logger.info("rank %s done", dist.get_rank()) def main() -> None: use_gpu = torch.cuda.is_available() dist.init_process_group(backend="nccl" if use_gpu else "gloo") dist.barrier() logger.info(f"Hello World from rank {dist.get_rank()}") host = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] world_size = os.environ['WORLD_SIZE'] logger.info("testing TCPStore") store = dist.TCPStore( host_name=host, port=int(port), world_size=int(world_size), ) _run_test(store) if __name__ == "__main__": main() ~~~ With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option) ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: *************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ************************************* Hello World from rank 1 Hello World from rank 2 Hello World from rank 0 testing TCPStore testing TCPStore testing TCPStore rank 2 done Rank 1 is sleeping rank 0 done Checking key lookup_key in store on rank 1 ~~~ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro c-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: ************************************* Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *************************************** Hello World from rank 0 Hello World from rank 2 Hello World from rank 1 testing TCPStore testing TCPStore testing TCPStore rank 0 done rank 2 done Rank 1 is sleeping Checking key lookup_key in store on rank 1 [rank1]: Traceback (most recent call last): [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module> [rank1]: main() [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main [rank1]: _run_test(store) [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test [rank1]: store.check([key]) [rank1]: torch.distributed.DistNetworkError: Connection reset by peer E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python Traceback (most recent call last): File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module> main() File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main run(args) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run elastic_launch( File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/kurman/workspace/dist-demo/stores.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-05_17:40:22 host : devgpu011.cln5.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 2279237) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ~~~ Differential Revision: D58180193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096 Approved by: https://github.com/shuqiangzhang	2024-06-12 21:49:42 +00:00
Michael Lazos	c0ea8fc3a3	Disable inlining nn modules on static inputs tests (#128529 ) With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529 Approved by: https://github.com/anijain2305 ghstack dependencies: #128528	2024-06-12 21:40:29 +00:00
Michael Lazos	ff3ba99320	Disable inline nn modules on unstable ptr test (#128528 ) With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528 Approved by: https://github.com/anijain2305	2024-06-12 21:40:29 +00:00
Andrea Frittoli	1026b7cfbe	Add docstring for the torch.typename function (#128129 ) Fixes: #127885 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128129 Approved by: https://github.com/malfet	2024-06-12 21:34:20 +00:00
Aaron Orenstein	cba840fde9	Fix accidental variable shadow (#128460 ) Fixes #128322 We should probably crank up clang's warning levels... Test: ``` import torch def addmv_slice(input, mat, vec, slice_op): vec = vec[slice_op] res = torch.addmv(input, mat, vec) # traced line: 25 return res torch._dynamo.reset() model_opt = torch.compile(addmv_slice) input = torch.empty(size=[11]).uniform_(-1, 1) mat = torch.empty([11, 128]).uniform_(-10.0, 20.0) vec = torch.empty([256]).uniform_(-10.0, 20.0) slice_op = slice(None, None, 2) out = model_opt(input, mat, vec, slice_op) vec = torch.empty([384]).uniform_(-10.0, 20.0) slice_op = slice(None, None, 3) out = model_opt(input, mat, vec, slice_op) ``` before this change the test fails with: ``` torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function getitem>((FakeTensor(..., size=(s0,)), slice(None, None, s1)), *{}): slice step cannot be zero ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128460 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 21:14:04 +00:00
Zhengxu Chen	0444e89931	[export] Remove replace_sym_size_ops_pass (#128443 ) Summary: Not needed anymore. Test Plan: CI Differential Revision: D58429458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128443 Approved by: https://github.com/angelayi	2024-06-12 21:03:06 +00:00
Joel Schlosser	67e6c76a18	Support apply_(callable) sugar for CPU NJTs (#125416 ) Example: ```python nt.apply_(lambda x: x * 2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125416 Approved by: https://github.com/soulitzer	2024-06-12 20:30:57 +00:00
Xuehai Pan	dd143d44cc	[BE] enable UFMT for top-level files `torch/*.py` (#127707 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127707 Approved by: https://github.com/ezyang	2024-06-12 20:15:05 +00:00
James Wu	cc231a8e2b	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 20:04:44 +00:00
Wanchao Liang	7775fee10f	[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 ) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes https://github.com/pytorch/pytorch/issues/128365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431 Approved by: https://github.com/awgu	2024-06-12 19:16:33 +00:00
Joel Schlosser	ec1fdda196	Fix jagged NT softmax semantics (#119459 ) Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong) After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459 Approved by: https://github.com/soulitzer	2024-06-12 19:12:03 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
DanilBaibak	6d1b1ddd3e	Select Runner Label Dynamically (#127287 ) Updated `get_workflow_type.py` logic to dynamically select a prefix for the runner label. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127287 Approved by: https://github.com/ZainRizvi	2024-06-12 18:47:47 +00:00
PyTorch MergeBot	7db501ba2b	Revert "[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 )" This reverts commit 45dccfddcd8fce804f50075484421ade27f1f021. Reverted https://github.com/pytorch/pytorch/pull/128350 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128350#issuecomment-2163669538))	2024-06-12 18:35:18 +00:00
mori360	d71f92213c	[DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004 ) Fixes #126950 `ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict` Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004 Approved by: https://github.com/fegin	2024-06-12 18:14:56 +00:00
Tharindu Patabandi	624e8ae491	Documentation for is_dependent function (#128197 ) Docstring for torch.distributions.constraints.is_dependent Fixes #127900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128197 Approved by: https://github.com/fritzo, https://github.com/malfet	2024-06-12 17:50:41 +00:00
Shashank Shekhar	a70a7337d2	Update torch.nanmean() docstring to mention input dtype requirement (#128155 ) Fixes #120570 ## Description Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex. Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated. ## Checklist - [X] The issue that is being fixed is referred in the description. - [X] Only one issue is addressed in this pull request. - [x] Labels from the issue that this PR is fixing are added to this pull request. - [X] No unnecessary issues are included into this pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155 Approved by: https://github.com/malfet	2024-06-12 17:46:36 +00:00
anandptl84	0f52dc7e51	Document `torch.cuda.profiler.stop` (#128196 ) Fixes #127918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128196 Approved by: https://github.com/malfet, https://github.com/eqy	2024-06-12 17:39:43 +00:00
PyTorch MergeBot	5001f41b90	Revert "Make TraceUtils.h to be device-agnostic (#126969 )" This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5. Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))	2024-06-12 16:32:57 +00:00
PyTorch MergeBot	f89574fa23	Revert "Pass params to dump_nccl_trace_pickle (#128307 )" This reverts commit eb567b1f40233667b982f81e3a75deec0fdfd9ca. Reverted https://github.com/pytorch/pytorch/pull/128307 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert 126969 ([comment](https://github.com/pytorch/pytorch/pull/128307#issuecomment-2163459399))	2024-06-12 16:29:51 +00:00
PyTorch MergeBot	81e4e12f02	Revert "Support aten operations with out tensor (#124926 )" This reverts commit cba195c8edd6c7149036ef0767772d11fff5390e. Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103. Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))	2024-06-12 16:20:04 +00:00
PyTorch MergeBot	c5172b8de8	Revert "[AOTI] Switch to use shim v2 (#127674 )" This reverts commit 9a38cae299e5ffd8143182bec878c28f96cfd72a. Reverted https://github.com/pytorch/pytorch/pull/127674 on behalf of https://github.com/clee2000 due to tests failed internally D56709309 ([comment](https://github.com/pytorch/pytorch/pull/127674#issuecomment-2163436728))	2024-06-12 16:17:07 +00:00
Xu Han	9e39c62908	correct avx512_vnni isa name. (#128318 ) `x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`. This PR correct the function name to `avx512_vnni`. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-06-12 16:12:49 +00:00
PyTorch MergeBot	f2dcbe89d6	Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815 )" This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d. Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153. Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))	2024-06-12 16:09:22 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Jeff Daily	b19c2319e4	[ROCm] TunableOp for gemm_and_bias (#128143 ) Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap. This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked. CC @xw285cornell Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143 Approved by: https://github.com/Skylion007	2024-06-12 15:53:39 +00:00
Aaron Orenstein	3c971d2ef3	Flip default value for mypy disallow_untyped_defs [final] (#127836 ) Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types. The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 15:28:42 +00:00
PyTorch MergeBot	15ab636007	Revert "Fix side effect pruning (#128028 )" This reverts commit a55d0d9718c11eb2897423c78eff18b168dd0a06. Reverted https://github.com/pytorch/pytorch/pull/128028 on behalf of https://github.com/clee2000 due to broke test in internal D58443816. Test exists in external too though ([comment](https://github.com/pytorch/pytorch/pull/128028#issuecomment-2163249251))	2024-06-12 14:55:57 +00:00
Wu, Chunyuan	5ef70faaa7	Revert "Make torch_geometric models compatible with export (#123403 )" (#128377 ) This reverts commit d78991a7381adb3df5e9b63c365db4506643edce. This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377 Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire	2024-06-12 14:53:01 +00:00
PyTorch MergeBot	71f491554c	Revert "First version of AOTAutogradCache (#126791 )" This reverts commit abc3eec22d38079bee855fbcb75da62a9558284c. Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))	2024-06-12 13:59:29 +00:00
James Wu	abc3eec22d	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 13:44:30 +00:00
Xia, Weiwen	2e065f2486	[Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#127592 ) Fixes #127402 - Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py - Add checks of mutation for QLinearPointwiseBinaryPT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592 Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee	2024-06-12 10:49:16 +00:00
Xuehai Pan	46a35a1ed4	[BE] enable UFMT for `torch/__init__.py` (#127710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127710 Approved by: https://github.com/ezyang ghstack dependencies: #127703, #127708, #127709	2024-06-12 10:40:23 +00:00
Xuehai Pan	26433b86de	[BE][Easy] sort `__all__` in `torch/__init__.py` (#127709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127709 Approved by: https://github.com/ezyang ghstack dependencies: #127703, #127708	2024-06-12 10:21:36 +00:00
Tom Ritchford	2386045e4f	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-12 09:39:58 +00:00
Jiong Gong	1edcb31d34	[RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 (#128472 ) reland for https://github.com/pytorch/pytorch/pull/126068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128472 Approved by: https://github.com/desertfire	2024-06-12 08:37:16 +00:00
Animesh Jain	ebb00a92bd	[dynamo] Skip freezing expect failure for inlining inbuilt nn modules (#128470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128470 Approved by: https://github.com/mlazos ghstack dependencies: #126578, #128440	2024-06-12 08:21:50 +00:00
Animesh Jain	1602c7d0c8	[dynamo] Enable some inlining inbuilt nn module tests (#128440 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #126578	2024-06-12 08:21:50 +00:00
Xuehai Pan	04037f3d22	[BE] sort imports in `torch/__init__.py` (#127708 ) ---- - Sort import via `usort` - Change relative import `from . import xxx` to absolute import `from torch import xxx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127708 Approved by: https://github.com/ezyang ghstack dependencies: #127703	2024-06-12 08:03:54 +00:00
Eddie Yan	0b331fd5d7	[CUDA] Abate `SoftMax.cu` compiler warning spam (#128468 ) Avoids excessively spammy warnings such as ``` pytorch/aten/src/ATen/native/cuda/SoftMax.cu(844): warning #191-D: type qualifier is meaningless on cast type [&] { const auto& the_type = input.scalar_type(); constexpr const char* at_dispatch_name = "host_softmax"; at::ScalarType _st = ::detail::scalar_type(the_type); ; switch (_st) { case at::ScalarType::Double: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Double)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Double), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Double>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Float: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Float)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Float), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Float>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Half: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Half)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Half), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Half>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::BFloat16: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::BFloat16)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::BFloat16), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::BFloat16>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } default: do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str('"', at_dispatch_name, "\" not implemented for '", toString(_st), "'")))); }; } while (false); } }() ``` and ``` SoftMax.cu:844: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘long unsigned int’ [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128468 Approved by: https://github.com/valentinandrei	2024-06-12 07:47:14 +00:00
Sam Larsen	8b3daf1768	Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418 ) Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this: ``` Ne(CeilToInt(FloatTrueDiv(ToFloat(8L['t0']) - 4.0, 8.0))CeilToInt(FloatTrueDiv(ToFloat(8L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8L['t1']) - 4.0, 8.0))) and ... ``` It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any? Test Plan: `TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418 Approved by: https://github.com/ezyang	2024-06-12 06:26:43 +00:00
PyTorch MergeBot	a421699998	Revert "[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 )" This reverts commit 089f9a116ac8b2c14d6351b52614b529caba126b. Reverted https://github.com/pytorch/pytorch/pull/128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - `089f9a116a` ([comment](https://github.com/pytorch/pytorch/pull/128431#issuecomment-2162197858))	2024-06-12 06:25:53 +00:00
Xuehai Pan	dcc0093dba	[BE][Easy] export explicitly imported public submodules (#127703 ) Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703 Approved by: https://github.com/ezyang	2024-06-12 05:52:18 +00:00
diwei sun	62311257ad	Add 1 test case for Convtranspose1D in op microbenchmark (#127216 ) Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR. I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary: Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G. Performance comparison between torch 1.13 vs. torch 2.2 Benchmarking PyTorch1.13: ConvTranspose1d Mode: Eager Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu Forward Execution Time (s) : 0.96s Benchmarking PyTorch2.2: ConvTranspose1d Mode: Eager Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu Forward Execution Time (s) : 7.988s Also benchmarking for 7 rounds to check the variance. \| Round1 \| Round2 \| Round3 \| Round4 \| Round5 \| Round6 \| Round7 \| Normalized Variance -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Pytorch1.13 \| 0.971 \| 0.972 \| 0.969 \| 0.970 \| 0.972 \| 0.970 \| 0.971 \| 0.0002% Pytorch 2.2 \| 8.064 \| 8.053 \| 8.027 \| 7.927 \| 7.971 \| 7.929 \| 7.902 \| 0.0059% Ratio v2.2 vs. v1.13(Lower is better) \| 8.31 \| 8.28 \| 8.29 \| 8.18 \| 8.20 \| 8.18 \| 8.14 \| Reproduce script： numctl -N 0 python -m pt.conv_test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman	2024-06-12 05:33:54 +00:00
Wanchao Liang	089f9a116a	[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 ) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes https://github.com/pytorch/pytorch/issues/128365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431 Approved by: https://github.com/awgu	2024-06-12 05:22:24 +00:00
Natalia Gimelshein	77a0ca66e4	Add threadfence to 2-stage reduction for correct writes visibility (#128455 ) Final block accumulating 2-stage reduction result has to complete acquire pattern to make sure the writes of all other blocks are visible to it, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=atom#release-and-acquire-patterns Pull Request resolved: https://github.com/pytorch/pytorch/pull/128455 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-12 04:13:36 +00:00
Animesh Jain	c0b87afcad	[RELAND2][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Fixes https://github.com/pytorch/pytorch/issues/111837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel	2024-06-12 04:09:23 +00:00
loganthomas	02e7519ac3	DOC: strip inaccurate either float32 or float64 statement from set_default_type (#128192 ) Fixes #126647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128192 Approved by: https://github.com/malfet	2024-06-12 03:57:48 +00:00
cyy	8cf302dce4	[5/N] Change static functions in headers to inline (#128406 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128406 Approved by: https://github.com/ezyang	2024-06-12 03:25:54 +00:00
Kazuaki Ishizaki	86b5df3e71	Documenting the torch.fx.annotate.annotate function (#128337 ) Fixes #127903 This PR adds docstring to the `torch.fx.annotate.annotate` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128337 Approved by: https://github.com/malfet	2024-06-12 03:06:32 +00:00
Tuan Trieu	7c2058338a	Improve convert fp32 to fp16 fx pass (#127829 ) Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion. Test Plan: ``` buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder ``` Differential Revision: D57803843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829 Approved by: https://github.com/Skylion007	2024-06-12 02:50:37 +00:00
PyTorch MergeBot	3ddec713b8	Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 )" This reverts commit cac7a22b92478d897488688010e562b7bd36b97f. Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests `cac7a22b92` https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913. Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))	2024-06-12 02:20:15 +00:00
William Wen	85eeb90d2c	[dynamo] Fix graph breaks related to HF ModelOutput (#127780 ) Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027. Changes: - Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses) - Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead) - Support side effects for `CustomizedDictVariable` - Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-06-12 02:16:24 +00:00
Sam Larsen	7f6daf289b	[inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376 ) Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run` Differential Revision: D58371264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376 Approved by: https://github.com/eellison	2024-06-12 01:55:53 +00:00
Jiashen Cao	3d55d84ec2	[Fix] Check tensor dtype before using torch.allclose in _trace log (#128438 ) #### Issue `torch.allclose` errors out during logging due to different dtypes. #### Test * `pytest test/test_jit.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438 Approved by: https://github.com/angelayi	2024-06-12 01:52:09 +00:00
Wei Chen	bb2a995529	Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 )" (#128432 ) Summary: Original commit changeset: c7d2e6b13922 Original Phabricator Diff: D57618942 Differential Revision: D58383241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432 Approved by: https://github.com/ezyang, https://github.com/Yuzhen11	2024-06-12 01:34:32 +00:00
cyy	9538bf4e7c	[2/N] Remove inclusion of c10/util/string_utils.h (#128372 ) Follows #128300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372 Approved by: https://github.com/aaronenyeshi	2024-06-12 01:18:20 +00:00
cyy	219da29dfd	[7/N] Remove unused functions (#128407 ) Follows #128309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407 Approved by: https://github.com/ezyang	2024-06-12 01:10:33 +00:00
cyy	fb013ecb24	Remove unused private List::ptr_to_first_element (#128405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405 Approved by: https://github.com/ezyang	2024-06-12 01:07:14 +00:00
Kurman Karabukaev	6af4c6acad	Migrate test to internal base class, fixes (#128367 ) Summary: ## Remove etc deps converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server ## Adopt pytorch test convetions - test starts with `test_TESTS.py` - Test base class is torch.testing._internal.common_utils.TestCase - include __main__ handler ## reduce test timing (used to take > 300 seconds): 3.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic 2.59s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path 2.30s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched 2.17s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents 2.12s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic 2.08s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations 1.32s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc 1.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash 1.03s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc 0.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown Test Plan: pytest --durations=0 test/distributed/launcher/run_test.py Differential Revision: D58388182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367 Approved by: https://github.com/d4l3k	2024-06-12 01:03:40 +00:00
Bin Bao	786c24a4cd	[inductor] Always realize sigmoid for CPU (#128339 ) Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339 Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10	2024-06-12 00:46:33 +00:00
PyTorch MergeBot	5d8c7f39d4	Revert "Introduce int_oo (#127693 )" This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c. Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))	2024-06-11 23:36:08 +00:00
PyTorch MergeBot	c9c1fed065	Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 )" This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63. Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))	2024-06-11 23:34:03 +00:00
Andrew Hoblitzell	94fea82d66	init sub comment (#128082 ) Fixes #127905 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082 Approved by: https://github.com/titaiwangms	2024-06-11 22:42:35 +00:00
Andrea Frittoli	447173198b	Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139 ) Fixes: #127916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139 Approved by: https://github.com/SherlockNoMad	2024-06-11 22:42:11 +00:00
angelayi	b79d056e76	[export] FIx unflattener for preserving modules containing unused inputs (#128260 ) Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs. This also fixes unflattener issues in D57829276. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260 Approved by: https://github.com/pianpwk	2024-06-11 22:32:08 +00:00
Chirag Pandya	eb567b1f40	Pass params to dump_nccl_trace_pickle (#128307 ) Summary: Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307 Approved by: https://github.com/d4l3k ghstack dependencies: #128191	2024-06-11 22:28:53 +00:00
Chirag Pandya	1dd2431f86	[Test] Add test for only_active flag (#128191 ) Summary: Add a unit test for the only_active flag to _dump_nccl_trace API call. With this flag, we only expect active records to be returned. Test Plan: Unit test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191 Approved by: https://github.com/d4l3k	2024-06-11 22:26:01 +00:00
Andrew Hoblitzell	5fcb5f0c8b	init reshape_from_tensor_shape comment (#128171 ) Fixes #127897 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171 Approved by: https://github.com/titaiwangms	2024-06-11 21:56:33 +00:00
rzou	a55d0d9718	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-11 21:40:48 +00:00
Andrew Gu	8c1247cffb	[BE] Fixed CPU autocast warning (#127774 ) This PR fixes ``` /data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774 Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l	2024-06-11 21:33:35 +00:00
Will Feng	70a1e85718	[Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856 ) Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856 Approved by: https://github.com/awgu	2024-06-11 20:15:03 +00:00
PyTorch MergeBot	adb699189b	Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 )" This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9. Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084. Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))	2024-06-11 19:41:41 +00:00
eqy	45dccfddcd	[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 ) CC @vedaanta-nvidia @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350 Approved by: https://github.com/Skylion007	2024-06-11 19:22:21 +00:00
yuqingj	3e09123797	Enable UFMT on test_nestedtensor.py (#128359 ) split it into two PRs since it is more than 2k lines of change Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359 Approved by: https://github.com/davidberard98	2024-06-11 19:14:04 +00:00
BowenBao	61f922c2ca	Fix 'get_real_value' on placeholder nodes (#127698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698 Approved by: https://github.com/jansel ghstack dependencies: #127695, #127696	2024-06-11 18:57:25 +00:00
BowenBao	984b1a8c35	Fix 'get_attr' call in dynamo 'run_node' (#127696 ) Fixes #124858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696 Approved by: https://github.com/jansel ghstack dependencies: #127695	2024-06-11 18:57:25 +00:00
Jing Xu	205410cb44	add xpu to torch.tensors (#127280 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280 Approved by: https://github.com/svekars	2024-06-11 18:13:01 +00:00
Eddie Yan	cac7a22b92	[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 ) Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166 CC @nWEIdia @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-11 18:09:25 +00:00
Wanchao Liang	8a09940a54	[inductor] fix compile time regression by caching get_gpu_type (#128363 ) We observed signficant compile time regression in torchtitan when turning on 2D parallel + torch.compile recently. So I decided to get a deeper understanding why. It turns out this is affecting all the trainings that have functional collectives captured in the graph, not only 2D parallel (2D parallel was just the job that happen to have collectives captured in the TP region). The root cause is because when doing inductor lowering, we are calling the comm analysis pass to get a estimated collective time for each collective node in the graph, for each call to check the collective node, we are calling `get_gpu_type()`, which under the hood calls a `torch.utils.collect_env.run` to get the GPU info. However, this call is super expensive! The reason is that this call effectively spawns a new process and call `nvidia-smi` to get the GPU info, so the cost is linear to the number of collective nodes in the graph. see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75 The fix is to add a lru cache to the function, so that we only call this once and reuse the cached results afterwards torchtitan benchmark shows: * before this fix: 2D parallel + fp8 compile time: 6min + * after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement) There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363 Approved by: https://github.com/yf225	2024-06-11 18:02:13 +00:00
PyTorch MergeBot	1d233b8f50	Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704 )" This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165. Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
PyTorch MergeBot	491c4a5dcb	Revert "Make sure #126704 is BC for torch.save-ed `nn.Module` (#128344 )" This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb. Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
Angela Yi	4345d98663	[dynamo] Fix for #127696 (#128358 ) Test Plan: `buck2 test @//mode/dev-nosan //executorch/exir/backend/...` https://www.internalfb.com/intern/testinfra/testrun/12666373989243932 Differential Revision: D58384518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358 Approved by: https://github.com/ydwu4	2024-06-11 16:43:15 +00:00
ankurneog	a838e90964	Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970 ) ### Motivation Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations. Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming. Hence with this PR we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded. The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices ### Changes Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU. Include code to check if intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests ### Additional Context please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970 Approved by: https://github.com/albanD	2024-06-11 16:35:17 +00:00
David Berard	29081059b6	[Static Runtime] Fix & run gen_static_runtime_ops (#128299 ) gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise. I added a number of ops to the blocklist: ``` + "_nested_tensor_storage_offsets", + "_nested_get_values", # no CPU backend + "_nested_get_values_copy", # no CPU backend + "_nested_view_from_jagged", # testing needs to be patched + "_nested_view_from_jagged_copy", # testing needs to be patched + "_nested_view_from_buffer", # testing needs to be patched + "_nested_view_from_buffer_copy", # testing needs to be patched + "_int_mm", # testing needs to be patched + "_to_sparse_csc", # testing needs to be patched + "_to_sparse_csr", # testing needs to be patched + "segment_reduce", # testing needs to be patched ``` Most of these are added just because testing doesn't work right now. Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though. Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299 Approved by: https://github.com/YuqingJ	2024-06-11 16:27:39 +00:00
Nikita Shulga	f8c45996d5	[MPS] Make erfinv compilable for bfloat16 (#128375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375 Approved by: https://github.com/Skylion007 ghstack dependencies: #128373	2024-06-11 16:04:11 +00:00
Aaron Orenstein	c13e03c874	Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374 Approved by: https://github.com/Skylion007	2024-06-11 15:58:28 +00:00
Nikita Shulga	053930e194	[MPS][BE] Remove code duplication (#128373 ) Use `scalarToMetalTypeString` instead of `getMetalType` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373 Approved by: https://github.com/Skylion007	2024-06-11 15:58:04 +00:00
Huamin Li	9a38cae299	[AOTI] Switch to use shim v2 (#127674 ) Differential Revision: D56709309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674 Approved by: https://github.com/desertfire	2024-06-11 15:01:25 +00:00
kareem mohiddeen shaik	55901fb3da	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-06-11 14:04:52 +00:00
IvanKobzarev	fc77fdca6f	[guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224 Approved by: https://github.com/ezyang	2024-06-11 14:01:34 +00:00
FFFrog	648625b230	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-11 08:38:07 +00:00
Peter Bell	207c2248a8	[inductor] Fix lowering full with SymBool value (#128213 ) Fixes #128161, fixes #128095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213 Approved by: https://github.com/lezcano	2024-06-11 08:33:35 +00:00
Colin L Reliability Rice	a206dcc79e	fb_memcache: Move to fbcode from thirdparty (#128174 ) Summary: The fb_memcache injections location and path is changing. Test Plan: Existing tests should pass. Reviewed By: bertmaher, oulgen Differential Revision: D57973772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174 Approved by: https://github.com/oulgen	2024-06-11 07:46:12 +00:00
Animesh Jain	f2d7f235a6	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128295, #126578, #128268, #128254	2024-06-11 07:09:04 +00:00
Michael Lazos	402b289f3b	Properly register parameter for binary folding test (#128356 ) This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356 Approved by: https://github.com/anijain2305 ghstack dependencies: #128355	2024-06-11 06:48:26 +00:00

4577 changed files with 178081 additions and 112859 deletions

									
										10

.ci/docker/README.md
									
												View File
												
				@ -1,4 +1,4 @@

				# Docker images for GitHub CI

				# Docker images for GitHub CI and CD

				This directory contains everything needed to build the Docker images

				that are used in our CI.

				@ -12,7 +12,7 @@ each image as the `BUILD_ENVIRONMENT` environment variable.

				See `build.sh` for valid build environments (it's the giant switch).

				## Contents

				## Docker CI builds

				* `build.sh` -- dispatch script to launch all builds

				* `common` -- scripts used to execute individual Docker build stages

				@ -21,6 +21,12 @@ See `build.sh` for valid build environments (it's the giant switch).

				* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support

				* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support

				### Docker CD builds

				* `conda` - Dockerfile and build.sh to build Docker images used in nightly conda builds

				* `manywheel` - Dockerfile and build.sh to build Docker images used in nightly manywheel builds

				* `libtorch` - Dockerfile and build.sh to build Docker images used in nightly libtorch builds

				## Usage

				```bash

6

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +1,5 @@
 .6b
 manylinux_2_17
 rocm6
 b5df8c8123f90cba3ede7e971e6fbc6040d506
 db6ecbc915893ff967abd6e1b43bd5f54949868873be60dc802086c3863e648
 rocm6.1
 f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
 c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

									
										24

.ci/docker/build.sh
									
												View File
												
				@ -373,6 +373,13 @@ case "$image" in

				    CONDA_CMAKE=yes

				    EXECUTORCH=yes

				    ;;

				  pytorch-linux-jammy-py3.12-halide)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				@ -400,6 +407,22 @@ case "$image" in

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				@ -490,6 +513,7 @@ docker build \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "HALIDE=${HALIDE}" \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 d4b3e5cc607e97afdba79dc90f8ef968142f347c
 91298923a0076c1b41059efb6dad2876426e4b03

1

.ci/docker/ci_commit_pins/halide.txt Normal file

View File

				`@ -0,0 +1 @@`
				`340136fec6d3ebc73e7a19eba1663e9b0ba8ab2d`

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

 @ -1 +1 @@
 cbe5045a6898c9a925f01435c8277b2fe6afcc
 eae954efa5bf584da70324b640288c3ee7aede

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b8c64f64c18d8cac598b3adb355c21e7439c21de
 b2f15840e0d70eec50d84c7a0575cb835524def

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 fff310c891f5a92d55445adf8cc9d29df5841e
 dedb7bdf339a3546896d4820366ca562c586bfa0

5

.ci/docker/common/aotriton_version.txt Normal file

View File

 @ -0,0 +1,5 @@
 .6b
 manylinux_2_17
 rocm6.1
 b5df8c8123f90cba3ede7e971e6fbc6040d506
 c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

									
										2

.ci/docker/common/install_aotriton.sh
									
												View File
												
				@ -9,7 +9,7 @@ TARBALL='aotriton.tar.bz2'

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}.tar.bz2"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

									
										2

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -85,7 +85,7 @@ fi

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

									
										20

.ci/docker/common/install_conda_docker.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,20 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				# Anaconda

				# Latest anaconda is using openssl-3 which is incompatible with all currently published versions of git

				# Which are using openssl-1.1.1, see https://anaconda.org/anaconda/git/files?version=2.40.1 for example

				MINICONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-py311_23.5.2-0-Linux-x86_64.sh

				wget -q $MINICONDA_URL

				# NB: Manually invoke bash per https://github.com/conda/conda/issues/10431

				bash $(basename "$MINICONDA_URL") -b -p /opt/conda

				rm $(basename "$MINICONDA_URL")

				export PATH=/opt/conda/bin:$PATH

				# See https://github.com/pytorch/builder/issues/1473

				# Pin conda to 23.5.2 as it's the last one compatible with openssl-1.1.1

				conda install -y conda=23.5.2 conda-build anaconda-client git ninja

				# The cmake version here needs to match with the minimum version of cmake

				# supported by PyTorch (3.18). There is only 3.18.2 on anaconda

				/opt/conda/bin/pip3 install cmake==3.18.2

				conda remove -y --force patchelf

									
										95

.ci/docker/common/install_cpython.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,95 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -uex -o pipefail

				PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python

				PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}

				function check_var {

				    if [ -z "$1" ]; then

				        echo "required variable not defined"

				        exit 1

				    fi

				}

				function do_cpython_build {

				    local py_ver=$1

				    local py_folder=$2

				    check_var $py_ver

				    check_var $py_folder

				    tar -xzf Python-$py_ver.tgz

				    pushd $py_folder

				    local prefix="/opt/_internal/cpython-${py_ver}"

				    mkdir -p ${prefix}/lib

				    if [[ -n $(which patchelf) ]]; then

				        local shared_flags="--enable-shared"

				    else

				        local shared_flags="--disable-shared"

				    fi

				    if [[ -z  "${WITH_OPENSSL+x}" ]]; then

				        local openssl_flags=""

				    else

				        local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"

				    fi

				    # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null

				    make -j40 > /dev/null

				    make install > /dev/null

				    if [[ "${shared_flags}" == "--enable-shared" ]]; then

				        patchelf --set-rpath '$ORIGIN/../lib' ${prefix}/bin/python3

				    fi

				    popd

				    rm -rf $py_folder

				    # Some python's install as bin/python3. Make them available as

				    # bin/python.

				    if [ -e ${prefix}/bin/python3 ]; then

				        ln -s python3 ${prefix}/bin/python

				    fi

				    ${prefix}/bin/python get-pip.py

				    if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    ${prefix}/bin/pip install wheel==0.34.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				}

				function build_cpython {

				    local py_ver=$1

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    if [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PY_VER_SHORT

				    else

				        wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz

				        do_cpython_build $py_ver Python-$py_ver

				    fi

				    rm -f Python-$py_ver.tgz

				}

				function build_cpythons {

				    check_var $GET_PIP_URL

				    curl -sLO $GET_PIP_URL

				    for py_ver in $@; do

				        build_cpython $py_ver

				    done

				    rm -f get-pip.py

				}

				mkdir -p /opt/python

				mkdir -p /opt/_internal

				build_cpythons $CPYTHON_VERSIONS

									
										239

.ci/docker/common/install_cuda.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,239 @@

				#!/bin/bash

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.1.0.70

				function install_cusparselt_040 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_052 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_118 {

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

				    chmod +x cuda_11.8.0_520.61.05_linux.run

				    ./cuda_11.8.0_520.61.05_linux.run --toolkit --silent

				    rm -f cuda_11.8.0_520.61.05_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    install_cusparselt_040

				    ldconfig

				}

				function install_121 {

				    echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				    rm -rf /usr/local/cuda-12.1 /usr/local/cuda

				    # install CUDA 12.1.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run

				    chmod +x cuda_12.1.1_530.30.02_linux.run

				    ./cuda_12.1.1_530.30.02_linux.run --toolkit --silent

				    rm -f cuda_12.1.1_530.30.02_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    install_cusparselt_052

				    ldconfig

				}

				function install_124 {

				  echo "Installing CUDA 12.4 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

				  chmod +x cuda_12.4.0_550.54.14_linux.run

				  ./cuda_12.4.0_550.54.14_linux.run --toolkit --silent

				  rm -f cuda_12.4.0_550.54.14_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_052

				  ldconfig

				}

				function prune_118 {

				    echo "Pruning CUDA 11.8 and cuDNN"

				    #####################################################################################

				    # CUDA 11.8 prune static libs

				    #####################################################################################

				    export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"

				    export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				    # all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 11.8 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-11.8/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/

				}

				function prune_121 {

				  echo "Pruning CUDA 12.1"

				  #####################################################################################

				  # CUDA 12.1 prune static libs

				  #####################################################################################

				    export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"

				    export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				    # all CUDA libs except CuDNN and CuBLAS

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 12.1 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-12.1/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.1 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    11.8) install_118; prune_118

				        ;;

				    12.1) install_121; prune_121

				        ;;

				    12.4) install_124; prune_124

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

				    shift

				done

									
										93

.ci/docker/common/install_cuda_aarch64.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,93 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				NCCL_VERSION=v2.21.5-1

				function install_cusparselt_052 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  echo "Installing CUDA 12.4 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux_sbsa.run

				  chmod +x cuda_12.4.0_550.54.14_linux_sbsa.run

				  ./cuda_12.4.0_550.54.14_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.0_550.54.14_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_052

				  ldconfig

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.1 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

				    shift

				done

									
										14

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -37,6 +37,9 @@ install_conda_dependencies() {

				install_pip_dependencies() {

				  pushd executorch/.ci/docker

				  # Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA

				  # binaries later, ExecuTorch only needs CPU

				  pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

				  # Install all Python dependencies

				  pip_install -r requirements-ci.txt

				  popd

				@ -44,13 +47,14 @@ install_pip_dependencies() {

				setup_executorch() {

				  pushd executorch

				  source .ci/scripts/utils.sh

				  # Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate

				  as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh

				  install_flatc_from_source

				  pip_install .

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # Make sure that all the newly generate files are owned by Jenkins

				  chown -R jenkins .

				  as_jenkins .ci/scripts/setup-linux.sh cmake

				  popd

				}

									
										46

.ci/docker/common/install_halide.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,46 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				COMMIT=$(get_pinned_commit halide)

				test -n "$COMMIT"

				# activate conda to populate CONDA_PREFIX

				test -n "$ANACONDA_PYTHON_VERSION"

				eval "$(conda shell.bash hook)"

				conda activate py_$ANACONDA_PYTHON_VERSION

				if [ -n "${UBUNTU_VERSION}" ];then

				    apt update

				    apt-get install -y lld liblld-15-dev libpng-dev libjpeg-dev libgl-dev \

				                  libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev

				fi

				conda_install numpy scipy imageio cmake ninja

				git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git

				cmake -DCMAKE_BUILD_TYPE=Release \

				        -DLLVM_ENABLE_PROJECTS="clang" \

				        -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \

				        -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \

				        -DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \

				        -S llvm-project/llvm -B llvm-build -G Ninja

				cmake --build llvm-build

				cmake --install llvm-build --prefix llvm-install

				export LLVM_ROOT=`pwd`/llvm-install

				export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config

				git clone https://github.com/halide/Halide.git

				pushd Halide

				git checkout ${COMMIT} && git submodule update --init --recursive

				pip_install -r requirements.txt

				cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build

				cmake --build build

				test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3

				cmake --install build --prefix ${CONDA_PREFIX}

				chown -R jenkins ${CONDA_PREFIX}

				popd

				rm -rf Halide llvm-build llvm-project llvm-install

				python -c "import halide"  # check for errors

									
										23

.ci/docker/common/install_libpng.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,23 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				LIBPNG_VERSION=1.6.37

				mkdir -p libpng

				pushd libpng

				wget http://download.sourceforge.net/libpng/libpng-$LIBPNG_VERSION.tar.gz

				tar -xvzf libpng-$LIBPNG_VERSION.tar.gz

				pushd libpng-$LIBPNG_VERSION

				./configure

				make

				make install

				popd

				popd

				rm -rf libpng

									
										29

.ci/docker/common/install_magma.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,29 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				MAGMA_VERSION="2.5.2"

				function do_install() {

				    cuda_version=$1

				    cuda_version_nodot=${1/./}

				    MAGMA_VERSION="2.6.1"

				    magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				    cuda_dir="/usr/local/cuda-${cuda_version}"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}

				        tar -xvf "${magma_archive}"

				        mkdir -p "${cuda_dir}/magma"

				        mv include "${cuda_dir}/magma/include"

				        mv lib "${cuda_dir}/magma/lib"

				        popd

				    )

				}

				do_install $1

									
										134

.ci/docker/common/install_miopen.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,134 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				ROCM_VERSION=$1

				if [[ -z $ROCM_VERSION ]]; then

				    echo "missing ROCM_VERSION"

				    exit 1;

				fi

				# To make version comparison easier, create an integer representation.

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})

				IFS="$save_IFS"

				if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=0

				elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))

				# Install custom MIOpen + COMgr for ROCm >= 4.0.1

				if [[ $ROCM_INT -lt 40001 ]]; then

				    echo "ROCm version < 4.0.1; will not install custom MIOpen"

				    exit 0

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# Build custom MIOpen to use comgr for offline compilation.

				## Need a sanitized ROCM_VERSION without patchlevel; patchlevel version 0 must be added to paths.

				ROCM_DOTS=$(echo ${ROCM_VERSION} | tr -d -c '.' | wc -c)

				if [[ ${ROCM_DOTS} == 1 ]]; then

				    ROCM_VERSION_NOPATCH="${ROCM_VERSION}"

				    ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}.0"

				else

				    ROCM_VERSION_NOPATCH="${ROCM_VERSION%.*}"

				    ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"

				fi

				# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues

				MIOPEN_CMAKE_COMMON_FLAGS="

				-DMIOPEN_USE_COMGR=ON

				-DMIOPEN_BUILD_DRIVER=OFF

				"

				# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version

				if [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				    echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then

				    echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then

				    echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-5.6-staging"

				elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"

				elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.4-staging"

				elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.3-staging"

				elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.2-staging"

				elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"

				    MIOPEN_BRANCH="release/rocm-rel-5.1-staging"

				elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"

				    MIOPEN_BRANCH="release/rocm-rel-5.0-staging"

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				yum remove -y miopen-hip

				git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}

				pushd MIOpen

				# remove .git to save disk space since CI runner was running out

				rm -rf .git

				# Don't build MLIR to save docker build time

				# since we are disabling MLIR backend for MIOpen anyway

				if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				    sed -i '/rocMLIR/d' requirements.txt

				elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then

				    sed -i '/llvm-project-mlir/d' requirements.txt

				fi

				## MIOpen minimum requirements

				cmake -P install_deps.cmake --minimum

				# clean up since CI runner was running out of disk space

				rm -rf /tmp/*

				yum clean all

				rm -rf /var/cache/yum

				rm -rf /var/lib/yum/yumdb

				rm -rf /var/lib/yum/history

				## Build MIOpen

				mkdir -p build

				cd build

				PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \

				    ${MIOPEN_CMAKE_COMMON_FLAGS} \

				    ${MIOPEN_CMAKE_DB_FLAGS} \

				    -DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"

				make MIOpen -j $(nproc)

				# Build MIOpen package

				make -j $(nproc) package

				# clean up since CI runner was running out of disk space

				rm -rf /usr/local/cget

				yum install -y miopen-*.rpm

				popd

				rm -rf MIOpen

									
										16

.ci/docker/common/install_mkl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				#!/bin/bash

				set -ex

				# MKL

				MKL_VERSION=2024.2.0

				MKLROOT=/opt/intel

				mkdir -p ${MKLROOT}

				pushd /tmp

				python3 -mpip install wheel

				python3 -mpip download -d . mkl-static==${MKL_VERSION}

				python3 -m wheel unpack mkl_static-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl

				python3 -m wheel unpack mkl_include-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl

				mv mkl_static-${MKL_VERSION}/mkl_static-${MKL_VERSION}.data/data/lib ${MKLROOT}

				mv mkl_include-${MKL_VERSION}/mkl_include-${MKL_VERSION}.data/data/include ${MKLROOT}

									
										13

.ci/docker/common/install_mnist.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,13 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				mkdir -p /usr/local/mnist/

				cd /usr/local/mnist

				for img in train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz; do

				  wget -q https://ossci-datasets.s3.amazonaws.com/mnist/$img

				  gzip -d $img

				done

									
										4

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -33,7 +33,9 @@ pip_install coloredlogs packaging

				pip_install onnxruntime==1.18

				pip_install onnx==1.16.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240523 --no-deps

				pip_install onnxscript==0.1.0.dev20240613 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										22

.ci/docker/common/install_openblas.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules

				OPENBLAS_BUILD_FLAGS="

				NUM_THREADS=128

				USE_OPENMP=1

				NO_SHARED=0

				DYNAMIC_ARCH=1

				TARGET=ARMV8

				CFLAGS=-O3

				"

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}

				make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

									
										16

.ci/docker/common/install_patchelf.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				# Pin the version to latest release 0.17.2, building newer commit starts

				# to fail on the current image

				git clone -b 0.17.2 --single-branch https://github.com/NixOS/patchelf

				cd patchelf

				sed -i 's/serial/parallel/g' configure.ac

				./bootstrap.sh

				./configure

				make

				make install

				cd ..

				rm -rf patchelf

									
										150

.ci/docker/common/install_rocm_drm.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,150 @@

				#!/bin/bash

				# Script used only in CD pipeline

				###########################

				### prereqs

				###########################

				# Install Python packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    apt-get update -y

				    apt-get install -y libpciaccess-dev pkg-config

				    apt-get clean

				    ;;

				  centos)

				    yum install -y libpciaccess-devel pkgconfig

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				python3 -m pip install meson ninja

				###########################

				### clone repo

				###########################

				GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git

				pushd drm

				###########################

				### patch

				###########################

				patch -p1 <<'EOF'

				diff --git a/amdgpu/amdgpu_asic_id.c b/amdgpu/amdgpu_asic_id.c

				index a5007ffc..13fa07fc 100644

				--- a/amdgpu/amdgpu_asic_id.c

				+++ b/amdgpu/amdgpu_asic_id.c

				@@ -22,6 +22,13 @@

				  *

				  */

				+#define _XOPEN_SOURCE 700

				+#define _LARGEFILE64_SOURCE

				+#define _FILE_OFFSET_BITS 64

				+#include <ftw.h>

				+#include <link.h>

				+#include <limits.h>

				+

				 #include <ctype.h>

				 #include <stdio.h>

				 #include <stdlib.h>

				@@ -34,6 +41,19 @@

				 #include "amdgpu_drm.h"

				 #include "amdgpu_internal.h"

				+static char *amdgpuids_path = NULL;

				+static const char* amdgpuids_path_msg = NULL;

				+

				+static int check_for_location_of_amdgpuids(const char *filepath, const struct stat *info, const int typeflag, struct FTW *pathinfo)

				+{

				+	if (typeflag == FTW_F && strstr(filepath, "amdgpu.ids")) {

				+		amdgpuids_path = strdup(filepath);

				+		return 1;

				+	}

				+

				+	return 0;

				+}

				+

				 static int parse_one_line(struct amdgpu_device *dev, const char *line)

				 {

				 	char *buf, *saveptr;

				@@ -113,10 +133,46 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)

				 	int line_num = 1;

				 	int r = 0;

				+	// attempt to find typical location for amdgpu.ids file

				 	fp = fopen(AMDGPU_ASIC_ID_TABLE, "r");

				+

				+	// if it doesn't exist, search

				+	if (!fp) {

				+

				+	char self_path[ PATH_MAX ];

				+	ssize_t count;

				+	ssize_t i;

				+

				+	count = readlink( "/proc/self/exe", self_path, PATH_MAX );

				+	if (count > 0) {

				+		self_path[count] = '\0';

				+

				+		// remove '/bin/python' from self_path

				+		for (i=count; i>0; --i) {

				+			if (self_path[i] == '/') break;

				+			self_path[i] = '\0';

				+		}

				+		self_path[i] = '\0';

				+		for (; i>0; --i) {

				+			if (self_path[i] == '/') break;

				+			self_path[i] = '\0';

				+		}

				+		self_path[i] = '\0';

				+

				+		if (1 == nftw(self_path, check_for_location_of_amdgpuids, 5, FTW_PHYS)) {

				+			fp = fopen(amdgpuids_path, "r");

				+			amdgpuids_path_msg = amdgpuids_path;

				+		}

				+	}

				+

				+	}

				+	else {

				+		amdgpuids_path_msg = AMDGPU_ASIC_ID_TABLE;

				+	}

				+

				+	// both hard-coded location and search have failed

				 	if (!fp) {

				-		fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,

				-			strerror(errno));

				+		fprintf(stderr, "amdgpu.ids: No such file or directory\n");

				 		return;

				 	}

				@@ -132,7 +188,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)

				 			continue;

				 		}

				-		drmMsg("%s version: %s\n", AMDGPU_ASIC_ID_TABLE, line);

				+		drmMsg("%s version: %s\n", amdgpuids_path_msg, line);

				 		break;

				 	}

				@@ -150,7 +206,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)

				 	if (r == -EINVAL) {

				 		fprintf(stderr, "Invalid format: %s: line %d: %s\n",

				-			AMDGPU_ASIC_ID_TABLE, line_num, line);

				+			amdgpuids_path_msg, line_num, line);

				 	} else if (r && r != -EAGAIN) {

				 		fprintf(stderr, "%s: Cannot parse ASIC IDs: %s\n",

				 			__func__, strerror(-r));

				EOF

				###########################

				### build

				###########################

				meson builddir --prefix=/opt/amdgpu

				pushd builddir

				ninja install

				popd

				popd

									
										13

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -1,7 +1,11 @@

				#!/bin/bash

				# Script used in CI and CD pipeline

				set -ex

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				# "install" hipMAGMA into /opt/rocm/magma by copying after build

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				@ -11,7 +15,10 @@ git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				@ -25,7 +32,7 @@ done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION

				make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				popd

				mv magma /opt/rocm

									
										110

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -1,6 +1,6 @@

				#!/bin/bash

				set -xe

				# Script used in CI and CD pipeline

				# Intel® software for general purpose GPU capabilities.

				# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				@ -8,19 +8,23 @@ set -xe

				# Users should update to the latest version as it becomes available

				function install_ubuntu() {

				    . /etc/os-release

				    if [[ ! " jammy " =~ " ${VERSION_CODENAME} " ]]; then

				        echo "Ubuntu version ${VERSION_CODENAME} not supported"

				        exit

				    fi

				    apt-get update -y

				    apt-get install -y gpg-agent wget

				    # Set up the repository. To do this, download the key to the system keyring

				    # To add the online network package repository for the GPU Driver LTS releases

				    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \

				        | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg

				    # Add the signed entry to APT sources and configure the APT client to use the Intel repository

				        | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \

				        https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-jammy.list

				        https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list

				    # To add the online network network package repository for the Intel Support Packages

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg

				    echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \

				        https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \

				        | tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list

				@ -97,6 +101,86 @@ EOF

				    rm -rf /var/lib/yum/history

				}

				function install_rhel() {

				    . /etc/os-release

				    if [[ "${ID}" == "rhel" ]]; then

				        if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				            echo "RHEL version ${VERSION_ID} not supported"

				            exit

				        fi

				    elif [[ "${ID}" == "almalinux" ]]; then

				        # Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64

				        VERSION_ID="8.6"

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				    # To add the online network package repository for the GPU Driver LTS releases

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}/lts/2350/unified/intel-gpu-${VERSION_ID}.repo

				    # To add the online network network package repository for the Intel Support Packages

				    tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF

				[intel-for-pytorch-gpu-dev]

				name=Intel for Pytorch GPU dev repository

				baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev

				enabled=1

				gpgcheck=1

				repo_gpgcheck=1

				gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				EOF

				    # The xpu-smi packages

				    dnf install -y xpu-smi

				    # Compute and Media Runtimes

				    dnf install -y \

				        intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\

				        level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \

				        mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \

				        mesa-libxatracker libvpl-tools intel-metrics-discovery \

				        intel-metrics-library intel-igc-core intel-igc-cm \

				        libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc

				    # Development packages

				    dnf install -y --refresh \

				        intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \

				        level-zero-devel

				    # Install Intel Support Packages

				    yum install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    # Cleanup

				    dnf clean all

				    rm -rf /var/cache/yum

				    rm -rf /var/lib/yum/yumdb

				    rm -rf /var/lib/yum/history

				}

				function install_sles() {

				    . /etc/os-release

				    VERSION_SP=${VERSION_ID//./sp}

				    if [[ ! " 15sp4 15sp5 " =~ " ${VERSION_SP} " ]]; then

				        echo "SLES version ${VERSION_ID} not supported"

				        exit

				    fi

				    # To add the online network package repository for the GPU Driver LTS releases

				    zypper addrepo -f -r \

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}/lts/2350/unified/intel-gpu-${VERSION_SP}.repo

				    rpm --import https://repositories.intel.com/gpu/intel-graphics.key

				    # To add the online network network package repository for the Intel Support Packages

				    zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev

				    rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				    # The xpu-smi packages

				    zypper install -y lsb-release flex bison xpu-smi

				    # Compute and Media Runtimes

				    zypper install -y intel-level-zero-gpu level-zero intel-gsc intel-opencl intel-ocloc \

				        intel-media-driver libigfxcmrt7 libvpl2 libvpl-tools libmfxgen1 libmfx1

				    # Development packages

				    zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel

				    # Install Intel Support Packages

				    zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev

				}

				# The installation depends on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				@ -107,6 +191,12 @@ case "$ID" in

				    centos)

				        install_centos

				    ;;

				    rhel|almalinux)

				        install_rhel

				    ;;

				    sles)

				        install_sles

				    ;;

				    *)

				        echo "Unable to determine OS..."

				        exit 1

									
										100

.ci/docker/conda/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,100 @@

				ARG CUDA_VERSION=10.2

				ARG BASE_TARGET=cuda${CUDA_VERSION}

				FROM centos:7 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum update -y

				RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				# EPEL for cmake

				RUN yum --enablerepo=extras install -y epel-release

				# cmake

				RUN yum install -y cmake3 && \

				    ln -s /usr/bin/cmake3 /usr/bin/cmake

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN yum install -y autoconf aclocal automake make sudo

				RUN rm -rf /usr/local/cuda-*

				FROM base as patchelf

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf

				FROM base as openssl

				# Install openssl

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				FROM base as conda

				# Install Anaconda

				ADD ./common/install_conda_docker.sh install_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh

				# Install CUDA

				FROM base as cuda

				ARG CUDA_VERSION=10.2

				RUN rm -rf /usr/local/cuda-*

				ADD ./common/install_cuda.sh install_cuda.sh

				ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

				# Preserve CUDA_VERSION for the builds

				ENV CUDA_VERSION=${CUDA_VERSION}

				# Make things in our path by default

				ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH

				FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				ENV DESIRED_CUDA=11.8

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				ENV DESIRED_CUDA=12.1

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				ENV DESIRED_CUDA=12.4

				# Install MNIST test data

				FROM base as mnist

				ADD ./common/install_mnist.sh install_mnist.sh

				RUN bash ./install_mnist.sh

				FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.1  /usr/local/cuda-12.1 /usr/local/cuda-12.1

				COPY --from=cuda12.4  /usr/local/cuda-12.4 /usr/local/cuda-12.4

				# Final step

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

				COPY --from=patchelf           /patchelf              /usr/local/bin/patchelf

				COPY --from=conda              /opt/conda             /opt/conda

				# Add jni.h for java host build.

				COPY ./common/install_jni.sh install_jni.sh

				COPY ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				ENV  PATH /opt/conda/bin:$PATH

				COPY --from=mnist  /usr/local/mnist /usr/local/mnist

				RUN rm -rf /usr/local/cuda

				RUN chmod o+rw /usr/local

				RUN touch /.condarc && \

				    chmod o+rw /.condarc && \

				    chmod -R o+rw /opt/conda

									
										76

.ci/docker/conda/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,76 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  exit 1

				fi

				DOCKER_IMAGE_NAME="pytorch/${image}"

				export DOCKER_BUILDKIT=1

				TOPDIR=$(git rev-parse --show-toplevel)

				CUDA_VERSION=${CUDA_VERSION:-12.1}

				case ${CUDA_VERSION} in

				  cpu)

				    BASE_TARGET=base

				    DOCKER_TAG=cpu

				    ;;

				  all)

				    BASE_TARGET=all_cuda

				    DOCKER_TAG=latest

				    ;;

				  *)

				    BASE_TARGET=cuda${CUDA_VERSION}

				    DOCKER_TAG=cuda${CUDA_VERSION}

				    ;;

				esac

				(

				  set -x

				  docker build \

				    --target final \

				    --progress plain \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				    --build-arg "DEVTOOLSET_VERSION=9" \

				    -t ${DOCKER_IMAGE_NAME} \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/conda/Dockerfile" \

				    ${TOPDIR}/.ci/docker/

				)

				if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then

				  # Test that we're using the right CUDA compiler

				  (

				    set -x

				    docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"

				  )

				fi

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  (

				    set -x

				    docker push "${DOCKER_IMAGE_NAME}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}

				        docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				        docker push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				fi

									
										107

.ci/docker/libtorch/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,107 @@

				ARG BASE_TARGET=base

				ARG GPU_IMAGE=ubuntu:20.04

				FROM ${GPU_IMAGE} as base

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get clean && apt-get update

				RUN apt-get install -y curl locales g++ git-all autoconf automake make cmake wget unzip sudo

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN locale-gen en_US.UTF-8

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				# Install openssl

				FROM base as openssl

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				# Install python

				FROM base as python

				ADD common/install_cpython.sh install_cpython.sh

				RUN apt-get update -y && \

				    apt-get install build-essential gdb lcov libbz2-dev libffi-dev \

				        libgdbm-dev liblzma-dev libncurses5-dev libreadline6-dev \

				        libsqlite3-dev libssl-dev lzma lzma-dev tk-dev uuid-dev zlib1g-dev -y && \

				    bash ./install_cpython.sh && \

				    rm install_cpython.sh && \

				    apt-get clean

				FROM base as conda

				ADD ./common/install_conda_docker.sh install_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh

				FROM base as cpu

				# Install Anaconda

				COPY --from=conda /opt/conda /opt/conda

				# Install python

				COPY --from=python /opt/python    /opt/python

				COPY --from=python /opt/_internal /opt/_internal

				ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

				# Install MKL

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM cpu as cuda

				ADD ./common/install_cuda.sh install_cuda.sh

				ADD ./common/install_magma.sh install_magma.sh

				ENV CUDA_HOME /usr/local/cuda

				FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				RUN bash ./install_magma.sh 11.8

				RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				RUN bash ./install_magma.sh 12.1

				RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda

				FROM cpu as rocm

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				ENV MKLROOT /opt/intel

				# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)

				# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.

				# Remove below when ROCm5.7 is not in support matrix anymore.

				ENV ROCM_PATH /opt/rocm

				# No need to install ROCm as base docker image should have full ROCm install

				#ADD ./common/install_rocm.sh install_rocm.sh

				ADD ./common/install_rocm_drm.sh install_rocm_drm.sh

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				# gfortran and python needed for building magma from source for ROCm

				RUN apt-get update -y && \

				    apt-get install gfortran -y && \

				    apt-get install python -y && \

				    apt-get clean

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh

				# Install Anaconda

				COPY --from=conda /opt/conda /opt/conda

				# Install python

				COPY --from=python /opt/python    /opt/python

				COPY --from=python /opt/_internal /opt/_internal

				ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

									
										93

.ci/docker/libtorch/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,93 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				TOPDIR=$(git rev-parse --show-toplevel)

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				WITH_PUSH=${WITH_PUSH:-}

				DOCKER=${DOCKER:-docker}

				case ${GPU_ARCH_TYPE} in

				    cpu)

				        BASE_TARGET=cpu

				        DOCKER_TAG=cpu

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    cuda)

				        BASE_TARGET=cuda${GPU_ARCH_VERSION}

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    rocm)

				        BASE_TARGET=rocm

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"

				        ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"

				        if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then

				            ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))

				        else

				            echo "ERROR: rocm regex failed"

				            exit 1

				        fi

				        if [[ $ROCM_VERSION_INT -ge 60000 ]]; then

				            PYTORCH_ROCM_ARCH+=";gfx942"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        exit 1

				        ;;

				esac

				(

				    set -x

				    DOCKER_BUILDKIT=1 ${DOCKER} build \

				         --target final \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --build-arg "BASE_TARGET=${BASE_TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				  (

				    set -x

				    ${DOCKER} push "${DOCKER_IMAGE}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				        ${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"

				        ${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				fi

									
										2

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -29,7 +29,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/re

				# Install cuda and cudnn

				ARG CUDA_VERSION

				RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

									
										202

.ci/docker/manywheel/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,202 @@

				# syntax = docker/dockerfile:experimental

				ARG ROCM_VERSION=3.7

				ARG BASE_CUDA_VERSION=11.8

				ARG GPU_IMAGE=centos:7

				FROM centos:7 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				# Note: This is required patch since CentOS have reached EOL

				# otherwise any yum install setp will fail

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms

				# patch is required once again. Somehow this steps adds mirror.centos.org

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN yum --enablerepo=extras install -y epel-release

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake

				RUN yum install -y autoconf aclocal automake make sudo

				FROM base as openssl

				# Install openssl (this must precede `build python` step)

				# (In order to have a proper SSL module, Python is compiled

				# against a recent openssl [see env vars above], which is linked

				# statically. We delete openssl afterwards.)

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				# EPEL for cmake

				FROM base as patchelf

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh

				RUN cp $(which patchelf) /patchelf

				FROM patchelf as python

				# build python

				COPY manywheel/build_scripts /build_scripts

				ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh

				RUN bash build_scripts/build.sh && rm -r build_scripts

				FROM base as cuda

				ARG BASE_CUDA_VERSION=10.2

				# Install CUDA

				ADD ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh

				FROM base as intel

				# MKL

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM base as magma

				ARG BASE_CUDA_VERSION=10.2

				# Install magma

				ADD ./common/install_magma.sh install_magma.sh

				RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh

				FROM base as jni

				# Install java jni header

				ADD ./common/install_jni.sh install_jni.sh

				ADD ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				FROM base as libpng

				# Install libpng

				ADD ./common/install_libpng.sh install_libpng.sh

				RUN bash ./install_libpng.sh && rm install_libpng.sh

				FROM ${GPU_IMAGE} as common

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				RUN yum install -y \

				        aclocal \

				        autoconf \

				        automake \

				        bison \

				        bzip2 \

				        curl \

				        diffutils \

				        file \

				        git \

				        make \

				        patch \

				        perl \

				        unzip \

				        util-linux \

				        wget \

				        which \

				        xz \

				        yasm

				RUN yum install -y \

				    https://repo.ius.io/ius-release-el7.rpm \

				    https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm

				RUN yum swap -y git git236-core

				# git236+ would refuse to run git commands in repos owned by other users

				# Which causes version check to fail, as pytorch repo is bind-mounted into the image

				# Override this behaviour by treating every folder as safe

				# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327

				RUN git config --global --add safe.directory "*"

				ENV SSL_CERT_FILE=/opt/_internal/certs.pem

				# Install LLVM version

				COPY --from=openssl            /opt/openssl                          /opt/openssl

				COPY --from=python             /opt/python                           /opt/python

				COPY --from=python             /opt/_internal                        /opt/_internal

				COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel

				COPY --from=intel              /opt/intel                            /opt/intel

				COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf

				COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h

				COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/

				COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/

				COPY --from=libpng             /usr/local/include/png*               /usr/local/include/

				COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/

				COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/

				COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig

				FROM common as cpu_final

				ARG BASE_CUDA_VERSION=10.1

				ARG DEVTOOLSET_VERSION=9

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				# cmake is already installed inside the rocm base image, so remove if present

				RUN rpm -e cmake || true

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake

				# ninja

				RUN yum install -y ninja-build

				FROM cpu_final as cuda_final

				RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}

				COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}

				COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}

				RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda

				ENV PATH=/usr/local/cuda/bin:$PATH

				FROM cpu_final as rocm_final

				ARG ROCM_VERSION=3.7

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)

				# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.

				# Remove below when ROCm5.7 is not in support matrix anymore.

				ENV ROCM_PATH /opt/rocm

				ENV MKLROOT /opt/intel

				# No need to install ROCm as base docker image should have full ROCm install

				#ADD ./common/install_rocm.sh install_rocm.sh

				#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh

				ADD ./common/install_rocm_drm.sh install_rocm_drm.sh

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				# cmake3 is needed for the MIOpen build

				RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

153

.ci/docker/manywheel/Dockerfile_2014 Normal file

View File

 @ -0,0 +1,153 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=10.2
 ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
 FROM quay.io/pypa/manylinux2014_x86_64 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
 RUN yum install -y yum-utils centos-release-scl sudo
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=10.2
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 FROM base as intel
 # MKL
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as jni
 # Install java jni header
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 # Install libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN yum install -y \
         aclocal \
         autoconf \
         automake \
         bison \
         bzip2 \
         curl \
         diffutils \
         file \
         git \
         make \
         patch \
         perl \
         unzip \
         util-linux \
         wget \
         which \
         xz \
         yasm
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=base               /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=10.2
 RUN yum install -y yum-utils centos-release-scl
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 # ninja
 RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
 RUN yum install -y ninja-build
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

153

.ci/docker/manywheel/Dockerfile_2_28 Normal file

View File

 @ -0,0 +1,153 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=11.8
 ARG GPU_IMAGE=amd64/almalinux:8
 FROM quay.io/pypa/manylinux_2_28_x86_64 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 ARG DEVTOOLSET_VERSION=11
 RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # cmake-3.18.4 from pip
 RUN yum install -y python3-pip && \
     python3 -mpip install cmake==3.18.4 && \
     ln -s /usr/local/bin/cmake /usr/bin/cmake3
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=11.8
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 FROM base as intel
 # MKL
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as jni
 # Install java jni header
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 # Install libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 ARG DEVTOOLSET_VERSION=11
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
         autoconf \
         automake \
         bison \
         bzip2 \
         curl \
         diffutils \
         file \
         git \
         make \
         patch \
         perl \
         unzip \
         util-linux \
         wget \
         which \
         xz \
         gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
         glibc-langpack-en
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=base               /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=11.8
 ARG DEVTOOLSET_VERSION=11
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # cmake-3.18.4 from pip
 RUN yum install -y python3-pip && \
     python3 -mpip install cmake==3.18.4 && \
     ln -s /usr/local/bin/cmake /usr/bin/cmake3
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
 FROM cpu_final as xpu_final
 # cmake-3.28.4 from pip
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 ADD ./common/install_xpu.sh install_xpu.sh
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

57

.ci/docker/manywheel/Dockerfile_2_28_aarch64 Normal file

View File

 @ -0,0 +1,57 @@
 FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 # Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
 ARG GCCTOOLSET_VERSION=11
 # Language variabes
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   less \
   libffi-devel \
   libgomp \
   make \
   openssl-devel \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   zstd \
   sudo \
   gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6

94

.ci/docker/manywheel/Dockerfile_aarch64 Normal file

View File

 @ -0,0 +1,94 @@
 FROM quay.io/pypa/manylinux2014_aarch64 as base
 # Graviton needs GCC 10 for the build
 ARG DEVTOOLSET_VERSION=10
 # Language variabes
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   less \
   zstd \
   libgomp \
   sudo \
   devtoolset-${DEVTOOLSET_VERSION}-gcc \
   devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
   devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
   devtoolset-${DEVTOOLSET_VERSION}-binutils
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ###############################################################################
 # libglfortran.a hack
 #
 # libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
 # This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
 # ubuntu's libgfortran.a which is compiled with -fPIC
 # NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
 ###############################################################################
 RUN cd ~/ \
   && curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
   && ar x ~/libgfortran-10-dev.deb \
   && tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
   && cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
 # install cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM base as openblas
 # Install openblas
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM openssl as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

91

.ci/docker/manywheel/Dockerfile_cuda_aarch64 Normal file

View File

 @ -0,0 +1,91 @@
 FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 # Cuda ARM build needs gcc 11
 ARG DEVTOOLSET_VERSION=11
 # Language variables
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   less \
   zstd \
   libgomp \
   sudo \
   gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM openssl as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION
 # Install CUDA
 ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as openblas
 # Install openblas
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM final as cuda_final
 ARG BASE_CUDA_VERSION
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

71

.ci/docker/manywheel/Dockerfile_cxx11-abi Normal file

View File

 @ -0,0 +1,71 @@
 FROM centos:8 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 # change to a valid repo
 RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
 # enable to install ninja-build
 RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
 RUN yum -y update
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
 RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
 FROM base as openssl
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # Install python
 FROM base as python
 RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
 ADD common/install_cpython.sh install_cpython.sh
 RUN bash ./install_cpython.sh && rm install_cpython.sh
 FROM base as conda
 ADD ./common/install_conda_docker.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
 RUN /opt/conda/bin/conda install -y cmake
 FROM base as intel
 # Install MKL
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=conda              /opt/conda                            /opt/conda
 ENV PATH=/opt/conda/bin:$PATH
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as patchelf
 ADD ./common/install_patchelf.sh install_patchelf.sh
 RUN bash ./install_patchelf.sh && rm install_patchelf.sh
 RUN cp $(which patchelf) /patchelf
 FROM base as jni
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM base as final
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=conda              /opt/conda                            /opt/conda
 COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 RUN yum install -y ninja-build

73

.ci/docker/manywheel/Dockerfile_s390x Normal file

View File

 @ -0,0 +1,73 @@
 FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
 # Language variables
 ENV LC_ALL=C.UTF-8
 ENV LANG=C.UTF-8
 ENV LANGUAGE=C.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN apt update ; apt upgrade -y
 RUN apt install -y \
   build-essential \
   autoconf \
   automake \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz-utils \
   less \
   zstd \
   cmake \
   python3 \
   python3-dev \
   python3-setuptools \
   python3-yaml \
   python3-typing-extensions \
   libblas-dev \
   libopenblas-dev \
   liblapack-dev \
   libatlas-base-dev
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # EPEL for cmake
 FROM base as patchelf
 # Install patchelf
 ADD ./common/install_patchelf.sh install_patchelf.sh
 RUN bash ./install_patchelf.sh && rm install_patchelf.sh
 RUN cp $(which patchelf) /patchelf
 FROM patchelf as python
 # build python
 COPY manywheel/build_scripts /build_scripts
 ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
 RUN bash build_scripts/build.sh && rm -r build_scripts
 FROM openssl as final
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
 COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf

									
										154

.ci/docker/manywheel/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,154 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				TOPDIR=$(git rev-parse --show-toplevel)

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}

				DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}

				WITH_PUSH=${WITH_PUSH:-}

				case ${GPU_ARCH_TYPE} in

				    cpu)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cpu-manylinux_2_28)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cpu-aarch64)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"

				        MANY_LINUX_VERSION="aarch64"

				        ;;

				    cpu-aarch64-2_28)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        ;;

				    cpu-cxx11-abi)

				        TARGET=final

				        DOCKER_TAG=cpu-cxx11-abi

				        GPU_IMAGE=""

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        MANY_LINUX_VERSION="cxx11-abi"

				        ;;

				    cpu-s390x)

				        TARGET=final

				        DOCKER_TAG=cpu-s390x

				        GPU_IMAGE=redhat/ubi9

				        DOCKER_GPU_BUILD_ARG=""

				        MANY_LINUX_VERSION="s390x"

				        ;;

				    cuda)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        # Keep this up to date with the minimum version of CUDA we currently support

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cuda-manylinux_2_28)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cuda-aarch64)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="aarch64"

				        DOCKERFILE_SUFFIX="_cuda_aarch64"

				        ;;

				    rocm)

				        TARGET=rocm_final

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"

				        ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"

				        if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then

				            ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))

				        else

				            echo "ERROR: rocm regex failed"

				            exit 1

				        fi

				        if [[ $ROCM_VERSION_INT -ge 60000 ]]; then

				            PYTORCH_ROCM_ARCH+=";gfx942"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    xpu)

				        TARGET=xpu_final

				        DOCKER_TAG=xpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        exit 1

				        ;;

				esac

				IMAGES=''

				if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				    DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}

				fi

				(

				    set -x

				    DOCKER_BUILDKIT=1 docker build \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				    (

				        set -x

				        docker push "${DOCKER_IMAGE}"

				        if [[ -n ${GITHUB_REF} ]]; then

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				            docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				            docker push "${DOCKER_IMAGE_SHA_TAG}"

				        fi

				    )

				fi

									
										131

.ci/docker/manywheel/build_scripts/build.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,131 @@

				#!/bin/bash

				# Top-level build script called from Dockerfile

				# Script used only in CD pipeline

				# Stop at any error, show all commands

				set -ex

				# openssl version to build, with expected sha256 hash of .tar.gz

				# archive

				OPENSSL_ROOT=openssl-1.1.1l

				OPENSSL_HASH=0b7a3e5e59c34827fe0c3a74b7ec8baef302b98fa80088d7f9153aa16fa76bd1

				DEVTOOLS_HASH=a8ebeb4bed624700f727179e6ef771dafe47651131a00a78b342251415646acc

				PATCHELF_HASH=d9afdff4baeacfbc64861454f368b7f2c15c44d245293f7587bbf726bfe722fb

				CURL_ROOT=curl-7.73.0

				CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131

				AUTOCONF_ROOT=autoconf-2.69

				AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969

				# Get build utilities

				MY_DIR=$(dirname "${BASH_SOURCE[0]}")

				source $MY_DIR/build_utils.sh

				if [ "$(uname -m)" != "s390x" ] ; then

				    # Dependencies for compiling Python that we want to remove from

				    # the final image after compiling Python

				    PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"

				    # Libraries that are allowed as part of the manylinux1 profile

				    MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"

				    # Development tools and libraries

				    yum -y install bzip2 make git patch unzip bison yasm diffutils \

				        automake which file cmake28 \

				        kernel-devel-`uname -r` \

				        ${PYTHON_COMPILE_DEPS}

				else

				    # Dependencies for compiling Python that we want to remove from

				    # the final image after compiling Python

				    PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"

				    # Libraries that are allowed as part of the manylinux1 profile

				    MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"

				    # Development tools and libraries

				    apt install -y bzip2 make git patch unzip diffutils \

				        automake which file cmake \

				        linux-headers-virtual \

				        ${PYTHON_COMPILE_DEPS}

				fi

				# Install newest autoconf

				build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH

				autoconf --version

				# Compile the latest Python releases.

				# (In order to have a proper SSL module, Python is compiled

				# against a recent openssl [see env vars above], which is linked

				# statically. We delete openssl afterwards.)

				build_openssl $OPENSSL_ROOT $OPENSSL_HASH

				/build_scripts/install_cpython.sh

				PY39_BIN=/opt/python/cp39-cp39/bin

				# Our openssl doesn't know how to find the system CA trust store

				#   (https://github.com/pypa/manylinux/issues/53)

				# And it's not clear how up-to-date that is anyway

				# So let's just use the same one pip and everyone uses

				$PY39_BIN/pip install certifi

				ln -s $($PY39_BIN/python -c 'import certifi; print(certifi.where())') \

				      /opt/_internal/certs.pem

				# If you modify this line you also have to modify the versions in the

				# Dockerfiles:

				export SSL_CERT_FILE=/opt/_internal/certs.pem

				# Install newest curl

				build_curl $CURL_ROOT $CURL_HASH

				rm -rf /usr/local/include/curl /usr/local/lib/libcurl* /usr/local/lib/pkgconfig/libcurl.pc

				hash -r

				curl --version

				curl-config --features

				# Install patchelf (latest with unreleased bug fixes)

				curl -sLOk https://nixos.org/releases/patchelf/patchelf-0.10/patchelf-0.10.tar.gz

				# check_sha256sum patchelf-0.9njs2.tar.gz $PATCHELF_HASH

				tar -xzf patchelf-0.10.tar.gz

				(cd patchelf-0.10 && ./configure && make && make install)

				rm -rf patchelf-0.10.tar.gz patchelf-0.10

				# Install latest pypi release of auditwheel

				$PY39_BIN/pip install auditwheel

				ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel

				# Clean up development headers and other unnecessary stuff for

				# final image

				if [ "$(uname -m)" != "s390x" ] ; then

				    yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \

				        avahi freetype bitstream-vera-fonts \

				        ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				    yum -y install ${MANYLINUX1_DEPS}

				    yum -y clean all > /dev/null 2>&1

				    yum list installed

				else

				    apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				fi

				# we don't need libpython*.a, and they're many megabytes

				find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f

				# Strip what we can -- and ignore errors, because this just attempts to strip

				# *everything*, including non-ELF files:

				find /opt/_internal -type f -print0 \

				    | xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true

				# We do not need the Python test suites, or indeed the precompiled .pyc and

				# .pyo files. Partially cribbed from:

				#    https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile

				find /opt/_internal \

				     \( -type d -a -name test -o -name tests \) \

				  -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \

				  -print0 | xargs -0 rm -f

				for PYTHON in /opt/python/*/bin/python; do

				    # Smoke test to make sure that our Pythons work, and do indeed detect as

				    # being manylinux compatible:

				    $PYTHON $MY_DIR/manylinux1-check.py

				    # Make sure that SSL cert checking works

				    $PYTHON $MY_DIR/ssl-check.py

				done

				# Fix libc headers to remain compatible with C99 compilers.

				find /usr/include/ -type f -exec sed -i 's/\bextern _*inline_*\b/extern __inline __attribute__ ((__gnu_inline__))/g' {} +

				# Now we can delete our built SSL

				rm -rf /usr/local/ssl

									
										91

.ci/docker/manywheel/build_scripts/build_utils.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,91 @@

				#!/bin/bash

				# Helper utilities for build

				# Script used only in CD pipeline

				OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/

				CURL_DOWNLOAD_URL=https://curl.askapache.com/download

				AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

				function check_var {

				    if [ -z "$1" ]; then

				        echo "required variable not defined"

				        exit 1

				    fi

				}

				function do_openssl_build {

				    ./config no-ssl2 no-shared -fPIC --prefix=/usr/local/ssl > /dev/null

				    make > /dev/null

				    make install > /dev/null

				}

				function check_sha256sum {

				    local fname=$1

				    check_var ${fname}

				    local sha256=$2

				    check_var ${sha256}

				    echo "${sha256}  ${fname}" > ${fname}.sha256

				    sha256sum -c ${fname}.sha256

				    rm -f ${fname}.sha256

				}

				function build_openssl {

				    local openssl_fname=$1

				    check_var ${openssl_fname}

				    local openssl_sha256=$2

				    check_var ${openssl_sha256}

				    check_var ${OPENSSL_DOWNLOAD_URL}

				    curl -sLO ${OPENSSL_DOWNLOAD_URL}/${openssl_fname}.tar.gz

				    check_sha256sum ${openssl_fname}.tar.gz ${openssl_sha256}

				    tar -xzf ${openssl_fname}.tar.gz

				    (cd ${openssl_fname} && do_openssl_build)

				    rm -rf ${openssl_fname} ${openssl_fname}.tar.gz

				}

				function do_curl_build {

				    LIBS=-ldl ./configure --with-ssl --disable-shared > /dev/null

				    make > /dev/null

				    make install > /dev/null

				}

				function build_curl {

				    local curl_fname=$1

				    check_var ${curl_fname}

				    local curl_sha256=$2

				    check_var ${curl_sha256}

				    check_var ${CURL_DOWNLOAD_URL}

				    curl -sLO ${CURL_DOWNLOAD_URL}/${curl_fname}.tar.bz2

				    check_sha256sum ${curl_fname}.tar.bz2 ${curl_sha256}

				    tar -jxf ${curl_fname}.tar.bz2

				    (cd ${curl_fname} && do_curl_build)

				    rm -rf ${curl_fname} ${curl_fname}.tar.bz2

				}

				function do_standard_install {

				    ./configure > /dev/null

				    make > /dev/null

				    make install > /dev/null

				}

				function build_autoconf {

				    local autoconf_fname=$1

				    check_var ${autoconf_fname}

				    local autoconf_sha256=$2

				    check_var ${autoconf_sha256}

				    check_var ${AUTOCONF_DOWNLOAD_URL}

				    curl -sLO ${AUTOCONF_DOWNLOAD_URL}/${autoconf_fname}.tar.gz

				    check_sha256sum ${autoconf_fname}.tar.gz ${autoconf_sha256}

				    tar -zxf ${autoconf_fname}.tar.gz

				    (cd ${autoconf_fname} && do_standard_install)

				    rm -rf ${autoconf_fname} ${autoconf_fname}.tar.gz

				}

									
										60

.ci/docker/manywheel/build_scripts/manylinux1-check.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,60 @@

				# Logic copied from PEP 513

				def is_manylinux1_compatible():

				    # Only Linux, and only x86-64 / i686

				    from distutils.util import get_platform

				    if get_platform() not in ["linux-x86_64", "linux-i686", "linux-s390x"]:

				        return False

				    # Check for presence of _manylinux module

				    try:

				        import _manylinux

				        return bool(_manylinux.manylinux1_compatible)

				    except (ImportError, AttributeError):

				        # Fall through to heuristic check below

				        pass

				    # Check glibc version. CentOS 5 uses glibc 2.5.

				    return have_compatible_glibc(2, 5)

				def have_compatible_glibc(major, minimum_minor):

				    import ctypes

				    process_namespace = ctypes.CDLL(None)

				    try:

				        gnu_get_libc_version = process_namespace.gnu_get_libc_version

				    except AttributeError:

				        # Symbol doesn't exist -> therefore, we are not linked to

				        # glibc.

				        return False

				    # Call gnu_get_libc_version, which returns a string like "2.5".

				    gnu_get_libc_version.restype = ctypes.c_char_p

				    version_str = gnu_get_libc_version()

				    # py2 / py3 compatibility:

				    if not isinstance(version_str, str):

				        version_str = version_str.decode("ascii")

				    # Parse string and check against requested version.

				    version = [int(piece) for piece in version_str.split(".")]

				    assert len(version) == 2

				    if major != version[0]:

				        return False

				    if minimum_minor > version[1]:

				        return False

				    return True

				import sys

				if is_manylinux1_compatible():

				    print(f"{sys.executable} is manylinux1 compatible")

				    sys.exit(0)

				else:

				    print(f"{sys.executable} is NOT manylinux1 compatible")

				    sys.exit(1)

									
										35

.ci/docker/manywheel/build_scripts/ssl-check.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,35 @@

				# cf. https://github.com/pypa/manylinux/issues/53

				GOOD_SSL = "https://google.com"

				BAD_SSL = "https://self-signed.badssl.com"

				import sys

				print("Testing SSL certificate checking for Python:", sys.version)

				if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):

				    print("This version never checks SSL certs; skipping tests")

				    sys.exit(0)

				if sys.version_info[0] >= 3:

				    from urllib.request import urlopen

				    EXC = OSError

				else:

				    from urllib import urlopen

				    EXC = IOError

				print(f"Connecting to {GOOD_SSL} should work")

				urlopen(GOOD_SSL)

				print("...it did, yay.")

				print(f"Connecting to {BAD_SSL} should fail")

				try:

				    urlopen(BAD_SSL)

				    # If we get here then we failed:

				    print("...it DIDN'T!!!!!11!!1one!")

				    sys.exit(1)

				except EXC:

				    print("...it did, yay.")

10

.ci/docker/requirements-ci.txt

View File

 @ -85,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.9.0
 mypy==1.10.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.9.0
 #Pinned versions: 1.10.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -134,9 +134,9 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 optree==0.11.0
 optree==0.12.1
 #Description: A library for tree manipulation
 #Pinned versions: 0.11.0
 #Pinned versions: 0.12.1
 #test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
 #test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
 #common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
 @ -306,7 +306,7 @@ pywavelets==1.5.0 ; python_version >= "3.12"
 #Pinned versions: 1.4.1
 #test that import:
 lxml==5.0.0.
 lxml==5.0.0
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries

									
										8

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -103,6 +103,14 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										10

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -50,7 +50,7 @@ RUN  bash ./install_lcov.sh && rm install_lcov.sh

				# Install cuda and cudnn

				ARG CUDA_VERSION

				RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				@ -155,6 +155,14 @@ COPY ci_commit_pins/executorch.txt executorch.txt

				RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi

				RUN rm install_executorch.sh common_utils.sh executorch.txt

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				ARG ONNX

				# Install ONNX dependencies

				COPY ./common/install_onnx.sh ./common/common_utils.sh ./

									
										41

.ci/pytorch/README.md
									
												View File
												
				@ -1,42 +1 @@

				This directory contains scripts for our continuous integration.

				One important thing to keep in mind when reading the scripts here is

				that they are all based off of Docker images, which we build for each of

				the various system configurations we want to run on Jenkins.  This means

				it is very easy to run these tests yourself:

				1. Figure out what Docker image you want.  The general template for our

				   images look like:

				   ``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,

				   where ``$BUILD_ENVIRONMENT`` is one of the build environments

				   enumerated in

				   [pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)

				2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and

				   run one of the scripts in this directory.

				The Docker images are designed so that any "reasonable" build commands

				will work; if you look in [build.sh](build.sh) you will see that it is a

				very simple script.  This is intentional.  Idiomatic build instructions

				should work inside all of our Docker images.  You can tweak the commands

				however you need (e.g., in case you want to rebuild with DEBUG, or rerun

				the build with higher verbosity, etc.).

				We have to do some work to make this so.  Here is a summary of the

				mechanisms we use:

				- We install binaries to directories like `/usr/local/bin` which

				  are automatically part of your PATH.

				- We add entries to the PATH using Docker ENV variables (so

				  they apply when you enter Docker) and `/etc/environment` (so they

				  continue to apply even if you sudo), instead of modifying

				  `PATH` in our build scripts.

				- We use `/etc/ld.so.conf.d` to register directories containing

				  shared libraries, instead of modifying `LD_LIBRARY_PATH` in our

				  build scripts.

				- We reroute well known paths like `/usr/bin/gcc` to alternate

				  implementations with `update-alternatives`, instead of setting

				  `CC` and `CXX` in our implementations.

									
										29

.ci/pytorch/build.sh
									
												View File
												
				@ -230,6 +230,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				  export CMAKE_BUILD_TYPE=RelWithAssert

				fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				@ -284,12 +288,26 @@ else

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				      fi

				      WERROR=1 python setup.py bdist_wheel

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel

				        BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				    else

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      python setup.py bdist_wheel

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        echo "USE_SPLIT_BUILD cannot be used with xla or rocm"

				        exit 1

				      else

				        python setup.py bdist_wheel

				      fi

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				@ -328,9 +346,10 @@ else

				    CUSTOM_OP_TEST="$PWD/test/custom_operator"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -343,7 +362,7 @@ else

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -355,7 +374,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

									
										46

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -56,9 +56,29 @@ function assert_git_not_dirty() {

				function pip_install_whl() {

				  # This is used to install PyTorch and other build artifacts wheel locally

				  # without using any network connection

				  python3 -mpip install --no-index --no-deps "$@"

				  # Convert the input arguments into an array

				  local args=("$@")

				  # Check if the first argument contains multiple paths separated by spaces

				  if [[ "${args[0]}" == *" "* ]]; then

				    # Split the string by spaces into an array

				    IFS=' ' read -r -a paths <<< "${args[0]}"

				    # Loop through each path and install individually

				    for path in "${paths[@]}"; do

				      echo "Installing $path"

				      python3 -mpip install --no-index --no-deps "$path"

				    done

				  else

				    # Loop through each argument and install individually

				    for path in "${args[@]}"; do

				      echo "Installing $path"

				      python3 -mpip install --no-index --no-deps "$path"

				    done

				  fi

				}

				function pip_install() {

				  # retry 3 times

				  # old versions of pip don't have the "--progress-bar" flag

				@ -188,28 +208,6 @@ function clone_pytorch_xla() {

				  fi

				}

				function checkout_install_torchdeploy() {

				  local commit

				  commit=$(get_pinned_commit multipy)

				  pushd ..

				  git clone --recurse-submodules https://github.com/pytorch/multipy.git

				  pushd multipy

				  git checkout "${commit}"

				  python multipy/runtime/example/generate_examples.py

				  BUILD_CUDA_TESTS=1 pip install -e .

				  popd

				  popd

				}

				function test_torch_deploy(){

				 pushd ..

				 pushd multipy

				 ./multipy/runtime/build/test_deploy

				 ./multipy/runtime/build/test_deploy_gpu

				 popd

				 popd

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				@ -224,6 +222,8 @@ function checkout_install_torchbench() {

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

									
										1

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -6,6 +6,7 @@ from cryptography.hazmat.primitives import hashes, serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography.x509.oid import NameOID

				temp_dir = mkdtemp()

				print(temp_dir)

									
										6

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -18,8 +18,9 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_cuda_p2p

				time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# FSDP tests

				@ -54,6 +55,9 @@ time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_full

				# Pipelining composability tests

				time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py

				# ND composability tests

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

									
										1

.ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
												
				@ -3,6 +3,7 @@ import json

				import math

				import sys

				parser = argparse.ArgumentParser()

				parser.add_argument(

				    "--test-name", dest="test_name", action="store", required=True, help="test name"

									
										1

.ci/pytorch/perf_test/get_stats.py
									
												View File
												
				@ -3,6 +3,7 @@ import sys

				import numpy

				sample_data_list = sys.argv[1:]

				sample_data_list = [float(v.strip()) for v in sample_data_list]

									
										1

.ci/pytorch/perf_test/update_commit_hash.py
									
												View File
												
				@ -1,6 +1,7 @@

				import json

				import sys

				data_file_path = sys.argv[1]

				commit_hash = sys.argv[2]

									
										1

.ci/pytorch/print_sccache_log.py
									
												View File
												
				@ -1,5 +1,6 @@

				import sys

				log_file_path = sys.argv[1]

				with open(log_file_path) as f:

									
										364

.ci/pytorch/test.sh
									
												View File
												
				@ -249,9 +249,7 @@ fi

				# This tests that the debug asserts are working correctly.

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				    echo "We are in debug mode: $BUILD_ENVIRONMENT. Expect the python assertion to fail"

				    # TODO: Enable the check after we setup the build to run debug asserts without having

				    #       to do a full (and slow) debug build

				    # (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")

				    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")

				elif [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				    # Noop when debug is disabled. Skip bazel jobs because torch isn't available there yet.

				    echo "We are not in debug mode: $BUILD_ENVIRONMENT. Expect the assertion to pass"

				@ -264,18 +262,6 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then

				  export ATEN_CPU_CAPABILITY=avx2

				fi

				# temp workarounds for https://github.com/pytorch/pytorch/issues/126692, remove when fixed

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				  pushd test

				  CUDA_VERSION=$(python -c "import torch; print(torch.version.cuda)")

				  if [ "$CUDA_VERSION" == "12.4" ]; then

				    ISCUDA124="cu124"

				  else

				    ISCUDA124=""

				  fi

				  popd

				fi

				test_python_legacy_jit() {

				  time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose

				  assert_git_not_dirty

				@ -289,6 +275,9 @@ test_python_shard() {

				  # Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly

				  # shellcheck disable=SC2086

				  # modify LD_LIBRARY_PATH to ensure it has the conda env.

				  # This set of tests has been shown to be buggy without it for the split-build

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				@ -328,13 +317,14 @@ test_inductor_distributed() {

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_save_load --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose

				@ -347,17 +337,31 @@ test_inductor_distributed() {

				  assert_git_not_dirty

				}

				test_inductor() {

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose

				test_inductor_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor \

				    --include test_modules test_ops test_ops_gradients test_torch \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py \

				    --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				}

				test_inductor_aoti() {

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				      BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				      CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				    BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				  fi

				}

				@ -376,7 +380,7 @@ test_inductor_cpp_wrapper_abi_compatible() {

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_timm_training.csv"

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -401,7 +405,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)

				fi

				if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--device cpu)

				else

				  DYNAMO_BENCHMARK_FLAGS+=(--device cuda)

				@ -425,6 +429,18 @@ test_perf_for_dashboard() {

				  # TODO: All the accuracy tests can be skipped once the CI accuracy checking is stable enough

				  local targets=(accuracy performance)

				  local device=cuda

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				      device=cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then

				      device=cpu_aarch64

				    fi

				    test_inductor_set_cpu_affinity

				  elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then

				    device=cuda_a10g

				  fi

				  for mode in "${modes[@]}"; do

				    if [[ "$mode" == "inference" ]]; then

				      dtype=bfloat16

				@ -440,56 +456,56 @@ test_perf_for_dashboard() {

				      fi

				      if [[ "$DASHBOARD_TAG" == *default-true* ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \

				            --dynamic-batch-only "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then

				        # TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.

				        # The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data

				        # to fill the dashboard.

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				          "${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \

				          --output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true

				          --output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true

				        # Copy cudagraph results as mock data, easiest choice?

				        cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \

				          "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"

				        cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv" \

				          "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv"

				      fi

				    done

				  done

				@ -526,11 +542,16 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      # For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx2* ]]; then

				      TEST_CONFIG=${TEST_CONFIG::-5}

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" \

				@ -538,10 +559,10 @@ test_single_dynamo_benchmark() {

				      --output "$TEST_REPORTS_DIR/${name}_${suite}.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				    python benchmarks/dynamo/check_graph_breaks.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				  fi

				}

				@ -550,6 +571,11 @@ test_inductor_micro_benchmark() {

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				test_inductor_halide() {

				  python test/run_test.py --include inductor/test_halide.py --verbose

				  assert_git_not_dirty

				}

				test_dynamo_benchmark() {

				  # Usage: test_dynamo_benchmark huggingface 0

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -564,11 +590,15 @@ test_dynamo_benchmark() {

				  elif [[ "${TEST_CONFIG}" == *perf* ]]; then

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				      local dt="float32"

				      if [[ "${TEST_CONFIG}" == *amp* ]]; then

				        dt="amp"

				      fi

				      if [[ "${TEST_CONFIG}" == *freezing* ]]; then

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 --freezing "$@"

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" --freezing "$@"

				      else

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" "$@"

				      fi

				    elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				@ -592,7 +622,7 @@ test_inductor_torchbench_smoketest_perf() {

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_torchbench_inference.csv"

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				@ -607,13 +637,8 @@ test_inductor_torchbench_smoketest_perf() {

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  # Use 4.7 for cuda 12.4, change back to 4.9 after fixing https://github.com/pytorch/pytorch/issues/126692

				  if [ "$CUDA_VERSION" == "12.4" ]; then

				    THRESHOLD=4.7

				  else

				    THRESHOLD=4.9

				  fi

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t $THRESHOLD

				  # lowering threshold from 4.9 to 4.7 for cu124. Will bump it up after cuda 12.4.0->12.4.1 update

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.7

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				@ -632,52 +657,76 @@ test_inductor_torchbench_smoketest_perf() {

				      --only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_huggingface_training.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"

				  done

				}

				test_inductor_get_core_number() {

				  if [[ "${TEST_CONFIG}" == *aarch64 ]]; then

				    echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"

				  else

				    echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"

				  fi

				}

				test_inductor_set_cpu_affinity(){

				  #set jemalloc

				  JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  cores=$(test_inductor_get_core_number)

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				  export TASKSET="taskset -c 0-$end_core"

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  #set jemalloc

				  JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  CORES=$(lscpu | grep Core | awk '{print $4}')

				  export OMP_NUM_THREADS=$CORES

				  end_core=$(( CORES-1 ))

				  test_inductor_set_cpu_affinity

				  MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv

				  grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg

				  do

				    local model_name=${model_cfg[0]}

				    local data_type=${model_cfg[1]}

				    local speedup_target=${model_cfg[4]}

				    if [[ ${model_cfg[3]} == "cpp" ]]; then

				    local data_type=${model_cfg[2]}

				    local speedup_target=${model_cfg[5]}

				    local backend=${model_cfg[1]}

				    if [[ ${model_cfg[4]} == "cpp" ]]; then

				      export TORCHINDUCTOR_CPP_WRAPPER=1

				    else

				      unset TORCHINDUCTOR_CPP_WRAPPER

				    fi

				    local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"

				    if [[ ${model_cfg[2]} == "dynamic" ]]; then

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				    if [[ ${model_cfg[3]} == "dynamic" ]]; then

				      $TASKSET python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \

				        --dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"

				        --dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"

				    else

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				      $TASKSET python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \

				        --freezing --timeout 9000 --backend=inductor --output "$output_name"

				        --freezing --timeout 9000 --"$backend" --output "$output_name"

				    fi

				    cat "$output_name"

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				  done

				  # Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"

				}

				test_torchbench_gcp_smoketest(){

				@ -991,11 +1040,113 @@ test_xla() {

				  assert_git_not_dirty

				}

				function check_public_api_test_fails {

				    test_name=$1

				    invalid_item_name=$2

				    invalid_item_desc=$3

				    echo "Running public API test '${test_name}'..."

				    test_output=$(python test/test_public_bindings.py -k "${test_name}" 2>&1) && ret=$? || ret=$?

				    # Ensure test fails correctly.

				    if [ "$ret" -eq 0 ]; then

				        cat << EOF

				Expected the public API test '${test_name}' to fail after introducing

				${invalid_item_desc}, but it succeeded! Check test/test_public_bindings.py

				for any changes that may have broken the test.

				EOF

				        return 1

				    fi

				    # Ensure invalid item is in the test output.

				    echo "${test_output}" | grep -q "${invalid_item_name}" && ret=$? || ret=$?

				    if [ $ret -ne 0 ]; then

				        cat << EOF

				Expected the public API test '${test_name}' to identify ${invalid_item_desc}, but

				it didn't! It's possible the test may not have run. Check test/test_public_bindings.py

				for any changes that may have broken the test.

				EOF

				        return 1

				    fi

				    echo "Success! '${test_name}' identified ${invalid_item_desc} ${invalid_item_name}."

				    return 0

				}

				# Do NOT run this test before any other tests, like test_python_shard, etc.

				# Because this function uninstalls the torch built from branch and installs

				# the torch built on its base commit.

				test_forward_backward_compatibility() {

				  set -x

				  # First, validate public API tests in the torch built from branch.

				  # Step 1. Make sure the public API test "test_correct_module_names" fails when a new file

				  # introduces an invalid public API function.

				  new_filename=$(mktemp XXXXXXXX.py -p "${TORCH_INSTALL_DIR}")

				  BAD_PUBLIC_FUNC=$(

				  cat << 'EOF'

				def new_public_func():

				  pass

				# valid public API functions have __module__ set correctly

				new_public_func.__module__ = None

				EOF

				  )

				  echo "${BAD_PUBLIC_FUNC}" >> "${new_filename}"

				  invalid_api="torch.$(basename -s '.py' "${new_filename}").new_public_func"

				  echo "Created an invalid public API function ${invalid_api}..."

				  check_public_api_test_fails \

				      "test_correct_module_names" \

				      "${invalid_api}" \

				      "an invalid public API function" && ret=$? || ret=$?

				  rm -v "${new_filename}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing

				  # file is modified to introduce an invalid public API function.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"

				  cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"

				  echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"

				  invalid_api="torch.nn.parameter.new_public_func"

				  echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."

				  check_public_api_test_fails \

				      "test_correct_module_names" \

				      "${invalid_api}" \

				      "an invalid public API function" && ret=$? || ret=$?

				  mv -v "${EXISTING_FILEPATH}.orig" "${EXISTING_FILEPATH}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Step 3. Make sure that the public API test "test_modules_can_be_imported" fails when a module

				  # cannot be imported.

				  new_module_dir=$(mktemp XXXXXXXX -d -p "${TORCH_INSTALL_DIR}")

				  echo "invalid syntax garbage" > "${new_module_dir}/__init__.py"

				  invalid_module_name="torch.$(basename "${new_module_dir}")"

				  check_public_api_test_fails \

				      "test_modules_can_be_imported" \

				      "${invalid_module_name}" \

				      "a non-importable module" && ret=$? || ret=$?

				  rm -rv "${new_module_dir}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Next, build torch from the merge base.

				  REPO_DIR=$(pwd)

				  if [[ "${BASE_SHA}" == "${SHA1}" ]]; then

				    echo "On trunk, we should compare schemas with torch built from the parent commit"

				@ -1169,15 +1320,21 @@ test_executorch() {

				  pushd /executorch

				  # NB: We need to build ExecuTorch runner here and not inside the Docker image

				  # because it depends on PyTorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  # shellcheck disable=SC1091

				  source .ci/scripts/utils.sh

				  build_executorch_runner "cmake"

				  source .ci/scripts/setup-linux.sh cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				  # shellcheck disable=SC1091

				  LLVM_PROFDATA=llvm-profdata-12 LLVM_COV=llvm-cov-12 bash test/run_oss_cpp_tests.sh

				  echo "Run ExecuTorch regression tests for some models"

				  # NB: This is a sample model, more can be added here

				  export PYTHON_EXECUTABLE=python

				  # TODO(huydhn): Add more coverage here using ExecuTorch's gather models script

				  # shellcheck disable=SC1091

				  source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''

				@ -1215,7 +1372,7 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				if [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then

				  test_linux_aarch64

				elif [[ "${TEST_CONFIG}" == *backward* ]]; then

				  test_forward_backward_compatibility

				@ -1237,11 +1394,10 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_rpc

				  fi

				elif [[ "$TEST_CONFIG" == deploy ]]; then

				  checkout_install_torchdeploy

				  test_torch_deploy

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				  test_inductor_halide

				elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then

				  test_inductor_micro_benchmark

				elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

				@ -1253,13 +1409,14 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				  id=$((SHARD_NUMBER-1))

				  test_dynamo_benchmark timm_models "$id"

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    install_torchaudio cpu

				  else

				    install_torchaudio cuda

				  fi

				  install_torchtext

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				@ -1267,9 +1424,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      shufflenet_v2_x1_0 hf_GPT2

				      functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				@ -1278,7 +1435,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				    # nightlies that torchbench may pull in

				    if [[ "${TEST_CONFIG}" != *cpu_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" != *cpu* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				@ -1286,17 +1443,22 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  install_torchvision

				  test_dynamo_shard 1

				  test_aten

				elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.8-gcc11-build ]]; then

				      # Temporarily skip test_inductor_aoti due to https://github.com/pytorch/pytorch/issues/130311

				      test_inductor_aoti

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				  install_torchvision

				  test_dynamo_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_aten

				  fi

				elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  install_torchvision

				  test_python_shard "$SHARD_NUMBER"

									
										1

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py
									
												View File
												
				@ -4,6 +4,7 @@ import os

				import subprocess

				import sys

				COMMON_TESTS = [

				    (

				        "Checking that torch is available",

									
										1

.circleci/codegen_validation/normalize_yaml_fragment.py
									
												View File
												
				@ -5,6 +5,7 @@ import sys

				import yaml

				# Need to import modules that lie on an upward-relative path

				sys.path.append(os.path.join(sys.path[0], ".."))

									
										27

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -46,14 +46,12 @@ if [[ "\$python_nodot" = *310* ]]; then

				  PROTOBUF_PACKAGE="protobuf>=3.19.0"

				fi

				if [[ "\$python_nodot" = *39*  ]]; then

				if [[ "\$python_nodot" = *39* ]]; then

				  # There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20

				  # we set a lower boundary here just to be safe

				  NUMPY_PIN=">=1.20"

				fi

				# Move debug wheels out of the package dir so they don't get installed

				mkdir -p /tmp/debug_final_pkgs

				mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"

				@ -83,7 +81,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				      "numpy\${NUMPY_PIN}" \

				      mkl>=2018 \

				      ninja \

				      sympy \

				      sympy>=1.12 \

				      typing-extensions \

				      ${PROTOBUF_PACKAGE}

				    if [[ "$DESIRED_CUDA" == 'cpu' ]]; then

				@ -97,8 +95,16 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				  )

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				    if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				      pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"

				      pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"

				      # todo: after folder is populated use the pypi_pkg channel instead

				      pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"

				      retry pip install -q numpy protobuf typing-extensions

				    else

				      pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				      retry pip install -q numpy protobuf typing-extensions

				    fi

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

				@ -110,9 +116,18 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  cd /tmp/libtorch

				fi

				if [[ "$GPU_ARCH_TYPE" == xpu ]]; then

				  # Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543

				  set +u

				  source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh

				fi

				# Test the package

				/builder/check_binary.sh

				# Clean temp files

				cd /builder && git clean -ffdx

				# =================== The above code will be executed inside Docker container ===================

				EOL

				echo

									
										49

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -33,9 +33,9 @@ if [[ -z "$DOCKER_IMAGE" ]]; then

				  if [[ "$PACKAGE_TYPE" == conda ]]; then

				    export DOCKER_IMAGE="pytorch/conda-cuda"

				  elif [[ "$DESIRED_CUDA" == cpu ]]; then

				    export DOCKER_IMAGE="pytorch/manylinux-cpu"

				    export DOCKER_IMAGE="pytorch/manylinux:cpu"

				  else

				    export DOCKER_IMAGE="pytorch/manylinux-cuda${DESIRED_CUDA:2}"

				    export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"

				  fi

				fi

				@ -75,9 +75,9 @@ export PYTORCH_BUILD_NUMBER=1

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.13 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				@ -87,11 +87,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:

				fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				@ -100,30 +100,18 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B

				    fi

				fi

				JAVA_HOME=

				BUILD_JNI=OFF

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  POSSIBLE_JAVA_HOMES=()

				  POSSIBLE_JAVA_HOMES+=(/usr/local)

				  POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)

				  POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)

				  # Add the Windows-specific JNI path

				  POSSIBLE_JAVA_HOMES+=("$PWD/pytorch/.circleci/windows-jni/")

				  for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do

				    if [[ -e "$JH/include/jni.h" ]] ; then

				      # Skip if we're not on Windows but haven't found a JAVA_HOME

				      if [[ "$JH" == "$PWD/pytorch/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then

				        break

				      fi

				      echo "Found jni.h under $JH"

				      JAVA_HOME="$JH"

				      BUILD_JNI=ON

				      break

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				    else

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				    fi

				  done

				  if [ -z "$JAVA_HOME" ]; then

				    echo "Did not find jni.h"

				  fi

				fi

				cat >"$envfile" <<EOL

				@ -136,6 +124,7 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"

				export DESIRED_CUDA="$DESIRED_CUDA"

				export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"

				export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"

				export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"

				if [[ "${OSTYPE}" == "msys" ]]; then

				  export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"

				  if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

				@ -159,8 +148,6 @@ export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'

				export ANACONDA_USER='pytorch'

				export USE_FBGEMM=1

				export JAVA_HOME=$JAVA_HOME

				export BUILD_JNI=$BUILD_JNI

				export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"

				export DOCKER_IMAGE="$DOCKER_IMAGE"

									
										9

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -25,6 +25,15 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then

				  AWS_S3_CP="aws s3 cp"

				fi

				if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"

				fi

				# this is special build with all dependencies packaged

				if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

				fi

				# Sleep 2 minutes between retries for conda upload

				retry () {

				  "$@"  || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

									
										1

.circleci/scripts/trigger_azure_pipeline.py
									
												View File
												
				@ -8,6 +8,7 @@ import time

				import requests

				AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"

				AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")

				PIPELINE_ID = "911"

									
										2

.devcontainer/scripts/install-dev-tools.sh
									
												View File
												
				@ -5,7 +5,7 @@ git submodule sync

				git submodule update --init --recursive

				# This takes some time

				make setup_lint

				make setup-lint

				# Add CMAKE_PREFIX_PATH to bashrc

				echo 'export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}' >> ~/.bashrc

7

.flake8

View File

 @ -2,12 +2,12 @@
 # NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
 # before we can fully move to use ruff
 enable-extensions = G
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 select = B,C,E,F,G,P,SIM1,SIM911,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
     E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     # shebang has extra meaning in fbcode lints, so I think it's not worth trying
     # to line this up with executable bit
     EXE001,
 @ -55,6 +55,9 @@ per-file-ignores =
     torch/distributed/_functional_collectives.py: TOR901
     torch/distributed/_spmd/data_parallel.py: TOR901
     torch/distributed/_tensor/_collective_utils.py: TOR901
     # This is a full package that happen to live within the test
     # folder, so ok to skip
     test/cpp_extensions/open_registration_extension/pytorch_openreg/__init__.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

4

.git-blame-ignore-revs

View File

 @ -40,3 +40,7 @@ e6ec0efaf87703c5f889cfc20b29be455885d58d
 a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
 # 2024-01-02 clangformat: fused adam #116583
 dc68d1aa9e554d09344a10fff69f7b50b2d23a0
 # 2024-06-28 enable UFMT in `torch/storage.py`
 d80939e5e9337e8078f11489afefec59fd42f93b
 # 2024-06-28 enable UFMT in `torch.utils.data`
 cf0b90e49689d45be91aa539fdf54cf2ea8a9a3

									
										30

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -9,14 +9,17 @@ self-hosted-runner:

				    - linux.large

				    - linux.2xlarge

				    - linux.4xlarge

				    - linux.9xlarge.ephemeral

				    - linux.12xlarge

				    - linux.12xlarge.ephemeral

				    - linux.24xlarge

				    - linux.arm64.2xlarge

				    - linux.arm64.m7g.4xlarge

				    - linux.4xlarge.nvidia.gpu

				    - linux.8xlarge.nvidia.gpu

				    - linux.16xlarge.nvidia.gpu

				    - linux.g5.4xlarge.nvidia.gpu

				    # Organization-wide AWS Linux Runners on Linux Foundation account

				    # Pytorch/pytorch AWS Linux Runners on Linux Foundation account

				    - lf.linux.large

				    - lf.linux.2xlarge

				    - lf.linux.4xlarge

				@ -27,6 +30,29 @@ self-hosted-runner:

				    - lf.linux.8xlarge.nvidia.gpu

				    - lf.linux.16xlarge.nvidia.gpu

				    - lf.linux.g5.4xlarge.nvidia.gpu

				    # Organization-wide AWS Linux Runners with new Amazon 2023 AMI

				    - amz2023.linux.large

				    - amz2023.linux.2xlarge

				    - amz2023.linux.4xlarge

				    - amz2023.linux.12xlarge

				    - amz2023.linux.24xlarge

				    - amz2023.linux.arm64.2xlarge

				    - amz2023.linux.arm64.m7g.4xlarge

				    - amz2023.linux.4xlarge.nvidia.gpu

				    - amz2023.linux.8xlarge.nvidia.gpu

				    - amz2023.linux.16xlarge.nvidia.gpu

				    - amz2023.linux.g5.4xlarge.nvidia.gpu

				    # Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account

				    - amz2023.lf.linux.large

				    - amz2023.lf.linux.2xlarge

				    - amz2023.lf.linux.4xlarge

				    - amz2023.lf.linux.12xlarge

				    - amz2023.lf.linux.24xlarge

				    - amz2023.lf.linux.arm64.2xlarge

				    - amz2023.lf.linux.4xlarge.nvidia.gpu

				    - amz2023.lf.linux.8xlarge.nvidia.gpu

				    - amz2023.lf.linux.16xlarge.nvidia.gpu

				    - amz2023.lf.linux.g5.4xlarge.nvidia.gpu

				    # Repo-specific IBM hosted S390x runner

				    - linux.s390x

				    # Organization wide AWS Windows runners

				@ -47,3 +73,5 @@ self-hosted-runner:

				    - macos-latest-xlarge

				    - macos-13-xlarge

				    - macos-14-xlarge

				    # Organization-wide Intel hosted XPU runners

				    - linux.idc.xpu

									
										6

.github/actions/diskspace-cleanup/action.yml
									
										vendored
									
												View File
												
				@ -14,12 +14,14 @@ runs:

				    - name: Cleans up diskspace

				      shell: bash

				      run: |

				        set -ex

				        diskspace_cutoff=${{ inputs.diskspace-cutoff }}

				        diskspace=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')

				        docker_root_dir=$(docker info -f '{{.DockerRootDir}}')

				        diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')

				        msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"

				        if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then

				            docker system prune -af

				            diskspace_new=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')

				            diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')

				            if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then

				                echo "Error: Available diskspace is less than $diskspace_cutoff percent. Not enough diskspace."

				                echo "$msg"

									
										3

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -41,6 +41,9 @@ outputs:

				  ci-verbose-test-logs:

				    description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.

				    value: ${{ steps.filter.outputs.ci-verbose-test-logs }}

				  ci-test-showlocals:

				    description: True if ci-test-showlocals label was on PR or [ci-test-showlocals] in PR body.

				    value: ${{ steps.filter.outputs.ci-test-showlocals }}

				  ci-no-test-timeout:

				    description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-test-timeout }}

									
										207

.github/actions/linux-build/action.yml
									
										vendored
									
												View File
											
				@ -1,207 +0,0 @@

				name: linux-build

				inputs:

				  build-environment:

				    required: true

				    description: Top-level label for what's being built/tested.

				  docker-image-name:

				    required: true

				    description: Name of the base docker image to build with.

				  build-generates-artifacts:

				    required: false

				    default: "true"

				    description: If set, upload generated build artifacts.

				  build-with-debug:

				    required: false

				    default: "false"

				    description: If set, build in debug mode.

				  sync-tag:

				    required: false

				    default: ""

				    description: |

				      If this is set, our linter will use this to make sure that every other

				      job with the same `sync-tag` is identical.

				  cuda-arch-list:

				    required: false

				    default: "5.2"

				    description: Runner label to select worker type

				  runner:

				    required: false

				    default: "linux.2xlarge"

				    description: |

				      List of CUDA architectures CI build should target.

				  test-matrix:

				    required: false

				    type: string

				    description: |

				      An option JSON description of what test configs to run later on. This

				      is moved here from the Linux test workflow so that we can apply filter

				      logic using test-config labels earlier and skip unnecessary builds

				  s3-bucket:

				    description: S3 bucket to download artifact

				    required: false

				    default: "gha-artifacts"

				  aws-role-to-assume:

				    description: role to assume for downloading artifacts

				    required: false

				    default: ""

				  GITHUB_TOKEN:

				    description: GitHub token

				    required: true

				  HUGGING_FACE_HUB_TOKEN:

				    description: Hugging Face Hub token

				    required: false

				    default: ""

				outputs:

				  docker-image:

				    value: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    description: The docker image containing the built PyTorch.

				  test-matrix:

				    value: ${{ steps.filter.outputs.test-matrix }}

				    description: An optional JSON description of what test configs to run later on.

				runs:

				  using: composite

				  steps:

				    - name: Setup Linux

				      uses: ./.github/actions/setup-linux

				    - name: configure aws credentials

				      uses: aws-actions/configure-aws-credentials@v3

				      if: ${{ inputs.aws-role-to-assume != '' }}

				      with:

				        role-to-assume: ${{ inputs.aws-role-to-assume }}

				        role-session-name: gha-linux-build

				        role-duration-seconds: 10800

				        aws-region: us-east-1

				    - name: Calculate docker image

				      id: calculate-docker-image

				      uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				      with:

				        docker-image-name: ${{ inputs.docker-image-name }}

				    - name: Use following to pull public copy of the image

				      id: print-ghcr-mirror

				      env:

				        ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      shell: bash

				      run: |

				        tag=${ECR_DOCKER_IMAGE##*/}

				        echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				    - name: Pull docker image

				      uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				      with:

				        docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    - name: Parse ref

				      id: parse-ref

				      shell: bash

				      run: .github/scripts/parse_ref.py

				    - name: Get workflow job id

				      id: get-job-id

				      uses: ./.github/actions/get-workflow-job-id

				      if: always()

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    # Apply the filter logic to the build step too if the test-config label is already there

				    - name: Select all requested test configurations (if the test matrix is available)

				      id: filter

				      uses: ./.github/actions/filter-test-configs

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				        test-matrix: ${{ inputs.test-matrix }}

				        job-name: ${{ steps.get-job-id.outputs.job-name }}

				    - name: Download pytest cache

				      uses: ./.github/actions/pytest-cache-download

				      continue-on-error: true

				      with:

				        cache_dir: .pytest_cache

				        job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				        s3_bucket: ${{ inputs.s3-bucket }}

				    - name: Build

				      if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''

				      id: build

				      env:

				        BUILD_ENVIRONMENT: ${{ inputs.build-environment }}

				        BRANCH: ${{ steps.parse-ref.outputs.branch }}

				        # TODO duplicated

				        AWS_DEFAULT_REGION: us-east-1

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				        XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				        PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}

				        TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				        DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				        DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}

				        OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				      shell: bash

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				        container_name=$(docker run \

				          -e BUILD_ENVIRONMENT \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e AWS_DEFAULT_REGION \

				          -e PR_NUMBER \

				          -e SHA1 \

				          -e BRANCH \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_S3_KEY_PREFIX \

				          -e XLA_CUDA \

				          -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          -e TORCH_CUDA_ARCH_LIST \

				          -e PR_LABELS \

				          -e OUR_GITHUB_JOB_ID \

				          -e HUGGING_FACE_HUB_TOKEN \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				          --tty \

				          --detach \

				          --user jenkins \

				          -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

				    - name: Archive artifacts into zip

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      shell: bash

				      run: |

				        zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files

				    - name: Store PyTorch Build Artifacts on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      with:

				        name: ${{ inputs.build-environment }}

				        retention-days: 14

				        if-no-files-found: error

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Upload sccache stats

				      if: steps.build.outcome != 'skipped'

				      uses: seemethere/upload-artifact-s3@v5

				      with:

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 365

				        if-no-files-found: warn

				        path: sccache-stats-*.json

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Teardown Linux

				      uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      if: always()

									
										1

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -167,6 +167,7 @@ runs:

				        REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}

				        CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}

				        VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

				        TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}

				        NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}

				        NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				        TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

									
										11

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -26,6 +26,7 @@ runs:

				          -e PYTORCH_FINAL_PACKAGE_DIR \

				          -e PYTORCH_ROOT \

				          -e SKIP_ALL_TESTS \

				          -e USE_SPLIT_BUILD \

				          --tty \

				          --detach \

				          -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \

				@ -35,7 +36,8 @@ runs:

				          "${DOCKER_IMAGE}"

				        )

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" ]]; then

				        echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" && "${GPU_ARCH_TYPE}" != "xpu" ]]; then

				          # Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner

				          grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"

				        fi

				@ -46,10 +48,9 @@ runs:

				        docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"

				    - name: Cleanup docker

				      if: always() && env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel'

				      if: always() && (env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel' || env.GPU_ARCH_TYPE == 'xpu')

				      shell: bash

				      run: |

				        # on s390x stop the container for clean worker stop

				        # ignore expansion of "docker ps -q" since it could be empty

				        # on s390x or xpu stop the container for clean worker stop

				        # shellcheck disable=SC2046

				        docker stop $(docker ps -q) || true

				        docker stop "${{ env.CONTAINER_NAME }}" || true

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 b829e936f7cc61b48149f5f957a451a38bf2a178
 b3f6f511f2a1082bd56b13a3f6794e7fc3ba4862

2

.github/ci_commit_pins/torchbench.txt vendored

View File

 @ -1 +1 @@
 d6015d42d9a1834bc7595c4bd6852562fb80b30b
 dbebd44a11eb84afbf53c3c071dd105297e

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 f0b61e5d782913a0fc7743812f2a8e522189111
 ea4535f0699f366adb554183a65ebf7dc34a8be

									
										227

.github/lf-canary-scale-config.yml
									
										vendored
									
												View File
												
				@ -1,13 +1,23 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch-canary and their labels.

				# This file is generated by .github/scripts/validate_scale_config.py in test-infra

				# It defines runner types that will be provisioned by by LF Self-hosted runners

				# scale-config.yml:

				#   Powers what instance types are available for GHA auto-scaled

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2

				#

				# Default values:

				# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls

				#                     to avoid RequestLimitExceeded issues

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				# NOTE: Default values,

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#   runner_label:

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				@ -21,17 +31,29 @@ runner_types:

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.c.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				  lf.c.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				  lf.c.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				  lf.c.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 20

				    os: linux

				  lf.c.linux.12xlarge.ephemeral:

				    disk_size: 200

				@ -43,7 +65,7 @@ runner_types:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				  lf.c.linux.24xlarge:

				    disk_size: 150

				@ -67,7 +89,7 @@ runner_types:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    max_available: 1000

				    os: linux

				  lf.c.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				@ -79,19 +101,19 @@ runner_types:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    max_available: 250

				    os: linux

				  lf.c.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    max_available: 300

				    os: linux

				  lf.c.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				  lf.c.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				@ -103,9 +125,16 @@ runner_types:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    max_available: 2400

				    os: linux

				  lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				@ -116,11 +145,17 @@ runner_types:

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.c.linux.arm64.m7g.2xlarge:

				  lf.c.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				  lf.c.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				  lf.c.windows.4xlarge:

				    disk_size: 256

				@ -138,7 +173,7 @@ runner_types:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    max_available: 300

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				@ -152,3 +187,159 @@ runner_types:

				    is_ephemeral: false

				    max_available: 250

				    os: windows

				  ### Setup runner types to test the Amazon Linux 2023 AMI

				  lf.c.amz2023.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 2400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.amz2023.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.amz2023.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

									
										227

.github/lf-scale-config.yml
									
										vendored
									
												View File
												
				@ -1,13 +1,23 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch and their labels.

				# This file is generated by .github/scripts/validate_scale_config.py in test-infra

				# It defines runner types that will be provisioned by by LF Self-hosted runners

				# scale-config.yml:

				#   Powers what instance types are available for GHA auto-scaled

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2

				#

				# Default values:

				# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls

				#                     to avoid RequestLimitExceeded issues

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				# NOTE: Default values,

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#   runner_label:

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				@ -21,17 +31,29 @@ runner_types:

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				  lf.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				  lf.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				  lf.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 20

				    os: linux

				  lf.linux.12xlarge.ephemeral:

				    disk_size: 200

				@ -43,7 +65,7 @@ runner_types:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				  lf.linux.24xlarge:

				    disk_size: 150

				@ -67,7 +89,7 @@ runner_types:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    max_available: 1000

				    os: linux

				  lf.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				@ -79,19 +101,19 @@ runner_types:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    max_available: 250

				    os: linux

				  lf.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    max_available: 300

				    os: linux

				  lf.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				  lf.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				@ -103,9 +125,16 @@ runner_types:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    max_available: 2400

				    os: linux

				  lf.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				@ -116,11 +145,17 @@ runner_types:

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.linux.arm64.m7g.2xlarge:

				  lf.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				  lf.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				  lf.windows.4xlarge:

				    disk_size: 256

				@ -138,7 +173,7 @@ runner_types:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    max_available: 300

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				@ -152,3 +187,159 @@ runner_types:

				    is_ephemeral: false

				    max_available: 250

				    os: windows

				  ### Setup runner types to test the Amazon Linux 2023 AMI

				  lf.amz2023.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 2400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.amz2023.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.amz2023.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

									
										42

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -27,11 +27,9 @@

				  - third_party/onnx

				  - caffe2/python/onnx/**

				  approved_by:

				  - BowenBao

				  - justinchuby

				  - liqunfu

				  - shubhambhokare1

				  - thiagocrepaldi

				  - titaiwangms

				  - wschin

				  - xadupre

				@ -244,6 +242,7 @@

				  - torch/csrc/xpu/**

				  - torch/xpu/**

				  - test/xpu/**

				  - test/test_xpu.py

				  - third_party/xpu.txt

				  - .ci/docker/ci_commit_pins/triton-xpu.txt

				  approved_by:

				@ -287,6 +286,7 @@

				  - test/cpp/dist_autograd/**

				  - test/cpp/rpc/**

				  approved_by:

				  - wconstab

				  - mrshenli

				  - pritamdamania87

				  - zhaojuanmao

				@ -313,6 +313,25 @@

				  - Lint

				  - pull

				- name: DCP

				  patterns:

				  - torch/distributed/checkpoint/**

				  approved_by:

				  - LucasLLC

				  - fegin

				  - wz337

				  - saumishr

				  - daulet-askarov

				  - pradeepdfb

				  - kirtiteja

				  - mhorowitz

				  - saiteja64

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				- name: IDEEP

				  patterns:

				  - third_party/ideep

				@ -376,13 +395,21 @@

				- name: CPU inductor

				  patterns:

				  - torch/_inductor/mkldnn_ir.py

				  - torch/_inductor/mkldnn_lowerings.py

				  - torch/_inductor/fx_passes/mkldnn_fusion.py

				  - torch/_inductor/fx_passes/quantization.py

				  - torch/_inductor/codegen/cpp_prefix.h

				  - torch/_inductor/codegen/cpp.py

				  - torch/_inductor/codegen/cpp_utils.py

				  - torch/_inductor/codegen/cpp_micro_gemm.py

				  - torch/_inductor/codegen/cpp_template_kernel.py

				  - torch/_inductor/codegen/cpp_template.py

				  - torch/_inductor/codegen/cpp_gemm_template.py

				  - test/inductor/test_mkldnn_pattern_matcher.py

				  - test/inductor/test_cpu_repo.py

				  - test/inductor/test_cpu_repro.py

				  - test/inductor/test_cpu_cpp_wrapper.py

				  - test/inductor/test_cpu_select_algorithm.py

				  - aten/src/ATen/cpu/**

				  - aten/src/ATen/native/quantized/cpu/**

				  - test/quantization/core/test_quantized_op.py

				@ -496,6 +523,13 @@

				  - Skylion007

				  - ngimel

				  - peterbell10

				  - eqy

				  - jansel

				  - jeffdaily

				  - eellison

				  - anijain2305

				  - bdhirsh

				  - zou3519

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				@ -510,6 +544,8 @@

				  - ezyang

				  - dzhulgakov

				  - malfet

				  - albanD

				  - ptrblck

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

									
										2

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -6,6 +6,7 @@ ciflow_push_tags:

				- ciflow/binaries_libtorch

				- ciflow/binaries_wheel

				- ciflow/inductor

				- ciflow/inductor-rocm

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-cu124

				@ -26,3 +27,4 @@ retryable_workflows:

				- windows-binary

				labeler_config: labeler.yml

				label_to_label_config: label_to_label.yml

				mergebot: True

2

.github/requirements/pip-requirements-iOS.txt vendored

View File

 @ -1,4 +1,4 @@
 # iOS simulator requirements
 coremltools==5.0b5
 protobuf==3.20.2
 optree==0.11.0
 optree==0.12.1

6

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -17,16 +17,16 @@ pytest-xdist==3.3.1
 pytest-rerunfailures==10.3
 pytest-flakefinder==1.1.0
 scipy==1.10.1
 sympy==1.11.1
 sympy==1.12.1 ; python_version == "3.8"
 sympy>=1.13.0 ; python_version >= "3.9"
 unittest-xml-reporting<=3.2.0,>=2.0.0
 xdoctest==1.1.0
 filelock==3.6.0
 sympy==1.11.1
 pytest-cpp==2.3.0
 rockset==1.0.3
 z3-solver==4.12.2.0
 tensorboard==2.13.0
 optree==0.11.0
 optree==0.12.1
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2

									
										2

.github/scripts/amd/package_triton_wheel.sh
									
										vendored
									
												View File
												
				@ -93,6 +93,8 @@ done

				# Copy Include Files

				cp -r $ROCM_HOME/include/hip $TRITON_ROCM_DIR/include

				cp -r $ROCM_HOME/include/roctracer $TRITON_ROCM_DIR/include

				cp -r $ROCM_HOME/include/hsa $TRITON_ROCM_DIR/include

				# Copy linker

				mkdir -p $TRITON_ROCM_DIR/llvm/bin

									
										31

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -1,4 +1,5 @@

				#!/usr/bin/env python3

				import os

				import shutil

				import sys

				@ -7,12 +8,17 @@ from subprocess import check_call

				from tempfile import TemporaryDirectory

				from typing import Optional

				SCRIPT_DIR = Path(__file__).parent

				REPO_DIR = SCRIPT_DIR.parent.parent

				def read_triton_pin(rocm_hash: bool = False) -> str:

				    triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"

				def read_triton_pin(device: str = "cuda") -> str:

				    triton_file = "triton.txt"

				    if device == "rocm":

				        triton_file = "triton-rocm.txt"

				    elif device == "xpu":

				        triton_file = "triton-xpu.txt"

				    with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:

				        return f.read().strip()

				@ -49,7 +55,7 @@ def build_triton(

				    version: str,

				    commit_hash: str,

				    build_conda: bool = False,

				    build_rocm: bool = False,

				    device: str = "cuda",

				    py_version: Optional[str] = None,

				    release: bool = False,

				) -> Path:

				@ -69,11 +75,14 @@ def build_triton(

				        triton_basedir = Path(tmpdir) / "triton"

				        triton_pythondir = triton_basedir / "python"

				        triton_repo = "https://github.com/openai/triton"

				        if build_rocm:

				        if device == "rocm":

				            triton_pkg_name = "pytorch-triton-rocm"

				        elif device == "xpu":

				            triton_pkg_name = "pytorch-triton-xpu"

				            triton_repo = "https://github.com/intel/intel-xpu-backend-for-triton"

				        else:

				            triton_pkg_name = "pytorch-triton"

				        check_call(["git", "clone", triton_repo], cwd=tmpdir)

				        check_call(["git", "clone", triton_repo, "triton"], cwd=tmpdir)

				        if release:

				            ver, rev, patch = version.split(".")

				            check_call(

				@ -140,7 +149,7 @@ def build_triton(

				            expected_version=None,

				        )

				        if build_rocm:

				        if device == "rocm":

				            check_call(

				                [f"{SCRIPT_DIR}/amd/package_triton_wheel.sh"],

				                cwd=triton_basedir,

				@ -155,7 +164,7 @@ def build_triton(

				        whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))

				        shutil.copy(whl_path, Path.cwd())

				        if build_rocm:

				        if device == "rocm":

				            check_call(

				                [f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh", Path.cwd()],

				                cwd=triton_basedir,

				@ -170,17 +179,19 @@ def main() -> None:

				    parser = ArgumentParser("Build Triton binaries")

				    parser.add_argument("--release", action="store_true")

				    parser.add_argument("--build-conda", action="store_true")

				    parser.add_argument("--build-rocm", action="store_true")

				    parser.add_argument(

				        "--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu"]

				    )

				    parser.add_argument("--py-version", type=str)

				    parser.add_argument("--commit-hash", type=str)

				    parser.add_argument("--triton-version", type=str, default=read_triton_version())

				    args = parser.parse_args()

				    build_triton(

				        build_rocm=args.build_rocm,

				        device=args.device,

				        commit_hash=args.commit_hash

				        if args.commit_hash

				        else read_triton_pin(args.build_rocm),

				        else read_triton_pin(args.device),

				        version=args.triton_version,

				        build_conda=args.build_conda,

				        py_version=args.py_version,

									
										1

.github/scripts/check_labels.py
									
										vendored
									
												View File
												
				@ -5,7 +5,6 @@ import sys

				from typing import Any

				from github_utils import gh_delete_comment, gh_post_pr_comment

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from label_utils import has_required_labels, is_label_err_comment, LABEL_ERR_MSG

				from trymerge import GitHubPR

									
										116

.github/scripts/cherry_pick.py
									
										vendored
									
												View File
												
				@ -3,12 +3,10 @@

				import json

				import os

				import re

				from typing import Any, Optional

				from typing import Any, cast, Dict, List, Optional

				from urllib.error import HTTPError

				from github_utils import gh_fetch_url, gh_post_pr_comment

				from github_utils import gh_fetch_url, gh_post_pr_comment, gh_query_issues_by_labels

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import get_pr_commit_sha, GitHubPR

				@ -19,6 +17,7 @@ REQUIRES_ISSUE = {

				    "critical",

				    "fixnewfeature",

				}

				RELEASE_BRANCH_REGEX = re.compile(r"release/(?P<version>.+)")

				def parse_args() -> Any:

				@ -58,6 +57,33 @@ def get_merge_commit_sha(repo: GitRepo, pr: GitHubPR) -> Optional[str]:

				    return commit_sha if pr.is_closed() else None

				def get_release_version(onto_branch: str) -> Optional[str]:

				    """

				    Return the release version if the target branch is a release branch

				    """

				    m = re.match(RELEASE_BRANCH_REGEX, onto_branch)

				    return m.group("version") if m else ""

				def get_tracker_issues(

				    org: str, project: str, onto_branch: str

				) -> List[Dict[str, Any]]:

				    """

				    Find the tracker issue from the repo. The tracker issue needs to have the title

				    like [VERSION] Release Tracker following the convention on PyTorch

				    """

				    version = get_release_version(onto_branch)

				    if not version:

				        return []

				    tracker_issues = gh_query_issues_by_labels(org, project, labels=["release tracker"])

				    if not tracker_issues:

				        return []

				    # Figure out the tracker issue from the list by looking at the title

				    return [issue for issue in tracker_issues if version in issue.get("title", "")]

				def cherry_pick(

				    github_actor: str,

				    repo: GitRepo,

				@ -77,17 +103,49 @@ def cherry_pick(

				    )

				    try:

				        org, project = repo.gh_owner_and_name()

				        cherry_pick_pr = ""

				        if not dry_run:

				            org, project = repo.gh_owner_and_name()

				            cherry_pick_pr = submit_pr(repo, pr, cherry_pick_branch, onto_branch)

				            msg = f"The cherry pick PR is at {cherry_pick_pr}"

				            if fixes:

				                msg += f" and it is linked with issue {fixes}"

				            elif classification in REQUIRES_ISSUE:

				                msg += f" and it is recommended to link a {classification} cherry pick PR with an issue"

				        tracker_issues_comments = []

				        tracker_issues = get_tracker_issues(org, project, onto_branch)

				        for issue in tracker_issues:

				            issue_number = int(str(issue.get("number", "0")))

				            if not issue_number:

				                continue

				            post_comment(org, project, pr.pr_num, msg)

				            res = cast(

				                Dict[str, Any],

				                post_tracker_issue_comment(

				                    org,

				                    project,

				                    issue_number,

				                    pr.pr_num,

				                    cherry_pick_pr,

				                    classification,

				                    fixes,

				                    dry_run,

				                ),

				            )

				            comment_url = res.get("html_url", "")

				            if comment_url:

				                tracker_issues_comments.append(comment_url)

				        msg = f"The cherry pick PR is at {cherry_pick_pr}"

				        if fixes:

				            msg += f" and it is linked with issue {fixes}."

				        elif classification in REQUIRES_ISSUE:

				            msg += f" and it is recommended to link a {classification} cherry pick PR with an issue."

				        if tracker_issues_comments:

				            msg += " The following tracker issues are updated:\n"

				            for tracker_issues_comment in tracker_issues_comments:

				                msg += f"* {tracker_issues_comment}\n"

				        post_pr_comment(org, project, pr.pr_num, msg, dry_run)

				    finally:

				        if current_branch:

				@ -159,7 +217,9 @@ def submit_pr(

				        raise RuntimeError(msg) from error

				def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:

				def post_pr_comment(

				    org: str, project: str, pr_num: int, msg: str, dry_run: bool = False

				) -> List[Dict[str, Any]]:

				    """

				    Post a comment on the PR itself to point to the cherry picking PR when success

				    or print the error when failure

				@ -182,7 +242,35 @@ def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:

				    comment = "\n".join(

				        (f"### Cherry picking #{pr_num}", f"{msg}", "", f"{internal_debugging}")

				    )

				    gh_post_pr_comment(org, project, pr_num, comment)

				    return gh_post_pr_comment(org, project, pr_num, comment, dry_run)

				def post_tracker_issue_comment(

				    org: str,

				    project: str,

				    issue_num: int,

				    pr_num: int,

				    cherry_pick_pr: str,

				    classification: str,

				    fixes: str,

				    dry_run: bool = False,

				) -> List[Dict[str, Any]]:

				    """

				    Post a comment on the tracker issue (if any) to record the cherry pick

				    """

				    comment = "\n".join(

				        (

				            "Link to landed trunk PR (if applicable):",

				            f"* https://github.com/{org}/{project}/pull/{pr_num}",

				            "",

				            "Link to release branch PR:",

				            f"* {cherry_pick_pr}",

				            "",

				            "Criteria Category:",

				            " - ".join((classification.capitalize(), fixes.capitalize())),

				        )

				    )

				    return gh_post_pr_comment(org, project, issue_num, comment, dry_run)

				def main() -> None:

				@ -214,7 +302,7 @@ def main() -> None:

				    except RuntimeError as error:

				        if not args.dry_run:

				            post_comment(org, project, pr_num, str(error))

				            post_pr_comment(org, project, pr_num, str(error))

				        else:

				            raise error

									
										1

.github/scripts/close_nonexistent_disable_issues.py
									
										vendored
									
												View File
												
				@ -10,6 +10,7 @@ import requests

				import rockset  # type: ignore[import]

				from gitutils import retries_decorator

				LOGS_QUERY = """

				with

				    shas as (

									
										2

.github/scripts/collect_ciflow_labels.py
									
										vendored
									
												View File
												
				@ -1,10 +1,12 @@

				#!/usr/bin/env python3

				import sys

				from pathlib import Path

				from typing import Any, cast, Dict, List, Set

				import yaml

				GITHUB_DIR = Path(__file__).parent.parent

									
										1

.github/scripts/convert_lintrunner_annotations_to_github.py
									
										vendored
									
												View File
												
				@ -1,7 +1,6 @@

				import json

				import subprocess

				import sys

				from enum import Enum

				from pathlib import Path

				from typing import NamedTuple, Optional

									
										1

.github/scripts/delete_old_branches.py
									
										vendored
									
												View File
												
				@ -9,6 +9,7 @@ from typing import Any, Callable, Dict, List, Set

				from github_utils import gh_fetch_json_dict, gh_graphql

				from gitutils import GitRepo

				SEC_IN_DAY = 24 * 60 * 60

				CLOSED_PR_RETENTION = 30 * SEC_IN_DAY

				NO_PR_RETENTION = 1.5 * 365 * SEC_IN_DAY

BIN
.github/scripts/drci_mocks.json.gz vendored

View File

Binary file not shown.

									
										1

.github/scripts/ensure_actions_will_cancel.py
									
										vendored
									
												View File
												
				@ -1,7 +1,6 @@

				#!/usr/bin/env python3

				import sys

				from pathlib import Path

				import yaml

									
										1

.github/scripts/export_pytorch_labels.py
									
										vendored
									
												View File
												
				@ -14,7 +14,6 @@ import json

				from typing import Any

				import boto3  # type: ignore[import]

				from label_utils import gh_get_labels

									
										4

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -15,6 +15,7 @@ from urllib.request import Request, urlopen

				import yaml

				REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://github.com/pytorch/pytorch/issues/)([0-9]+)"

				PREFIX = "test-config/"

				@ -504,6 +505,9 @@ def perform_misc_tasks(

				        "ci-verbose-test-logs",

				        check_for_setting(labels, pr_body, "ci-verbose-test-logs"),

				    )

				    set_output(

				        "ci-test-showlocals", check_for_setting(labels, pr_body, "ci-test-showlocals")

				    )

				    set_output(

				        "ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")

				    )

									
										77

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -8,11 +8,13 @@ architectures:

				    * CPU

				    * Latest CUDA

				    * Latest ROCM

				    * Latest XPU

				"""

				import os

				from typing import Dict, List, Optional, Tuple

				CUDA_ARCHES = ["11.8", "12.1", "12.4"]

				@ -24,6 +26,7 @@ CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}

				ROCM_ARCHES = ["6.0", "6.1"]

				XPU_ARCHES = ["xpu"]

				CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]

				@ -48,7 +51,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.1": (

				@ -61,7 +64,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.4": (

				@ -74,7 +77,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				@ -132,6 +135,8 @@ def arch_type(arch_version: str) -> str:

				        return "cuda"

				    elif arch_version in ROCM_ARCHES:

				        return "rocm"

				    elif arch_version in XPU_ARCHES:

				        return "xpu"

				    elif arch_version in CPU_CXX11_ABI_ARCH:

				        return "cpu-cxx11-abi"

				    elif arch_version in CPU_AARCH64_ARCH:

				@ -156,6 +161,7 @@ WHEEL_CONTAINER_IMAGES = {

				        gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"

				        for gpu_arch in ROCM_ARCHES

				    },

				    "xpu": f"pytorch/manylinux2_28-builder:xpu-{DEFAULT_TAG}",

				    "cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",

				    "cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",

				    "cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",

				@ -209,7 +215,7 @@ LIBTORCH_CONTAINER_IMAGES: Dict[Tuple[str, str], str] = {

				    ("cpu", CXX11_ABI): f"pytorch/libtorch-cxx11-builder:cpu-{DEFAULT_TAG}",

				}

				FULL_PYTHON_VERSIONS = ["3.8", "3.9", "3.10", "3.11", "3.12"]

				FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12"]

				def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				@ -221,6 +227,7 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				        "cuda": f"cu{gpu_arch_version.replace('.', '')}",

				        "cuda-aarch64": "cu124",

				        "rocm": f"rocm{gpu_arch_version}",

				        "xpu": "xpu",

				    }.get(gpu_arch_type, gpu_arch_version)

				@ -325,13 +332,13 @@ def generate_wheels_matrix(

				        package_type = "manywheel"

				    if python_versions is None:

				        python_versions = FULL_PYTHON_VERSIONS

				        python_versions = FULL_PYTHON_VERSIONS + ["3.13"]

				    if arches is None:

				        # Define default compute archivectures

				        arches = ["cpu"]

				        if os == "linux":

				            arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES

				            arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES

				        elif os == "windows":

				            arches += CUDA_ARCHES

				        elif os == "linux-aarch64":

				@ -347,10 +354,6 @@ def generate_wheels_matrix(

				    for python_version in python_versions:

				        for arch_version in arches:

				            gpu_arch_type = arch_type(arch_version)

				            # Disable py3.12 builds for ROCm because of triton dependency

				            # on llnl-hatchet, which doesn't have py3.12 wheels available

				            if gpu_arch_type == "rocm" and python_version == "3.12":

				                continue

				            gpu_arch_version = (

				                ""

				                if arch_version == "cpu"

				@ -358,9 +361,16 @@ def generate_wheels_matrix(

				                or arch_version == "cpu-aarch64"

				                or arch_version == "cpu-s390x"

				                or arch_version == "cuda-aarch64"

				                or arch_version == "xpu"

				                else arch_version

				            )

				            # TODO: Enable python 3.13 on rocm, xpu, aarch64, windows

				            if (

				                gpu_arch_type in ["rocm", "xpu"] or os != "linux"

				            ) and python_version == "3.13":

				                continue

				            # 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if (

				                arch_version in ["12.4", "12.1", "11.8"]

				@ -390,6 +400,49 @@ def generate_wheels_matrix(

				                        ),

				                    }

				                )

				                if arch_version != "cuda-aarch64":

				                    ret.append(

				                        {

				                            "python_version": python_version,

				                            "gpu_arch_type": gpu_arch_type,

				                            "gpu_arch_version": gpu_arch_version,

				                            "desired_cuda": translate_desired_cuda(

				                                gpu_arch_type, gpu_arch_version

				                            ),

				                            "use_split_build": "True",

				                            "devtoolset": "",

				                            "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                            "package_type": package_type,

				                            "pytorch_extra_install_requirements": (

				                                PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                                if os != "linux-aarch64"

				                                else ""

				                            ),

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace(  # noqa: B950

				                                ".", "_"

				                            ),

				                        }

				                    )

				                    # Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA

				                    if python_version == "3.10" and arch_version == "12.1":

				                        ret.append(

				                            {

				                                "python_version": python_version,

				                                "gpu_arch_type": gpu_arch_type,

				                                "gpu_arch_version": gpu_arch_version,

				                                "desired_cuda": translate_desired_cuda(

				                                    gpu_arch_type, gpu_arch_version

				                                ),

				                                "use_split_build": "False",

				                                "devtoolset": "",

				                                "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                                "package_type": package_type,

				                                "pytorch_extra_install_requirements": "",

				                                "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                    ".", "_"

				                                ),

				                            }

				                        )

				            else:

				                ret.append(

				                    {

				@ -400,7 +453,9 @@ def generate_wheels_matrix(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""

				                            "cxx11-abi"

				                            if arch_version in ["cpu-cxx11-abi", "xpu"]

				                            else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

									
										2

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -8,9 +8,9 @@ from typing import Dict, Iterable, List, Literal, Set

				from typing_extensions import TypedDict  # Python 3.11+

				import generate_binary_build_matrix  # type: ignore[import]

				import jinja2

				Arch = Literal["windows", "linux", "macos"]

				GITHUB_DIR = Path(__file__).resolve().parent.parent

									
										1

.github/scripts/generate_docker_release_matrix.py
									
										vendored
									
												View File
												
				@ -16,6 +16,7 @@ from typing import Dict, List

				import generate_binary_build_matrix

				DOCKER_IMAGE_TYPES = ["runtime", "devel"]

									
										2

.github/scripts/generate_pytorch_version.py
									
										vendored
									
												View File
												
				@ -4,11 +4,11 @@ import argparse

				import os

				import re

				import subprocess

				from datetime import datetime

				from distutils.util import strtobool

				from pathlib import Path

				LEADING_V_PATTERN = re.compile("^v")

				TRAILING_RC_PATTERN = re.compile("-rc[0-9]*$")

				LEGACY_BASE_VERSION_SUFFIX_PATTERN = re.compile("a0$")

									
										1

.github/scripts/get_workflow_job_id.py
									
										vendored
									
												View File
												
				@ -11,7 +11,6 @@ import sys

				import time

				import urllib

				import urllib.parse

				from typing import Any, Callable, Dict, List, Optional, Tuple

				from urllib.request import Request, urlopen

Compare commits

2506 Commits cslpull80 ... bf/cg-remo

10 .ci/docker/README.md Unescape Escape View File

6 .ci/docker/aotriton_version.txt Unescape Escape View File

24 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/halide.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

5 .ci/docker/common/aotriton_version.txt Normal file Unescape Escape View File

2 .ci/docker/common/install_aotriton.sh Unescape Escape View File

2 .ci/docker/common/install_conda.sh Unescape Escape View File

20 .ci/docker/common/install_conda_docker.sh Executable file Unescape Escape View File

95 .ci/docker/common/install_cpython.sh Executable file Unescape Escape View File

239 .ci/docker/common/install_cuda.sh Normal file Unescape Escape View File

93 .ci/docker/common/install_cuda_aarch64.sh Normal file Unescape Escape View File

14 .ci/docker/common/install_executorch.sh Unescape Escape View File

46 .ci/docker/common/install_halide.sh Normal file Unescape Escape View File

23 .ci/docker/common/install_libpng.sh Normal file Unescape Escape View File

29 .ci/docker/common/install_magma.sh Normal file Unescape Escape View File

134 .ci/docker/common/install_miopen.sh Normal file Unescape Escape View File

16 .ci/docker/common/install_mkl.sh Normal file Unescape Escape View File

13 .ci/docker/common/install_mnist.sh Normal file Unescape Escape View File

4 .ci/docker/common/install_onnx.sh Unescape Escape View File

22 .ci/docker/common/install_openblas.sh Normal file Unescape Escape View File

16 .ci/docker/common/install_patchelf.sh Normal file Unescape Escape View File

150 .ci/docker/common/install_rocm_drm.sh Normal file Unescape Escape View File

13 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

110 .ci/docker/common/install_xpu.sh Unescape Escape View File

100 .ci/docker/conda/Dockerfile Normal file Unescape Escape View File

76 .ci/docker/conda/build.sh Executable file Unescape Escape View File

107 .ci/docker/libtorch/Dockerfile Normal file Unescape Escape View File

93 .ci/docker/libtorch/build.sh Executable file Unescape Escape View File

2 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

202 .ci/docker/manywheel/Dockerfile Normal file Unescape Escape View File

153 .ci/docker/manywheel/Dockerfile_2014 Normal file Unescape Escape View File

153 .ci/docker/manywheel/Dockerfile_2_28 Normal file Unescape Escape View File

57 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Normal file Unescape Escape View File

94 .ci/docker/manywheel/Dockerfile_aarch64 Normal file Unescape Escape View File

91 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Normal file Unescape Escape View File

71 .ci/docker/manywheel/Dockerfile_cxx11-abi Normal file Unescape Escape View File

73 .ci/docker/manywheel/Dockerfile_s390x Normal file Unescape Escape View File

154 .ci/docker/manywheel/build.sh Executable file Unescape Escape View File

131 .ci/docker/manywheel/build_scripts/build.sh Normal file Unescape Escape View File

91 .ci/docker/manywheel/build_scripts/build_utils.sh Executable file Unescape Escape View File

60 .ci/docker/manywheel/build_scripts/manylinux1-check.py Normal file Unescape Escape View File

35 .ci/docker/manywheel/build_scripts/ssl-check.py Normal file Unescape Escape View File

10 .ci/docker/requirements-ci.txt Unescape Escape View File

8 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

10 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

41 .ci/pytorch/README.md Unescape Escape View File

29 .ci/pytorch/build.sh Unescape Escape View File

46 .ci/pytorch/common_utils.sh Unescape Escape View File

1 .ci/pytorch/create_test_cert.py Unescape Escape View File

6 .ci/pytorch/multigpu-test.sh Unescape Escape View File

1 .ci/pytorch/perf_test/compare_with_baseline.py Unescape Escape View File

1 .ci/pytorch/perf_test/get_stats.py Unescape Escape View File

1 .ci/pytorch/perf_test/update_commit_hash.py Unescape Escape View File

1 .ci/pytorch/print_sccache_log.py Unescape Escape View File

364 .ci/pytorch/test.sh Unescape Escape View File

1 .ci/pytorch/win-test-helpers/run_python_nn_smoketests.py Unescape Escape View File

1 .circleci/codegen_validation/normalize_yaml_fragment.py Unescape Escape View File

27 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

49 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

9 .circleci/scripts/binary_upload.sh Unescape Escape View File

1 .circleci/scripts/trigger_azure_pipeline.py Unescape Escape View File

2 .devcontainer/scripts/install-dev-tools.sh Unescape Escape View File

7 .flake8 Unescape Escape View File

4 .git-blame-ignore-revs Unescape Escape View File

30 .github/actionlint.yaml vendored Unescape Escape View File

6 .github/actions/diskspace-cleanup/action.yml vendored Unescape Escape View File

3 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

207 .github/actions/linux-build/action.yml vendored Unescape Escape View File

1 .github/actions/linux-test/action.yml vendored Unescape Escape View File

11 .github/actions/test-pytorch-binary/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/torchbench.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

227 .github/lf-canary-scale-config.yml vendored Unescape Escape View File

2506 Commits

cslpull80 ... bf/cg-remo

10

.ci/docker/README.md

View File

6

.ci/docker/aotriton_version.txt

View File

24

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/halide.txt Normal file

View File

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

5

.ci/docker/common/aotriton_version.txt Normal file

View File

2

.ci/docker/common/install_aotriton.sh

View File

2

.ci/docker/common/install_conda.sh

View File

20

.ci/docker/common/install_conda_docker.sh Executable file

View File

95

.ci/docker/common/install_cpython.sh Executable file

View File

239

.ci/docker/common/install_cuda.sh Normal file

View File

93

.ci/docker/common/install_cuda_aarch64.sh Normal file

View File

14

.ci/docker/common/install_executorch.sh

View File

46

.ci/docker/common/install_halide.sh Normal file

View File

23

.ci/docker/common/install_libpng.sh Normal file

View File

29

.ci/docker/common/install_magma.sh Normal file

View File

134

.ci/docker/common/install_miopen.sh Normal file

View File

16

.ci/docker/common/install_mkl.sh Normal file

View File

13

.ci/docker/common/install_mnist.sh Normal file

View File

4

.ci/docker/common/install_onnx.sh

View File

22

.ci/docker/common/install_openblas.sh Normal file

View File

16

.ci/docker/common/install_patchelf.sh Normal file

View File

150

.ci/docker/common/install_rocm_drm.sh Normal file

View File

13

.ci/docker/common/install_rocm_magma.sh

View File

110

.ci/docker/common/install_xpu.sh

View File

100

.ci/docker/conda/Dockerfile Normal file

View File

76

.ci/docker/conda/build.sh Executable file

View File

107

.ci/docker/libtorch/Dockerfile Normal file

View File

93

.ci/docker/libtorch/build.sh Executable file

View File

2

.ci/docker/linter-cuda/Dockerfile

View File

202

.ci/docker/manywheel/Dockerfile Normal file

View File

153

.ci/docker/manywheel/Dockerfile_2014 Normal file

View File

153

.ci/docker/manywheel/Dockerfile_2_28 Normal file

View File

57

.ci/docker/manywheel/Dockerfile_2_28_aarch64 Normal file

View File

94

.ci/docker/manywheel/Dockerfile_aarch64 Normal file

View File

91

.ci/docker/manywheel/Dockerfile_cuda_aarch64 Normal file

View File

71

.ci/docker/manywheel/Dockerfile_cxx11-abi Normal file

View File

73

.ci/docker/manywheel/Dockerfile_s390x Normal file

View File

154

.ci/docker/manywheel/build.sh Executable file

View File

131

.ci/docker/manywheel/build_scripts/build.sh Normal file

View File

91

.ci/docker/manywheel/build_scripts/build_utils.sh Executable file

View File

60

.ci/docker/manywheel/build_scripts/manylinux1-check.py Normal file

View File

35

.ci/docker/manywheel/build_scripts/ssl-check.py Normal file

View File

10

.ci/docker/requirements-ci.txt

View File

8

.ci/docker/ubuntu-cuda/Dockerfile

View File

10

.ci/docker/ubuntu/Dockerfile

View File

41

.ci/pytorch/README.md

View File

29

.ci/pytorch/build.sh

View File

46

.ci/pytorch/common_utils.sh

View File

1

.ci/pytorch/create_test_cert.py

View File

6

.ci/pytorch/multigpu-test.sh

View File

1

.ci/pytorch/perf_test/compare_with_baseline.py

View File

1

.ci/pytorch/perf_test/get_stats.py

View File

1

.ci/pytorch/perf_test/update_commit_hash.py

View File

1

.ci/pytorch/print_sccache_log.py

View File

364

.ci/pytorch/test.sh

View File

1

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py

View File

1

.circleci/codegen_validation/normalize_yaml_fragment.py

View File

27

.circleci/scripts/binary_linux_test.sh

View File

49

.circleci/scripts/binary_populate_env.sh

View File

9

.circleci/scripts/binary_upload.sh

View File

1

.circleci/scripts/trigger_azure_pipeline.py

View File

2

.devcontainer/scripts/install-dev-tools.sh

View File

7

.flake8

View File

4

.git-blame-ignore-revs

View File

30

.github/actionlint.yaml vendored

View File

6

.github/actions/diskspace-cleanup/action.yml vendored

View File

3

.github/actions/filter-test-configs/action.yml vendored

View File

207

.github/actions/linux-build/action.yml vendored

View File

1

.github/actions/linux-test/action.yml vendored

View File

11

.github/actions/test-pytorch-binary/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/torchbench.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

227

.github/lf-canary-scale-config.yml vendored

View File

227

.github/lf-scale-config.yml vendored

View File