pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Anthony Shoumikhin	7d39e73c57	Fix more URLs (#153277 ) Or ignore them. Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277 Approved by: https://github.com/malfet	2025-05-14 16:23:50 +00:00
fengqing.lu	de92296bbb	[Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976 ) Fix https://github.com/pytorch/pytorch/issues/152290. The model hubert uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN. This PR handles this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-05-14 16:09:03 +00:00
chunhuanMeng	1f48bab377	Update torch-xpu-ops commit pin (#153445 ) Update the torch-xpu-ops commit to [207105038963e5f9f012f1a0cfd3b9f57b2ab5b0](`2071050389`), includes: - Improve the accuracy of `upsample_bilinear2d_backward` - Enhance the performance of `avg_pool2d` - Update the implementation of scatter-gather and indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/153445 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-05-14 15:34:47 +00:00
Shangdi Yu	2e440e39a6	[nativert] Move Placement to pytorch core (#152953 ) Summary: Move Placement to pytorch core. Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:placement_test ./bin/test_nativert ``` OSS and internal CI Differential Revision: D74190745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953 Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-14 15:26:54 +00:00
eqy	ced90d23d3	[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 ) For #152816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101 Approved by: https://github.com/Skylion007	2025-05-14 15:22:47 +00:00
PyTorch UpdateBot	0ce941f994	[audio hash update] update the pinned audio hash (#153507 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153507 Approved by: https://github.com/pytorchbot	2025-05-14 15:16:35 +00:00
Horace He	cd119ddd7c	Add matching against hypothetical (new) ghstack pull-request trailer (#153528 ) I would like to change ghstack to use a new trailer Pull Request resolved: https://github.com/pytorch/pytorch/pull/153528 Approved by: https://github.com/malfet	2025-05-14 14:07:01 +00:00
Animesh Jain	8f3d7972ad	[dynamo][compile-time] Cache the function signature to speedup inlining (#153396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #153333	2025-05-14 14:01:46 +00:00
PyTorch MergeBot	2344eca5eb	Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315 )" This reverts commit ee096b89f63394b2c18826288783eef241f3959c. Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))	2025-05-14 13:15:03 +00:00
PyTorch MergeBot	2c1912452d	Revert "Rewrite autograd producer consumer stream sync logic (#151079 )" This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a. Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))	2025-05-14 13:07:12 +00:00
PyTorch MergeBot	a628efd1e8	Revert "Enable accelerator to perform streaming backward (#153412 )" This reverts commit d5d26ce43641a19c3e36a751b59b7fa3825cea83. Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))	2025-05-14 13:04:27 +00:00
Bin Bao	e8f7a97e2e	[Refactor] Explicilty spell out the namespace for device() function (#153248 ) Summary: To prepare for the coming up header-only file change. The same files have been using a mixed style of using at::device() and device(). Given these .cpp files are not in the at namespace, it makes sense to spell them out explicitly. Differential Revision: [D74577412](https://our.internmc.facebook.com/intern/diff/D74577412) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153248 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/janeyx99	2025-05-14 12:00:47 +00:00
abmajumder	0ef5ba43a6	Fix negative dim issue in for parallel loss context manager (#152785 ) Facing similar issue as on #152016 , and added as per @tianyu-l 's solution. Fixes #152016 Tagging @tianyu-l @atalman for review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785 Approved by: https://github.com/tianyu-l	2025-05-14 10:43:27 +00:00
Animesh Jain	864a5f4434	[dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42	2025-05-14 09:26:26 +00:00
Will Feng	0139ce9303	Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513 ) Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking. Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513 Approved by: https://github.com/jansel	2025-05-14 09:14:11 +00:00
Hashem Hashemi	4015166e5d	[ROCm] Maxpool backward NHWC Perf Improvement targeting Resnet scenarios (#152267 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152267 Approved by: https://github.com/jeffdaily	2025-05-14 06:59:29 +00:00
Wanchao Liang	4c5cf18ee0	[device_mesh] improve device selection logic (#150897 ) as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: * If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user * If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: * If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves * If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) * If not above, then we throw warning to users about situation, and fallback to the old heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897 Approved by: https://github.com/tianyu-l ghstack dependencies: #150898	2025-05-14 06:29:16 +00:00
zeshengzong	0f891cad5a	Enable ruff check for `torch/utils/data/.ipynb` (#148654 ) Fixes part of #146411 Enable ruff check for `torch/utils/data/.ipynb` files ## Test Result ```bash lintrunner -a --take RUFF torch/utils/data/*.ipynb ``` ![image](https://github.com/user-attachments/assets/88fddc91-3f19-4704-9aef-2cabd2cdc96e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148654 Approved by: https://github.com/Skylion007	2025-05-14 06:21:47 +00:00
redwrasse	f7798d8645	Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136 ) `OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict. This just adds in that functionality to the test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136 Approved by: https://github.com/eellison	2025-05-14 06:05:23 +00:00
Animesh Jain	11c64b7cf8	[dynamo][compile-time] Cache whether a function is inlineable (#153192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #153458	2025-05-14 05:40:25 +00:00
Ke Wen	e2ce17c6ef	[SymmMem][a2av] Use more CTAs for intra-node case (#153509 ) Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth. This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism. The kernel now achieves 350 GB/s SOL for Hopper. See figure. It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8) For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s). ![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509 Approved by: https://github.com/ngimel ghstack dependencies: #153483	2025-05-14 04:24:32 +00:00
Wang, Chuanqi	20dbe644c7	[CD] Fix the libgomp twice load issue (#150084 ) Fixes #149422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150084 Approved by: https://github.com/malfet, https://github.com/leslie-fang-intel, https://github.com/atalman Co-authored-by: LifengWang <lifeng.a.wang@intel.com>	2025-05-14 04:06:18 +00:00
Zizeng Meng	316c15297c	[MemoryZ] Show the current and max entries rendered (#153446 ) Summary: as title Test Plan: {F1977904091} Differential Revision: D74626081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153446 Approved by: https://github.com/sraikund16	2025-05-14 03:16:12 +00:00
Animesh Jain	c797f1285c	[dynamo][copmile-time] Handle builtins first in LOAD_GLOBAL (#153458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153458 Approved by: https://github.com/jansel	2025-05-14 03:04:38 +00:00
Bin Bao	33a5179269	[AOTI][reland2] Remove typedef for half and bfloat16 (#153467 ) Summary: Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues. typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen. Differential Revision: D74398762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever	2025-05-14 02:37:18 +00:00
Meet Patel	9ad9a04ca7	Add TensorLR variant for fused Adagrad on CPU (#153078 ) This PR adds a tensor LR variant for the CPU Adagrad(fused=True). I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078 Approved by: https://github.com/janeyx99	2025-05-14 02:23:33 +00:00
angelayi	d51bc27378	[export] Make draft_export public (#153219 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219 Approved by: https://github.com/pianpwk	2025-05-14 02:18:36 +00:00
Jane Xu	b15b870903	[BE] remove outdated torch/README.md (#153500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153500 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-14 02:10:30 +00:00
Marek Michalowski	d759a517af	Update the heuristic for AArch64 bmm/baddbmm (#149122 ) Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location - The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn. - The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic: ``` import torch import time # Set below to True to use cases selected by only one of the hueristics. USE_ONLY_DIVERGENT_TEST_CASES = True BATCH_SIZES = [ 1, 8, 32, 64, 128, 256 ] M_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ] N_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ] K_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ] ITERS = 50 def old_heuristic(m, n, k): is_above_min_dims = m > 8 and n > 8 and k > 8 is_above_min_size = mnk > 8_192 return is_above_min_dims and is_above_min_size def new_heuristic(b, m, n, k): return bbmnk >= 4_194_304 def generate_test_cases(): test_cases = [] for b in BATCH_SIZES: for m in M_DIMS: for n in N_DIMS: for k in K_DIMS: if USE_ONLY_DIVERGENT_TEST_CASES: if old_heuristic(m, n, k) != new_heuristic(b, m, n, k): test_cases.append([b, m, n, k]) else: test_cases.append([b, m, n, k]) return test_cases def test(x, y): for _ in range(5): torch.bmm(x, y) perf = 0.0 for _ in range(ITERS): start = time.time() torch.bmm(x, y) end = time.time() perf += (end - start) / ITERS return perf def main(): print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}") cumulative_mean_time = 0.0 for b, m, n, k in generate_test_cases(): mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k)) cumulative_mean_time += mean_time print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}") print(f"Cumulative mean time = {cumulative_mean_time:.4f} s") if __name__ == "__main__": main() ``` From the script we see that cumulative mean time from all test cases (at 16 threads) is: - 1.6195 s for the old heuristic - 0.7012 s for the new heuristic Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122 Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet	2025-05-14 02:03:50 +00:00
Scott Wolchok	e8662e836a	Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424 ) Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901 Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-14 02:01:32 +00:00
clr	85f97b5a8c	compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983 ) This is a pretty minor change, but by having exact correspondence, we can easily confirm data differences between perfetto and wait counters Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983 Approved by: https://github.com/jansel, https://github.com/masnesral	2025-05-14 01:54:42 +00:00
Ke Wen	90001554bf	[SymmMem][a2av] Fix TODO: change stride unit (#153483 ) Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483 Approved by: https://github.com/fegin, https://github.com/ngimel	2025-05-14 01:47:54 +00:00
eqy	9386701b51	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg	2025-05-14 01:39:24 +00:00
William Wen	8521a690f7	[dynamo] fix potential circular import error in decorators.py (#153217 ) Differential Revision: [D74442043](https://our.internmc.facebook.com/intern/diff/D74442043) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153217 Approved by: https://github.com/jansel	2025-05-14 01:01:57 +00:00
Hashem Hashemi	e6a9067260	[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727 Approved by: https://github.com/jeffdaily Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-05-14 00:58:00 +00:00
Ting Lu	7f79222992	Upgrade to NCCL 2.26.5 for CUDA 12 (#152810 ) Upgrade NCCL to latest 2.26.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152810 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/cyyever	2025-05-14 00:52:50 +00:00
Georg Narodoslawsky	8739a8c288	elastic: do not shutdown rendezvous on leaving workers (#152525 ) In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](`fa6f9eb2be/torch/distributed/launcher/api.py (L290)`) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749). #124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving. Fixes #150916 Fixes #147064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525 Approved by: https://github.com/kiukchung	2025-05-14 00:44:10 +00:00
Pian Pawakapan	8ac82c3e72	[export] support functools.partial forward (non-strict) (#153408 ) Fixes #153086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408 Approved by: https://github.com/tugsbayasgalan	2025-05-13 23:30:13 +00:00
dolpm	40b719c97d	[nativert] move executor config to torch (#153087 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff moves the executor config to torch. since it's header-only this requires some changes to the libtorch build configs Test Plan: CI Differential Revision: D74278789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153087 Approved by: https://github.com/zhxchen17	2025-05-13 23:26:00 +00:00
Yiming Zhou	3498201e57	GPU lowering uses aoti_call_delegate (#153282 ) Summary: Skip custom objects when serializing the weight nodes of `aoti_call_delegate` hop as they are not consumed by the runtime. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D73704385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153282 Approved by: https://github.com/dolpm, https://github.com/SherlockNoMad	2025-05-13 23:23:27 +00:00
TJ Yin	81719ebde3	[caffe2] Make c10::str works with scoped enum (#152705 ) (#152714 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705 Test Plan: ``` buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast ``` Differential Revision: D74087796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714 Approved by: https://github.com/Skylion007	2025-05-13 21:05:36 +00:00
Benjamin Glass	e8596c291b	Fix misleadingly high AOT Inductor dashboard performance (#153060 ) Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good. An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x. Only two benchmarks we regularly run are affected by this, both in the TIMM set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060 Approved by: https://github.com/desertfire	2025-05-13 20:59:59 +00:00
Shivam Raikundalia	a13c8f2ecb	[EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415 ) Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock Test Plan: Induced error manually and saw that GIL was released Differential Revision: D74593564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-13 20:47:52 +00:00
James Wu	5ff2cb8587	Add justknobs for static cuda launcher (#153400 ) Summary: This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off. It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile. Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet. Differential Revision: D74599203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400 Approved by: https://github.com/oulgen	2025-05-13 20:10:13 +00:00
clr	20ba8fe7e6	induct: Log a pt2 compile event + waitcounter for node fusing. (#153270 ) This appears to be slow in production (potentially a quadratic explosion), and logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how bad of an issue this is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270 Approved by: https://github.com/masnesral	2025-05-13 19:02:36 +00:00
Ryan Guo	8ac82a1d20	[dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416 ) This adds a regression test for #149831, also as part of getting it cherry-picked into 2.7.1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416 Approved by: https://github.com/atalman	2025-05-13 18:28:02 +00:00
Wanchao Liang	9df9d9ded0	[device_mesh] replace dim_group_info with group_name (#150898 ) as titled, there's no need to maintain a dim_group_info anymore, we can simply maintain a list of group_name instead. This will simplify the logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-05-13 17:16:45 +00:00
Tristan Rice	9c3cef437c	gloo: support ibverbs in cmake (#153425 ) This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch. Test plan: ``` sudo dnf install rdma-core-devel USE_GLOO_IBVERBS=ON python setup.py develop torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py ``` ```py """ run with: torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py """ import os os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() if rank == 0: device = "cpu" else: device = "cuda" print(device) t = torch.full((10, 100), fill_value=(rank+1), device=device) target = torch.full((10, 100), fill_value=3, device=device) dist.all_reduce(t) torch.testing.assert_close(t, target) t = torch.full((10, 100), fill_value=(rank+1), device=device) if rank == 0: dist.send(t, dst=1) else: dist.recv(t, src=0) torch.testing.assert_close(t, torch.full_like(t, 1)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425 Approved by: https://github.com/fduwjj	2025-05-13 17:09:00 +00:00
Sam Larsen	dde705864a	Fix test broken by D73809989 (#153413 ) Summary: I forgot to remove this unused field in D73809989. Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413 Approved by: https://github.com/c00w	2025-05-13 16:44:30 +00:00
Simon Fan	216e28f7e9	[ca] run xfails up until their last passing backend (#153279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279 Approved by: https://github.com/jansel ghstack dependencies: #153193, #153222	2025-05-13 16:42:10 +00:00

1 2 3 4 5 ...

87740 Commits