pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Anthony Shoumikhin	7d39e73c57	Fix more URLs (#153277 ) Or ignore them. Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277 Approved by: https://github.com/malfet	2025-05-14 16:23:50 +00:00
Shangdi Yu	2e440e39a6	[nativert] Move Placement to pytorch core (#152953 ) Summary: Move Placement to pytorch core. Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:placement_test ./bin/test_nativert ``` OSS and internal CI Differential Revision: D74190745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953 Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-14 15:26:54 +00:00
Animesh Jain	8f3d7972ad	[dynamo][compile-time] Cache the function signature to speedup inlining (#153396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #153333	2025-05-14 14:01:46 +00:00
PyTorch MergeBot	2c1912452d	Revert "Rewrite autograd producer consumer stream sync logic (#151079 )" This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a. Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))	2025-05-14 13:07:12 +00:00
PyTorch MergeBot	a628efd1e8	Revert "Enable accelerator to perform streaming backward (#153412 )" This reverts commit d5d26ce43641a19c3e36a751b59b7fa3825cea83. Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))	2025-05-14 13:04:27 +00:00
abmajumder	0ef5ba43a6	Fix negative dim issue in for parallel loss context manager (#152785 ) Facing similar issue as on #152016 , and added as per @tianyu-l 's solution. Fixes #152016 Tagging @tianyu-l @atalman for review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785 Approved by: https://github.com/tianyu-l	2025-05-14 10:43:27 +00:00
Animesh Jain	864a5f4434	[dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42	2025-05-14 09:26:26 +00:00
Will Feng	0139ce9303	Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513 ) Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking. Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513 Approved by: https://github.com/jansel	2025-05-14 09:14:11 +00:00
Wanchao Liang	4c5cf18ee0	[device_mesh] improve device selection logic (#150897 ) as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: * If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user * If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: * If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves * If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) * If not above, then we throw warning to users about situation, and fallback to the old heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897 Approved by: https://github.com/tianyu-l ghstack dependencies: #150898	2025-05-14 06:29:16 +00:00
zeshengzong	0f891cad5a	Enable ruff check for `torch/utils/data/.ipynb` (#148654 ) Fixes part of #146411 Enable ruff check for `torch/utils/data/.ipynb` files ## Test Result ```bash lintrunner -a --take RUFF torch/utils/data/*.ipynb ``` ![image](https://github.com/user-attachments/assets/88fddc91-3f19-4704-9aef-2cabd2cdc96e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148654 Approved by: https://github.com/Skylion007	2025-05-14 06:21:47 +00:00
Animesh Jain	11c64b7cf8	[dynamo][compile-time] Cache whether a function is inlineable (#153192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #153458	2025-05-14 05:40:25 +00:00
Ke Wen	e2ce17c6ef	[SymmMem][a2av] Use more CTAs for intra-node case (#153509 ) Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth. This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism. The kernel now achieves 350 GB/s SOL for Hopper. See figure. It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8) For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s). ![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509 Approved by: https://github.com/ngimel ghstack dependencies: #153483	2025-05-14 04:24:32 +00:00
Zizeng Meng	316c15297c	[MemoryZ] Show the current and max entries rendered (#153446 ) Summary: as title Test Plan: {F1977904091} Differential Revision: D74626081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153446 Approved by: https://github.com/sraikund16	2025-05-14 03:16:12 +00:00
Animesh Jain	c797f1285c	[dynamo][copmile-time] Handle builtins first in LOAD_GLOBAL (#153458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153458 Approved by: https://github.com/jansel	2025-05-14 03:04:38 +00:00
Bin Bao	33a5179269	[AOTI][reland2] Remove typedef for half and bfloat16 (#153467 ) Summary: Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues. typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen. Differential Revision: D74398762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever	2025-05-14 02:37:18 +00:00
angelayi	d51bc27378	[export] Make draft_export public (#153219 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219 Approved by: https://github.com/pianpwk	2025-05-14 02:18:36 +00:00
Jane Xu	b15b870903	[BE] remove outdated torch/README.md (#153500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153500 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-14 02:10:30 +00:00
clr	85f97b5a8c	compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983 ) This is a pretty minor change, but by having exact correspondence, we can easily confirm data differences between perfetto and wait counters Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983 Approved by: https://github.com/jansel, https://github.com/masnesral	2025-05-14 01:54:42 +00:00
Ke Wen	90001554bf	[SymmMem][a2av] Fix TODO: change stride unit (#153483 ) Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483 Approved by: https://github.com/fegin, https://github.com/ngimel	2025-05-14 01:47:54 +00:00
William Wen	8521a690f7	[dynamo] fix potential circular import error in decorators.py (#153217 ) Differential Revision: [D74442043](https://our.internmc.facebook.com/intern/diff/D74442043) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153217 Approved by: https://github.com/jansel	2025-05-14 01:01:57 +00:00
Georg Narodoslawsky	8739a8c288	elastic: do not shutdown rendezvous on leaving workers (#152525 ) In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](`fa6f9eb2be/torch/distributed/launcher/api.py (L290)`) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749). #124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving. Fixes #150916 Fixes #147064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525 Approved by: https://github.com/kiukchung	2025-05-14 00:44:10 +00:00
Pian Pawakapan	8ac82c3e72	[export] support functools.partial forward (non-strict) (#153408 ) Fixes #153086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408 Approved by: https://github.com/tugsbayasgalan	2025-05-13 23:30:13 +00:00
dolpm	40b719c97d	[nativert] move executor config to torch (#153087 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff moves the executor config to torch. since it's header-only this requires some changes to the libtorch build configs Test Plan: CI Differential Revision: D74278789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153087 Approved by: https://github.com/zhxchen17	2025-05-13 23:26:00 +00:00
Yiming Zhou	3498201e57	GPU lowering uses aoti_call_delegate (#153282 ) Summary: Skip custom objects when serializing the weight nodes of `aoti_call_delegate` hop as they are not consumed by the runtime. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D73704385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153282 Approved by: https://github.com/dolpm, https://github.com/SherlockNoMad	2025-05-13 23:23:27 +00:00
Shivam Raikundalia	a13c8f2ecb	[EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415 ) Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock Test Plan: Induced error manually and saw that GIL was released Differential Revision: D74593564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-13 20:47:52 +00:00
James Wu	5ff2cb8587	Add justknobs for static cuda launcher (#153400 ) Summary: This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off. It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile. Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet. Differential Revision: D74599203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400 Approved by: https://github.com/oulgen	2025-05-13 20:10:13 +00:00
clr	20ba8fe7e6	induct: Log a pt2 compile event + waitcounter for node fusing. (#153270 ) This appears to be slow in production (potentially a quadratic explosion), and logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how bad of an issue this is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270 Approved by: https://github.com/masnesral	2025-05-13 19:02:36 +00:00
Wanchao Liang	9df9d9ded0	[device_mesh] replace dim_group_info with group_name (#150898 ) as titled, there's no need to maintain a dim_group_info anymore, we can simply maintain a list of group_name instead. This will simplify the logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-05-13 17:16:45 +00:00
Tristan Rice	9c3cef437c	gloo: support ibverbs in cmake (#153425 ) This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch. Test plan: ``` sudo dnf install rdma-core-devel USE_GLOO_IBVERBS=ON python setup.py develop torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py ``` ```py """ run with: torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py """ import os os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() if rank == 0: device = "cpu" else: device = "cuda" print(device) t = torch.full((10, 100), fill_value=(rank+1), device=device) target = torch.full((10, 100), fill_value=3, device=device) dist.all_reduce(t) torch.testing.assert_close(t, target) t = torch.full((10, 100), fill_value=(rank+1), device=device) if rank == 0: dist.send(t, dst=1) else: dist.recv(t, src=0) torch.testing.assert_close(t, torch.full_like(t, 1)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425 Approved by: https://github.com/fduwjj	2025-05-13 17:09:00 +00:00
Sam Larsen	dde705864a	Fix test broken by D73809989 (#153413 ) Summary: I forgot to remove this unused field in D73809989. Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413 Approved by: https://github.com/c00w	2025-05-13 16:44:30 +00:00
Simon Fan	a80eb84a5f	[ca] support higher order gradients (create_graph=True) (#153222 ) Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager"). Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222 Approved by: https://github.com/jansel ghstack dependencies: #153193	2025-05-13 16:42:09 +00:00
Simon Fan	37efaf4af9	[ca][api] config api shouldn't error with optimize_assert (#153193 ) Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193 Approved by: https://github.com/jansel	2025-05-13 16:42:02 +00:00
Guilherme Leobas	a4459cd4e3	Remove `property` from python_type function (#152900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152900 Approved by: https://github.com/amjames, https://github.com/anijain2305 ghstack dependencies: #153070	2025-05-13 16:26:25 +00:00
Guilherme Leobas	f67eb6f8c5	Fix path matching in `CPythonTestCase/setUpClass` (#153070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153070 Approved by: https://github.com/amjames, https://github.com/anijain2305, https://github.com/Skylion007	2025-05-13 16:26:25 +00:00
Zizeng Meng	445d8fd77d	[MemoryZ] Sync changes to internal page (#153166 ) Summary: For MTIA on-demand mode, since we are not using torch Module. The data upload happens in cpp and doesn't support pickle. Thus, we store as JSON at the end and need the update visualizer to support it Test Plan: Check Test plan in D74179606 Differential Revision: D74406209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153166 Approved by: https://github.com/sraikund16	2025-05-13 15:35:10 +00:00
Animesh Jain	7fdd754136	[compile-time traces] Profile large missing gaps in compile time (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel	2025-05-13 14:44:51 +00:00
Howard Huang	d9ef1012db	[PP] Optimize memory usage by releasing output memory earlier (#153383 ) Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-05-13 14:42:38 +00:00
Sam Larsen	f1de3f9f07	Rename "output_tensor" -> "out" in autotune_process.py (#153169 ) Summary: This change is to support remote autotuning. I want to use all the same benchmarking utilities in select_algorithm.py. For remote autotuning, I'll reuse the TritonBenchmarkRequest class used for subprocess autotuning because it's already serializable. That class is also used in standard, in-process autotuning, but via TritonTemplateCaller.benchmark() which sets the output_tensor param when calling the underlying TritonBenchmarkRequest. For remote, I'll be using the TritonBenchmarkRequest request directly so I want the parameter to be named 'out' to avoid "got an unexpected keyword argument 'out'". Test Plan: Existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153169 Approved by: https://github.com/aorenste, https://github.com/eellison	2025-05-13 14:18:29 +00:00
Michael Lazos	ff039d39ec	[Dynamo] Optimize dedupe region ancestor tracking (#152589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572	2025-05-13 12:17:59 +00:00
Michael Lazos	d0faa9985d	[Dynamo] Fix typing in graph_deduplication.py (#152572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570	2025-05-13 12:17:59 +00:00
Michael Lazos	a415c9831f	[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506	2025-05-13 12:17:59 +00:00
Michael Lazos	57dafb90ef	[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410	2025-05-13 12:17:59 +00:00
Michael Lazos	118192011e	[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505	2025-05-13 12:17:59 +00:00
Michael Lazos	3592cb52d9	[Hierarchical Compilation] Use universal flatten APIs (#152505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389	2025-05-13 12:17:59 +00:00
Michael Lazos	023a3dc69f	[Hierarchical Compilation] Track node mutations (#152389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389 Approved by: https://github.com/anijain2305	2025-05-13 12:17:59 +00:00
nikitaved	edc2d539d1	`torch.tensordot`: performance improvements when contracting to a scalar. (#145936 ) As per title. Fixes https://github.com/pytorch/pytorch/issues/145731 Touches only compute. The CPU overhead can potentially be further reduced. Before: ```python In [3]: n = 512 In [4]: A = torch.rand(n, n) In [5]: B = torch.rand(n, n) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ```python In [2]: n = 512 In [3]: A = torch.rand(n, n) In [4]: B = torch.rand(n, n) In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936 Approved by: https://github.com/albanD, https://github.com/ngimel	2025-05-13 10:57:30 +00:00
PyTorch MergeBot	8d7dec6e92	Revert "[DSD] Don't pop tensors if they are on Meta device (#153185 )" This reverts commit 7243c69421cd0b868f3fa3b552c17e9c8b3023a1. Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))	2025-05-13 09:13:27 +00:00
Henry Tsang	36722c287f	[cutlass backend] make compile name independent of command (#153388 ) Differential Revision: D74291603 The goal is to reuse the kernels as much as possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153388 Approved by: https://github.com/ColinPeppler	2025-05-13 03:49:24 +00:00
Nikita Shulga	a6c5b59067	[MPSInductor] Fix multistage reduction suffixes (#153362 ) By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache. Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024 ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device half* out_ptr1, constant half* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; threadgroup float tmp_acc_0[32]; float tmp_acc_1 = 0; for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; auto tmp0 = static_cast<float>(in_ptr0[r0_0]); tmp_acc_1 += tmp0; } auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024); for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; auto tmp2 = static_cast<float>(in_ptr0[r0_0]); auto tmp3 = tmp2 / tmp1; out_ptr1[r0_0] = static_cast<half>(tmp3); } } ``` Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps` Fixes https://github.com/pytorch/pytorch/issues/152155 Though inductor tests are still failing, need to keep refining the variable invalidation Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel	2025-05-13 03:07:53 +00:00
fduwjj	27e9d9b103	[c10d][fr] Add try catch to update entry due to cuda error (#153414 ) During the dump of FR, due to some unknown reasons, we see cuda errors when querying events and this leads to the failures of whole FR dumps (when trying to get entries). So we do a try-catch instead of let it fails the whole process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153414 Approved by: https://github.com/d4l3k	2025-05-13 01:10:00 +00:00

1 2 3 4 5 ...

48265 Commits