pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Isuru Fernando	99b32a6750	[inductor] print 0.0 as 0 for triton (#164291 ) Fixes https://github.com/pytorch/pytorch/issues/164157 Fixes https://github.com/pytorch/pytorch/issues/164086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164291 Approved by: https://github.com/bobrenjc93 ciflow/mps/165453 ciflow/trunk/165453 ciflow/inductor/165453 trunk/99b32a6750bfd0cfe2bc84a47823e1da34802b7b	2025-10-16 16:37:50 +00:00
Edward Yang	783da8b8e7	Repro for property related Dynamo graph break (#165609 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165609 Approved by: https://github.com/albanD, https://github.com/gchanan, https://github.com/malfet, https://github.com/anijain2305 trunk/783da8b8e7f3af90c5b8bde4c849768bd2860834 viable/strict/1760646330	2025-10-16 16:22:43 +00:00
Brian Hirsh	ed74dc054d	add the option to disable functionalization in AOTDispatcher (#164577 ) I'm cleaning this PR up as a proper way of disabling functionalization via config in AOTDispatcher. I removed the non-functionalization related changes from the original version: (1) preventing proxy mode (and functionalization) from incorrectly decomposing CIA ops (Ed has a PR for it here: https://github.com/pytorch/pytorch/pull/164939) (2) preventing python-dispatcher-based decomps above autograd from running. I'm not doing this for now, will likely do it in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/164577 Approved by: https://github.com/ezyang ghstack dependencies: #165372 trunk/ed74dc054d45ede6ebf77e1e1b7e2a7a15612e55	2025-10-16 15:44:11 +00:00
Brian Hirsh	f33c7e1a43	add and fix OpInfo tests for the default partitioner (#165372 ) I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372 Approved by: https://github.com/ezyang	2025-10-16 15:44:11 +00:00
Yu, Guangye	219fb6aafc	Refactor CUDAAllocatorConfig using ConfigTokenizer (#165281 ) * #165129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165281 Approved by: https://github.com/albanD ghstack dependencies: #165129, #165131, #165135, #165136 trunk/219fb6aafc6203a1be68798ced470a26e7a2a5d3	2025-10-16 15:26:50 +00:00
Yu, Guangye	515b5ff539	Remove unused code in CUDAAllocatorConfig (#165136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165136 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129, #165131, #165135	2025-10-16 15:26:50 +00:00
Yu, Guangye	608a6d4a26	Reuse AcceleratorAllocatorConfig in CUDAAllocatorConfig (#165135 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165135 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129, #165131	2025-10-16 15:26:40 +00:00
Yu, Guangye	03e5dbb26e	Register CUDAAllocatorConfig to AcceleratorAllocatorConfig (#165131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165131 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129	2025-10-16 15:26:28 +00:00
Yu, Guangye	7ee45f7503	Restore AcceleratorAllocatorConfig to avoid potential regression (#165129 ) # Motivation This PR aims to restore `AcceleratorAllocatorConfig` to avoid the potential regression mentioned in https://github.com/pytorch/pytorch/pull/160666#issue-3323270375 These code change would be reverted in the following PR https://github.com/pytorch/pytorch/pull/165304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165129 Approved by: https://github.com/albanD	2025-10-16 15:26:17 +00:00
Lucas Kabela	e6d9d68598	[Bugfix][Dynamo] Fix Sparse tensors by graph break in Dynamo (#164873 ) Fixes #164823 by making lack of support for sparse tensors very explicit (in fake tensor, inductor, and lowering code) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164873 Approved by: https://github.com/williamwen42, https://github.com/eellison, https://github.com/mlazos trunk/e6d9d685986c9b46013a6bef99ecf532a481b8e8	2025-10-16 15:06:20 +00:00
Nikita Shulga	1a5b7eca7b	[BE] Fold cond into `TORCH_CHECK(false,...)` (#165593 ) Replace `if (!foo) { TORCH_CHECK(false, "bar");}` with `TORCH_CHECK(foo,"bar");` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165593 Approved by: https://github.com/albanD ghstack dependencies: #165594 trunk/1a5b7eca7b6a0a73a6d4c03ebe8c45fbb0c115ae viable/strict/1760642528	2025-10-16 15:00:30 +00:00
Isalia20	8573574b32	[MPS] sparse mask implementation (#165102 ) sparse mask implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/165102 Approved by: https://github.com/malfet trunk/8573574b3242d93f3844c7c0bc8fec913eca3e19 viable/strict/1760640743	2025-10-16 14:31:00 +00:00
Nikita Shulga	e6033f6efb	[MPS] Improve `index_fill_` error handling (#165594 ) It shoudl not throw "Cannot convert a float64 Tensor to MPS", but rather a sensible "Converting complex Scalar to non-complex type is not supported". Add TODO about the complex support, probably good reason to rip out MPSGraph from index_fill as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/165594 Approved by: https://github.com/dcci, https://github.com/kulinseth trunk/e6033f6efb20e717c41a32bfddeeb638387a2e76	2025-10-16 14:18:39 +00:00
IvanKobzarev	9272437cde	Fx collectives bucketing: add bucket all_reduce (#165351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165351 Approved by: https://github.com/eellison trunk/9272437cde67fcbb7dde66373382f711fd189418 viable/strict/1760635490	2025-10-16 13:27:33 +00:00
lichuyang	f06e669f6c	refactor: replace runtime_error with TORCH_CHECK for better error handling (#163628 ) Fixes some parts of issue #148114 @pytorchbot label "topic: not user facing" @FFFrog PTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/163628 Approved by: https://github.com/albanD trunk/f06e669f6c5a0b1840dc57224fecc1a27d46b049 viable/strict/1760628335	2025-10-16 11:09:48 +00:00
PyTorch MergeBot	69b05913fb	Revert "Add mingw to docker (#165560 )" This reverts commit 5e480b8ecf870e4a466c165701ab0e9d055f2ceb. Reverted https://github.com/pytorch/pytorch/pull/165560 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165560#issuecomment-3409814274)) trunk/69b05913fb0332f9a938c74e26b106e2bd24d82e viable/strict/1760619733	2025-10-16 08:42:11 +00:00
Isalia20	d73c283c3a	[CUDA] Large tensor maxpool crash fix (#165374 ) Fixes #165297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165374 Approved by: https://github.com/eqy, https://github.com/malfet trunk/d73c283c3a315cbed83e1795bb05db8ec315c48a	2025-10-16 07:59:46 +00:00
Tiwari-Avanish	eaeaa08e3a	[PowerPC] Disable MKLDNN TF32 on PowerPC to fix build failure (#163454 ) The commits f4d8bc46c7706f872abcb4ec41f0b32207d5d826 added TF32 support for x86 CPUs, which causes build failures on PowerPC systems with mkldnn. This patch disables TF32 paths on PowerPC while keeping x86 TF32 support intact, allowing PyTorch to build successfully on PowerPC. I have run the mkldnn test case on PowerPC, and it passed successfully. `pytest test/test_mkldnn.py 87 passed, 2 skipped in 1709.02s (0:28:29` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163454 Approved by: https://github.com/jgong5, https://github.com/malfet trunk/eaeaa08e3a8071be46f833f7b46aa642ec14e0f7	2025-10-16 06:13:59 +00:00
Yu, Guangye	d0c32971b4	Refine XPU allocator message when OOM (#165509 ) # Motivation Provide more information and align with other backends to enhance the user experience. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165509 Approved by: https://github.com/EikanWang ghstack dependencies: #165508 trunk/d0c32971b41ba9b9e9b8953beb8c29dd275ebdd3	2025-10-16 05:47:49 +00:00
Ketan Ambati	d7ffa8b8a2	12/n : Remove fbandroid_compiler_flags (#165558 ) Summary: Currently `get_c2_fbandroid_xplat_compiler_flags()` is reading the `caffe2.strip_glog` buckconfig which we want to get rid of. This diff removes the `fbandroid_compiler_flags` arg and merges it with compiler_flags with a nested select and the select version of the method The goal is to get rid of all the usages of `get_c2_fbandroid_xplat_compiler_flags()` so that we can get rid of the `caffe2.strip_glog` buckconfig Test Plan: CI Differential Revision: D84626885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165558 Approved by: https://github.com/malfet trunk/d7ffa8b8a29ba6071c51499c1df3d702d0a26f72	2025-10-16 05:46:02 +00:00
Nan Zhang	00afa06800	Add cse for make_block_ptr in Triton codegen (#163399 ) Summary: per title Test Plan: added test cases Differential Revision: D82648215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163399 Approved by: https://github.com/jansel, https://github.com/njriasan trunk/00afa06800b7af2aefabeb50c006c45edf3a233c	2025-10-16 05:29:48 +00:00
Oguz Ulgen	5d0b22008d	Codemod inductor/fx_passes from Optional to union none (#165606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165606 Approved by: https://github.com/aorenste ghstack dependencies: #165604, #165605 trunk/5d0b22008d4e4f8d73d5e16d4dc2029fd801bba0	2025-10-16 04:59:47 +00:00
Oguz Ulgen	ab6014a903	Codemod inductor/runtime from Optional to union none (#165605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165605 Approved by: https://github.com/aorenste ghstack dependencies: #165604	2025-10-16 04:59:47 +00:00
Oguz Ulgen	f6daffc54d	Codemod codecache.py from Optional to union none (#165604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165604 Approved by: https://github.com/aorenste	2025-10-16 04:59:37 +00:00
Yu, Guangye	66b75693ae	Reuse kLargeBuffer in XPUCachingAllocator (#165508 ) # Motivation Reuse the shared code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165508 Approved by: https://github.com/EikanWang trunk/66b75693aeda0f0219106839ed02e9c7577f0bec	2025-10-16 04:12:52 +00:00
Simon Fan	21697feff2	[hop] run local_map with interpreter to preserve fx_traceback annotations (#165336 ) We have an issue when using fx_traceback.annotate and HOPs that trace joint graphs. HOPs have bodies that have already been traced by Dynamo, and after Animesh's PR, does have the annotations. But when we lower that Dynamo HOP body to aten in either pre-dispatch or post-dispatch, we need to propagate the annotations to the aten nodes. AOTAutograd does this indirectly by piggybacking off the `PropagateUnbackedSymInts` fx.Interpreter. I'm not sure if all HOPs should be using it to trace their joints or not. This PR adds an interpreter to local_map's implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165336 Approved by: https://github.com/yushangdi trunk/21697feff257ad04dd916ef63b8b841c38f7e9ee	2025-10-16 02:53:17 +00:00
Xilun Wu	12fa4192c5	[ContextParallel] add process-time based Round-Robin load-balance to CP (#163617 ) Summary The load-balancing problem can be modeled as [identical-machines scheduling](https://en.wikipedia.org/wiki/Identical-machines_scheduling) problem. We already provided an easy-to-extend interface in #161062 for implementing load-balancing and in this PR we start with adding a Round-Robin solution as an example and also a verification. This can be easily adapted to other solutions like Shortest-processing-time-first/ Longest-processing-time-first with extra padding added for collectives. - Added a new type of `_LoadBalancer` implementation `_PTRRLoadBalancer` which is designed for `flex_attention()`. This load-balance strategy analyzes the `BlockMask` sparsity info and perform Round-Robin (unlike traditional Round-Robin doing it in circular order, we do in zig-zag order). - Make `_context_parallel_buffers` and `context_parallel_unshard` handle batched load-balance index (previously it can only handle non-batched load-balance index), like in `create_cp_block_mask`. Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163617 Approved by: https://github.com/fegin trunk/12fa4192c5e6440d400aa45ccb4f33f0f5f36ace	2025-10-16 02:20:27 +00:00
Nikita Shulga	23fb7e9f4b	[CI] Add arch prefix in front of op benchmark results (#165584 ) To be able to run x86 and aarch64 benchmarks later on Pull Request resolved: https://github.com/pytorch/pytorch/pull/165584 Approved by: https://github.com/huydhn ghstack dependencies: #165583 trunk/23fb7e9f4b564e9f00c26231c9d9c3138eaff8ba	2025-10-16 01:50:52 +00:00
Shangdi Yu	5e480b8ecf	Add mingw to docker (#165560 ) Add mingw to `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11` docker image to support AOTI cross-compilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/165560 Approved by: https://github.com/malfet ghstack dependencies: #165574 trunk/5e480b8ecf870e4a466c165701ab0e9d055f2ceb	2025-10-16 01:31:50 +00:00
Shangdi Yu	19ba506ca3	Support libtorch and posix mingw flavor (#165574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165574 Approved by: https://github.com/desertfire	2025-10-16 01:31:50 +00:00
jmaczan	003dd13073	[dynamo, guards] Better error messages when generated guard fails on the same frame (#165242 ) Not sure what exactly we want to have in the message, but that's easy to adjust. I tried to find a reliable test to reproduce this message (happens only when a guard fails right after it's created), but I ended up mocking a `guard_manager.check` function to return `False` to trigger this behavior. I think that's fine, because any other case that we pick (like datetime.now()), we want to patch one day anyway, so every time we make the next patch, will need to chase for another repro test @williamwen42 Fixes #164990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165242 Approved by: https://github.com/williamwen42 trunk/003dd130730993eedc302f769b7b653016ab6450 viable/strict/1760592311	2025-10-16 01:05:31 +00:00
Huy Do	c2bd41ac9f	Build vLLM nightly wheels for CUDA 13.0 (#163239 ) Now that https://github.com/vllm-project/vllm/pull/24599 has been merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/163239 Approved by: https://github.com/malfet, https://github.com/atalman trunk/c2bd41ac9f64cd873afa8a061f14192adaadbf7e	2025-10-16 01:03:26 +00:00
Pearu Peterson	ca8bd5dbed	Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164405 Approved by: https://github.com/Skylion007, https://github.com/janeyx99 ghstack dependencies: #164350, #164354 trunk/ca8bd5dbedb5b46f78026e0378b0f47500ddba38	2025-10-16 00:55:43 +00:00
Pearu Peterson	26f3803433	Remove workaround to old CUDA bug (#164354 ) As in the title. A check for https://github.com/pytorch/pytorch/issues/164348 to see if the workaround can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164354 Approved by: https://github.com/janeyx99, https://github.com/ngimel, https://github.com/malfet, https://github.com/jeffdaily ghstack dependencies: #164350	2025-10-16 00:55:43 +00:00
Pearu Peterson	48064acf37	Move AT_FORALL_... macros and ScalarTypeToCPPTypeT to headeronly (#164350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164350 Approved by: https://github.com/janeyx99	2025-10-16 00:55:42 +00:00
xinan.lin	e5a9c247bc	[Fix XPU CI] [Inductor UT] Fix test cases broken by community. (#165406 ) Fixes #163159, Fixes #164098, Fixes #164097, Fixes #164099, Fixes #165025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165406 Approved by: https://github.com/EikanWang, https://github.com/jansel trunk/e5a9c247bc0a906e5fb03289589f044b9bfb89ec viable/strict/1760590200	2025-10-16 00:53:32 +00:00
PaulZhang12	36371b8ec7	[ATen] Fix CUDA reduction warp shuffle order (#164790 ) Typical warp shuffle reduction has the following pattern: <img width="1138" height="501" alt="image" src="https://github.com/user-attachments/assets/3bd176dc-0ad2-4df6-90c7-06e467337166" /> which is exhibited in Triton generated by torch.compile: <img width="663" height="403" alt="image" src="https://github.com/user-attachments/assets/7f9f36cd-b9eb-44c1-879e-b469668a2ea8" /> Switch the warp shuffle order to make bitwise equivalence between the 2 easier. PTX difference between old and new, we see a few extra instructions: https://www.diffchecker.com/h6ly3INC/ Comparing the performance on different reduction operations, we see minimal differences. New represents the changes in this PR, old represents the past warp shuffle order: ``` Tensor Shape Operation New all dims (ms) New dim=0 (ms) New dim=1 (ms) Old all dims (ms) Old dim=0 (ms) Old dim=1 (ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1024, 1024) mean 0.015817 0.016259 0.013642 0.015990 0.016258 0.013631 (1024, 1024) sum 0.015917 0.015906 0.013359 0.015707 0.016266 0.013226 (1024, 1024) min 0.016021 0.024625 0.015631 0.015761 0.024485 0.015317 (1024, 1024) max 0.016349 0.024971 0.015972 0.015771 0.025001 0.015314 (1024, 1024) argmin 0.018070 0.024448 0.015578 0.018135 0.025370 0.015322 (1024, 1024) argmax 0.018427 0.024859 0.015932 0.018164 0.024452 0.015639 (1024, 1024) var 0.020078 0.026413 0.020295 0.020199 0.026381 0.020214 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (2048, 2048) mean 0.023826 0.023726 0.022273 0.023236 0.023776 0.022248 (2048, 2048) sum 0.023840 0.023355 0.021974 0.023294 0.023354 0.021884 (2048, 2048) min 0.024519 0.041263 0.024620 0.023292 0.041491 0.024358 (2048, 2048) max 0.024509 0.041670 0.024277 0.023334 0.041231 0.024395 (2048, 2048) argmin 0.026125 0.041282 0.024567 0.026772 0.041773 0.024296 (2048, 2048) argmax 0.026117 0.041487 0.024572 0.026412 0.041477 0.024273 (2048, 2048) var 0.026603 0.048581 0.031308 0.027587 0.048603 0.030860 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (4096, 4096) mean 0.053927 0.057070 0.054073 0.053028 0.057544 0.053935 (4096, 4096) sum 0.053604 0.057410 0.054451 0.053076 0.057033 0.054266 (4096, 4096) min 0.054293 0.109122 0.058363 0.053821 0.108689 0.058382 (4096, 4096) max 0.054258 0.108035 0.058703 0.053492 0.110552 0.058376 (4096, 4096) argmin 0.056805 0.111167 0.058301 0.056836 0.112325 0.058292 (4096, 4096) argmax 0.056488 0.110958 0.058636 0.056844 0.111000 0.057928 (4096, 4096) var 0.058936 0.141755 0.068693 0.059735 0.141284 0.068500 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (8192, 8192) mean 0.145552 0.148082 0.138647 0.145364 0.147818 0.138207 (8192, 8192) sum 0.145985 0.147900 0.138714 0.145755 0.148031 0.138616 (8192, 8192) min 0.146566 0.205359 0.192739 0.145611 0.205237 0.182335 (8192, 8192) max 0.146526 0.204844 0.193050 0.146073 0.205457 0.182697 (8192, 8192) argmin 0.150190 0.206605 0.192543 0.150654 0.206847 0.182007 (8192, 8192) argmax 0.150481 0.206368 0.192535 0.150845 0.206430 0.182022 (8192, 8192) var 0.150884 0.184546 0.203900 0.151594 0.184172 0.197983 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1, 1024, 128) mean 0.014293 0.008119 0.014533 0.013861 0.008022 0.014449 (1, 1024, 128) sum 0.014039 0.007877 0.014111 0.014219 0.008227 0.014045 (1, 1024, 128) min 0.014159 0.011354 0.023493 0.014271 0.010862 0.023644 (1, 1024, 128) max 0.014154 0.011027 0.023368 0.014259 0.011234 0.023692 (1, 1024, 128) argmin 0.016403 0.005677 0.023328 0.016273 0.005683 0.024073 (1, 1024, 128) argmax 0.016734 0.005675 0.023437 0.016580 0.005318 0.023331 (1, 1024, 128) var 0.018338 0.009549 0.025538 0.018528 0.009391 0.024777 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (5, 1024, 128) mean 0.014873 0.010131 0.015546 0.015123 0.010131 0.015481 (5, 1024, 128) sum 0.015334 0.009673 0.015824 0.014736 0.009671 0.015438 (5, 1024, 128) min 0.015047 0.013252 0.024573 0.014803 0.013163 0.024551 (5, 1024, 128) max 0.015050 0.013339 0.024197 0.014810 0.013525 0.024230 (5, 1024, 128) argmin 0.017341 0.012737 0.024306 0.017471 0.012379 0.024991 (5, 1024, 128) argmax 0.017345 0.012411 0.024421 0.017422 0.012471 0.024237 (5, 1024, 128) var 0.019973 0.011453 0.026188 0.020050 0.011438 0.026282 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (10, 1024, 128) mean 0.016976 0.011575 0.016831 0.016722 0.011927 0.017173 (10, 1024, 128) sum 0.017039 0.011841 0.017159 0.016385 0.011860 0.016753 (10, 1024, 128) min 0.017036 0.015331 0.026770 0.016944 0.015205 0.027166 (10, 1024, 128) max 0.017369 0.015348 0.027077 0.016531 0.015716 0.026819 (10, 1024, 128) argmin 0.019203 0.014447 0.026813 0.018994 0.014497 0.027313 (10, 1024, 128) argmax 0.019563 0.014795 0.027140 0.019460 0.014912 0.026733 (10, 1024, 128) var 0.020529 0.014316 0.030405 0.020719 0.013960 0.029964 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (100, 1024, 128) mean 0.045046 0.039168 0.046082 0.044839 0.039217 0.045782 (100, 1024, 128) sum 0.045094 0.039150 0.045777 0.044496 0.039542 0.046083 (100, 1024, 128) min 0.045768 0.054466 0.076244 0.044915 0.053943 0.076599 (100, 1024, 128) max 0.045748 0.054459 0.076188 0.044931 0.053949 0.076856 (100, 1024, 128) argmin 0.048275 0.054046 0.076647 0.048694 0.054105 0.077004 (100, 1024, 128) argmax 0.048267 0.054395 0.077401 0.048691 0.054131 0.076751 (100, 1024, 128) var 0.049710 0.043254 0.083077 0.050971 0.043251 0.082378 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1000, 1000, 100) mean 0.202312 0.196723 0.197765 0.201774 0.196641 0.197459 (1000, 1000, 100) sum 0.202651 0.196682 0.197736 0.202175 0.196313 0.197523 (1000, 1000, 100) min 0.203022 0.264762 0.269200 0.202729 0.264129 0.268694 (1000, 1000, 100) max 0.202864 0.264396 0.269388 0.202486 0.263896 0.268720 (1000, 1000, 100) argmin 0.226727 0.263781 0.268651 0.226597 0.264676 0.268983 (1000, 1000, 100) argmax 0.226412 0.264469 0.269090 0.226570 0.264595 0.269178 (1000, 1000, 100) var 0.243223 0.204079 0.216096 0.241942 0.204079 0.215925 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (10000, 100) mean 0.016193 0.020277 0.014316 0.016152 0.020324 0.013712 (10000, 100) sum 0.016289 0.020237 0.014034 0.016168 0.020265 0.013708 (10000, 100) min 0.016046 0.030872 0.019609 0.016208 0.030867 0.018627 (10000, 100) max 0.016369 0.030835 0.019257 0.016218 0.030861 0.018209 (10000, 100) argmin 0.017957 0.031171 0.019517 0.018050 0.031556 0.018077 (10000, 100) argmax 0.017961 0.031658 0.019521 0.018060 0.031564 0.018087 (10000, 100) var 0.020393 0.035652 0.019339 0.020144 0.035987 0.019171 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (100000, 10) mean 0.015718 0.016576 0.016555 0.015999 0.016246 0.014869 (100000, 10) sum 0.015833 0.016247 0.016572 0.016007 0.016627 0.014872 (100000, 10) min 0.015888 0.020510 0.023920 0.015671 0.020821 0.021417 (100000, 10) max 0.015889 0.020479 0.023918 0.016077 0.020386 0.021421 (100000, 10) argmin 0.018233 0.020863 0.023647 0.017574 0.020864 0.021103 (100000, 10) argmax 0.017896 0.020527 0.023296 0.017569 0.020447 0.021098 (100000, 10) var 0.020005 0.024198 0.024372 0.020075 0.024167 0.022415 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 1023) mean 1.874816 1.963506 1.903909 1.873279 1.963859 1.903230 (1023, 1023, 1023) sum 1.875030 1.965716 1.902458 1.873566 1.960730 1.901642 (1023, 1023, 1023) min 1.878563 2.473455 2.179092 1.875174 2.482086 2.183027 (1023, 1023, 1023) max 1.879128 2.474803 2.178895 1.874831 2.482253 2.183884 (1023, 1023, 1023) argmin 1.921800 2.476629 2.174831 1.923987 2.472641 2.170453 (1023, 1023, 1023) argmax 1.922605 2.476688 2.177927 1.923366 2.472808 2.172979 (1023, 1023, 1023) var 1.972606 3.088695 2.758797 1.978679 3.095658 2.762243 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 255) mean 0.489984 0.500954 0.492957 0.489891 0.500654 0.491971 (1023, 1023, 255) sum 0.490228 0.500764 0.492289 0.489624 0.501089 0.492824 (1023, 1023, 255) min 0.491457 0.563560 0.553334 0.490355 0.564709 0.554754 (1023, 1023, 255) max 0.491396 0.563628 0.553345 0.490017 0.565004 0.554947 (1023, 1023, 255) argmin 0.503666 0.561512 0.551831 0.503845 0.560972 0.551017 (1023, 1023, 255) argmax 0.503602 0.561185 0.551407 0.504328 0.561267 0.551448 (1023, 1023, 255) var 0.510844 0.709452 0.701630 0.512693 0.710365 0.701965 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ (1023, 1023, 377) mean 0.707439 0.727646 0.712019 0.706769 0.727101 0.711632 (1023, 1023, 377) sum 0.707780 0.727453 0.711554 0.706807 0.726656 0.711729 (1023, 1023, 377) min 0.709423 0.819809 0.794379 0.707847 0.822086 0.796664 (1023, 1023, 377) max 0.709297 0.819780 0.794308 0.707566 0.821913 0.796690 (1023, 1023, 377) argmin 0.725028 0.817088 0.791695 0.726039 0.816445 0.790828 (1023, 1023, 377) argmax 0.725301 0.817011 0.791420 0.726040 0.816917 0.791143 (1023, 1023, 377) var 0.740859 1.034165 1.006712 0.743413 1.035506 1.007638 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164790 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #165494 trunk/36371b8ec7a1baed255c18451b2c716386a54c95 viable/strict/1760588829	2025-10-15 23:54:51 +00:00
Nikita Shulga	7e6721fb0a	[BE] Remove confusing `opbenchmark-on-demand-build` (#165583 ) As it doesn't have a test shard, so what's the point or running the build? Was added in https://github.com/pytorch/pytorch/pull/143733 and looks like test shard never existed for it Moreover, allow one to specify benchmark size as argument, so one technically can do a workflow dispatch with different opbenchmark sizes Pull Request resolved: https://github.com/pytorch/pytorch/pull/165583 Approved by: https://github.com/huydhn trunk/7e6721fb0a25e4653add7b5e7272f82c41834433 viable/strict/1760586815	2025-10-15 23:48:28 +00:00
PaulZhang12	901bbcba12	Gate division bitwise numerics under a flag (#165566 ) https://github.com/pytorch/pytorch/pull/164144 ensures that division for compile is bitwise equivalent with eager. However, in https://github.com/pytorch/pytorch/issues/164301, the kernel performance is regressed. On B200: With standard triton `/`: 6511 GB/s With triton `div_rn`: 4692 GB/s Further investigation is required for the generated PTX to see why there is such a large slowdown. For now, enable bitwise equivalent results under `TORCHINDUCTOR_EMULATE_DIVISION_ROUNDING` similar to emulate_precision_cast Pull Request resolved: https://github.com/pytorch/pytorch/pull/165566 Approved by: https://github.com/ngimel, https://github.com/eellison trunk/901bbcba122825c817cac9e0b88221096fcd74ae	2025-10-15 23:41:01 +00:00
Nikhil Patel	febb603230	[Inductor][CuTeDSL] Move load_template up two directories (#165347 ) (#165576 ) Summary: Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future. Test Plan: `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8` Reviewed By: drisspg Differential Revision: D84527470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165576 Approved by: https://github.com/jananisriram trunk/febb60323018948b2b9d2cff35b3cc4e0d0c55c8	2025-10-15 23:37:55 +00:00
Xiao Fu	568d2f3ae7	[Dynamo][Logging] Add sources/types to LazyVariableTracker logging (#165402 ) Fixes #162860 This task add the variable source attrition to LazyVariableTracker when output trace bytecode Test plan -- test/dynamo/test_error_messages.py ErrorMessagesTest.test_variable_tracker_source_attribution The output is as specified in the prior mentioned Github issue. <img width="961" height="59" alt="Screenshot 2025-10-13 at 10 19 44 PM" src="https://github.com/user-attachments/assets/fb27da3f-d00b-437b-bf2e-52e892572cd7" /> This is specifically for the log setup with ``TORCH_LOGS=trace_bytecode`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165402 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42 Co-authored-by: William Wen <williamwen@meta.com> trunk/568d2f3ae7ab5ba4a3bba057f9aa2eb787cd8ea7	2025-10-15 23:23:09 +00:00
James Wu	b54e466fd0	Megacache integration (#163533 ) This diff adds megacache integration for DynamoCache. Because DynamoCache requires lazy serialization, i.e. it can only be serialized once all relevant backends have been compiled and we're ready for a save, we actually do the DynamoCache saving only on a call to `torch.compiler.save_cache_artifacts`. Differential Revision: [D82735763](https://our.internmc.facebook.com/intern/diff/D82735763/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163533 Approved by: https://github.com/oulgen, https://github.com/zhxchen17 trunk/b54e466fd04e5e736662a6206d81ab0d5fe85d91	2025-10-15 22:49:15 +00:00
Glen Cao	53f9ae0e50	[ROCm] new implementation of upsample_bilinear2d_backward (#164572 ) Changed the implementation from an output-based approach to an input-based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup. The changes are from Yu-Yun <YuYun.Chang@amd.com>. # Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X - The original "scatter-add" approach - Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs. - The new "gather-sum" approach - Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the `compute_output_range` device function). # Breakdown of the code changes - Inversion of the parallelization strategy of the kernel function `upsample_bilinear2d_backward_out_frame` - Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (`const size_t o_numel = nc * width2 * height2;`). - Each thread processed one output pixel. - The new loop is parallelized over the number of elements in the input gradient tensor (`const size_t i_numel = nc * height1 * width1;`). - Each thread is responsible for calculating the final gradient for a single input pixel. - The kernel launch changes accordingly in the function `upsample_bilinear2d_backward_out_cuda_template`. - Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (`input_pos`) during the forward pass interpolation - This is essentially the mathematical inverse of the forward pass. - This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor. - Gradient calculation approach switching from "scatter-add" to "gather-sum" - Scatter-add - For each output pixel, the thread calculated 4 gradient contributions and use `fastAtomicAdd` 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor. - Gather-sum - A thread responsible for one input pixel calls `compute_output_range` to determine the small rectangular region of output pixels that influence the input's final gradient value. - The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel. - All these contributions are accumulated into a private, per-thread register variable (`accscalar_t grad_sum = 0;`). - W/o any gloabl memory access, this accumulation is extremely fast. - When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (`idata[index] = static_cast<scalar_t>(grad_sum);`). # Why performance gets boosted - Analysis of the root cause of performance drop - Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward - First and foremost, elimination of the contention of atomic operations - Many parallel threads called `atomicAdd` frequently attempting to update the exact same memory location in the input gradient tensor at the same time. - The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points. - MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue. - When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect. - The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many `atomicAdd`. - Improved memory access pattern and locality - Write coalescing - The regular sum writes `idata[index] = static_cast<scalar_t>(grad_sum);` can be perfectly coalesced by GPUs. - Read locality - Even though there are many (potentially repeated) reads from the output tensor (`static_cast<accscalar_t>(odata[output_idx])`), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread. - Trade-off: computation for memory synchronization - The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X. - Removal of atomic operations avoids expensive memory synchronization. --- Optimizations of `grid_sampler_2d_backward` will be addressed in a separate PR. Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/164572 Approved by: https://github.com/jeffdaily trunk/53f9ae0e50d4dcc47f2ca4bf854803f9d4f875ae viable/strict/1760582614	2025-10-15 22:35:43 +00:00
blorange-amd	b42fe389b9	ROCm unit tests enablement (#165366 ) Enables: test_cuda.py::TestCuda::test_streaming_backwards_multiple_streams test_cuda.py::TestCuda::test_graph_make_graphed_callables_with_amp_cache_disabled_allow_unused_input test_cuda.py::TestCuda::test_graph_make_graphed_callables_without_amp_allow_unused_input test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float32 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float32 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float32 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165366 Approved by: https://github.com/jeffdaily trunk/b42fe389b93344ac31f492dd43710f68b09b937b	2025-10-15 22:35:03 +00:00
Sarthak Tandon	66ea76ec44	[ROCm][tunableop] Improvements to tunableop Numerical Check (#163079 ) Modified the flag PYTORCH_TUNABLEOP_NUMERICAL_CHECK, so that it accepts the numerical tolerances in the format atol_rtol as compared to the previous 0 and 1. Retains previous functionality with default values as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163079 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily trunk/66ea76ec44c0cfd0499f9544201b1cdce6d5cb4e	2025-10-15 22:26:47 +00:00
Richard Zou	e787d532b6	tmp fix for compile internal logger issue (#165568 ) Summary: Catch runtime exception when garse and scrub uninteresting configs from inductor config Test Plan: tested locally Differential Revision: D84727788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165568 Approved by: https://github.com/luccafong, https://github.com/oulgen trunk/e787d532b62306a0585e5b42ad2951ab1a91b296	2025-10-15 22:03:16 +00:00
eellison	b3f6d49b69	Overlap scheduler improvements (#165318 ) Bucketing a number of smallish improvements: - Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time - Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future. - When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well. Better Memory Handling: - Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing). - When we are above peak memory, schedule waits. TODO: - for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory - By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples. - config some hard coded constants, clean up enablement (can do in subsequent pr) On small llama 2d backward : 578 of 618 potentially hideable collectives hidden original mem 14.4GB, rescheduled mem, 15.9GB on forward: 254/256 potentially hideable collectives hidden original mem 5.8 gb, reshceduled mem 5.8GB WIP: adding tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165318 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944, #164945, #165059 trunk/b3f6d49b69533f453239de284be7210635706e4d	2025-10-15 21:58:47 +00:00
Howard Huang	bc1f2108d7	[PP] Update backward_counter and fsdp util to schedule class (#165513 ) Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in https://github.com/pytorch/torchtitan/pull/1721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165513 Approved by: https://github.com/fegin trunk/bc1f2108d7e0b89a98225523ed04ed2a39b3a901	2025-10-15 21:58:16 +00:00
Boyuan Feng	f071f17911	[Graph Partition] fix partition x memory plan issue (#165514 ) For `test_graph_partition_with_memory_plan_reuse`, before this PR, when using graph partition, it would error ([P1992728479](https://www.internalfb.com/phabricator/paste/view/P1992728479)): ``` def partition_0(args): ... del buf0 return (buf3, buf4, buf5, buf2, primals_4, ) ... File "/tmp/torchinductor_boyuan/ww/cwwc7ukfqscg2vy6ankby2fizdb377tvgyx3fwdgddrxe3g47jg6.py", line 132, in partition_0 return (buf3, buf4, buf5, buf2, primals_4, ) ^^^^ NameError: name 'buf2' is not defined. Did you mean: 'buf0'? ``` When not using graph partition, it would work and give the following code ([P1992997521](https://www.internalfb.com/phabricator/paste/view/P1992997521)): ``` def call(self, args): ... buf2 = buf0; del buf0 # reuse ... ``` Note that the issue is buf0 is not reused for buf2 when using graph partition. Why? Because the codegen runs `run_wrapper_ir_passes` and `memory_plan_reuse`, which pops tailing `MemoryPlanningLine` unless it is in graph output by checking `V.graph.get_output_names()`. However, for graph partition, we should check the output of the current partition instead of the graph before partition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165514 Approved by: https://github.com/ProExpertProg, https://github.com/eellison trunk/f071f17911ac7ace9b170e5289e44d50ae460c43 viable/strict/1760579709	2025-10-15 21:52:16 +00:00
Avik Chaudhuri	fa1539594b	consolidate fw and inference compile paths (#165457 ) By design, fw compile and inference compile stages should share a bunch of code; just consolidating the duplication here. Differential Revision: D84628978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165457 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan trunk/fa1539594b68dda5cc994dbb9ca36d98ca1178a3	2025-10-15 21:33:50 +00:00

1 2 3 4 5 ...

94535 Commits