pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Nikita Shulga	f9fa138a39	[BE] Delete all pre py-3.10 checks (#163653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163653 Approved by: https://github.com/jansel ghstack dependencies: #163648, #163649	2025-09-23 23:22:53 +00:00
drisspg	f3f67ff43a	Fix warn message (#163578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0	2025-09-23 22:46:51 +00:00
Mu-Chu Lee	6b5ad5f211	[Kineto] Add list of string parsing for profiler (#163593 ) Summary: We add the parsing for list of string. This is needed for AOTInductor profiling for input information of Triton kernels. Test Plan: Included in commit. test_profiler_op_event_kwargs_list_of_strings Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593 Approved by: https://github.com/sraikund16	2025-09-23 22:45:49 +00:00
Kurt Mohler	20149080f2	[MPS] Compute `offset2bag/bag_size/max_indices` in `_embedding_bag` (#163281 ) Part of #162270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163281 Approved by: https://github.com/malfet	2025-09-23 22:30:48 +00:00
Jeff Daily	b879ef7c0d	[ROCm][CI] skip TestCudaPrimaryCtx.test_set_device_0 (#163693 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 22:15:10 +00:00
eellison	c63e417c79	use reduction hint for aggressive rblock (#163371 ) I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163371 Approved by: https://github.com/PaulZhang12	2025-09-23 22:04:22 +00:00
bobrenjc93	c3d9f089d9	[torchfuzz] introduce multi process fuzzer (#163560 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163560 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558	2025-09-23 22:00:51 +00:00
eellison	29af25844b	Less aggressive persistent reduction when it could induce large masking with dynamic shapes (#163365 ) As per comment in source code: ``` # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower # and upper are large, for the lower values we will be masking off large % of read/writes, # when we could expand the coalescing xblock instead. ``` For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values). Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163365 Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0	2025-09-23 21:58:57 +00:00
Svetlana Karslioglu	8c8416b021	Update pytorch.org links in docs/conf.py (#163682 ) Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD	2025-09-23 21:40:11 +00:00
Bob Ren	b182365660	[ez] use list initializer syntax in fill_diagonal_ (#163607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163607 Approved by: https://github.com/Skylion007 ghstack dependencies: #163485	2025-09-23 21:27:12 +00:00
bobrenjc93	5ca563ea09	symintify fill_diagonol_ (#163485 ) Fixes #162271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163485 Approved by: https://github.com/Skylion007	2025-09-23 21:27:12 +00:00
Tugsbayasgalan Manlaibaatar	e671dcc969	Update tests to check for more robust pattern (#163107 ) Landing this instead of https://github.com/pytorch/pytorch/pull/162994. Here is how i think the whole dynamo + frame construction logic work: 1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that. 2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (`f6ea41ead2/test/export/test_export.py (L1369)`). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self. 3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (`f6ea41ead2/torch/_dynamo/variables/functions.py (L463)`) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export: "While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]" My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names. Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163107 Approved by: https://github.com/avikchaudhuri	2025-09-23 21:11:48 +00:00
Mark Saroufim	fc84743707	Implement CUDA stream protocol (#163614 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163614 Approved by: https://github.com/eqy	2025-09-23 21:02:08 +00:00
Pian Pawakapan	2a9745de3c	[multi-kernel] shape-similarity kernel selection (#163090 ) Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space). Some caveats/changes: - Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes - Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint: <img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" /> Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090 Approved by: https://github.com/bobrenjc93	2025-09-23 21:00:47 +00:00
PaulZhang12	22c5e8c17c	Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 ) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296	2025-09-23 20:36:39 +00:00
Jithun Nair	bcb893acb0	[ROCm] Build FBGEMM_GENAI for gfx942 only (#162648 ) Fixes build timeouts >4h on libtorch build jobs: `75e7f49f9c/1` Brings back code to narrow down CK compilation targets from `69a25f6888 (diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777)` gfx942 supports fp8 Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738 Validation: [rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162648 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 18:55:35 +00:00
Blaine Burton Rister	8e6b0c71fb	[Inductor] Remove `no_type_check` annotation on properties (#163570 ) Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163570 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007	2025-09-23 18:20:04 +00:00
Nikita Shulga	0696a4b0b8	[EZ] Perma-ignore UP038 (#163649 ) As it has been removed, see https://docs.astral.sh/ruff/rules/non-pep604-isinstance/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/163649 Approved by: https://github.com/Skylion007 ghstack dependencies: #163648	2025-09-23 17:58:18 +00:00
Nikita Shulga	ca35dc2fdd	[EZ] Fix UP041 violations (#163648 ) I.e. use `TimeoutError` instead of `socket.timeout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163648 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-09-23 17:58:18 +00:00
Raman-RH	649ceda8a5	[export] handling NamedTuple inputs (#162959 ) Fixes #160547 ### Summary: bug ``` def test_namedtuple(self): from collections import namedtuple Point = namedtuple('Point', 'x y') class M(torch.nn.Module): def forward(self, x, y): return x + y inp = Point(torch.ones(3), torch.ones(3)) print(M()(*inp)) # errors ep = torch.export.export(M(), inp, strict=False) print(ep) # succeeds ep = torch.export.export(M(), inp, strict=True) print(ep) # workaround could be to convert namedtuple to a kwarg inp_kwargs = {field: getattr(inp, field) for field in inp._fields} ep = torch.export.export(M(), (), inp_kwargs) print(ep) ``` FIx : namedtuple is subclass of tuple but namedtuple is not expected So, this change handles named tuple case I have added 🧪 test case for this as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959 Approved by: https://github.com/angelayi Co-authored-by: Angela Yi <angelayi@meta.com>	2025-09-23 17:43:50 +00:00
Jerry Mannil	2aadcea05c	[ROCm] Improve perf for elementwise broadcast with mixed dtype (#163562 ) * Unroll loops manually to hide memory access latency Co-author: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/163562 Approved by: https://github.com/jeffdaily	2025-09-23 17:42:48 +00:00
Xu Han	fde929c8a8	[AOTI] Fix model_package_loader get_cpp_compile_command (#163561 ) It should fix AOTI UTs of `test_aot_inductor_package.py`, these cases are failed at `compile_so`. reproducer: ```cmd pytest test\inductor\test_aot_inductor_package.py -v -k test_multiple_methods ``` <img width="1262" height="95" alt="image" src="https://github.com/user-attachments/assets/49458536-1cfe-498e-a12a-2bfd8da67a9e" /> Major fix at `get_cpp_compile_command`. The code is aligned to cpp_builder frontend code: `3ef1bef36c/torch/_inductor/cpp_builder.py (L1780-L1790)` `3ef1bef36c/torch/_inductor/cpp_builder.py (L1959-L1976)` Fixed on Windows: <img width="1261" height="89" alt="Image" src="https://github.com/user-attachments/assets/9bf43b11-aac1-4161-a625-e602e313a299" /> Also validated on Linux: <img width="1039" height="81" alt="Image" src="https://github.com/user-attachments/assets/46063e16-6cf1-4a28-8466-0496871b8619" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/163561 Approved by: https://github.com/jansel	2025-09-23 17:38:18 +00:00
Saurabh Mishra	134dfbeaef	[DCP] DTensor slice dequantization with proper block alignment (#163532 ) Summary: When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions. This fix addresses the issue by: 1. Proper coordinate mapping: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape. 3. Block-aligned dequantization: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors. The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions. Test Plan: Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that: 1. Dequantized tensor values are numerically correct 2. Block boundaries are properly calculated relative to full tensor shape 3. Scale factors are applied to correct tensor regions 4. Tensor shapes map is built efficiently using only metadata Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py ``` { "model.layers.0.mlp.gate_proj.weight": { "mse": 4.30626645453458e-11, "mae": 9.98388827611052e-07, "max_abs_diff": 0.0009703934192657471, "cosine_similarity": 1.010810375213623, "relative_error": 0.001330620958469808, "kl_divergence_1_to_2": "6.563401e-08", "kl_divergence_2_to_1": "-6.522914e-08", "js_divergence": 1.3711876079014476e-10, "shape": [ 18432, 7168 ], "t1_stats": { "min": -0.4453125, "max": 0.30859375, "mean": -1.2592146958922967e-05 }, "t2_stats": { "min": -0.44529813528060913, "max": 0.3085886240005493, "mean": -1.2624391274584923e-05 } }, "model.layers.0.mlp.up_proj.weight": { "mse": 2.5534721906361746e-11, "mae": 3.118609583907528e-06, "max_abs_diff": 0.00047551095485687256, "cosine_similarity": 1.038962483406067, "relative_error": 0.0013681650161743164, "kl_divergence_1_to_2": "-5.8253768e-08", "kl_divergence_2_to_1": "5.8747577e-08", "js_divergence": NaN, "shape": [ 18432, 7168 ], "t1_stats": { "min": -0.228515625, "max": 0.2333984375, "mean": 8.862222955485777e-08 }, "t2_stats": { "min": -0.2285017967224121, "max": 0.23338991403579712, "mean": 8.824501662729745e-08 } }, "model.layers.0.mlp.down_proj.weight": { "mse": 2.2803769289536646e-11, "mae": 2.8916260816913564e-06, "max_abs_diff": 0.0008973777294158936, "cosine_similarity": 1.0376262664794922, "relative_error": 0.001346255769021809, "kl_divergence_1_to_2": "1.2744896e-07", "kl_divergence_2_to_1": "-1.2736885e-07", "js_divergence": 5.992362162032805e-11, "shape": [ 7168, 18432 ], "t1_stats": { "min": -0.54296875, "max": 0.546875, "mean": -2.9487239316949854e-07 }, "t2_stats": { "min": -0.5429964661598206, "max": 0.5469087362289429, "mean": -2.9507478416235244e-07 } } } ``` https://www.internalfb.com/intern/testinfra/testrun/3940649985202645 Differential Revision: D82975005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532 Approved by: https://github.com/wwwjn	2025-09-23 16:48:16 +00:00
PyTorch MergeBot	221ac81043	Revert "[precompile] Add option to disable guard check on aot-compiled function. (#163432 )" This reverts commit 539e84e289fa7563032410706ede50a4eaa7a15d. Reverted https://github.com/pytorch/pytorch/pull/163432 on behalf of https://github.com/Camyll due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/163432#issuecomment-3324757069))	2025-09-23 16:31:30 +00:00
dilililiwhy	6e5dddba64	Use accelerator API in common_dtensor (#163498 ) Fixes #ISSUE_NUMBER Try to unify the device checking in common_dtensor (testing module) by accelerator API Pull Request resolved: https://github.com/pytorch/pytorch/pull/163498 Approved by: https://github.com/albanD, https://github.com/H-Huang	2025-09-23 16:30:20 +00:00
Jeff Daily	ebddbe787a	[ROCm][CI] skip test_sparse_triangular_solve (#163651 ) need more time to debug, but also need clean CI signal test was unskipped by #163495, but had been skipp on rocm prior Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163651 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 15:55:51 +00:00
drisspg	5f0c7cb4aa	Add B200 smoke test (#159494 ) Okay running test_max_autotune locally on B200is horrible read, for now to get something landed I am focusing on test_matmul_cuda.py and test_fp8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159494 Approved by: https://github.com/nWEIdia, https://github.com/huydhn ghstack dependencies: #163460, #163537, #163552	2025-09-23 15:45:05 +00:00
drisspg	b3cf5c79dd	Skip on sm100 later since Tests are non determinisitic (#163552 ) This is tracked https://github.com/pytorch/pytorch/issues/163462 skipping since we are seeing sporadic errors locally and on CI, Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552 Approved by: https://github.com/eqy, https://github.com/Skylion007 ghstack dependencies: #163460, #163537	2025-09-23 15:45:05 +00:00
drisspg	0f674077f4	Large tests failing on bfloat16 (#163537 ) # Summary I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later) ``` Shell ==================================================================================================== FAILURES ===================================================================================================== _________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________ Traceback (most recent call last): File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn return fn(slf, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn return fn(self, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision self.cublas_addmm(size, dtype, True) File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol) File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp raise AssertionError("\n".join(error_parts)) AssertionError: Tensor-likes are not close! Mismatched elements: 425 / 100030002 (0.0%) Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed) Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed) ============================================================ ULP Analysis of Failures: ============================================================ Total failures: 425 ULP distances: min=-32761, max=32763, mean=-11513.7 Top 10 failures by absolute difference: # \| Index \| Abs Diff \| Rel Diff \| ULP \| Expected \| Actual ---------------------------------------------------------------------------------------------------- 1 \| (6923, 1580) \| 1.600000e+01 \| 5.390625e-01 \| 146 \| 29.750000 \| 13.750000 2 \| (4677, 420) \| 1.600000e+01 \| 6.601562e-01 \| 95 \| 24.250000 \| 40.250000 3 \| (2176, 9325) \| 1.600000e+01 \| 6.875000e-01 \| 210 \| 23.250000 \| 7.250000 4 \| (5119, 7865) \| 1.600000e+01 \| 1.164062e+00 \| 146 \| -13.750000 \| -29.750000 5 \| (3218, 8334) \| 1.600000e+01 \| 2.593750e+00 \| 236 \| 6.156250 \| 22.125000 6 \| (5245, 241) \| 1.600000e+01 \| 5.468750e-01 \| 75 \| 29.250000 \| 45.250000 7 \| (7666, 6549) \| 1.600000e+01 \| 1.640000e+03 \| 1376 \| -0.009766 \| -16.000000 8 \| (1663, 1115) \| 1.593750e+01 \| 8.375000e+00 \| -32427 \| 1.898438 \| -14.062500 9 \| (3967, 7708) \| 1.593750e+01 \| 1.368750e+01 \| -32510 \| 1.164062 \| -14.750000 10 \| (2874, 2038) \| 1.593750e+01 \| 1.710938e+00 \| 181 \| 9.312500 \| 25.250000 Note: Maximum absolute and relative errors occur at different locations Max abs diff location (2176, 9325): 210 ULP Max rel diff location (376, 3754): 31868 ULP To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________ Traceback (most recent call last): File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, *kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn return fn(slf, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn return fn(self, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision self.cublas_addmm(size, dtype, True) File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol) File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp raise AssertionError("\n".join(error_parts)) AssertionError: Tensor-likes are not close! Mismatched elements: 425 / 100030002 (0.0%) Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed) Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed) ============================================================ ULP Analysis of Failures: ============================================================ Total failures: 425 ULP distances: min=-32761, max=32763, mean=-11513.7 Top 10 failures by absolute difference: # \| Index \| Abs Diff \| Rel Diff \| ULP \| Expected \| Actual ---------------------------------------------------------------------------------------------------- 1 \| (6923, 1580) \| 1.600000e+01 \| 5.390625e-01 \| 146 \| 29.750000 \| 13.750000 2 \| (4677, 420) \| 1.600000e+01 \| 6.601562e-01 \| 95 \| 24.250000 \| 40.250000 3 \| (2176, 9325) \| 1.600000e+01 \| 6.875000e-01 \| 210 \| 23.250000 \| 7.250000 4 \| (5119, 7865) \| 1.600000e+01 \| 1.164062e+00 \| 146 \| -13.750000 \| -29.750000 5 \| (3218, 8334) \| 1.600000e+01 \| 2.593750e+00 \| 236 \| 6.156250 \| 22.125000 6 \| (5245, 241) \| 1.600000e+01 \| 5.468750e-01 \| 75 \| 29.250000 \| 45.250000 7 \| (7666, 6549) \| 1.600000e+01 \| 1.640000e+03 \| 1376 \| -0.009766 \| -16.000000 8 \| (1663, 1115) \| 1.593750e+01 \| 8.375000e+00 \| -32427 \| 1.898438 \| -14.062500 9 \| (3967, 7708) \| 1.593750e+01 \| 1.368750e+01 \| -32510 \| 1.164062 \| -14.750000 10 \| (2874, 2038) \| 1.593750e+01 \| 1.710938e+00 \| 181 \| 9.312500 \| 25.250000 Note: Maximum absolute and relative errors occur at different locations Max abs diff location (2176, 9325): 210 ULP Max rel diff location (376, 3754): 31868 ULP To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Okay the bfloat16 are forsure real cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy ghstack dependencies: #163460	2025-09-23 15:45:05 +00:00
Yiming Zhou	720a7b2887	[export] Remove .contiguous() when saving weights to raw bytes (#163587 ) Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing Differential Revision: D83016250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587 Approved by: https://github.com/angelayi	2025-09-23 15:44:56 +00:00
Jason Ansel	49e7b2f69d	[inductor] Fix error from custom CUDA allocators (#163422 ) Fixes #163257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163422 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412	2025-09-23 15:37:45 +00:00
Jason Ansel	6ef74879f6	[dynamo] Fix TorchFunctionMode handling with get_rng_state (#163412 ) Fixes #162624 Fixes #162586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163412 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393	2025-09-23 15:37:45 +00:00
Jason Ansel	9c4d9f940b	[inductor] Support out_dtype arg to matmul (#163393 ) Fixes #163275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393 Approved by: https://github.com/eellison, https://github.com/coconutruben ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434	2025-09-23 15:37:38 +00:00
Jason Ansel	ed84e808f0	[inductor] Freeze layouts in FlexAttention (#163434 ) Fixes #163300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163434 Approved by: https://github.com/drisspg ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419	2025-09-23 15:37:29 +00:00
Jason Ansel	518c320676	[inductor] libdevice.sqrt => tl.sqrt_rn (#163419 ) Fixes #163082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419 Approved by: https://github.com/Skylion007, https://github.com/mlazos ghstack dependencies: #163386, #163398, #163387, #163414, #163415	2025-09-23 15:37:21 +00:00
Scott Wolchok	4264fd34ec	Add basic tests for torch.distributed.tensor._utils.compute_global_tensor_info (#162968 ) Next PR writes a C++ implementation. Seems good to have tests first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162968 Approved by: https://github.com/ezyang ghstack dependencies: #161695, #162508	2025-09-23 14:56:32 +00:00
Jeff Daily	e05c9c0c84	[ROCm][CI] cudagraph trees ut fixes (#163592 ) Fixes #162125. Fixes #160719. Fixes #157901. Fixes #157871. Fixes #157761. Fixes #157723. Fixes #157643. Fixes #157616. Fixes #157556. Fixes #157533. Fixes #157449. Fixes #157428. Fixes #157413. Fixes #157367. Fixes #157350. Fixes #157339. Fixes #157312. Fixes #157280. Fixes #157258. Fixes #157173. Fixes #157143. Fixes #157112. Fixes #157086. Fixes #157058. Fixes #157035. Fixes #156984. Fixes #156957. Fixes #156954. Fixes #156922. Fixes #156886. Fixes #156838. Fixes #156808. Fixes #156801. Fixes #156778. Fixes #156755. Fixes #156735. Fixes #156693. Fixes #152561. Fixes #130749. Fixes #100074. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163592 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-23 14:45:00 +00:00
PyTorch MergeBot	aff76c046d	Revert "Add fake_impl for _native_multi_head_attention (#163167 )" This reverts commit 27164b6788cab6e6d8095012839e51c958a819d6. Reverted https://github.com/pytorch/pytorch/pull/163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see `1a42656d6c/1` ([comment](https://github.com/pytorch/pytorch/pull/163167#issuecomment-3324302026))	2025-09-23 14:36:45 +00:00
Isalia20	1a42656d6c	[Flex attention] Fix flex attention head broadcast (#163426 ) Fixes part of #163314 In particular bug: Bug 1: H=None Broadcasting Produces Incorrect Results This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (mask[:, :, i]). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426 Approved by: https://github.com/drisspg	2025-09-23 13:01:51 +00:00
Simon Fan	bda9ab291d	[inductor] fix as_strided lowering with .view(dtype) inputs (#163319 ) FIXES https://github.com/pytorch/pytorch/issues/163286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163319 Approved by: https://github.com/eellison	2025-09-23 12:50:57 +00:00
atalman	3c64b2abab	CUDA 13.0 Warning update for supported architectures (#163585 ) Please see build script: `8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)` This should display correct warning: `` Please install PyTorch with a following CUDA configurations: 12.6 12.8 13.0 following instructions at https://pytorch.org/get-started/locally/ `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585 Approved by: https://github.com/malfet	2025-09-23 11:27:11 +00:00
Yuanyuan Chen	5d749ceb92	Remove test conditions for CUDA<12 (#163495 ) Because it required that CUDA >=12. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163495 Approved by: https://github.com/janeyx99	2025-09-23 07:52:00 +00:00
Nicolas Macchioni	8d81564df5	[pt2][cache] rework cache for true generic usage + better tests (#163488 ) Differential Revision: D82933509 over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out. with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible) putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green I have the next PR ready as well (implements a fresh cache context manager), will export once this lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488 Approved by: https://github.com/aorenste, https://github.com/masnesral	2025-09-23 07:31:48 +00:00
bobrenjc93	b426ba1d5e	[torchfuzz] introduce tensor and scalar pointwise ops (#163558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163558 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557	2025-09-23 06:20:13 +00:00
KarhouTam	375f3e3a61	[OpenReg][Docs] Correct docs about `openreg` usage example. (#163235 ) ## Why this PR? I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4). Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path. ## What I've changed? - I added `-I ./` to correctly search for `./include/openreg.h`. - I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235 Approved by: https://github.com/FFFrog, https://github.com/albanD Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2025-09-23 06:16:45 +00:00
Shivam Raikundalia	45d9dcccc5	Update Kineto Submodule (#162222 ) Summary: Update Test Plan: CI Rollback Plan: Differential Revision: D81727392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162222 Approved by: https://github.com/sanrise	2025-09-23 06:08:55 +00:00
bobrenjc93	309fe03f4b	[torchfuzz] remove unneeded try catch (#163557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163557 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556	2025-09-23 06:05:08 +00:00
bobrenjc93	1545bb1c00	[torchfuzz] shuffle compatible ops (#163556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163556 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555	2025-09-23 05:53:44 +00:00
bobrenjc93	d5e51d34f7	[torchfuzz] decompose -> fuzz_inputs_specs (#163555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163555 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554	2025-09-23 05:44:59 +00:00
bobrenjc93	08c5efde5f	[torchfuzz] cache operators (#163554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163554 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553	2025-09-23 05:28:07 +00:00

1 2 3 4 5 ...

93424 Commits