Commit Graph

93424 Commits

Author SHA1 Message Date
f9fa138a39 [BE] Delete all pre py-3.10 checks (#163653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163653
Approved by: https://github.com/jansel
ghstack dependencies: #163648, #163649
2025-09-23 23:22:53 +00:00
f3f67ff43a Fix warn message (#163578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0
2025-09-23 22:46:51 +00:00
6b5ad5f211 [Kineto] Add list of string parsing for profiler (#163593)
Summary:
We add the parsing for list of string. This is needed for AOTInductor
profiling for input information of Triton kernels.

Test Plan:
Included in commit.
test_profiler_op_event_kwargs_list_of_strings

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593
Approved by: https://github.com/sraikund16
2025-09-23 22:45:49 +00:00
20149080f2 [MPS] Compute offset2bag/bag_size/max_indices in _embedding_bag (#163281)
Part of #162270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163281
Approved by: https://github.com/malfet
2025-09-23 22:30:48 +00:00
b879ef7c0d [ROCm][CI] skip TestCudaPrimaryCtx.test_set_device_0 (#163693)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163693
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 22:15:10 +00:00
c63e417c79 use reduction hint for aggressive rblock (#163371)
I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163371
Approved by: https://github.com/PaulZhang12
2025-09-23 22:04:22 +00:00
c3d9f089d9 [torchfuzz] introduce multi process fuzzer (#163560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163560
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558
2025-09-23 22:00:51 +00:00
29af25844b Less aggressive persistent reduction when it could induce large masking with dynamic shapes (#163365)
As per comment in source code:
```
            # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel
            # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile
            # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower
            # and upper are large, for the lower values we will be masking off large % of read/writes,
            # when we could expand the coalescing xblock instead.
```

For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values).

Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163365
Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0
2025-09-23 21:58:57 +00:00
8c8416b021 Update pytorch.org links in docs/conf.py (#163682)
Update links in conf.py to docs.pytorch.org

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682
Approved by: https://github.com/sekyondaMeta, https://github.com/albanD
2025-09-23 21:40:11 +00:00
b182365660 [ez] use list initializer syntax in fill_diagonal_ (#163607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163607
Approved by: https://github.com/Skylion007
ghstack dependencies: #163485
2025-09-23 21:27:12 +00:00
5ca563ea09 symintify fill_diagonol_ (#163485)
Fixes #162271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163485
Approved by: https://github.com/Skylion007
2025-09-23 21:27:12 +00:00
e671dcc969 Update tests to check for more robust pattern (#163107)
Landing this instead of https://github.com/pytorch/pytorch/pull/162994.

Here is how i think the whole dynamo + frame construction logic work:
1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that.
2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (f6ea41ead2/test/export/test_export.py (L1369)). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self.
3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (f6ea41ead2/torch/_dynamo/variables/functions.py (L463)) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export:
"While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]"

My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names.

Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163107
Approved by: https://github.com/avikchaudhuri
2025-09-23 21:11:48 +00:00
fc84743707 Implement CUDA stream protocol (#163614)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163614
Approved by: https://github.com/eqy
2025-09-23 21:02:08 +00:00
2a9745de3c [multi-kernel] shape-similarity kernel selection (#163090)
Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space).

Some caveats/changes:
- Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes
- Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension

Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint:
<img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" />

Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090
Approved by: https://github.com/bobrenjc93
2025-09-23 21:00:47 +00:00
22c5e8c17c Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-09-23 20:36:39 +00:00
bcb893acb0 [ROCm] Build FBGEMM_GENAI for gfx942 only (#162648)
Fixes build timeouts >4h on libtorch build jobs: 75e7f49f9c/1

Brings back code to narrow down CK compilation targets from 69a25f6888 (diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777)

gfx942 supports fp8

Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738

Validation:
[rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162648
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 18:55:35 +00:00
8e6b0c71fb [Inductor] Remove no_type_check annotation on properties (#163570)
Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163570
Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007
2025-09-23 18:20:04 +00:00
0696a4b0b8 [EZ] Perma-ignore UP038 (#163649)
As it has been removed, see https://docs.astral.sh/ruff/rules/non-pep604-isinstance/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163649
Approved by: https://github.com/Skylion007
ghstack dependencies: #163648
2025-09-23 17:58:18 +00:00
ca35dc2fdd [EZ] Fix UP041 violations (#163648)
I.e. use `TimeoutError` instead of `socket.timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163648
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-09-23 17:58:18 +00:00
649ceda8a5 [export] handling NamedTuple inputs (#162959)
Fixes #160547
### Summary:
bug
```
    def test_namedtuple(self):
        from collections import namedtuple
        Point = namedtuple('Point', 'x y')

        class M(torch.nn.Module):
            def forward(self, x, y):
                return x + y

        inp = Point(torch.ones(3), torch.ones(3))
        print(M()(*inp))

        # errors
        ep = torch.export.export(M(), inp, strict=False)
        print(ep)

        # succeeds
        ep = torch.export.export(M(), inp, strict=True)
        print(ep)

        # workaround could be to convert namedtuple to a kwarg
        inp_kwargs =  {field: getattr(inp, field) for field in inp._fields}
        ep = torch.export.export(M(), (), inp_kwargs)
        print(ep)
```
FIx :
namedtuple is subclass of tuple
but namedtuple is not expected
So, this change handles named tuple case

I have added 🧪 test case for this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959
Approved by: https://github.com/angelayi

Co-authored-by: Angela Yi <angelayi@meta.com>
2025-09-23 17:43:50 +00:00
2aadcea05c [ROCm] Improve perf for elementwise broadcast with mixed dtype (#163562)
* Unroll loops manually to hide memory access latency

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163562
Approved by: https://github.com/jeffdaily
2025-09-23 17:42:48 +00:00
fde929c8a8 [AOTI] Fix model_package_loader get_cpp_compile_command (#163561)
It should fix AOTI UTs of `test_aot_inductor_package.py`, these cases are failed at `compile_so`.

reproducer:
```cmd
pytest test\inductor\test_aot_inductor_package.py -v -k test_multiple_methods
```
<img width="1262" height="95" alt="image" src="https://github.com/user-attachments/assets/49458536-1cfe-498e-a12a-2bfd8da67a9e" />

Major fix at `get_cpp_compile_command`. The code is aligned to cpp_builder frontend code:  3ef1bef36c/torch/_inductor/cpp_builder.py (L1780-L1790)
3ef1bef36c/torch/_inductor/cpp_builder.py (L1959-L1976)

Fixed on Windows:
<img width="1261" height="89" alt="Image" src="https://github.com/user-attachments/assets/9bf43b11-aac1-4161-a625-e602e313a299" />

Also validated on Linux:
<img width="1039" height="81" alt="Image" src="https://github.com/user-attachments/assets/46063e16-6cf1-4a28-8466-0496871b8619" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163561
Approved by: https://github.com/jansel
2025-09-23 17:38:18 +00:00
134dfbeaef [DCP] DTensor slice dequantization with proper block alignment (#163532)
Summary:
When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions.

This fix addresses the issue by:

1. **Proper coordinate mapping**: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape.

3. **Block-aligned dequantization**: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors.

The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions.

Test Plan:
Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that:
1. Dequantized tensor values are numerically correct
2. Block boundaries are properly calculated relative to full tensor shape
3. Scale factors are applied to correct tensor regions
4. Tensor shapes map is built efficiently using only metadata

Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py
```
{
  "model.layers.0.mlp.gate_proj.weight": {
    "mse": 4.30626645453458e-11,
    "mae": 9.98388827611052e-07,
    "max_abs_diff": 0.0009703934192657471,
    "cosine_similarity": 1.010810375213623,
    "relative_error": 0.001330620958469808,
    "kl_divergence_1_to_2": "6.563401e-08",
    "kl_divergence_2_to_1": "-6.522914e-08",
    "js_divergence": 1.3711876079014476e-10,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.4453125,
      "max": 0.30859375,
      "mean": -1.2592146958922967e-05
    },
    "t2_stats": {
      "min": -0.44529813528060913,
      "max": 0.3085886240005493,
      "mean": -1.2624391274584923e-05
    }
  },
  "model.layers.0.mlp.up_proj.weight": {
    "mse": 2.5534721906361746e-11,
    "mae": 3.118609583907528e-06,
    "max_abs_diff": 0.00047551095485687256,
    "cosine_similarity": 1.038962483406067,
    "relative_error": 0.0013681650161743164,
    "kl_divergence_1_to_2": "-5.8253768e-08",
    "kl_divergence_2_to_1": "5.8747577e-08",
    "js_divergence": NaN,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.228515625,
      "max": 0.2333984375,
      "mean": 8.862222955485777e-08
    },
    "t2_stats": {
      "min": -0.2285017967224121,
      "max": 0.23338991403579712,
      "mean": 8.824501662729745e-08
    }
  },
  "model.layers.0.mlp.down_proj.weight": {
    "mse": 2.2803769289536646e-11,
    "mae": 2.8916260816913564e-06,
    "max_abs_diff": 0.0008973777294158936,
    "cosine_similarity": 1.0376262664794922,
    "relative_error": 0.001346255769021809,
    "kl_divergence_1_to_2": "1.2744896e-07",
    "kl_divergence_2_to_1": "-1.2736885e-07",
    "js_divergence": 5.992362162032805e-11,
    "shape": [
      7168,
      18432
    ],
    "t1_stats": {
      "min": -0.54296875,
      "max": 0.546875,
      "mean": -2.9487239316949854e-07
    },
    "t2_stats": {
      "min": -0.5429964661598206,
      "max": 0.5469087362289429,
      "mean": -2.9507478416235244e-07
    }
  }
}
```

https://www.internalfb.com/intern/testinfra/testrun/3940649985202645

Differential Revision: D82975005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532
Approved by: https://github.com/wwwjn
2025-09-23 16:48:16 +00:00
221ac81043 Revert "[precompile] Add option to disable guard check on aot-compiled function. (#163432)"
This reverts commit 539e84e289fa7563032410706ede50a4eaa7a15d.

Reverted https://github.com/pytorch/pytorch/pull/163432 on behalf of https://github.com/Camyll due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/163432#issuecomment-3324757069))
2025-09-23 16:31:30 +00:00
6e5dddba64 Use accelerator API in common_dtensor (#163498)
Fixes #ISSUE_NUMBER

Try to unify the device checking in common_dtensor (testing module) by accelerator API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163498
Approved by: https://github.com/albanD, https://github.com/H-Huang
2025-09-23 16:30:20 +00:00
ebddbe787a [ROCm][CI] skip test_sparse_triangular_solve (#163651)
need more time to debug, but also need clean CI signal test was unskipped by #163495, but had been skipp on rocm prior

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 15:55:51 +00:00
5f0c7cb4aa Add B200 smoke test (#159494)
Okay running test_max_autotune locally on B200is horrible read, for now to get something landed I am focusing on test_matmul_cuda.py and test_fp8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159494
Approved by: https://github.com/nWEIdia, https://github.com/huydhn
ghstack dependencies: #163460, #163537, #163552
2025-09-23 15:45:05 +00:00
b3cf5c79dd Skip on sm100 later since Tests are non determinisitic (#163552)
This is tracked https://github.com/pytorch/pytorch/issues/163462

skipping since we are seeing sporadic errors locally and on CI,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552
Approved by: https://github.com/eqy, https://github.com/Skylion007
ghstack dependencies: #163460, #163537
2025-09-23 15:45:05 +00:00
0f674077f4 Large tests failing on bfloat16 (#163537)
# Summary

I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later)

``` Shell
==================================================================================================== FAILURES =====================================================================================================
_________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
Okay the bfloat16 are forsure  real cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #163460
2025-09-23 15:45:05 +00:00
720a7b2887 [export] Remove .contiguous() when saving weights to raw bytes (#163587)
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing

Differential Revision: D83016250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
2025-09-23 15:44:56 +00:00
49e7b2f69d [inductor] Fix error from custom CUDA allocators (#163422)
Fixes #163257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163422
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412
2025-09-23 15:37:45 +00:00
6ef74879f6 [dynamo] Fix TorchFunctionMode handling with get_rng_state (#163412)
Fixes #162624
Fixes #162586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163412
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393
2025-09-23 15:37:45 +00:00
9c4d9f940b [inductor] Support out_dtype arg to matmul (#163393)
Fixes #163275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393
Approved by: https://github.com/eellison, https://github.com/coconutruben
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434
2025-09-23 15:37:38 +00:00
ed84e808f0 [inductor] Freeze layouts in FlexAttention (#163434)
Fixes #163300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163434
Approved by: https://github.com/drisspg
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419
2025-09-23 15:37:29 +00:00
518c320676 [inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
Fixes #163082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #163386, #163398, #163387, #163414, #163415
2025-09-23 15:37:21 +00:00
4264fd34ec Add basic tests for torch.distributed.tensor._utils.compute_global_tensor_info (#162968)
Next PR writes a C++ implementation. Seems good to have tests first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162968
Approved by: https://github.com/ezyang
ghstack dependencies: #161695, #162508
2025-09-23 14:56:32 +00:00
e05c9c0c84 [ROCm][CI] cudagraph trees ut fixes (#163592)
Fixes #162125.
Fixes #160719.
Fixes #157901.
Fixes #157871.
Fixes #157761.
Fixes #157723.
Fixes #157643.
Fixes #157616.
Fixes #157556.
Fixes #157533.
Fixes #157449.
Fixes #157428.
Fixes #157413.
Fixes #157367.
Fixes #157350.
Fixes #157339.
Fixes #157312.
Fixes #157280.
Fixes #157258.
Fixes #157173.
Fixes #157143.
Fixes #157112.
Fixes #157086.
Fixes #157058.
Fixes #157035.
Fixes #156984.
Fixes #156957.
Fixes #156954.
Fixes #156922.
Fixes #156886.
Fixes #156838.
Fixes #156808.
Fixes #156801.
Fixes #156778.
Fixes #156755.
Fixes #156735.
Fixes #156693.
Fixes #152561.
Fixes #130749.
Fixes #100074.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163592
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 14:45:00 +00:00
aff76c046d Revert "Add fake_impl for _native_multi_head_attention (#163167)"
This reverts commit 27164b6788cab6e6d8095012839e51c958a819d6.

Reverted https://github.com/pytorch/pytorch/pull/163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see 1a42656d6c/1 ([comment](https://github.com/pytorch/pytorch/pull/163167#issuecomment-3324302026))
2025-09-23 14:36:45 +00:00
1a42656d6c [Flex attention] Fix flex attention head broadcast (#163426)
Fixes part of #163314

In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**

This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.

The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
2025-09-23 13:01:51 +00:00
bda9ab291d [inductor] fix as_strided lowering with .view(dtype) inputs (#163319)
FIXES https://github.com/pytorch/pytorch/issues/163286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163319
Approved by: https://github.com/eellison
2025-09-23 12:50:57 +00:00
3c64b2abab CUDA 13.0 Warning update for supported architectures (#163585)
Please see build script: 8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)

This should display correct warning:
``
Please install PyTorch with a following CUDA
configurations: 12.6 12.8 13.0 following instructions at
https://pytorch.org/get-started/locally/
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585
Approved by: https://github.com/malfet
2025-09-23 11:27:11 +00:00
5d749ceb92 Remove test conditions for CUDA<12 (#163495)
Because it required that CUDA >=12.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163495
Approved by: https://github.com/janeyx99
2025-09-23 07:52:00 +00:00
8d81564df5 [pt2][cache] rework cache for true generic usage + better tests (#163488)
Differential Revision: D82933509

over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out.

with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later

I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible)

putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green

I have the next PR ready as well (implements a fresh cache context manager), will export once this lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-09-23 07:31:48 +00:00
b426ba1d5e [torchfuzz] introduce tensor and scalar pointwise ops (#163558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163558
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557
2025-09-23 06:20:13 +00:00
375f3e3a61 [OpenReg][Docs] Correct docs about openreg usage example. (#163235)
## Why this PR?
I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4).

Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path.

## What I've changed?
- I added `-I ./` to correctly search for `./include/openreg.h`.
- I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235
Approved by: https://github.com/FFFrog, https://github.com/albanD

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-09-23 06:16:45 +00:00
45d9dcccc5 Update Kineto Submodule (#162222)
Summary: Update

Test Plan:
CI

Rollback Plan:

Differential Revision: D81727392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162222
Approved by: https://github.com/sanrise
2025-09-23 06:08:55 +00:00
309fe03f4b [torchfuzz] remove unneeded try catch (#163557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163557
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556
2025-09-23 06:05:08 +00:00
1545bb1c00 [torchfuzz] shuffle compatible ops (#163556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163556
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555
2025-09-23 05:53:44 +00:00
d5e51d34f7 [torchfuzz] decompose -> fuzz_inputs_specs (#163555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163555
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554
2025-09-23 05:44:59 +00:00
08c5efde5f [torchfuzz] cache operators (#163554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163554
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553
2025-09-23 05:28:07 +00:00