Commit Graph

36094 Commits

Author SHA1 Message Date
b477fb106f [ROCm] enable grouped gemm fallback (#162419)
Enables bf16 group gemm alternative path as described in #161366
Fast path will be enabled in future through CK integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 20:04:56 +00:00
86d34a43f5 NamedTuple: Allow side effects for dynamic attributes (#161645)
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it.

The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple.

Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object.

Fixes #161610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645
Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
2025-09-09 19:42:02 +00:00
8508651477 Fix flaky AOTFxirTestCase (#162472)
Fixes https://github.com/pytorch/pytorch/issues/162357
Fixes https://github.com/pytorch/pytorch/issues/160970
Fixes https://github.com/pytorch/pytorch/issues/161038
Fixes https://github.com/pytorch/pytorch/issues/160951
Fixes https://github.com/pytorch/pytorch/issues/161698

These tests were introduced in https://github.com/pytorch/pytorch/pull/160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option).  The issue could be reproduced by running them locally multiple times.  For example,

```
pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add
(output logs at P1938386961)
...
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
================================================================================================================================================= short test summary info ==================================================================================================================================================
FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
=============================================================================================================================================== 3 failed, 7 passed in 12.89s ===============================================================================================================================================
```

Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from https://github.com/pytorch/pytorch/pull/160765.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162472
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-09-09 19:39:24 +00:00
82f1eb9b03 Revert "[MPS] mps sparse mul op implementation (#162349)"
This reverts commit 3ea686804925f1291de57ffdb3394da0b46deb54.

Reverted https://github.com/pytorch/pytorch/pull/162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](https://github.com/pytorch/pytorch/pull/162349#issuecomment-3271783442))
2025-09-09 18:14:16 +00:00
5eb35d2ab8 [CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387
Approved by: https://github.com/Skylion007
2025-09-09 17:04:06 +00:00
f03d635dc6 [ROCm][CI] skip test_max_autotune until resolved (#162496)
many tests taking >30 min and causing timeouts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162496
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 16:34:01 +00:00
c0142f5c06 [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@fb.com>
2025-09-09 15:49:21 +00:00
3ea6868049 [MPS] mps sparse mul op implementation (#162349)
Implements mps sparse mul operation as well as enables other operations such as:
1. copy_
2. div
3. sum
4. floor
5. power
6. sub
7. floor_divide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349
Approved by: https://github.com/pearu, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-09 15:45:37 +00:00
4dd73e659a Revert "fix torch.sparse.log_softmax on CPU (#161959)"
This reverts commit 002e59440afe8711019e68df500f5e18b9a43f3c.

Reverted https://github.com/pytorch/pytorch/pull/161959 on behalf of https://github.com/davidberard98 due to test failure: test_sparse.py::TestSparseMPS::test_log_softmax_float_mps_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17573794461/job/49915138287) [HUD commit link](002e59440a) ([comment](https://github.com/pytorch/pytorch/pull/161959#issuecomment-3270509418))
2025-09-09 12:33:25 +00:00
002e59440a fix torch.sparse.log_softmax on CPU (#161959)
Fix https://github.com/pytorch/pytorch/issues/152293.

**Example:**
```
import torch
from torch.sparse import log_softmax as sparse_log_softmax

def test_bug():
    a = torch.rand(4, 3)
    b = a - 10000000.0
    b_sparse = b.to_sparse()

    cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense()
    print('cpu_out_sparse =', cpu_out_sparse)

    b_sparse_double = b.double().to_sparse()
    cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense()
    print('cpu_out_sparse_double =', cpu_out_sparse_double)

if __name__ == '__main__':
    test_bug()
```

**Output:**

- before
```
cpu_out_sparse = tensor([[-2., -1., -2.],
        [-1., -1., -1.],
        [-1., -2., -2.],
        [-1., -1., -2.]])
cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514],
        [-1.0986, -1.0986, -1.0986],
        [-0.5514, -1.5514, -1.5514],
        [-0.8620, -0.8620, -1.8620]], dtype=torch.float64)
```

- after
```
cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]])
cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]], dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959
Approved by: https://github.com/Skylion007
2025-09-09 06:25:16 +00:00
d85392a88e Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-09 05:42:19 +00:00
60d009267e Revert "testing infra and some fixes (#162183)"
This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d.

Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))
2025-09-09 05:26:32 +00:00
4590438329 [fx] fix qualified name for methods of torch.Tensor (#162407)
This fixes an error in the previous PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2025-09-09 05:14:43 +00:00
065c446193 [SymmMem] Use global pe for put and get (#162394)
NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel.

Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394
Approved by: https://github.com/ngimel, https://github.com/fegin
ghstack dependencies: #162320
2025-09-09 03:58:48 +00:00
4c45090cf7 [DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798)
Fixes #159590

This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).

This also handles cases such as #159590  where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
2025-09-09 03:52:05 +00:00
1641606aa4 Revert "Add BundledAOTAutogradSerializableCallable (#162170)"
This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a.

Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))
2025-09-09 03:33:36 +00:00
7b8a64557d [inductor] fix 3d tiled online softmax (#162341)
The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162311
2025-09-09 02:59:52 +00:00
d8b6622bb6 testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162167
2025-09-09 02:42:11 +00:00
583bbf7761 [MPS] Add native_dropout and native_dropout_backward (#162108)
Fixes #162002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108
Approved by: https://github.com/malfet
2025-09-09 01:44:06 +00:00
6eb14ac60f [Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447)
This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior.

Fixes: #140457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447
Approved by: https://github.com/mlazos
2025-09-09 01:07:02 +00:00
ed77e23b68 Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482)"
This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7.

Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))
2025-09-09 00:21:05 +00:00
8485aac873 [precompile] Fix inlined source tracking with generators. (#162389)
Summary:
When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location.

In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field.

Test Plan:
test_package.py -k test_code_with_generator

Rollback Plan:

Differential Revision: D81929751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389
Approved by: https://github.com/dolpm, https://github.com/hrithick-codes
2025-09-09 00:13:54 +00:00
07f07309c6 [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/huydhn
2025-09-08 23:30:11 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
ac9ccd0dc2 Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-08 22:44:48 +00:00
711c8c821e shape guards (#161178)
Summary: This PR introduces shape guards to export. Previously only value ranges,  equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program.

Test Plan:
updated several tests

Rollback Plan:

Differential Revision: D80713603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178
Approved by: https://github.com/tugsbayasgalan
2025-09-08 22:44:09 +00:00
fecd9686f5 Graph split event tracker (#159795)
Summary:
A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs.

Usage: use env var to specify a mode and necessary arguments.

FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode.
```
Different modes of the event tracker:
"0": Tracker not enabled (by default)
"1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb.
"2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt
"3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt
"4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt
```
FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`.
FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3".

Test Plan: New unit test

Reviewed By: georgiaphillips

Differential Revision: D79203595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795
Approved by: https://github.com/ezyang
2025-09-08 21:30:17 +00:00
5d819f3faf Revert "[associative_scan] Autograd separated (#139939)"
This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8.

Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))
2025-09-08 20:42:47 +00:00
015423bef8 Add fp16-overflow regression test (#162401)
Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-08 20:33:23 +00:00
26a1b9cce2 [dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318)
Fixes https://github.com/pytorch/pytorch/issues/162313

Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305
2025-09-08 20:26:24 +00:00
8ec01f34e9 [bucketing] custom_ops mode to hide inductor copies overhead (#161499)
Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each,
to workaround too many generated kernels on inductor side.

This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems.

Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499
Approved by: https://github.com/eellison
2025-09-08 20:03:08 +00:00
4e50651c5f [DTensor] fix F.one_hot (#162307)
F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due
to an arange call creating a new Tensor (not DTensor). This PR fixes it
by allowing implicit replication of Tensors for the arange call and the
one consumer of the arange call (the at::eq call).

Test Plan:
- new test. Also, F.one_hot(num_classes=-1) is broken so we skip that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307
Approved by: https://github.com/ezyang
ghstack dependencies: #162117
2025-09-08 19:37:08 +00:00
a0d026688c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-08 19:10:36 +00:00
d80297a684 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-08 19:10:36 +00:00
fbcabb4fbd Handle f([]) vs. f() in fake tensor caching (#162284)
Fixes https://github.com/pytorch/pytorch/issues/162279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162284
Approved by: https://github.com/manuelcandales, https://github.com/aorenste
2025-09-08 18:28:05 +00:00
ebd29a13fe [inductor] fuse for scalar shared data (#162311)
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
2025-09-08 17:20:46 +00:00
f044fa2902 [AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041)
The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: #162040
2025-09-08 16:12:52 +00:00
a92773eeb1 Revert "Use vectorized stores for all dtypes in cat (#161649)"
This reverts commit 377033757ae5ca524ea842f1b0a5f446ed3d8fe0.

Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3266963044))
2025-09-08 15:58:58 +00:00
25c170b72e [inductor] Runtime estimations: use nccl estimator; mm only benchmark mode (#161405)
During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms.

Adding optional usage of:
- c10d.time_estimator for collectives, which is based on NCCL estimator

Benchmark mode only for matmuls, as they are highly dependent on mm backend

- The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572

This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()`

Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405
Approved by: https://github.com/eellison
2025-09-08 14:33:19 +00:00
e101411b9f Update slow tests (#161395)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161395
Approved by: https://github.com/pytorchbot
2025-09-08 13:33:32 +00:00
5b90e85112 [AsyncTP] Fixes AsyncMM (#162040)
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
2025-09-08 10:53:59 +00:00
1e0656f063 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit de893e96c775023aa3be895060848fac3296772c.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
29e09a6545 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
103f725afa [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/ydwu4
2025-09-08 03:21:17 +00:00
5babb4d5c0 Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-07 23:37:31 +00:00
ec2e3687c7 [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
2025-09-07 21:55:29 +00:00
ff2de5d522 Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473)"
This reverts commit 040d00af048967dde7938d358d7f5988cbd18388.

Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983))
2025-09-07 21:06:38 +00:00
8235c4f65d Revert "[ROCm] Enabling several UTs (#161715)"
This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354.

Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))
2025-09-07 21:03:17 +00:00
e246a85b76 Revert "[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)"
This reverts commit 5c473e9f5ee0ef0fc38e6cf34a95b547f8cdc8d5.

Reverted https://github.com/pytorch/pytorch/pull/159118 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473 ([comment](https://github.com/pytorch/pytorch/pull/159118#issuecomment-3264037799))
2025-09-07 21:00:29 +00:00
7a83cf430e Revert " [while_loop][autograd] support autograd_key of while_loop (#160483)"
This reverts commit 2b8a83901c58a0858ea9e4ce00055f48e6ed164c.

Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))
2025-09-07 08:50:49 +00:00