Compare commits

..

4710 Commits

Author SHA1 Message Date
83349ae64d [async_tp] Base support ag-transpose-mm(mat_B) case
ghstack-source-id: edd51b9c46e46e8eca0c45e0ea53c1b26b375c01
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163069
2025-09-19 08:35:51 -07:00
bf08b164dc [async_tp] Support ag+mm with gather_dim lastdim of mat_A
ghstack-source-id: 8de8acdc31566643d4b8370f27006002b05cdd61
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068
2025-09-16 04:42:16 -07:00
da0b6aea11 [async_tp] Support mm+rs with scatter_dim matmul K by sharding B
ghstack-source-id: dee5390f82c6899af543adc6b91b5954097077ad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162794
2025-09-12 04:34:10 -07:00
4840a1a591 Run vLLM tests on all trunk commits before 2.9 branch cut (#161797)
This makes it easier to bisect issue now given that we don't have lots of time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161797
Approved by: https://github.com/yangw-dev
2025-09-09 05:56:41 +00:00
d49205fe1f Add more tests for vllm and clean out the old vllm test (#162292)
Test failure coverage from pytorch 2.8 release issues
[internal access only](https://docs.google.com/document/d/1zvK1eUAHubHGGHg9jKxd-QlP89fzgfqOBvE2m9mUs90/edit?tab=t.0
)

See coverage mapping
| Given test / pattern | Suite ID (from config) |
|---|---|
| pytest -v -s basic_correctness/test_cumem.py | vllm_basic_correctness_test |
| pytest -v -s entrypoints/openai/test_sleep.py | vllm_entrypoints_test |
| pytest -v -s entrypoints/openai/test_translation_validation.py::test_long_audio_request | vllm_entrypoints_test |
| pytest -v -s lora/test_quant_model.py | vllm_lora_28_failure_test |
| pytest -v -s -x tests/lora/test_llama_tp.py | vllm_lora_tp_test_distributed |
| pytest -v -s distributed/test_sequence_parallel.py -k test_tp_sp_generation |vllm_distributed_test_28_failure_test |
| pytest -v -s distributed/test_sequence_parallel.py::test_tp_sp_generation[...] | vllm_distributed_test_28_failure_test |
| pytest models/language/generation/test_mistral.py::test_models[...] | vllm_languagde_model_test_extended_generation_28_failure_test |
| pytest models/multimodal/pooling/test_jinavl_reranker.py::test_model_text_image[...] | vllm_multi_model_test_28_failure_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen25vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora_beam_search | vllm_lora_test |
| tests/lora/test_phi.py::test_phi2_lora | DIDN'T FIND IT IT IN VLLM |
| models/multimodal/generation/test_voxtral.py::test_models_with_multiple_audios[5-128-half] | vllm_multi_model_test_28_failure_test |
| models/test_initialization.py::test_can_initialize[VoxtralForConditionalGeneration] | vllm_basic_models_test |
| pytest -v -s -x lora/test_chatglm3_tp.py -k test_chatglm3_lora_tp4_fully_sharded_loras | vllm_lora_tp_test_distributed |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162292
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-09-09 05:53:46 +00:00
d85392a88e Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-09 05:42:19 +00:00
7feb8fc589 [SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available (#162142)
Summary:
As we have multiple backends, _SymmetricMemory should not be imported together with NVSHMEM related modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162142
Approved by: https://github.com/dcci, https://github.com/kwen2501
2025-09-09 05:37:43 +00:00
60d009267e Revert "testing infra and some fixes (#162183)"
This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d.

Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))
2025-09-09 05:26:32 +00:00
4590438329 [fx] fix qualified name for methods of torch.Tensor (#162407)
This fixes an error in the previous PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2025-09-09 05:14:43 +00:00
8494afb837 Add missing fstream include to fix std::ofstream compilation error (#162421)
## Summary
This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard *Google internal compile setup* (built with bazel).

## Details
The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers.

## Error message:
```
torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>'
8 | std::ofstream file(filename_, std::ios::binary);
| ^
libcxx/include/__fwd/fstream.h:26:7: note: template is declared here
26 | class basic_ofstream;
| ^
1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421
Approved by: https://github.com/ezyang
2025-09-09 05:14:32 +00:00
7ad40de60e [audio hash update] update the pinned audio hash (#162437)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162437
Approved by: https://github.com/pytorchbot
2025-09-09 04:41:34 +00:00
607327beae [vllm hash update] update the pinned vllm hash (#162356)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162356
Approved by: https://github.com/pytorchbot
2025-09-09 04:40:25 +00:00
f216d64bfe [SymmMem] Better tuning of A2AV based on accurate node boundary (#162003)
Use `world_within_direct_access()` to distinguish intra- vs inter- node.
Previously we assumed a fixed node size of 8, which is not true for NVL72.

Also added env var `TORCH_SYMMMEM_NBLOCKS` for control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162003
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-09-09 04:18:17 +00:00
847d7f21af [CUDA-13] Implement workaround for cudaErrorNotSupported (#162412)
See https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162412
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-09 04:12:10 +00:00
065c446193 [SymmMem] Use global pe for put and get (#162394)
NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel.

Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394
Approved by: https://github.com/ngimel, https://github.com/fegin
ghstack dependencies: #162320
2025-09-09 03:58:48 +00:00
98ecc0f374 [SymmMem] Add team pool to hold duplicated teams for the same rank group (#162320)
When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets).

This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162320
Approved by: https://github.com/ngimel
2025-09-09 03:58:48 +00:00
4c45090cf7 [DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798)
Fixes #159590

This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).

This also handles cases such as #159590  where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
2025-09-09 03:52:05 +00:00
1641606aa4 Revert "Add BundledAOTAutogradSerializableCallable (#162170)"
This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a.

Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))
2025-09-09 03:33:36 +00:00
7b8a64557d [inductor] fix 3d tiled online softmax (#162341)
The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162311
2025-09-09 02:59:52 +00:00
d8b6622bb6 testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162167
2025-09-09 02:42:11 +00:00
a965f09793 [export] Update PT2 archive docs (#162308)
Summary: Minor updates based on the recent refactoring for weight saving and loading

Test Plan:
doc change only

Rollback Plan:

Differential Revision: D81821994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162308
Approved by: https://github.com/angelayi
2025-09-09 02:08:13 +00:00
583bbf7761 [MPS] Add native_dropout and native_dropout_backward (#162108)
Fixes #162002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108
Approved by: https://github.com/malfet
2025-09-09 01:44:06 +00:00
e025c0f459 Dynamo: set_eval_frame microoptimization (#162220)
Optimize for common case and remove a pair of refcount operations (see new comments.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162220
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219
2025-09-09 01:10:06 +00:00
a8a187b2cf Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef (#162219)
Avoids requiring vector allocation to call this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162219
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692
2025-09-09 01:10:06 +00:00
12db2a7889 Call checkLong in is_int_or_symint, completing TODO (#161692)
Calling this first minimizes overhead for plain old ints, making cheap things cheap.

Differential Revision: [D81530098](https://our.internmc.facebook.com/intern/diff/D81530098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161692
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634
2025-09-09 01:10:06 +00:00
eab2afeff7 fastpath type Tensor in THPVariable_NewWithVar (#161634)
It is cheap to do an exact check against Tensor and much faster when it works (PyType_IsSubtype does not have this fastpath, I checked [source](9ee0214b5d/Objects/typeobject.c (L2889))). Spot-checked in perf on detach-DTensor-in-a-loop benchmark; small win but clear.

Differential Revision: [D81530101](https://our.internmc.facebook.com/intern/diff/D81530101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161634
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #161591, #161595, #161633
2025-09-09 01:10:06 +00:00
a951f435fd Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function (#161633)
py::args::size() calls PyTuple_GetSize. Compiler can't know the two calls will always return the same result, so we have to consolidate them ourselves.

Differential Revision: [D81530096](https://our.internmc.facebook.com/intern/diff/D81530096)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161633
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595
2025-09-09 01:10:06 +00:00
6eb14ac60f [Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447)
This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior.

Fixes: #140457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447
Approved by: https://github.com/mlazos
2025-09-09 01:07:02 +00:00
ed77e23b68 Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482)"
This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7.

Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))
2025-09-09 00:21:05 +00:00
897c4e70a7 Move to small wheel approach for CUDA SBSA wheel (#160720)
https://github.com/pytorch/pytorch/issues/160673

Use download.pytorch.org's dependencies like x86 build instead of bundling libs into the wheel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160720
Approved by: https://github.com/atalman
2025-09-09 00:18:43 +00:00
8485aac873 [precompile] Fix inlined source tracking with generators. (#162389)
Summary:
When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location.

In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field.

Test Plan:
test_package.py -k test_code_with_generator

Rollback Plan:

Differential Revision: D81929751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389
Approved by: https://github.com/dolpm, https://github.com/hrithick-codes
2025-09-09 00:13:54 +00:00
c0fc86b511 Fix aarch64 wheel pack (#159481)
PR that introduced the change: https://github.com/pytorch/builder/pull/1775
Use wheel pack instead of zip to repack the wheel.
It should regenerate the RECORD file and update all the hashes correctly.

TODO:
Apply wheel pack instead of zip to Rest of builds
Add validation test to make sure wheel contents matches RECORD file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159481
Approved by: https://github.com/malfet
2025-09-08 23:36:50 +00:00
07f07309c6 [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/huydhn
2025-09-08 23:30:11 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
5fd6b6a2db [refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189)
## Summary
- document guard behavior in `SizeVarAllocator.is_size_one`
- use `is_size_one` for broadcast/expand checks.
- This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)`

a4f9132a17/torch/_inductor/sizevars.py (L450-L453)

------
https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189
Approved by: https://github.com/laithsakka
2025-09-08 22:48:16 +00:00
ac9ccd0dc2 Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-08 22:44:48 +00:00
711c8c821e shape guards (#161178)
Summary: This PR introduces shape guards to export. Previously only value ranges,  equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program.

Test Plan:
updated several tests

Rollback Plan:

Differential Revision: D80713603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178
Approved by: https://github.com/tugsbayasgalan
2025-09-08 22:44:09 +00:00
2c538c9acf rewrite __maybe_broadcast should_expand check for unbacked (#162109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162109
Approved by: https://github.com/aorenste
ghstack dependencies: #162084, #162099
2025-09-08 22:41:18 +00:00
85fe94e933 make should_swap more dde friendly (#162099)
unblock customers for common cases with DDE ,until @pianpwk  land the change to should_swap https://github.com/pytorch/pytorch/pull/160473.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162099
Approved by: https://github.com/aorenste
ghstack dependencies: #162084
2025-09-08 22:41:18 +00:00
fecd9686f5 Graph split event tracker (#159795)
Summary:
A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs.

Usage: use env var to specify a mode and necessary arguments.

FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode.
```
Different modes of the event tracker:
"0": Tracker not enabled (by default)
"1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb.
"2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt
"3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt
"4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt
```
FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`.
FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3".

Test Plan: New unit test

Reviewed By: georgiaphillips

Differential Revision: D79203595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795
Approved by: https://github.com/ezyang
2025-09-08 21:30:17 +00:00
dd44faa9d9 Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#162103)"
This reverts commit 0af70e2353e1dcda83175fd4834ecb7b63e009e0.

Reverted https://github.com/pytorch/pytorch/pull/162103 on behalf of https://github.com/jithunnair-amd due to Cirrascale network outage resolved. Reverting back to running per commit to aid in triage and CI health ([comment](https://github.com/pytorch/pytorch/pull/162103#issuecomment-3267977825))
2025-09-08 20:53:05 +00:00
5d819f3faf Revert "[associative_scan] Autograd separated (#139939)"
This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8.

Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))
2025-09-08 20:42:47 +00:00
015423bef8 Add fp16-overflow regression test (#162401)
Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-08 20:33:23 +00:00
26a1b9cce2 [dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318)
Fixes https://github.com/pytorch/pytorch/issues/162313

Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305
2025-09-08 20:26:24 +00:00
8f114650eb Add std::any_of to ConvParams struct (#162334)
Removes some for-loops that didn't short-circuit in favor of std::any_of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162334
Approved by: https://github.com/Skylion007
2025-09-08 20:12:20 +00:00
ec2c1371af [BE]: Update cudnn frontend submodule to 1.14.1 (#162347)
Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-08 20:03:23 +00:00
8ec01f34e9 [bucketing] custom_ops mode to hide inductor copies overhead (#161499)
Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each,
to workaround too many generated kernels on inductor side.

This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems.

Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499
Approved by: https://github.com/eellison
2025-09-08 20:03:08 +00:00
9c991b63ff [CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 build (#162364)
https://github.com/pytorch/pytorch/issues/159779

Add the full CUDA support matrix to sbsa build (12.6, 12.8)
Same arch support as x86 build
Remove 12.9 sbsa build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162364
Approved by: https://github.com/atalman
2025-09-08 20:00:25 +00:00
4e50651c5f [DTensor] fix F.one_hot (#162307)
F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due
to an arange call creating a new Tensor (not DTensor). This PR fixes it
by allowing implicit replication of Tensors for the arange call and the
one consumer of the arange call (the at::eq call).

Test Plan:
- new test. Also, F.one_hot(num_classes=-1) is broken so we skip that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307
Approved by: https://github.com/ezyang
ghstack dependencies: #162117
2025-09-08 19:37:08 +00:00
a0d026688c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-08 19:10:36 +00:00
d80297a684 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-08 19:10:36 +00:00
fbcabb4fbd Handle f([]) vs. f() in fake tensor caching (#162284)
Fixes https://github.com/pytorch/pytorch/issues/162279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162284
Approved by: https://github.com/manuelcandales, https://github.com/aorenste
2025-09-08 18:28:05 +00:00
314d47a210 [audio hash update] update the pinned audio hash (#162315)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315
Approved by: https://github.com/pytorchbot
2025-09-08 18:26:33 +00:00
bc4176c92a CD Windows CUDA 13.0 build - fix packaging of cuda dlls (#162383)
Trying to fix https://github.com/pytorch/pytorch/issues/162333

CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture :
<img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" />
<img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162383
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet
2025-09-08 17:57:22 +00:00
eqy
de5dc1f038 [cuDNN][SDPA][Nested Tensor] add forward/backward caching support for cuDNN SDPA Nested tensor/varlen (#161434)
Don't recompile every time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161434
Approved by: https://github.com/drisspg
2025-09-08 17:51:13 +00:00
72e6717d00 Avoid crash with release_available_cached_blocks (#162269)
updated release behavior for cached blocks
Fixes #159567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162269
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-09-08 17:46:43 +00:00
ebd29a13fe [inductor] fuse for scalar shared data (#162311)
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
2025-09-08 17:20:46 +00:00
5793dd7875 [Intel GPU] Integrate OneDNN SDPA training forward and backward (#161058)
This PR is the first split PR of https://github.com/pytorch/pytorch/pull/156272, only contains the OneDNN code. Please help review.

Pending on OneDNN v3.9 commit update. Don't merge.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161058
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-09-08 17:07:31 +00:00
49c446c617 Add C++ function for torch.distributed.tensor._op_schema.is_view_op (#161595)
This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: #161466, #161586, #161590, #161591
2025-09-08 16:28:08 +00:00
8e076d889c Don't call check_has_torch_dispatch in THPVariable_NewWithVar if we already know (#161591)
We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161591
Approved by: https://github.com/ezyang
ghstack dependencies: #161466, #161586, #161590
2025-09-08 16:28:08 +00:00
f044fa2902 [AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041)
The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: #162040
2025-09-08 16:12:52 +00:00
a92773eeb1 Revert "Use vectorized stores for all dtypes in cat (#161649)"
This reverts commit 377033757ae5ca524ea842f1b0a5f446ed3d8fe0.

Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3266963044))
2025-09-08 15:58:58 +00:00
53297f6ad0 Revert "[audio hash update] update the pinned audio hash (#162315)"
This reverts commit c9ac8c25ef9ad020542898ab569910a9d0cd1f7e.

Reverted https://github.com/pytorch/pytorch/pull/162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](https://github.com/pytorch/pytorch/pull/162315#issuecomment-3266932718))
2025-09-08 15:52:30 +00:00
25c170b72e [inductor] Runtime estimations: use nccl estimator; mm only benchmark mode (#161405)
During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms.

Adding optional usage of:
- c10d.time_estimator for collectives, which is based on NCCL estimator

Benchmark mode only for matmuls, as they are highly dependent on mm backend

- The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572

This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()`

Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405
Approved by: https://github.com/eellison
2025-09-08 14:33:19 +00:00
3f5993316e [upstream triton] update triton pin to triton 3.5 (#162278)
Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton)

Notably:
* this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR)
* sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278
Approved by: https://github.com/atalman
ghstack dependencies: #162244, #162309
2025-09-08 14:29:24 +00:00
e101411b9f Update slow tests (#161395)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161395
Approved by: https://github.com/pytorchbot
2025-09-08 13:33:32 +00:00
32911ff541 [xla hash update] update the pinned xla hash (#162372)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162372
Approved by: https://github.com/pytorchbot
2025-09-08 11:31:16 +00:00
5b90e85112 [AsyncTP] Fixes AsyncMM (#162040)
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
2025-09-08 10:53:59 +00:00
31d5c67539 [inductor][triton] support static cuda launcher after triton # 7866 (#162309)
Fixes static cuda launcher after https://github.com/triton-lang/triton/pull/7866.

Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162309
Approved by: https://github.com/aakhundov
ghstack dependencies: #162244
2025-09-08 07:57:48 +00:00
fb0afa853e [inductor][triton] more JITCallable._hash_lock support (#162244)
Follow-up to #161768.

Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished.

In #161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable.

It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock.

To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock)

Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162244
Approved by: https://github.com/atalman
2025-09-08 07:57:48 +00:00
1e0656f063 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit de893e96c775023aa3be895060848fac3296772c.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
29e09a6545 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
c9ac8c25ef [audio hash update] update the pinned audio hash (#162315)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315
Approved by: https://github.com/pytorchbot
2025-09-08 04:17:23 +00:00
103f725afa [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/ydwu4
2025-09-08 03:21:17 +00:00
5babb4d5c0 Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-07 23:37:31 +00:00
eb9073a6b7 [easy] [precompile] Convert CompileArtifacts to callable (#162169)
The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module.
Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162169
Approved by: https://github.com/zhxchen17
2025-09-07 23:37:31 +00:00
ec2e3687c7 [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
2025-09-07 21:55:29 +00:00
ff2de5d522 Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473)"
This reverts commit 040d00af048967dde7938d358d7f5988cbd18388.

Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983))
2025-09-07 21:06:38 +00:00
8235c4f65d Revert "[ROCm] Enabling several UTs (#161715)"
This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354.

Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))
2025-09-07 21:03:17 +00:00
e246a85b76 Revert "[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)"
This reverts commit 5c473e9f5ee0ef0fc38e6cf34a95b547f8cdc8d5.

Reverted https://github.com/pytorch/pytorch/pull/159118 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473 ([comment](https://github.com/pytorch/pytorch/pull/159118#issuecomment-3264037799))
2025-09-07 21:00:29 +00:00
df59c21768 Revert "[BE] Cleanup stale comments/copy from gemm (#162001)"
This reverts commit 6087ef41e54c2494b117ffd923faf20f515a6806.

Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to breaks internal ads signal, see [D81845017](https://www.internalfb.com/diff/D81845017) ([comment](https://github.com/pytorch/pytorch/pull/162001#issuecomment-3264034312))
2025-09-07 20:53:16 +00:00
093ab5f477 Revert "[inductor] add kernel template choice (ktc) (#161347)"
This reverts commit 9a8d454c464c0b811fc4586ff104424bccf1da0c.

Reverted https://github.com/pytorch/pytorch/pull/161347 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))
2025-09-07 20:39:39 +00:00
4348db0b92 Revert "[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)"
This reverts commit c32111149921b48bfef909293f1049e21619ed76.

Reverted https://github.com/pytorch/pytorch/pull/161348 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))
2025-09-07 20:39:39 +00:00
9ad5e8edb1 Improve typing of ONNX decorators with ParamSpec (#162332)
## Summary
This PR improves typing in ONNX-related modules by replacing TypeVar bound to Callable[..., Any] with ParamSpec to preserve parameter types and avoid type erasure in decorator functions.

## Changes
- `torch/onnx/_internal/exporter/_flags.py`: Replace TCallable TypeVar with ParamSpec
- `torch/onnx/ops/_impl.py`: Replace _T TypeVar with ParamSpec for _onnx_op decorator
- `torch/onnx/_internal/exporter/_torchlib/_torchlib_registry.py`: Replace _T TypeVar with ParamSpec

## Motivation
The previous implementation used TypeVar bound to Callable which erased parameter type information to Any. ParamSpec preserves the exact parameter types and return types, providing better type safety and IDE support.

## Testing
- Verified all changes compile and import correctly
- Created comprehensive test suite to validate ParamSpec functionality
- No linting errors introduced
- Maintains backward compatibility

Fixes #142306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162332
Approved by: https://github.com/Skylion007
2025-09-07 18:06:03 +00:00
7a83cf430e Revert " [while_loop][autograd] support autograd_key of while_loop (#160483)"
This reverts commit 2b8a83901c58a0858ea9e4ce00055f48e6ed164c.

Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))
2025-09-07 08:50:49 +00:00
ada43ed39c Revert "[inductor] pdl inductor option (disabled by default) (#160928)"
This reverts commit 9458d1ac3bd70c2af316a8ba95d2c6c9c1199c9c.

Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))
2025-09-07 07:37:37 +00:00
93fb23d6fa Build vLLM nightly wheels (#162000)
This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated.  The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli

1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get
2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build
3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist`
4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89]
5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled
6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM

We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date.

### Usage

* Install latest nightlies
```
pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \
  --index-url https://download.pytorch.org/whl/nightly/cu129
```

* Install a specific version
```
pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \
  vllm==1.0.0.dev20250903 \
  xformers=0.0.33.dev20250903 \
  flashinfer_python=0.2.14.dev20250903 \
  --index-url https://download.pytorch.org/whl/nightly/cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000
Approved by: https://github.com/atalman
2025-09-07 06:09:17 +00:00
104f2680e0 Revert "Add return-max-scores to flex-attention (#161667)"
This reverts commit 486b20b73cfcf32a773a4301b1b97f91c157ce76.

Reverted https://github.com/pytorch/pytorch/pull/161667 on behalf of https://github.com/huydhn due to Sorry for reverting your change but reverting https://github.com/pytorch/pytorch/pull/161730 does not seem to fix all trunk failures ([comment](https://github.com/pytorch/pytorch/pull/161667#issuecomment-3263512642))
2025-09-07 06:00:55 +00:00
eac3d6f04c Revert "[inductor] fuse for scalar shared data (#162311)"
This reverts commit 2a45837e98c63cae9d1a2e2133a727b829e549d5.

Reverted https://github.com/pytorch/pytorch/pull/162311 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is breaking lint ([comment](https://github.com/pytorch/pytorch/pull/162311#issuecomment-3263511162))
2025-09-07 05:57:43 +00:00
fea20775ad [vllm hash update] update the pinned vllm hash (#162314)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162314
Approved by: https://github.com/pytorchbot
2025-09-07 04:29:23 +00:00
2a45837e98 [inductor] fuse for scalar shared data (#162311)
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
ghstack dependencies: #162028, #162221, #162303
2025-09-07 01:48:45 +00:00
b919560c4a [nativert] AOTI lowering and packaging as NativeRT delegate (#162285)
Summary:
A demo for creating AOTI delegate for NativeRT in OSS.

- It supports full graph lowering only.
- It leverages `executorch_call_delegate` HOP but doesn't rely on `executorch`.
- The delegate graph is obtained by tracing a `LoweredBackendModule` whose forward function calls `executorch_call_delegate`.
- The main difference between `executorch_call_delegate` and `aoti_call_delegate` is that the delegate graph from `executorch_call_delegate` doesn't have weights lifted as inputs.
- original_ep and delegate_ep are treated as flat EP dictionary and there is no nested structure.
- The naming contract is enforced by `model_name` and `backend_id`

Test Plan:
CI

Rollback Plan:

Differential Revision: D81641157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162285
Approved by: https://github.com/dolpm
2025-09-07 01:29:54 +00:00
e3068cdb44 [dynamo] Use relaxed CLOSURE_MATCH guard then ID_MATCH (#162247)
I am unable to write a test that would fail here. The reason is that when we do _dynamo.disable(fn) in the compiled frame, the id of disabled function changes but currently we guard on the original function - `fn` whose id is not changing. This PR still guards on the `fn.__code__` just to be more precise.

Thanks to @thenumberouscode for pointing this out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162247
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-09-07 01:25:52 +00:00
5211f1f908 [export] Move example inputs in move_to_device_pass (#162301)
Summary:
If i have a EP that's exported on CPU and want to AOTI compile it for CUDA. I need to use `move_to_device_pass`.

But in `torch._inductor.aoti_compile_and_package()`, it directly uses the `example_inputs` attached to the EP, so we should move the example inputs as well if applicable.

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_move_device_example_inputs

Rollback Plan:

Differential Revision: D81812366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162301
Approved by: https://github.com/angelayi
2025-09-06 23:54:54 +00:00
2b8a83901c [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160467
2025-09-06 21:26:33 +00:00
48e3be3ab6 [while_loop][autograd] add hop while_loop_stack_output (#160467)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160467
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-09-06 21:26:33 +00:00
5927a70934 NLLLoss: validate target is 0D when input is 1D (#161412)
Add a shape check in nll_loss_forward to error out when both input and target are 1D. Added a unit test to cover the incompatible 1D/1D case.

Fixes #157420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161412
Approved by: https://github.com/ngimel
2025-09-06 20:58:42 +00:00
1a588ace46 [inductor] rename deps during refreshing (#162303)
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162028, #162221
2025-09-06 20:38:28 +00:00
541aa23de5 [inductor] fix TemplateBuffer.extract_read_writes (#162221)
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162028
2025-09-06 20:38:28 +00:00
047603d35b New export implementation with flat inp/out (#162167)
This is my first attempt of building new export API. The main thing it addresses is correctly getting input and output relations. Subsequent diffs willl add functionality for dynamic shapes, nn_module_stack etc.

Differential Revision: [D81793205](https://our.internmc.facebook.com/intern/diff/D81793205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162167
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2025-09-06 20:03:52 +00:00
ae0edc133e [3/N] Enable 6 fsdp test on Intel GPU (#161601)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR is created base on PR https://github.com/pytorch/pytorch/pull/158533 and https://github.com/pytorch/pytorch/pull/159473 and will work on some test files under test/distributed/fsdp. We could enable Intel GPU with following methods and try the best to keep the original code styles in this PR:

1. add allow_xpu=True in instantiate_device_type_tests() if needed.
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend

3. enabled XPU for some test path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161601
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-06 16:47:13 +00:00
b6d0a9ea90 MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209)
## Summary
- We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816
- This is needed for backward pass of mxfp8 MoE training with grouped gemms
- Changes:
    - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm`
    - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile
    - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs
    - Bump FBGEMM third party submodule to include:
          - https://github.com/pytorch/FBGEMM/pull/4816
          - https://github.com/pytorch/FBGEMM/pull/4820
          - https://github.com/pytorch/FBGEMM/pull/4821
          - https://github.com/pytorch/FBGEMM/pull/4823

#### How fbgemm dependency was bumped
Documenting this since I haven't found it documented elsewhere:
- `cd ~/pytorch/third_party/fbgemm`
- `git fetch`
- `git checkout <hash>`
- `cd ~/pytorch`
- `git add third_party/fbgemm`

## Test plan

#### Test build
```
USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e .
...
Successfully installed torch-2.9.0a0+gitf5070f3
```
[full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581)

#### Unit tests
```
pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_
...

test/test_matmul_cuda.py .........                                                                                                                        [100%]

============================================================== 9 passed, 1668 deselected in 5.34s ===============================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209
Approved by: https://github.com/ngimel
2025-09-06 15:25:30 +00:00
eqy
5985e28912 [CUDA 13][cuDNN][Windows] Roll back cuDNN upgrade from 9.13 to 9.12 on Windows (#162322)
Forward fix for #162268

CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162322
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2025-09-06 13:32:07 +00:00
9aedb3cd87 [AOTI-FX] Support registering custom FX backends (#162317)
# Feature
Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen.

# Test plan
Added a CI test registering a custom FX backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317
Approved by: https://github.com/jansel
2025-09-06 07:32:03 +00:00
0ff8eabf13 Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670)"
This reverts commit 146371483318e17929daefd37c8e459d9d6d47bb.

Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379267 and https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379271 ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3261241229))
2025-09-06 06:18:57 +00:00
28f4ab0737 Add -Wno-ctad-maybe-unsupported compiler flag (#162223)
When running bazel build, we (Google) run into the following error.
The `-Wctad-maybe-unsupported` warning would be raised to an error and break the build in certain cases.
So, we propose to suppress the warning to make the build with bazel more smooth.

This is the error message we got:
```
c10/util/IntrusiveList.h:166:12: error: 'std::reverse_iterator' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]
  166 |     return std::reverse_iterator{end()};
      |            ^
c10/test/util/IntrusiveList_test.cpp:24:18: note: in instantiation of member function 'c10::IntrusiveList<(anonymous namespace)::ListItem>::rbegin' requested here
   24 |     auto it = c1.rbegin();
      |                  ^
c10/test/util/IntrusiveList_test.cpp:43:5: note: in instantiation of function template specialization '(anonymous namespace)::check_containers_equal<(anonymous namespace)::ListItem>' requested here
   43 |     check_containers_equal(l, v);
      |     ^
libcxx/include/__iterator/reverse_iterator.h:51:7: note: add a deduction guide to suppress this warning
   51 | class reverse_iterator
      |       ^
1 error generated.

```

@haifeng-jin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162223
Approved by: https://github.com/ezyang
2025-09-06 06:11:37 +00:00
c98ddaca6d Fixed comment to match logic in distributed_c10d.py (#162158)
inconsistent with the logic introduced in #162157  and modified in #142216.This update ensures the documentation matches the actual behavior of the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162158
Approved by: https://github.com/wconstab
2025-09-06 05:37:49 +00:00
bc505977fb torch.zeros bound checks for symint (#161976)
Fixes #161490

I added a bounds check for negative symints to create a better error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161976
Approved by: https://github.com/ezyang
2025-09-06 05:37:42 +00:00
aac1a50a19 Add api info for torch._C._nn.pyi (#162148)
Fix part of #148404

APis involved are as followed:

- cross_entropy_loss
- hardsigmoid_
- hardswish
- hardswish_
- huber_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162148
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2025-09-06 05:21:40 +00:00
20b47acef8 [fx] fix qualified name for methods of torch.Tensor (#162224)
Fixes #160077, #154721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162224
Approved by: https://github.com/ezyang
2025-09-06 05:16:19 +00:00
da4db4b33d Fix DeviceMesh._flatten docstring example (#162277)
Fix the `DeviceMesh._flatten` docstring example of use. Alternative fix would be to replace `mesh_3d["dp", "cp"]` with `mesh_3d["cp", "tp"]`.

(I verified the fix using the `gloo` backend)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162277
Approved by: https://github.com/ezyang
2025-09-06 05:00:00 +00:00
a3e5466002 Revert "Resize to 0 if not going to be used (#161730)"
This reverts commit 081cab045472ce045634548cc6c14a4870641e23.

Reverted https://github.com/pytorch/pytorch/pull/161730 on behalf of https://github.com/davidberard98 due to functorch/test_aotdispatch.py::TestAOTModuleSimplified::test_flex_attn_noncontiguous_tangents [GH job link](https://github.com/pytorch/pytorch/actions/runs/17506617662/job/49731934012) [HUD commit link](081cab0454) ([comment](https://github.com/pytorch/pytorch/pull/161730#issuecomment-3260492575))
2025-09-06 04:17:08 +00:00
c0983e6cc0 [Graph Partition] interface for custom cg wrapper (#162207)
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](https://github.com/vllm-project/vllm/pull/24281)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162207
Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
2025-09-06 03:13:01 +00:00
b2b4add0e7 Docs on export joint with descriptors (#159006)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159006
Approved by: https://github.com/SherlockNoMad
2025-09-06 03:02:58 +00:00
20629b1619 Add contiguous subgraph transformation threshold (#162192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162192
Approved by: https://github.com/coconutruben
2025-09-06 02:48:00 +00:00
c3ceca2995 codebase structure documentation to include torchgen (#162261)
📚 The doc update

adding description about torchgen folder in code structure guide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162261
Approved by: https://github.com/ezyang
2025-09-06 02:10:57 +00:00
145a3a7bda [CUDA 13][cuDNN] Bump CUDA 13 to cuDNN 9.13.0 (#162268)
Fixes some `d_qk` != `d_v` cases on Hopper that are broken by cuDNN 9.11-9.12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162268
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-09-06 01:59:03 +00:00
291cd11f2d [inductor] estimate peak memory in codegen only when buffer reuse (#162300)
As titled, this PR ensures peak memory is estimated only when buffer reuse is enabled. Without this config, some nodes' successor nodes are eliminated from memory estimation after inductor bucketing, which can cause errors.

The original codegen peak memory estimation code is from this PR: https://github.com/pytorch/pytorch/pull/159530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162300
Approved by: https://github.com/eellison, https://github.com/v0i0
2025-09-06 01:30:38 +00:00
7f4ff79210 remove deprecated vllm test (#162306)
Fixes https://github.com/pytorch/pytorch/issues/162274

the test is removed from vllm side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162306
Approved by: https://github.com/malfet
2025-09-06 01:27:13 +00:00
0f45aaf441 Disable autocast when running joint graph passes (#162304)
Fixes #159469. See https://github.com/pytorch/pytorch/issues/159469#issuecomment-3221474027 for root-cause analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162304
Approved by: https://github.com/bdhirsh, https://github.com/zou3519, https://github.com/eellison
2025-09-06 00:57:58 +00:00
4f72d932fe re-land triton runtime implementation" (#162217)
Summary: original pr - https://github.com/pytorch/pytorch/pull/161798

Test Plan:
ci

Rollback Plan:

Differential Revision: D81724234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162217
Approved by: https://github.com/SherlockNoMad
2025-09-06 00:52:29 +00:00
1463714833 [dynamo] Graph break on on user-defined class in compiled region (#161670)
Currently, user-defined classes inside of a compiled frame will cause the whole
frame to be skipped by dynamo.  This change defers the Unsupported exception
until the __build_class__ builtin is actually called, which allows a graph break
to be inserted.  Fixes #161562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas
2025-09-06 00:04:57 +00:00
081cab0454 Resize to 0 if not going to be used (#161730)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #161730
*  #161667

```Py
        with torch.cuda._DeviceGuard(0):
            torch.cuda.set_device(0)
            buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32)
            buf1 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32)
            buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32)
            # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
            stream0 = get_raw_stream(0)
            triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0)
            del arg0_1
            del arg1_1
            del arg2_1
            del arg3_1
            del arg4_1
            del arg5_1
            del arg6_1
            del buf0
            del buf1
        return (buf2, )
```

Vs

```Py
        with torch.cuda._DeviceGuard(0):
            torch.cuda.set_device(0)
            buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32)
            buf1 = empty_strided_cuda((0, ), (1, ), torch.float32)
            buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32)
            # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
            stream0 = get_raw_stream(0)
            triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0)
            del arg0_1
            del arg1_1
            del arg2_1
            del arg3_1
            del arg4_1
            del arg5_1
            del arg6_1
            del buf0
            del buf1
        return (buf2, )
```
<img width="428" height="145" alt="Screenshot 2025-08-28 at 12 37 11 PM" src="https://github.com/user-attachments/assets/240a7bca-97e1-40c4-bf93-f075fdc1a40d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161730
Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng
ghstack dependencies: #161667
2025-09-05 23:21:46 +00:00
486b20b73c Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-05 23:21:46 +00:00
4d4abec80f allow user to pass in custom partitioner function (#157580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157580
Approved by: https://github.com/bdhirsh
2025-09-05 22:49:39 +00:00
9c03d6be87 [CD][BE] Delete Python-3.9 case (#162265)
And raise error when building for an unsupported version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162265
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162297
2025-09-05 22:46:36 +00:00
8d50355d97 [CD][EZ] Update libtorch python version to 3.10 (#162297)
Not sure why it was at 3.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162297
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-09-05 22:46:36 +00:00
e0a62b266c [aot-precompile] default-filter global guards (#162090)
if the user doesn't provide their own guard filter fn, we should by default filter global guards.

pytest test/dynamo/test_aot_compile.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162090
Approved by: https://github.com/zhxchen17
2025-09-05 22:44:55 +00:00
01ab325cc2 [DCP][Quantization] Fix the issue when scale vector is in a different SafeTensors file (#162214)
Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files.

Test Plan:
buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage

Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082

Rollback Plan:

Differential Revision: D81718598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162214
Approved by: https://github.com/wwwjn
2025-09-05 22:43:58 +00:00
79fcd5247a symbolic cpp channels_last_contiguous (#160402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160402
Approved by: https://github.com/aorenste
2025-09-05 21:40:32 +00:00
70d36e047d Making batching rule for F.embedding DTensor-aware (#162117)
`vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's
batching rule generates a new tensor via at::arange, at::arange
generates a regular tensor, and DTensor rightfully errors on mixed
DTensor-regular Tensor operations.

This PR fixes the problem by activating DTensor implicit replication on
just the at::arange and the subsequent add operation.

In order to accomplish this I move the DTensor implicit replication flag
to C++ (most batching rules are in C++).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117
Approved by: https://github.com/bdhirsh
2025-09-05 21:40:14 +00:00
a00cdc1e41 [CD][BE] Get rid of SETUPTOOLS and PYYAML extra pins (#162266)
As those weren't really a pins to begin with, and requirments.txt
already has those
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162266
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162263, #162264
2025-09-05 21:32:52 +00:00
c10195e723 [C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633)
- Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it.
- Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo.

Fixes #156632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633
Approved by: https://github.com/d4l3k
2025-09-05 21:24:36 +00:00
771f369448 [Inductor] Improve RoPE (#161420)
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420
Approved by: https://github.com/shunting314
2025-09-05 20:55:20 +00:00
92a43025e0 [cutlass backend] Add FP8 tests for multiple linears (#160782)
Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases.

We still have to skip the AOTI and dynamic case due to accuracy issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160782
Approved by: https://github.com/mlazos
2025-09-05 20:23:25 +00:00
2fa0520a64 [BE][pytree] cleanup parameterized pytree tests (#160842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160842
Approved by: https://github.com/Skylion007
2025-09-05 20:15:29 +00:00
01edcd4df8 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-05 20:15:11 +00:00
de893e96c7 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-05 20:15:11 +00:00
6087ef41e5 [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-05 19:59:51 +00:00
a3c7f77e50 [EZ][CD] Update MacOS deployment platform to 11.0 (#162264)
Fixes following warning
```
MACOSX_DEPLOYMENT_TARGET is set to a lower value (10.15) than the version on which the Python interpreter was compiled (11.0)
```
Update deployment platform in `README.MD` as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162264
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162263
2025-09-05 19:58:04 +00:00
3771380f83 [ONNX] Hide draft export under a flag (#162225)
Use `TORCH_ONNX_ENABLE_DRAFT_EXPORT` to control whether draft_export should be used as a strategy in onnx export.

Follow up of https://github.com/pytorch/pytorch/pull/161454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162225
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
2025-09-05 19:54:50 +00:00
adae7f66aa Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit c37103234afc832dcad307e9016230810957c9d5.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
70f865ac9b Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
88d94d17e8 Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further (#161590)
This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161590
Approved by: https://github.com/albanD
ghstack dependencies: #161466, #161586
2025-09-05 18:43:41 +00:00
c321111499 [inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)
\# why

- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
  generators are executed, so we don't need this generator anymore

\# what

- rather than yielding the ChoiceCaller, just return the list of all
  valid ChoiceCallers

\# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346, #161347
2025-09-05 18:02:53 +00:00
9a8d454c46 [inductor] add kernel template choice (ktc) (#161347)
# why

- gather everything up to make choices, without running
  potentially expensive generators
- enables overrides where we toss the entire list of configs
  from inductor, without having to enumrate it (expensive)

# what

- add a holding class that just gets all the components necessary
  to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
  the scene

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346
2025-09-05 18:02:53 +00:00
e02e9edb55 [inductor] V.choice.get_mm_configs takes a stack of templates (#161346)
# why

- enables us to just gather relevant templates and get all
  choices at once
- that in turns allows us to make op wide override decisions

# what

- V.choice.get_mm_configs takes a stack of templates
- all callsites just provide a stack of size 1 right now
  but do not merge everything yet (other features pending)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520583](https://our.internmc.facebook.com/intern/diff/D81520583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161346
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345
2025-09-05 18:02:46 +00:00
d63ad53a99 [inductor][ez] return choicecallers directly (#161345)
# why

- remove repeat patterns
- we have everything to make the choicecallers
  - templates
  - input_nodes
  - layouts
  - all the kwargs

# what

- yield a choicecaller directly from V.choices.get_mm_configs

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344
2025-09-05 18:02:38 +00:00
031d79cb51 [inductor] move max-autotune logic inside V.choices.get_mm_configs (#161344)
# why

- heuristics providers know decide whether to (or which choices to add)
  in the max-autotune case
- enables an eventual override point to gracefully fallback to the
  standard behavior

# what

- max-autotune is determined inside V.choices.get_mm_configs
  because it's mm only right now, we can just do
  `config.max_autotune or config.max_autotune_gemm`
  a TODO indicates that this can change in the future when this
  expands to more templates

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520573](https://our.internmc.facebook.com/intern/diff/D81520573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161344
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343
2025-09-05 18:02:30 +00:00
a301dc3b60 [inductor][ez] pass template rather than template.uid (#161343)
# why

- simpler interface
- enables future of extracting more things out of the template e.g. a
  hash

# what

V.choices.get_mm_configs now takes the whole template rather than just
the template.uid

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520576](https://our.internmc.facebook.com/intern/diff/D81520576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161343
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342
2025-09-05 18:02:22 +00:00
af590cb729 [inductor][aten] treat like a template in GEMMs (#161342)
# why

- central point to analyze and override all generated choices

# what

- add a pseudo heuristic for aten that just yields a single, empty
  kwargs
- add a pseudo heuristic with the bias_addmm logic for it
- add an addmm specific heuristic that yields a single choice, but
  also expands it with alpha and beta kwargs

- replace all the aten.bind calls with V.choices.get_mm_configs
  using the now matching API for aten

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520580](https://our.internmc.facebook.com/intern/diff/D81520580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161342
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341
2025-09-05 18:02:10 +00:00
4902c76c65 [inductor][ez] add template/externchoice uid (#161341)
# why

- to have a central registry of templates/externkernelchoice
  to match them to heuristics etc, they need unique names
- mm is both the triton template name and the aten_mm name

# what

- add a uid() to KernelTemplate/ExternKernelChoice that returns name
- override in ExternKernel to prepend "aten::"
- override in TritonTemplate to prepend "triton::"

This id is just use to find template heuristics, so it has no other
impact

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075, #161340
2025-09-05 18:01:58 +00:00
9602590b15 [inductor] move scaled_mm input nodes logic (#161340)
# why

- a step towards a unified interface for all choices, where any
  adjustment to nodes (e.g. unsqueezing) happens as part of
  choice specific preprocessing, behind a common point

# what

- move the unsqueeze logic for triton nodes for scaled_mm inside
  the new hookup for adjusting the kernel inputs for template
  heuristics

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "scale"
```

Differential Revision: [D81520582](https://our.internmc.facebook.com/intern/diff/D81520582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161340
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075
2025-09-05 18:01:44 +00:00
2ef665ae19 [inductor][contigous mm] mild refactor (#162075)
# why

- use the new heuristics logic better to handle kwargs

# what

- move all checks into the heuristics to yield a single choice or not
  choices if the decomposition should not be used
- fix `hip` device type, which should be `cuda`
- let heuristics handle the kwarg passing

# testing

in ci

Differential Revision: [D81706776](https://our.internmc.facebook.com/intern/diff/D81706776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162075
Approved by: https://github.com/exclamaforte, https://github.com/jansel
2025-09-05 18:01:07 +00:00
b18bb6796f Add const to stable amax (#162082)
Fixes https://github.com/pytorch/pytorch/issues/161826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162082
Approved by: https://github.com/soulitzer
2025-09-05 17:37:49 +00:00
d711f27845 Revert "[ROCm] [CK] Composable Kernel integration for inductor backend (#158747)"
This reverts commit 019fed39aa6b2dd8c69347378d53423e5efae8d4.

Reverted https://github.com/pytorch/pytorch/pull/158747 on behalf of https://github.com/jithunnair-amd due to Broke linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test: 019fed39aa/1 ... PR didn't have this job run successfully due to CI outage ([comment](https://github.com/pytorch/pytorch/pull/158747#issuecomment-3259212343))
2025-09-05 17:27:45 +00:00
261a84a176 [CD][BE] Remove unnecessary checks for XCode version (#162263)
None of them have worked for a while, PyTorch for Mac is build with
XCode-15.4
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162263
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-05 17:02:36 +00:00
98374612fc [Intel GPU] Update Intel triton commit pin to Triton 3.5.x (#161777)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161777
Approved by: https://github.com/EikanWang
2025-09-05 16:55:47 +00:00
c2a3024617 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-05 16:55:09 +00:00
b2c7b9ad2d [Intel GPU][FlexAttention] Enable TMA path on Intel GPU (#162138)
The existing `can_use_tma` has some conditions that are unnecessary for Intel GPUs.
We have removed these useless conditions on the Intel GPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162138
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf
2025-09-05 16:54:51 +00:00
f3cebec39e Revert "Rename propagate_tensor_meta to make private again (#161744)"
This reverts commit 734ce8eba9c69381f187359bf0fef1d71d84cd20.

Reverted https://github.com/pytorch/pytorch/pull/161744 on behalf of https://github.com/jeanschmidt due to seems to break internal tests, see D81657000 for more details ([comment](https://github.com/pytorch/pytorch/pull/161744#issuecomment-3258934519))
2025-09-05 16:20:29 +00:00
06da7c0730 [DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202)
Summary:
Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13

We will use FP32 for the scale vector multiplication and convert to the target dtype.

Upcasting helps with the following:

1.  **Full CPU support**: `float32` has complete CPU kernel implementations for all operations
2.  **Numerical stability**: `float32` provides more precision during intermediate calculations
3.  **Compatibility**: Works across all devices (CPU/GPU) and PyTorch versions

Test Plan:
UTs

Rollback Plan:

Differential Revision: D81711093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202
Approved by: https://github.com/wwwjn
2025-09-05 16:06:21 +00:00
2dd529df00 A basic CLAUDE.md based on bad things I see claude code doing (#162163)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162163
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-05 14:52:36 +00:00
a714437093 [ez][inductor] add a few outer dimension reduction cases for LOAF (#162028)
For the not able to fuse issue reported here: https://github.com/pytorch/pytorch/issues/93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162028
Approved by: https://github.com/jansel, https://github.com/eellison
2025-09-05 09:30:13 +00:00
bffc7dd1f3 [CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916)
Related to https://github.com/pytorch/pytorch/issues/159779

Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956
Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-09-05 07:47:54 +00:00
5c473e9f5e [1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- skip some test cases which Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-05 05:52:15 +00:00
5da573c42c [PGO] handle PGO profile merges (#162097)
Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe.

Differential Revision: D81299840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097
Approved by: https://github.com/bobrenjc93
2025-09-05 04:58:15 +00:00
494878a11b [audio hash update] update the pinned audio hash (#162114)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162114
Approved by: https://github.com/pytorchbot
2025-09-05 04:32:16 +00:00
3bbc2e3e4f [vllm hash update] update the pinned vllm hash (#162226)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162226
Approved by: https://github.com/pytorchbot
2025-09-05 04:32:08 +00:00
b67c410398 [BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409)
Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel.

Test Plan:
NFC. Just a logging change.

Rollback Plan:

Differential Revision: D80942794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409
Approved by: https://github.com/PaulZhang12
2025-09-05 02:53:13 +00:00
be5b03dde9 Allow for using a dedicated binary for the torch subproc pool. (#162093)
Summary:
The binary torch is running inside of can be larger than needed and in certain
situations, this can cause a loss of memory.

Test Plan:
We've manually run tests via
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0
make mc8-train-publish-cint-datafm-toy -C
minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 | tee ~/run_out
```
and overriding the binary used to be the built fbpkg in /packages.

We've also kicked off manual runs at
```
fire-feid-20250903-1051-ae8c6827
```

Which do show the binary running -  https://fburl.com/scuba/procprint/e6lwv32m

Rollback Plan:
steps:
  - jk.update:
      jk: pytorch/compiler:subproc_worker_binary
      constant_bool: null
      consistent_pass_rate: null
      fractional_host_rollout: null
      sampling_rate: null
  - manual.note:
      content: ''

Differential Revision: D81616624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093
Approved by: https://github.com/masnesral
2025-09-05 01:43:46 +00:00
73eb4511fb [B200][NVFP4] Fix argument passing in test_blockwise_mxfp8_nvfp4_mxfp4_numerics_ (#162185)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-05 01:24:59 +00:00
29280864d9 Add new parameter for gen_pyi.py to make it more configureable. (#161772)
This is a reposting of PR #128519.
This change is important to how we maintain PyTorch at Google.

From the previous PR:
"
This will make the script more flexible for the directory where it is executed.
...
We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access.
"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772
Approved by: https://github.com/albanD

Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
2025-09-05 00:48:15 +00:00
5c67426d68 [dynamo] Add support for const prop on .item (#162204)
Fixes some of the errors in https://fb.workplace.com/groups/1028545332188949/permalink/1303030824740397/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162204
Approved by: https://github.com/williamwen42
2025-09-05 00:28:49 +00:00
d2d4c8e9b2 [BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)
Followup after https://github.com/pytorch/pytorch/pull/154012

Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999
Approved by: https://github.com/drisspg
2025-09-04 23:35:27 +00:00
c7e41071a0 [B200][MXFP8] Fix regex in test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda (#162180)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia
2025-09-04 23:29:10 +00:00
9499c8761c [Inductor][Intel GPU] Register triton template heuristic for addmm tma. (#162132)
Fixes #162048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162132
Approved by: https://github.com/jansel
2025-09-04 23:01:57 +00:00
3a207816cc Forward fix for user defined triton kernel grid calc (#162162)
Summary:

This change fixes the test: inductor:fxir_backend - test_custom_triton_autotune_dynamic which was broken by https://github.com/pytorch/pytorch/pull/160997

Test Plan:
inductor:fxir_backend - test_custom_triton_autotune_dynamic

Rollback Plan:

Differential Revision: D81679217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162162
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-04 22:51:23 +00:00
09be1890d7 [export] Fix torch.export.load with storage offset (#162172)
Summary: As titled

Test Plan:
CI

Rollback Plan:

Differential Revision: D81687701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162172
Approved by: https://github.com/angelayi
2025-09-04 22:50:33 +00:00
0d84ff3b78 [PGO] log add_extra_remote PGO to tlparse (#161751)
Summary: log when additional PGO profile is merged in, from added read key

Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D81284190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161751
Approved by: https://github.com/bobrenjc93
2025-09-04 22:47:03 +00:00
1ec2c15914 Revert "Fix Arm64 OSS pytorch build with FBGEMM (#161527)"
This reverts commit dbec08729fb9848bebed6048c63831b87170d061.

Reverted https://github.com/pytorch/pytorch/pull/161527 on behalf of https://github.com/malfet due to This breaks all Mac builds, see b04e922712/1 ([comment](https://github.com/pytorch/pytorch/pull/161527#issuecomment-3256034443))
2025-09-04 22:29:38 +00:00
b04e922712 Fix memory leak in AOTI when calling aoti_torch_as_strided (#162118)
Summary:
Fix memory leak in AOTI when calling `aoti_torch_as_strided`

If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak
```

Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")`

Differential Revision: D81640339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118
Approved by: https://github.com/angelayi
2025-09-04 22:17:06 +00:00
0d71a9dd5b fix incorrect interaction between DDPOptimizer and donated buffers (#160745)
This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test.

You can see more of a description of the problem in the code comments. But the TLDR is that:

* When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts
* We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per **AOT** graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i
* This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during **backward compilation**. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers

The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-04 21:57:27 +00:00
89d41d3f61 [SymmMem] Feed tensor.data_ptr instead of handle.buffer_ptr into kernels (#162193)
After MemPool support, `get_buffer_ptrs` points to base address of allocation segment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162193
Approved by: https://github.com/ngimel
2025-09-04 21:26:05 +00:00
9bdcee01f8 [SymmMem] Add root argument to broadcast op (#161090)
It was missing earlier. Also added range check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090
Approved by: https://github.com/fegin
2025-09-04 21:09:54 +00:00
b9ba612f7a [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-04 20:43:03 +00:00
d5b38410b5 Revert "[SymmMem] Add root argument to broadcast op (#161090)"
This reverts commit 3c0ff1b569c45cfa6935ad8031a9d4cf1551aa3f.

Reverted https://github.com/pytorch/pytorch/pull/161090 on behalf of https://github.com/jeanschmidt due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/161090#issuecomment-3255574093))
2025-09-04 20:42:31 +00:00
48bedd753d Revert "Fix usage of forwarding references (#161094)"
This reverts commit 1ebd70d0c0d562d3be9abdee2a21906584af7d99.

Reverted https://github.com/pytorch/pytorch/pull/161094 on behalf of https://github.com/jeanschmidt due to checking if revert will fix https://github.com/pytorch/pytorch/actions/runs/17470601839/job/49621447581 ([comment](https://github.com/pytorch/pytorch/pull/161094#issuecomment-3255541480))
2025-09-04 20:35:41 +00:00
a3d72b09ae Apply Triton tensor descriptor for flex-decoding for performance (#161643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161643
Approved by: https://github.com/drisspg
2025-09-04 20:10:41 +00:00
ef3be6726f Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-04 20:05:50 +00:00
95ee0bfea9 Revert "[nativert] triton runtime implementation (#161798)"
This reverts commit 3dde5d7f9bf80dd6623a712bc429e9e4302464b5.

Reverted https://github.com/pytorch/pytorch/pull/161798 on behalf of https://github.com/jeanschmidt due to introducing linting failures ([comment](https://github.com/pytorch/pytorch/pull/161798#issuecomment-3255412085))
2025-09-04 20:05:24 +00:00
dbec08729f Fix Arm64 OSS pytorch build with FBGEMM (#161527)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/4775

Without this change, Arm64 OSS pytorch build with FBGEMM failed with the following error.
Undefined symbols for architecture arm64:
  "fbgemm::FindMinMax(float const*, float*, float*, long long)", referenced from:
      at::native::fbgemm_linear_int8_weight_fp32_activation(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&) in QuantizedLinear.cpp.o
      at::native::fbgemm_linear_quantize_weight(at::Tensor const&) in QuantizedLinear.cpp.o
      PackedConvWeight<2>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o
      PackedConvWeight<3>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o
      at::Tensor PackedLinearWeight::apply_dynamic_impl<false>(at::Tensor, bool) in qlinear_dynamic.cpp.o
      at::Tensor PackedLinearWeight::apply_dynamic_impl<true>(at::Tensor, bool) in qlinear_dynamic.cpp.o
ld: symbol(s) not found for architecture arm64

This change fixed the issue by moving FindMinMax's implementation from QuantUtilsAvx2.cc to QuantUtils.cc. FindMinMax is a platform-agnostic function with AVX2-specific optimizations so conceptually it can be put in QuantUtils.cc.

Test Plan:
With this change, Arm64 OSS pytorch built successfully with FBGEMM enabled.

Rollback Plan:

Reviewed By: q10

Differential Revision: D81052327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161527
Approved by: https://github.com/q10
2025-09-04 20:01:13 +00:00
c3d54dea9f Revert "[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)"
This reverts commit 02c83f13348631d80aa23f57aaff6b7d1223bbdd.

Reverted https://github.com/pytorch/pytorch/pull/161999 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))
2025-09-04 19:56:48 +00:00
afa6e5604d Revert "[BE] Cleanup stale comments/copy from gemm (#162001)"
This reverts commit b40d9432be44a6b5974ee62e7d19c3c61c5ece37.

Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))
2025-09-04 19:56:48 +00:00
9e5247f51d Revert "[MPS] enable cat op for sparse (#162007)"
This reverts commit 2c03f0acc53ed13fe8ebfe809129f25996e009a0.

Reverted https://github.com/pytorch/pytorch/pull/162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](https://github.com/pytorch/pytorch/pull/162007#issuecomment-3255357336))
2025-09-04 19:49:44 +00:00
c37103234a Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-04 19:43:17 +00:00
3dde5d7f9b [nativert] triton runtime implementation (#161798)
Summary:
att
Test Plan:
ci
Rollback Plan:

Reviewed By: minjang

Differential Revision: D80828148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161798
Approved by: https://github.com/minjang, https://github.com/SherlockNoMad
2025-09-04 19:00:15 +00:00
1f51056bd6 [BE]: Update cpp-httplib submodule to 0.26.0 (#162181)
Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162181
Approved by: https://github.com/jansel
2025-09-04 18:59:32 +00:00
6b1900c22f [dynamo][hops] Remove const outputs from the speculated subgraph (#161355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161355
Approved by: https://github.com/zou3519
2025-09-04 18:52:01 +00:00
9480cdc0b6 Modified the docs to add example for torch.is_floating_point and torc… (#161951)
…h.is_complex.

The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values.

Fixes #161859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161951
Approved by: https://github.com/zou3519
2025-09-04 18:50:19 +00:00
eqy
6f7608d603 [cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073)
for 2.9
🙏

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162073
Approved by: https://github.com/drisspg
2025-09-04 18:46:28 +00:00
d1a15abfdc export: add explicit decomposition for aten.expand_copy and unit test (#161688)
Fixes #161080
torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument.

I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161688
Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
2025-09-04 18:16:56 +00:00
33028597bf [dynamo] Make the MRO walk more narrow (#162105)
I dont have a failing test case but just saw an extra guard somewhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162105
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-09-04 17:54:33 +00:00
9eadb37cdd enable float32 and float16 in torch._grouped_mm fallback (#162059)
Summary:

Enables `torch.float32` and `torch.float16` options in
`torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`,
`mat_b`, and `out_dtype` are `torch.bfloat16`.

Saving for future PRs:
1. enabling testing on more platforms
2. supporting out_dtype != mat_a.dtype
3. opinfo
4. better compile support

Test Plan:

```bash
// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #161407, #161717
2025-09-04 17:48:52 +00:00
61fb632cfb move _grouped_mm fallback to composite explicit autograd (#161717)
Summary:

Moves the `torch._grouped_mm` fallback from cuda-only code to a place
where it can be used by multiple backends. Specifically:
1. make the fallback path and util functions reusable and move them to
   `ATen/native/GroupedMMUtils.h`
2. register a backend-agnostic kernel to composite explicit autograd key
3. refactor the grouped_mm tests to their own test case and enable CPU

At the end of this PR, here is the support matrix:
* CUDA SM90+: fast path with test coverage (no change)
* CUDA SM80+: fallback with test coverage (no change)
* CPU: fallback works, but without test coverage (new in this PR)
* other SM versions and other backends: will probably already work, but
  let's leave this to future PRs
* float32/float16: will probably already work, but let's leave this to
  future PRs

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717
Approved by: https://github.com/ngimel, https://github.com/drisspg
ghstack dependencies: #161407
2025-09-04 17:48:52 +00:00
8a736fa1ea create torch._grouped_mm fallback path with for loops / bmm (#161407)
Summary:

Creates a fallback path for `torch._grouped_mm`, using the naive for
loop implementation (or bmm).

For the sake of keeping the PR small, this PR only enables SM80+ (CUDA
capability 8.0 and up), since I am testing this on an A100 machine. In
future PRs, we can increase the coverage of the fallback to:
1. float32 and float16, which will extend the GPU coverage
2. cpu

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-09-04 17:48:44 +00:00
8bb213b6d5 [SymmMem] Increase signal pad size for NVL72 (#162026)
so that the signal calls do not step on each other's foot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162026
Approved by: https://github.com/ngimel
2025-09-04 17:41:38 +00:00
869cbcc16e [SymmMem] Add a helper API to distinguish intra- and inter- node (#161984)
Added a helper API to tell if the world is entirely within a P2P domain or crosses network.
This is mainly for nblocks tuning purpose. (In later PRs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161984
Approved by: https://github.com/ngimel
ghstack dependencies: #161983
2025-09-04 17:37:59 +00:00
0c0e056a9e [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel, https://github.com/eqy

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-04 17:21:26 +00:00
f36f285953 [dynamo] change error_on_graph_break/fullgraph semantics (#161747)
This PR implements the semantics change to `torch._dynamo.error_on_graph_break`:
- ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~
- `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks.
- `error_on_graph_break` does nothing when `fullgraph=True`
- `error_on_graph_break` does NOT guarantee a single graph

Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation:
- `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled
- `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time
- `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747
Approved by: https://github.com/mlazos
ghstack dependencies: #161739
2025-09-04 17:10:17 +00:00
ba7f546ccc Update torch-xpu-ops commit pin (#162062)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](83c5a5a551), includes:

- Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed
- Fallback lu_factor kernel to CPU for single batch
- Enable aten::linalg_inv and aten::linalg_inv_ex on XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062
Approved by: https://github.com/EikanWang
2025-09-04 17:05:33 +00:00
43b7c86a2c Add dependency-groups.dev to pyproject.toml (#161216)
[PEP 735](https://peps.python.org/pep-0735) introduces the
[dependency-groups] table for a number of use-cases one of
which includes specifying development dependencies for projects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161216
Approved by: https://github.com/seemethere
2025-09-04 16:51:36 +00:00
019fed39aa [ROCm] [CK] Composable Kernel integration for inductor backend (#158747)
This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality.

The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files.

This PR is a remake of due to branch error: https://github.com/pytorch/pytorch/pull/156192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158747
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-04 16:51:06 +00:00
81aeefa657 Add torch.compile support for triton.constexpr_function (#162106)
Fixes #161868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162106
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-09-04 16:46:55 +00:00
248355faf5 Don't require FakeStore to be passed into fake backend (#162164)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162164
Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab
2025-09-04 16:43:49 +00:00
1ebd70d0c0 Fix usage of forwarding references (#161094)
I found a number of places that seem to want forwarding
references but the type signature does not reflect that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161094
Approved by: https://github.com/malfet
2025-09-04 16:34:39 +00:00
cc5bdd1240 Keep default CMAKE_PREFIX_PATH in test_aot_inductor_package (#161907)
`CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found.

Instead prepend the path to the existing value.

This fixes a test failure:
> pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package
>    self.assertTrue(so_path.exists())
> AssertionError: False is not true

Caused by:
```
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory
collect2: error: ld returned 1 exit status
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161907
Approved by: https://github.com/Skylion007
2025-09-04 16:27:57 +00:00
3a20a20e70 Fix largeTensorTest malfunction on XPU (#161988)
# Motivation
https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-09-04 16:10:03 +00:00
6b8b3ac440 Revert "[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)"
This reverts commit cd529b686d54bbaa443f5b310140de48422d96c7.

Reverted https://github.com/pytorch/pytorch/pull/162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](https://github.com/pytorch/pytorch/pull/162044#issuecomment-3254427869))
2025-09-04 16:06:30 +00:00
601ae8e483 [CUDAGraph] add config to error on skipping cudagraph (#161862)
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.

This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-09-04 15:52:39 +00:00
b7dad7dd49 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit 90b08643c3a6eb1f3265b7d1388bd76660759f46.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))
2025-09-04 15:25:07 +00:00
e532c9d4f1 Relax tolerance for test_quick_baddbmm_cpu_complex64 (#152424)
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`.

I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb

The greatest differences are consistent and the same on both CPU architectures:
```
Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed)
```

Hence I assume this is in the expected tolerances  especially as `complex128` and all other types pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152424
Approved by: https://github.com/malfet
2025-09-04 13:26:42 +00:00
34aa78274d Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))
2025-09-04 13:13:52 +00:00
040d00af04 [2/N]Port several test files under test/distributed to Intel GPU (#159473)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  Added xpu for Backend.backend_capability and dist.Backend.register_backend()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-04 12:53:17 +00:00
9c957723a0 Replace setup.py develop with pip install -e (#156710)
#156027 already replaced most use of `python setup.py develop`. This PR only adds a few more occurrences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156710
Approved by: https://github.com/atalman
2025-09-04 11:07:44 +00:00
acece97c3a [Intel GPU] Upgrade OneDNN XPU Tag to v3.9.1 (#161932)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161932
Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey
2025-09-04 11:05:10 +00:00
ea1883dfd3 Fixes #154982: add missing to_result_dtype in vector_norm (#155111)
Fixes #154982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155111
Approved by: https://github.com/isuruf, https://github.com/eellison
2025-09-04 10:49:08 +00:00
d67c29ad22 [inductor] Fix int64 from MutationOutput Buffer (#162020)
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Differential Revision: D81530083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-09-04 09:47:57 +00:00
09587daf8c Adding missing example of torch.full_like Issue#161899 (#162051)
Fixes #161899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162051
Approved by: https://github.com/zou3519
2025-09-04 08:45:49 +00:00
c024b1f5a1 [AMD] [Reland] Fix AMD User Defined Kernel Autotune (#161521)
Summary: This is a reland of D80285441, fixed the unit test.

Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR

```
will succeed after this diff.

Rollback Plan:

Differential Revision: D80971224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521
Approved by: https://github.com/frank-wei
2025-09-04 08:41:18 +00:00
8fd3c9ce91 Optimize AMP custom_backend_name error message (#162037)
Print out amp target dtype and let custom backend easier find out expected dtype while integration.

## Test Result

### Before
```python
In [1]: import torch
   ...: import torch_openreg
   ...:
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(4, 2)
   ...: with torch.autocast("openreg", dtype=torch.float16):
   ...:     torch.mm(a, b)
   ...:
/home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype is not supported. Disabling autocast.
 openreg Autocast only supports dtypes of torch.float32 currently.
  warnings.warn(error_message
```

### After
```python
In [1]: import torch
   ...: import torch_openreg
   ...:
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(4, 2)
   ...: with torch.autocast("openreg", dtype=torch.float16):
   ...:     torch.mm(a, b)
   ...:

/home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype torch.float16 is not supported. Disabling autocast.
 openreg Autocast only supports dtypes of torch.float32 currently.
  warnings.warn(error_message)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162037
Approved by: https://github.com/zou3519
2025-09-04 08:27:56 +00:00
e19e02c84c port distributed tensor test files for Intel GPU (#161604)
In this pr, we port test/distributed/tensor test filesfor Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161604
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-04 07:49:25 +00:00
69a25f6888 [ROCm] Enable USE_FBGEMM_GENAI (#160676)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](9491d289b3/.ci/docker/libtorch/build.sh (L48))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160676
Approved by: https://github.com/drisspg
2025-09-04 07:13:17 +00:00
890626632d [DLPACK] Optimize toDLPack Conversion Speed (#162111)
Previously in gh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.

Fixes #162113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162111
Approved by: https://github.com/msaroufim
2025-09-04 05:27:05 +00:00
480c739112 Capture TypeError in CONTAINS_OP (#161069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161069
Approved by: https://github.com/anijain2305
2025-09-04 04:49:09 +00:00
66f3b4a682 Contiguous subgraph decomposition (#161241)
## Summary

Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.

## Background

On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
  ))
```
is a lot slower than:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
  ))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.

## Data

I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
  addmm_16448x512x2048: +0.14%
  addmm_128x2048x2048: +0.01%
  addmm_128x768x1000: +0.75%
  addmm_12672x3072x768: +1.08%
  addmm_512x768x32000: +0.62%
  addmm_12608x384x384: +0.00%
  addmm_4160x1024x4096: +0.90%
  addmm_16x768x2: +0.56%
  addmm_12608x3072x768: +0.09%
  addmm_64x4096x1000: +2.77%
  addmm_256x1024x512: +1.99%
  addmm_30x256x256: +1.12%
  addmm_100480x128x384: +0.91%
  addmm_6400x2048x512: +0.25%
  addmm_61568x1024x256: +0.08%
  addmm_1x768x768: +0.93%
  addmm_12544x384x384: +0.19%
  addmm_128x512x1000: +0.77%
  addmm_2048x128x128: +1.32%
  addmm_128x3072x1000: +0.24%
  addmm_7936x512x2048: +0.07%
  addmm_8192x512x2048: +0.33%
  addmm_64x1024x1000: +1.43%
  addmm_128x2304x1000: +0.01%
  addmm_32768x256x2: +0.75%
  addmm_64x384x1152: +0.79%
  addmm_64x640x1000: +0.01%
  addmm_100480x128x128: +0.87%
  addmm_1152x3072x768: +1.13%
  addmm_8192x256x2048: +1.40%
  addmm_4096x128x768: +0.01%
  addmm_128x2560x1000: +0.01%
  addmm_12544x2048x512: +0.43%
  addmm_200704x24x96: +0.14%
  addmm_8448x512x2048: +0.96%
  addmm_50176x256x1024: +0.62%
  addmm_4160x4096x1024: +0.22%
  addmm_4096x768x768: +0.32%
  addmm_220x2048x512: +0.56%
  addmm_8x2048x1000: +1.12%
  addmm_256x197951x512: +26.99%
  addmm_401536x64x192: +0.60%
  addmm_2040x2048x512: +0.47%
  addmm_512x1024x256: +1.32%
  addmm_128x4096x1000: +1.67%
  addmm_12672x768x768: +0.34%
  addmm_128x368x1000: +0.77%
  addmm_96x1280x1000: +0.01%
  addmm_12544x512x2048: +0.41%
  addmm_6272x320x1280: +0.76%
  addmm_12544x3072x768: +0.09%
  addmm_64x384x1000: +0.39%
mm improvements when best:
  mm_200704x128x512: +1.29%
  mm_663552x16x16: +0.80%
  mm_4096x768x768: +0.51%
  mm_131072x64x31: +0.24%
  mm_12544x1152x384: +0.11%
  mm_128x2048x2: +0.46%
  mm_262144x16x23: +0.62%
  mm_50176x576x192: +0.37%
  mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%

Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)

Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%

```

## Results

The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```

It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
  addmm_128x16384x128: +0.18%
  addmm_128x262144x256: +38.24%
  addmm_128x200000x512: +14.76%
  addmm_256x800000x128: +0.06%
  addmm_131072x128x256: +0.27%
  addmm_128x256x131072: +0.25%
  addmm_2048x200000x64: +12.45%
mm improvements when best:
  mm_128x16384x128: +0.18%
  mm_128x262144x256: +38.05%
  mm_128x200000x512: +9.47%
  mm_256x800000x128: +0.99%
  mm_512x6400000x256: +3.17%
  mm_524288x64x64: +0.29%
  mm_2048x200000x64: +11.19%
  mm_8192x1000000x256: +34.14%
  mm_128x4096x100000: +0.40%
  mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%

Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%

```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.

Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866

## Test Plan:
New unit tests.

Differential Revision: D80771648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241
Approved by: https://github.com/eellison
2025-09-04 04:43:58 +00:00
302df2ac5d [vllm hash update] update the pinned vllm hash (#162115)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162115
Approved by: https://github.com/pytorchbot
2025-09-04 04:26:34 +00:00
dec72ea4b0 [reland] Add inductor provenance mapping for cpp extern kernel (#161656) (#162069)
Summary:

Add inductor provenance mapping for cpp extern kernel

Test Plan:
```
buck run fbcode//caffe2/test/inductor:provenance_tracing --  -r test_cpu_extern_kernel
```

Differential Revision: D81598857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162069
Approved by: https://github.com/angelayi
2025-09-04 04:18:43 +00:00
8975cda252 [pt] strip error messages in profile builds (#162076)
Summary: Profile builds should match production builds, and error messages result in large static initializers running. Omit them for profile builds too.

Test Plan:
Before:
```
$ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output
$ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a | grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9
0000000000003234 T __ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9_
```

After:
```
$ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output
$ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a | grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9
```

Rollback Plan:

Reviewed By: yury-dymov, abashyam

Differential Revision: D81599582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162076
Approved by: https://github.com/swolchok
2025-09-04 04:18:27 +00:00
d636c181f9 Fix range.__getitem__() (#161804)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161804
Approved by: https://github.com/anijain2305
ghstack dependencies: #161801, #161802, #161803
2025-09-04 02:33:03 +00:00
c8255c67cd redirect iter(range) to range.__iter__() (#161803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161803
Approved by: https://github.com/anijain2305
ghstack dependencies: #161801, #161802
2025-09-04 02:33:03 +00:00
485a7bd82e Add range_count and range.__contains__ (#161802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161802
Approved by: https://github.com/anijain2305
ghstack dependencies: #161801
2025-09-04 02:33:03 +00:00
1ef7efa592 Add range_equals (#161801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161801
Approved by: https://github.com/anijain2305
2025-09-04 02:33:03 +00:00
57278d45f0 [Quant][Inductor][CPU] add qconv int8-mixed-bf16 patterns (#161487)
Summary:
Expand the patterns supported by qconv weight prepack, Specifically, expand the conv patterns of int8-mixed-bf16 datatype to support the following two cases:
Case 1:
the `out_dtype `of `dequantize_per_tensor  `is `torch.float32`

```
    dq_per_tensor  dq_per_channel
         |               |
    to_bf16           to_bf16
            \          /
             Conv2d
```

Case 2:
the `out_dtype `of `dequantize_per_tensor  `is `torch.bfloat16`

```
    dq_per_tensor  dq_per_channel
         \               |
                      to_bf16
                       /
             Conv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161487
Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel
ghstack dependencies: #161486
2025-09-04 02:01:34 +00:00
cec0ff1228 [Quant][Inductor][CPU] add qlinear int8-mixed-bf16 patterns (#161486)
Summary:
Expand the patterns supported by qlinear weight prepack, Specifically, expand the linear patterns of int8-mixed-bf16 datatype to support the following two cases:
Case 1:
the `out_dtype` of `dequantize_per_tensor ` is `torch.float32`

    dq_per_tensor  dq_per_channel
         |               |
    to_bf16           to_bf16
         |               |
     OPT(reshape)     permute
            \          /
             addmm/mm
                    |
           OPT(reshape)

or

    dq_per_tensor  dq_per_channel
         |               |
    to_bf16           to_bf16
         |               |
       expand         permute
          \              |
                      expand
                       /
               bmm
                |
            OPT(add)

Case 2:
the `out_dtype` of `dequantize_per_tensor ` is `torch.bfloat16`

    dq_per_tensor  dq_per_channel
         |               |
                       to_bf16
                         |
     OPT(reshape)   permute
            \          /
             addmm/mm
                    |
           OPT(reshape)

or

    dq_per_tensor  dq_per_channel
         |                |
                        to_bf16
                          |
       expand          permute
          \               |
                        expand
                        /
               bmm
                |
            OPT(add)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161486
Approved by: https://github.com/Xia-Weiwen, https://github.com/jansel
2025-09-04 01:53:02 +00:00
65985937d9 expose number of outputs in native runtime for unified runtime (#161723)
This is only user outputs which is what we want. Spoke to @zhxchen17 though and it seems like nativeRT might have some bugs on propogating updates to things like input mutation or buffer mutation though. Something to take a look at in a follow up.

Also I have no idea where the nativeRT tests are. Any pointers @zhxchen17  @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161723
Approved by: https://github.com/zhxchen17
2025-09-04 01:20:31 +00:00
fbf3d2027d use sym_or instead of any to avoid dde in calc_conv_nd_return_shape (#162084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162084
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-09-04 01:20:22 +00:00
8678d831c4 [dynamo] rename set_fullgraph to error_on_graph_break (#161739)
Renaming `set_fullgraph` to `error_on_graph_break` for now. There are no semantic differences yet. In a followup PR, we will introduce a new `torch.compile` option `error_on_graph_break` that has lower priority than `fullgraph` so that `fullgraph` really returns 1 graph.

I could keep `set_fullgraph` as a deprecated alias for `error_on_graph_break` for now, but I'm hoping that won't be necessary since it's still private API (there are no internal callsites yet, and there are no significant OSS callsites yet).

 cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela @mlazos @guilhermeleobas @xmfan as primary users for `set_fullgraph`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161739
Approved by: https://github.com/xmfan, https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/mlazos
2025-09-04 01:15:06 +00:00
1281470155 [DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682)
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency

Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D80174674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
2025-09-04 01:09:53 +00:00
9458d1ac3b [inductor] pdl inductor option (disabled by default) (#160928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928
Approved by: https://github.com/eellison
2025-09-04 00:35:23 +00:00
3c45af079a kill allow_complex_guards_as_runtime_asserts (#161794)
Summary:
[reland]
Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D81334984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161794
Approved by: https://github.com/zhxchen17
2025-09-04 00:17:01 +00:00
aad96a2022 Revert "Contiguous subgraph decomposition (#161241)"
This reverts commit d64718503728001a1e78168fd7f2d4ff23e57285.

Reverted https://github.com/pytorch/pytorch/pull/161241 on behalf of https://github.com/jeffdaily due to breaks rocm mi300 tests ([comment](https://github.com/pytorch/pytorch/pull/161241#issuecomment-3251185098))
2025-09-04 00:14:22 +00:00
5f3cbc9442 fixed typo error (#162055)
Fixes #162054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162055
Approved by: https://github.com/RajeshvShiyal, https://github.com/malfet
2025-09-04 00:06:58 +00:00
a918bbad6a [inductor] fix test output path 2 (#162085)
Fix test_output_path_2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162085
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-09-04 00:03:47 +00:00
8ec551bb35 [aot-compile] strip internal tracebacks for non-verbose graph breaks + include user file/lineno (#162005)
pytest test/dynamo/test_aot_compile.py -k test_aot_compile_graph_break_error_fmt

before
```
Traceback (most recent call last):
  File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module>
    aot_compiled_fn = compiled.aot_compile((example_inputs, {}))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 717, in aot_compile
    return aot_compile_fullgraph(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/aot_compile.py", line 132, in aot_compile_fullgraph
    capture_output = convert_frame.fullgraph_capture(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 947, in fullgraph_capture
    dynamo_output = compile_frame(
                    ^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 1020, in compile_frame
    bytecode, tracer_output = transform_code_object(code, transform)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/bytecode_transformation.py", line 1592, in transform_code_object
    tracer_output = transformations(instructions, code_options)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 992, in transform
    tracer_output = trace_frame(
                    ^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 312, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 821, in trace_frame
    run_tracer()
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 803, in run_tracer
    tracer.run()
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1472, in run
    while self.step():
          ^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1342, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 902, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3364, in CALL
    self._call(inst)
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3358, in _call
    self.call_function(fn, args, kwargs)
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1260, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/variables/functions.py", line 1513, in call_function
    unimplemented_v2(
  File "/data/users/$USER/pytorch/torch/_dynamo/exc.py", line 596, in unimplemented_v2
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html
```
after
```
Traceback (most recent call last):
  File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module>
    aot_compiled_fn = compiled.aot_compile((example_inputs, {}))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 737, in aot_compile
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html

from user code:
   File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo
    torch._dynamo.graph_break()

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```
consistent w/ std torch.compile
```
Traceback (most recent call last):
  File "/data/users/$USER/vllm-tests/graph-break.py", line 16, in <module>
    res = compiled(*example_inputs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 850, in compile_wrapper
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html

from user code:
   File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo
    torch._dynamo.graph_break()

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162005
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2025-09-03 23:19:47 +00:00
36d207fcaa [CI] viable strict upgrade: Explicitly name which linux binary wheels should block (#162100)
Reason:
rocm binary builds should not block viable strict upgrade.  It is queuing/canceled so viable strict is 1.2 days old

Tested by mangling the workflow file to get to the actual call of the python script `python ../test-infra/tools/scripts/fetch_latest_green_commit.py --required-checks '["pull", "trunk", "lint", "^linux-binary-manywheel$", "^linux-binary-libtorch-release$", "linux-aarch64"]' --viable-strict-branch viable/strict --main-branch master`, which I then ran locally where I have credentials.  It returned d64718503728001a1e78168fd7f2d4ff23e57285 which is green.  Without this change, it returns 5e5870e858f60ff4bf87d03f3592097e934a9580, which is pretty old

The other solution would have been to mark it as unstable I think

Side note, why is it master and how is it working like that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162100
Approved by: https://github.com/huydhn
2025-09-03 22:38:32 +00:00
99f356fa58 [ROCm] revamp miopen integration (#161687)
Update sources under ATen/miopen and ATen/native/miopen to align with best practices. Avoid reshape_ calls inside backward operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161687
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 22:28:09 +00:00
0af70e2353 Modify ROCm MI2xx-based workflows to run on cron schedule (#162103)
To mitigate queueing on MI2xx runners since Cirrascale runners are offline. Match cron schedule of periodic.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162103
Approved by: https://github.com/jeffdaily, https://github.com/seemethere
2025-09-03 21:51:03 +00:00
b1bb98ddeb [ROCm] TunableOp should use HIP version, not ROCm version (#162067)
Fixes #160874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162067
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 21:42:23 +00:00
abc447174c [PP] Add profiling to schedule execution (#160753)
Profiling title will be `str(action)`

<img width="1545" height="694" alt="image" src="https://github.com/user-attachments/assets/60b3506b-b8d6-4ae0-8b32-0d51d45fa2f0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160753
Approved by: https://github.com/wconstab
2025-09-03 21:31:50 +00:00
734ce8eba9 Rename propagate_tensor_meta to make private again (#161744)
Rename the wrapper `propagate_tensor_meta` added in #161334 to make it clearly private, and rename the existing LRU function to accommodate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161744
Approved by: https://github.com/bdhirsh
2025-09-03 21:11:45 +00:00
98efc9e93d [ROCm] Bump AOTriton to 0.11b (#161754)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:

* Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
  - AITER ASM kernels deliver over 500TFLOPS training performance. See
    [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more
    details.
* Now returns natural based `logsumexp` tensor, matching CUDA's behavior
  - PR #156903 is reverted in this PR as well since it is not needed anymore.
* Enables `CausalVariant.LOWER_RIGHT`

The build system changes drastically along with new packaging scheme of
AOTriton 0.11

* AOTriton 0.11 packs GPU images separately from AOTriton runtime
* `aotriton.cmake` now selectively downloads image packs according to
  `PYTORCH_ROCM_ARCH`
* `aotriton.cmake` now only use pre-compiled runtime library that exactly
  matches the ROCM in the build environment. For PyTorch builds with ROCm
  versions not listed in the file, the build process will build AOTriton
  runtime without GPU images from source
  - This avoids any further ABI breaks like ROCM 6.4 -> 7.0
  - recursive git clone is disabled since building AOTriton runtime does not
    require submodules.

Bug fixes:

* Fix a kernel bug introduced when implementing SWA

Known Problems:

* gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
  due to accuracy issues. Triton compiler fixes are needed to restore the
  support status.
* Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
  This issue is under investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-09-03 20:45:44 +00:00
994f2a5dbc [SymmMem][CI] Make sure group names are consistent (#162035)
Unblocking #161741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162035
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-09-03 20:40:24 +00:00
d1706d9128 [Symmetric memory] set handle type for ROCm (#161741)
Fixes #161722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161741
Approved by: https://github.com/kwen2501
2025-09-03 20:33:35 +00:00
1aa7476885 fix to segmentation fault when empty tensor is passed to choose_qpara… (#161966)
…ms_optimized

Fixes #153326

Minimal code to reproduce error:
```
import torch

tensor = torch.tensor([])

torch.choose_qparams_optimized(
    tensor,
    0,
    200,
    0.16,
    8
)
```

Previous Output:
`Segmentation fault`

Now Output:
```
Traceback (most recent call last):
  File "/home/amaitra/work/tests/issue_153326.py", line 5, in <module>
    torch.choose_qparams_optimized(
RuntimeError: input tensor is empty and has no data
```

Caused because `const float* input_row =input_tensor.const_data_ptr<float>();` becomes null
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161966
Approved by: https://github.com/Skylion007
2025-09-03 20:26:26 +00:00
8e23a1227b [ROCm/Windows] Fix build failures and support some BLAS calls (#161981)
* Support getrsBatched/geqrfBatched/gelsBatched on Windows ROCm (fixes https://github.com/ROCm/TheRock/issues/1367)
* Fix windows pytorch build with USE_DISTRIBUTED=ON by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161981
Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 20:26:14 +00:00
850e1382a9 [hipify] Replace cudaStreamCaptureStatusNone (#161992)
Replacing additional cuda symbols to hip symbols

Differential Revision: D81420086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161992
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2025-09-03 20:23:32 +00:00
3c0ff1b569 [SymmMem] Add root argument to broadcast op (#161090)
It was missing earlier. Also added range check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090
Approved by: https://github.com/fegin
2025-09-03 20:17:45 +00:00
c465b3d52c [2/n][export] Refactor PT2 Archive weight saving and loading (#161520)
Summary:
The saving (serialization) part of PT2 archive weight refactoring.
The loading (deserialization part) has been landed in D80035490

Test Plan:
CI

Rollback Plan:

bifferential Revision: D80970931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161520
Approved by: https://github.com/SherlockNoMad
2025-09-03 20:12:49 +00:00
f4c33cd44a [pt2e] Avoid getting model device once per node (#159901)
**Summary:** Previously, we call `assert_and_get_unqiue_device` once per node in both prepare and convert. This is expensive and unnecessary since the model device is the same across all nodes, so we should just call this once in the beginning and reuse the same model device across all the nodes.

**Test Plan:**
python test/test_quantization.py -k TestQuantizePT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159901
Approved by: https://github.com/jerryzh168
2025-09-03 19:29:00 +00:00
92576a594b Prototype for building non-strict leak detector (#160456)
Summary:
Our strategy for detecting fake tensor leakage in non-strict for outside scope (side effects happening outside of model.forward) is:
1. We do gc.collect() before export and get the alive fake tensors
2. We dump the proxy to fake tensor map from make_fx tracer
3. We query gc again to get alive fake tensors
4. We take the delta between (1) and (3)
5. Filter out fake tensors that are:
    1. Associated with `TrackedFake` (input tracking thing in symbolic_shapes)
    2. Associated with `gm.meta`
6. Do ID match with the proxies and emit their stacktraces.

We rely on (https://github.com/pytorch/pytorch/pull/159923) for other sources of leakages such as:
1. We failed to proxy an operator (like param.data)
2. We cache some tensor in model.forward (https://github.com/pytorch/pytorch/issues/155114)

In general, we notice `gc.collect()` and query-ing gc for live objects are kinda slow. So we turn on this feature under env variable. We should document on export public facing documents that if you run into weird errors regarding fake tensors, they should look into turning on this env variable for further analysis.

Test Plan:
Test plan

Rollback Plan:

Differential Revision: D80003204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160456
Approved by: https://github.com/pianpwk
2025-09-03 19:21:27 +00:00
cd529b686d [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)
### Motivation

* MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs:
<img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" />

* MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755).
* Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity.
* However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa

* Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR.

### TODOs (cc @amdfaa):
* Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step
* Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 18:34:07 +00:00
62c3f9a97f [inductor] Follow integer overflow rules in TypedExpr (#161922)
Fixes https://github.com/pytorch/pytorch/issues/161763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161922
Approved by: https://github.com/jansel
2025-09-03 18:33:18 +00:00
8076a185c8 Offload set method execution to CPython when possible (#160763)
Reduces CPython `test_set.py` runtime from 63.477s to 40.298s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160763
Approved by: https://github.com/anijain2305
2025-09-03 18:26:05 +00:00
f00445b43e [inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339)
# why

- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
  for codegen and heuristics

- unified place where we can just adjust them for the template

# what

- inside get_mm_configs, return not the passed in kernel inputs,
  but allow the template heuristic to adjust them if necessary

- the default implementation right now just passes them back

this diff just adds the functionality, but does not exercise it
other than the default (passthrough)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
2025-09-03 18:23:22 +00:00
3559c354ce stop suggesting using guard_size_oblivious on data dependent errors (#160510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160510
Approved by: https://github.com/ezyang
2025-09-03 18:07:59 +00:00
71992dd805 S390x: build nightly binaries for new pythons (#161920)
Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries

Fixes #161515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161920
Approved by: https://github.com/malfet
2025-09-03 17:38:38 +00:00
d647185037 Contiguous subgraph decomposition (#161241)
## Summary

Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.

## Background

On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
  ))
```
is a lot slower than:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
  ))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.

## Data

I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
  addmm_16448x512x2048: +0.14%
  addmm_128x2048x2048: +0.01%
  addmm_128x768x1000: +0.75%
  addmm_12672x3072x768: +1.08%
  addmm_512x768x32000: +0.62%
  addmm_12608x384x384: +0.00%
  addmm_4160x1024x4096: +0.90%
  addmm_16x768x2: +0.56%
  addmm_12608x3072x768: +0.09%
  addmm_64x4096x1000: +2.77%
  addmm_256x1024x512: +1.99%
  addmm_30x256x256: +1.12%
  addmm_100480x128x384: +0.91%
  addmm_6400x2048x512: +0.25%
  addmm_61568x1024x256: +0.08%
  addmm_1x768x768: +0.93%
  addmm_12544x384x384: +0.19%
  addmm_128x512x1000: +0.77%
  addmm_2048x128x128: +1.32%
  addmm_128x3072x1000: +0.24%
  addmm_7936x512x2048: +0.07%
  addmm_8192x512x2048: +0.33%
  addmm_64x1024x1000: +1.43%
  addmm_128x2304x1000: +0.01%
  addmm_32768x256x2: +0.75%
  addmm_64x384x1152: +0.79%
  addmm_64x640x1000: +0.01%
  addmm_100480x128x128: +0.87%
  addmm_1152x3072x768: +1.13%
  addmm_8192x256x2048: +1.40%
  addmm_4096x128x768: +0.01%
  addmm_128x2560x1000: +0.01%
  addmm_12544x2048x512: +0.43%
  addmm_200704x24x96: +0.14%
  addmm_8448x512x2048: +0.96%
  addmm_50176x256x1024: +0.62%
  addmm_4160x4096x1024: +0.22%
  addmm_4096x768x768: +0.32%
  addmm_220x2048x512: +0.56%
  addmm_8x2048x1000: +1.12%
  addmm_256x197951x512: +26.99%
  addmm_401536x64x192: +0.60%
  addmm_2040x2048x512: +0.47%
  addmm_512x1024x256: +1.32%
  addmm_128x4096x1000: +1.67%
  addmm_12672x768x768: +0.34%
  addmm_128x368x1000: +0.77%
  addmm_96x1280x1000: +0.01%
  addmm_12544x512x2048: +0.41%
  addmm_6272x320x1280: +0.76%
  addmm_12544x3072x768: +0.09%
  addmm_64x384x1000: +0.39%
mm improvements when best:
  mm_200704x128x512: +1.29%
  mm_663552x16x16: +0.80%
  mm_4096x768x768: +0.51%
  mm_131072x64x31: +0.24%
  mm_12544x1152x384: +0.11%
  mm_128x2048x2: +0.46%
  mm_262144x16x23: +0.62%
  mm_50176x576x192: +0.37%
  mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%

Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)

Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%

```

## Results

The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```

It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
  addmm_128x16384x128: +0.18%
  addmm_128x262144x256: +38.24%
  addmm_128x200000x512: +14.76%
  addmm_256x800000x128: +0.06%
  addmm_131072x128x256: +0.27%
  addmm_128x256x131072: +0.25%
  addmm_2048x200000x64: +12.45%
mm improvements when best:
  mm_128x16384x128: +0.18%
  mm_128x262144x256: +38.05%
  mm_128x200000x512: +9.47%
  mm_256x800000x128: +0.99%
  mm_512x6400000x256: +3.17%
  mm_524288x64x64: +0.29%
  mm_2048x200000x64: +11.19%
  mm_8192x1000000x256: +34.14%
  mm_128x4096x100000: +0.40%
  mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%

Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%

```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.

Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866

## Test Plan:
New unit tests.

Differential Revision: D80771648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241
Approved by: https://github.com/eellison
2025-09-03 17:02:59 +00:00
eb18d32bda Add range_iterator (#161800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161800
Approved by: https://github.com/anijain2305
ghstack dependencies: #161799
2025-09-03 16:55:04 +00:00
889f01eb73 Add CPython test test_range (#161799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161799
Approved by: https://github.com/anijain2305
2025-09-03 16:55:04 +00:00
451ed93156 [inductor] fix split_aot_inductor_output_path on Windows. (#162058)
fix split_aot_inductor_output_path on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162058
Approved by: https://github.com/angelayi
2025-09-03 16:53:38 +00:00
9491d289b3 Support generic dynamic shape with padding (#160997)
Summary:
Inductor has the following configurations:

config.comprehensive_padding
config.padding_alignment_bytes
config.padding_stride_threshold

In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today.
This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic.

config.pad_dynamic_shapes

In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80468808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997
Approved by: https://github.com/blaine-rister, https://github.com/jansel
2025-09-03 15:58:18 +00:00
c157cf6488 port distributed tensor parallel test files for Intel GPU (#161261)
In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. Use torch.accelerator for general gpu
2. Skip the case if running on xpu which has known issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161261
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-03 15:03:32 +00:00
bb950284c7 Revert "[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339)"
This reverts commit 90f50f7e68e120d9574e6e3189e37b4280010ad9.

Reverted https://github.com/pytorch/pytorch/pull/161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](https://github.com/pytorch/pytorch/pull/161339#issuecomment-3249600885))
2025-09-03 14:56:02 +00:00
f27985b7e7 Revert "[CUDAGraph] add config to error on skipping cudagraph (#161862)"
This reverts commit 204697f0e695d82894c5010fbec664c4391f90cc.

Reverted https://github.com/pytorch/pytorch/pull/161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](https://github.com/pytorch/pytorch/pull/161862#issuecomment-3249582583))
2025-09-03 14:50:44 +00:00
0cd6c56bdf Revert "test: ensure editable cached wrapper is respected (#160943)"
This reverts commit bbedc71fd3267c639c38b4ec25eaa22f973d9c4d.

Reverted https://github.com/pytorch/pytorch/pull/160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](https://github.com/pytorch/pytorch/pull/160943#issuecomment-3249565671))
2025-09-03 14:46:35 +00:00
b40d9432be [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-03 14:31:09 +00:00
02c83f1334 [BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)
Followup after https://github.com/pytorch/pytorch/pull/154012

Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999
Approved by: https://github.com/drisspg
2025-09-03 14:31:08 +00:00
aed33a8fcb [Inductor][Tritonparse] Get Inductor kernel params (#161953)
Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse.

Reviewed By: davidberard98, FindHao

Differential Revision: D80994791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161953
Approved by: https://github.com/FindHao
2025-09-03 14:11:27 +00:00
b16d3f4c8c [AOTI] Fix a bug from load_constants (#161887)
Summary:
we have
```
std::vector<size_t> constants_internal_offset(
        num_constants - num_folded_constants);
```

but the for loop does not consider it
```
for (size_t i = 0; i < num_constants; i++) {
...
constants_internal_offset[i]
...
```
even in the for loop, it does
```
bool from_folded = this->constant_from_folded(i);
      if (from_folded) {
        continue;
      }
```
but `i` could still be wrong

Rollback Plan:

Differential Revision: D81425007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161887
Approved by: https://github.com/angelayi
2025-09-03 07:45:16 +00:00
4ae57d448c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-03 07:33:55 +00:00
90b08643c3 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-03 07:33:55 +00:00
b0a3e58dd7 Add inline fast paths for SymInt operators (#161586)
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586
Approved by: https://github.com/ezyang
ghstack dependencies: #161466
2025-09-03 06:54:47 +00:00
fa1514acf1 Outline SymInt::maybe_as_int_slow_path (#161466)
Keeps SymInt::maybe_as_int small enough to inline.

Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161466
Approved by: https://github.com/ezyang
2025-09-03 06:54:47 +00:00
827f0d4054 Using get_paths() to get correct installation path for PYTHONPATY (#161947)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161947
Approved by: https://github.com/albanD
ghstack dependencies: #161845, #161903
2025-09-03 06:38:03 +00:00
2c03f0acc5 [MPS] enable cat op for sparse (#162007)
Enable cat op for sparse on MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007
Approved by: https://github.com/malfet
2025-09-03 06:31:35 +00:00
f8ffa9194e Perf nitpicks on python_arg_parser's is_int_or_symint_list (#161998)
This function has come up in DTensor perf work, and I had a nitpick on #160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161998
Approved by: https://github.com/ezyang
2025-09-03 05:38:30 +00:00
50fc22dedf [Intel GPU] Fix XPU SDPA default priority_order UT fail (#161690)
Fixes #161483

When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by eec876deb6/aten/src/ATen/native/mkldnn/xpu/Attention.cpp (L98-L112)

However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161690
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-09-03 04:43:27 +00:00
e381d4b020 [DTensor] forbid view ops to redistribute when local split is impossible (#161950)
This PR is a followup to https://github.com/pytorch/pytorch/pull/149764.

In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`.

This PR also updates the error message to be less about internal implementation details, which users may find confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161950
Approved by: https://github.com/ezyang
2025-09-03 04:40:11 +00:00
8875d6e394 [vllm hash update] update the pinned vllm hash (#161929)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161929
Approved by: https://github.com/pytorchbot
2025-09-03 04:26:38 +00:00
00636e0171 [Reland][Inductor] Prune configs that require more shared memory than the hardware limit. (#161996)
Summary:
This is a re-land of [PR161040](https://github.com/pytorch/pytorch/pull/161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs.

This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```

Test Plan:
```
pytest test/inductor/test_max_autotune.py
pytest test/inductor/test_triton_heuristics.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161996
Approved by: https://github.com/coconutruben
2025-09-03 04:23:09 +00:00
09d2f1b631 [audio hash update] update the pinned audio hash (#161928)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161928
Approved by: https://github.com/pytorchbot
2025-09-03 04:22:55 +00:00
dac8a4b91c Using pip3 install instead of python setup.py develop/install (#161903)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161903
Approved by: https://github.com/ezyang
ghstack dependencies: #161845
2025-09-03 03:12:18 +00:00
d789451ff6 [OpenReg] Migrate Accelerator Document from source/notes into source/accelerator (#161845)
As the tile stated.

As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845
Approved by: https://github.com/albanD
2025-09-03 03:12:18 +00:00
0447f2d99b build: Add fallback commands to setup.py (#162009)
Adds fallback commands for the following:
* python setup.py install
* python setup.py develop

Ideally these should just work and should provide backwards compat.

Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone.

This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-09-03 02:56:10 +00:00
d5643e8f3a [dynamo, nested graph breaks] support nested graph breaks that cause skipped frames (#160470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160470
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817, #160138, #159786
2025-09-03 02:47:07 +00:00
9b81fe281d [c10d] Lessen density of barrier warning (#162015)
Warnings are great, but too dense when there are many ranks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162015
Approved by: https://github.com/d4l3k, https://github.com/H-Huang
2025-09-03 02:20:54 +00:00
90f50f7e68 [inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339)
# why

- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
  for codegen and heuristics

- unified place where we can just adjust them for the template

# what

- inside get_mm_configs, return not the passed in kernel inputs,
  but allow the template heuristic to adjust them if necessary

- the default implementation right now just passes them back

this diff just adds the functionality, but does not exercise it
other than the default (passthrough)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
2025-09-03 01:03:57 +00:00
877062c9d3 [inductor][choices][ez] pass through layout and input_nodes (#161338)
# why

- params already available in get_mm_configs
- simplifies the code
- adds a possibility to edit the nodes/layout in
  a centralized place

# what

- add layout and input_nodes into extra_kwargs
- no other modifications

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520575](https://our.internmc.facebook.com/intern/diff/D81520575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161338
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #161123, #161124, #161125, #161126, #161336
2025-09-03 01:03:57 +00:00
c31dee6fa5 [inductor][ez] ExternChoice with maybe_append_choice (#161336)
# why

- make the API for ExternChoice the same as KernelTemplate
- make it possible to use the same retrieval point as templates

# what

- add a maybe_append_choice to ExternChoice that under the hood
  invokes self.bind

This pr does not actuate the new path, but just exposes it

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py
```

Differential Revision: [D81520578](https://our.internmc.facebook.com/intern/diff/D81520578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161336
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126
2025-09-03 01:03:57 +00:00
6cb13dd3cc [inductor] move scaled_mm template args into heuristics (#161126)
# why

- another step towards get_mm_configs providing
  all the kwargs needed to add a choice from
  a template. This in turn will allow us to send
  all templates through one single call, and handle modifications

# what

- use the infrastructure for template heuristics to provide extra kwargs
  that are fixed for a template/op pair to provide the suffix args
  and epilogue function/fn for scaled_mm

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670914](https://our.internmc.facebook.com/intern/diff/D80670914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161126
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125
2025-09-03 01:03:57 +00:00
cbf01c11ff [inductor] move addmm/baddbmm template args into heuristics (#161125)
# why

- another step towards get_mm_configs providing
  all the kwargs needed to add a choice from
  a template. This in turn will allow us to send
  all templates through one single call, and handle modifications

# what

- use the infrastructure for template heuristics to provide extra kwargs
  that are fixed for a template/op pair to provide the prefix args
  and epilogue function/fn for addmm/baddbmm

- expand kernelinputs to also be able to shuttle around non tensor
  inputs (scalars) as is needed for alpha and beta

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k addmm
```

Differential Revision: [D80670912](https://our.internmc.facebook.com/intern/diff/D80670912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161125
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124
2025-09-03 01:03:57 +00:00
7cdfa520a6 [inductor] move tma workspace in heuristics (#161124)
# why

- another step towards get_mm_configs providing
  all the kwargs needed to add a choice from
  a template. This in turn will allow us to send
  all templates through one single call, and handle modifications

# what

use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the workspace_arg for
all the tma templates

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k tma
```

Differential Revision: [D80670915](https://our.internmc.facebook.com/intern/diff/D80670915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161124
Approved by: https://github.com/jansel
ghstack dependencies: #161123
2025-09-03 01:03:57 +00:00
1485ac3264 [inductor] add notion of extra_kwargs for mm_configs (#161123)
# why

- some kwargs are choice independent but rather
  always the same for a specific op or template
- this enables us to track those differently than the
  choice ones, and thus enables interception of them
  cleaner
- maybe_append_choices can then be simplified to
  just pass through the kwargs

# what

- hookup for template heuristics to have per template/op extra
  kwargs that are always the same, for all choices
- hookup for the called to get_mm_configs to provide template/op
  kwargs to override some of the template/choice kwargs

this pr does not use the new machinery, and everything is empty
for now. subsequent prs start using it to simplify ops

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670916](https://our.internmc.facebook.com/intern/diff/D80670916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161123
Approved by: https://github.com/jansel
2025-09-03 01:03:57 +00:00
c5b8a10be5 Fix compiler errors in 3.14 stub definitions (#161792)
The functions here expect to return pointers, but currently aren't returning anything.  Make them return NULL.

The properties array wants an extra set of braces.  One pair for the array, another for the first item in the array.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161792
Approved by: https://github.com/Skylion007
2025-09-03 00:58:41 +00:00
a02ee4a816 [SymmMem] Use non-blocking version of getmem (#162006)
As titled, so that the `getmem` calls in the loop are non-blocking, so that we max out the issuance rate.
Also had a single `nvshmem_quiet()` at the end to make sure all the getmem calls complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162006
Approved by: https://github.com/ngimel
2025-09-02 23:55:22 +00:00
81b7b16618 Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)" (#161949)
This PR reland #161142 which is reverted to be able to revert other PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949
Approved by: https://github.com/jansel
2025-09-02 23:43:27 +00:00
4cdaf8265d Revert "Update Kineto submodule (#161572)"
This reverts commit d33840c542b387ab08ba49aa6c45aa9567fd9be7.

Reverted https://github.com/pytorch/pytorch/pull/161572 on behalf of https://github.com/seemethere due to This appears as though its causing downstream build failures in inductor workflows and for developers working locally. Going to revert out of an abundance of caution. ([comment](https://github.com/pytorch/pytorch/pull/161572#issuecomment-3247121981))
2025-09-02 23:28:19 +00:00
874069fbe4 Log Const Folded Node (#161827)
Summary: Log folded nodes for easier debugging.

Test Plan:
sandcastle.

Rollback Plan:

Reviewed By: henryoier

Differential Revision: D81352098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161827
Approved by: https://github.com/henryoier, https://github.com/yewentao256
2025-09-02 23:23:51 +00:00
ab643e4dbb [SymmMem] Increase minimum nthreads to cover sync needs in NVL72 (#161983)
`sync_remote_blocks` maps threads to peers. Previously min nthreads is warp size, which is too small to cover NVL72. Bumping it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161983
Approved by: https://github.com/ngimel
2025-09-02 23:18:08 +00:00
5a2da090ed [SymmMem] Make sure CUDA runtime is initialized before NVSHMEM init (#161232)
Previously, without calling `torch.empty` before NVSHMEM init, we see error below:
```
src/host/init/init.cu:nvshmemi_check_state_and_init:1117: nvshmem initialization failed, exiting
src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
```
Fixing it by calling a `cudaFree(nullptr)` to make sure CUDA runtime is initialized before NVSHMEM init.
Removing all `torch.empty(1)` calls from tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161232
Approved by: https://github.com/ngimel
ghstack dependencies: #161214
2025-09-02 22:53:28 +00:00
bd39e47fee [ONNX] Default to dynamo export (#159646)
Set dynamo=True and enable fallback.

1. Implemented the compatible behavior where BytesIO objects as `f` is accepted
2. Update tests to explicitly set dynamo=False

#151693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646
Approved by: https://github.com/titaiwangms
2025-09-02 22:45:55 +00:00
e4bd0ff4f8 [aot precompile] Handle closure variables. (#161990)
We previously assume aot precompile should only work on non closures. This is hard to enforce in practice because we will see a lot of cases with decorater (e.g. hugging face models)
```
def check_inputs(fn):
    def _fn(self, *args, **kwargs):
        for arg in args:
            assert arg.shape[0] > 1

        return fn(*args, **kwargs)
    return _fn

@check_inputs
def foo(x, y):
    a = x + x
    b = y + y
    c = a + b
    return c
```
It doesn't make sense to not support these cases since they are straightfowrad to do.

This PR adds the logic to handle closure and make sure they can be precompiled properly.

Differential Revision: [D81509535](https://our.internmc.facebook.com/intern/diff/D81509535/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161990
Approved by: https://github.com/angelayi
2025-09-02 22:26:04 +00:00
15c77a8cfd Revert "Add inductor provenance mapping for cpp extern kernel (#161656)"
This reverts commit 5e5870e858f60ff4bf87d03f3592097e934a9580.

Reverted https://github.com/pytorch/pytorch/pull/161656 on behalf of https://github.com/jeffdaily due to causing failures on ROCm MI300, will add label to PR ([comment](https://github.com/pytorch/pytorch/pull/161656#issuecomment-3246965676))
2025-09-02 22:19:19 +00:00
791eff96c8 [MPS] Add igamma/igammac ops (#161927)
Fixes #161725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161927
Approved by: https://github.com/malfet
2025-09-02 20:52:02 +00:00
80dd397f19 Argsort doc stable kwargs (#161986)
Fixes #129311

Updated torch.argsort documentation to reflect that the 'stable' parameter is a keyword argument and not a normal parameter.

@albanD, @soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161986
Approved by: https://github.com/soulitzer
2025-09-02 20:42:53 +00:00
a75e8cd270 Add api info for torch._C._nn.pyi (#161958)
Fix part of #148404

APis involved are as followed:

- max_pool2d_with_indices
- max_pool3d_with_indices
- elu
- glu
- max_unpool2d
- max_unpool3d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161958
Approved by: https://github.com/ezyang
2025-09-02 20:39:20 +00:00
4e42aa8ffc Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit b7034e9c924412bfbe8ee25a22d7e95239b5ca65.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))
2025-09-02 20:28:42 +00:00
420c52ecf3 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))
2025-09-02 20:24:01 +00:00
82f63c8f6d Revert "[HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946)"
This reverts commit 5561e45758d59c94605873d5db48ed459c004c3b.

Reverted https://github.com/pytorch/pytorch/pull/161946 on behalf of https://github.com/jeanschmidt due to Need to be reverted so https://github.com/pytorch/pytorch/pull/159889 can be ([comment](https://github.com/pytorch/pytorch/pull/161946#issuecomment-3246663376))
2025-09-02 20:18:52 +00:00
b4ad38279b [AOTI] Add Windows-compatible implementation of the mmap-related funcs (#161805)
Add Windows-compatible implementation of the mmap-related functions.

These code was validated on the small developing project: https://github.com/xuhancn/cross_os_mmap?tab=readme-ov-file#cross_os_mmap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161805
Approved by: https://github.com/angelayi
2025-09-02 20:07:41 +00:00
ef8aabd424 [CD][CUDA13][ARM] aarch64 binary seems to be missing Triton dependency (#161833)
Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions

Seems to be missing Triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161833
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman
2025-09-02 19:31:14 +00:00
dcf385395d [MPS] Move sparsemps testing from test_mps to test_sparse (#161852)
Moves Sparse MPS testing from test_mps to test_sparse. Lots of skips now but I expect to remove them iteratively once ops are implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161852
Approved by: https://github.com/malfet
2025-09-02 19:04:11 +00:00
600c25e9a1 [dynamo] Graph break on torch.cuda.sychronize (#161925)
Today, AOTDispatcher ignores cuda.synchornize. Even if we wrap it in
some  HOP, we need it to be a barrier op to prevent any inductor
reordering. So graph breaking.

Fixes https://github.com/pytorch/pytorch/issues/160751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161925
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/mlazos
2025-09-02 19:00:21 +00:00
f981a7fa52 [SymmMem] Add device guard before alloc (#161214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161214
Approved by: https://github.com/ngimel
2025-09-02 18:53:45 +00:00
b7e207ca9f Make error message descriptive (#150627) (#159423)
Summary:

Adding the number of locals shards to error messages makes it easier to debug.

Test Plan: UT

Differential Revision: D72396478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159423
Approved by: https://github.com/Saiteja64
2025-09-02 17:54:39 +00:00
5e5870e858 Add inductor provenance mapping for cpp extern kernel (#161656)
Summary: Add inductor provenance mapping for cpp extern kernel

Test Plan:

```
buck run fbcode//caffe2/test/inductor:provenance_tracing --  -r test_cpu_extern_kernel
```

Differential Revision: D81161751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161656
Approved by: https://github.com/angelayi
2025-09-02 17:54:04 +00:00
a99d8d39bc Update torch-xpu-ops commit pin (#161919)
# Motivation
1. Fallback some linalg functionality such as `linalg_eig`, `linalg_householder_product`, `linalg_solve_triangular` to CPU;
2. Fix codegen dependency bug.

# Additional Context
This PR aims to fix https://github.com/pytorch/pytorch/issues/161498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161919
Approved by: https://github.com/EikanWang
2025-09-02 17:09:07 +00:00
d6b74568e2 Revert "Add __init__.pyi to torch/linalg (#160750)"
This reverts commit 9a665ca3c472384e9d722bddba79e5a7680f1abd.

Reverted https://github.com/pytorch/pytorch/pull/160750 on behalf of https://github.com/jeanschmidt due to Seems that those errors are legitimate, and there is no test plan. I'll be proceeding with a revert ([comment](https://github.com/pytorch/pytorch/pull/160750#issuecomment-3246095383))
2025-09-02 16:53:55 +00:00
d33840c542 Update Kineto submodule (#161572)
Differential Revision: D81087601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161572
Approved by: https://github.com/cyyever, https://github.com/aaronenyeshi
2025-09-02 16:31:55 +00:00
f0c391102b [ONNX] Remove private members from torch.onnx (#161546)
Remove import of two functions

- _run_symbolic_function
- _run_symbolic_method

to the `torch.onnx` namespace.

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161546
Approved by: https://github.com/titaiwangms
ghstack dependencies: #161323, #161449
2025-09-02 16:31:23 +00:00
a8d6943d36 ROCm: Enable overload tests from test_matmul_cuda (#161540)
This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload.
Tests were disabled as part of #150812
Rocblas backend tests are not enabled yet, WIP.

Test command
PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540
Approved by: https://github.com/jeffdaily
2025-09-02 16:27:42 +00:00
d11720efdb [ONNX] Remove unused logic from internal verification module (#161449)
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161449
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
ghstack dependencies: #161323
2025-09-02 16:22:49 +00:00
9a1c5c0a07 Detect torch function in lists as well (#160256)
We basically follow the same pattern we do for tensor arguments. The major downside is we now have to traverse the entirety of the int list / etc where previously we didn't have. Benchmark suggests 2% regression for relevant things.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160256
Approved by: https://github.com/albanD
2025-09-02 16:22:42 +00:00
524b78d4f6 [ONNX] Refactor torchscript based exporter (#161323)
Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved.

- Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly
- Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused

@albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and
`torch/autograd/_functions/utils.py`? Thanks!

## BC Breaking
- **Deprecated members in `torch.onnx.verification` are removed**

Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323
Approved by: https://github.com/titaiwangms, https://github.com/angelayi
2025-09-02 16:10:30 +00:00
793fc12aff [CD] Fix setup-xpu action issue (#161934)
Fix XPU CD test failure, refer https://github.com/pytorch/pytorch/actions/runs/17370923627/job/49315624191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161934
Approved by: https://github.com/atalman
2025-09-02 16:03:44 +00:00
204697f0e6 [CUDAGraph] add config to error on skipping cudagraph (#161862)
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.

This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862
Approved by: https://github.com/ezyang
2025-09-02 15:28:22 +00:00
789d494212 Defer loading hipify until it is needed (#160824)
Saves a few milliseconds when running a test case:

Before:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 1.497s
```

After:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 0.909s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824
Approved by: https://github.com/zou3519
2025-09-02 15:27:37 +00:00
bc4db2c27f CUDA 13 -- sm_120 -- Nvidia 5090 -- ptxas warning : Value of threads … (#161380)
bug fix:

i have opened a issue ( https://github.com/pytorch/pytorch/issues/161376 ) and i suggest this bug fix.

In this metod compile fine.

Fixes #161376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161380
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
2025-09-02 13:27:57 +00:00
e304ea4e69 Revert "[BE] Update xpu driver repo for CD used almalinux 8.10 (#157356)"
This reverts commit c78bbdf4102d2c13bf6aa1abe4352aa7bca401ca.

Reverted https://github.com/pytorch/pytorch/pull/157356 on behalf of https://github.com/chuanqi129 due to This PR has performance regression on some workloads ([comment](https://github.com/pytorch/pytorch/pull/157356#issuecomment-3245319046))
2025-09-02 13:20:38 +00:00
1f820de639 [ci] Increase shards for linux-jammy-py3.10-clang18-asan on pull.yml to 7 (#161968)
[ci] Increase shards for linux-jammy-py3.10-clang18-asan to 7
2025-09-02 14:08:47 +02:00
fca2601c9d Improve error message for unsupported padding config (#160866)
Fixes #160053

The previous error message `Only 2D, 3D, 4D, 5D padding with non-constant  padding are supported for now`  was not clear

now we have

```
python3
Python 3.13.5 | packaged by conda-forge | (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
... import torch.nn.functional as F
... a = torch.empty(2,2,2,2)
... F.pad(a, (1,1), mode="circular")
...
Traceback (most recent call last):
  File "<python-input-0>", line 4, in <module>
    F.pad(a, (1,1), mode="circular")
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rrathaur/Desktop/pytorch/torch/nn/functional.py", line 5294, in pad
    return torch._C._nn.pad(input, pad, mode, value)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: Padding size 2 is not supported for 4D input tensor.
Supported combinations for non-constant padding:
  - 2D or 3D input: padding size = 2 (pads last dimension)
  - 3D or 4D input: padding size = 4 (pads last 2 dimensions)
  - 4D or 5D input: padding size = 6 (pads last 3 dimensions)
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160866
Approved by: https://github.com/mikaylagawarecki
2025-09-02 07:15:59 +00:00
f8746b878d Add uuid to XPU device properties (#161392)
# Motivation
Fix https://github.com/intel/torch-xpu-ops/issues/1955
Refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-uuid, `ext::intel::info::device::uuid` returns `std::array<unsigned char, 16>` as the UUID.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161392
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-09-02 06:41:32 +00:00
8703debf66 [DTensor] select strategy with no redistribute when redistribute cost is 0 (#161882)
Before this PR, the `_select_strategy` always selects the first strategy with minimum redistribute cost. This causes unexpected behavior when
- multiple strategies have 0 redistribute costs
- the first one with 0 redistribute cost may perform local chunking

E.g. in memory efficient SDPA, the default orders of candidate strategies have a `Shard(2)` one before the `Replicate()` one. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_matrix_ops.py#L500-L512
When the input is `Replicate()`, `_select_strategy` will pick the `Shard(2)` strategy and do local chunking first, before local computation. This is clearly unexpected to users.

In this PR, we improve `_select_strategy` so that when multiple strategies have 0 redistribute cost, we prioritize the one which keeps input unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161882
Approved by: https://github.com/ezyang
2025-09-02 05:41:56 +00:00
1aeb421c34 Make pattern matcher resilient to ddes (#161843)
Motivated by the following discord support chat: https://discord.com/channels/1189498204333543425/1409578286186758195

```
import torch
@torch.compile(fullgraph=True, mode='reduce-overhead')
def get_mask(W: torch.Tensor, percentage_nonzeros: torch.Tensor):
    total_elements = W.numel()
    k = int(total_elements * percentage_nonzeros)
    top_k_indices = torch.topk(torch.abs(W).flatten(), k)[1]
    mask = torch.zeros(total_elements, dtype=torch.bool, device=W.device)
    mask.scatter_(0, top_k_indices, True)
    mask = mask.view(W.shape)
    return mask

x = torch.randn((128, 64), device='cuda')
p = torch.tensor(0.50, device='cuda')
get_mask(x, p)
```

Results in

```
InductorError: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(TruncToInt(zuf0), 1) (unhinted: Eq(TruncToInt(zuf0), 1)).  (Size-like symbols: none)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161843
Approved by: https://github.com/ezyang
2025-09-02 05:16:13 +00:00
5561e45758 [HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161946
Approved by: https://github.com/msaroufim
2025-09-02 05:01:46 +00:00
8171d6052e Clear custom autograd Function ctx.to_save earlier (#161171)
Fixes https://github.com/pytorch/pytorch/issues/161186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161171
Approved by: https://github.com/albanD
2025-09-02 03:26:31 +00:00
d5e0f4202b Fixes broken memory_viz link in CUDA memory docs (#161426)
Fixes #161375

The  "Using the visualizer" section in torch_cuda_memory.md had a link to  https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link.

I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426
Approved by: https://github.com/soulitzer
2025-09-02 02:06:54 +00:00
13d66e2a66 [BE][Easy] restore #157584 after #158288 (#158541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158541
Approved by: https://github.com/ezyang
2025-09-02 02:06:50 +00:00
bbedc71fd3 test: ensure editable cached wrapper is respected (#160943)
## Summary
- add a test verifying that editing the local cache wrapper is picked up after Dynamo reset

## Testing
- `lintrunner -a` *(fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure)*
- `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v`

------
https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160943
Approved by: https://github.com/xmfan
2025-09-02 01:48:30 +00:00
e9481b6617 [dynamo] Prevent unnecessary recompile on disabled functions in the compiled frame (#161883)
Trying out a re-impl of https://github.com/pytorch/pytorch/pull/160934

The above PR led to OOM, most likely because of the cache holding to a nested function (which if not held in the cache would have been garbage collected), which holds on to cuda tensors in its closure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161883
Approved by: https://github.com/jansel
2025-09-02 01:13:48 +00:00
1c1b28d5b6 Fix slice scatter dtype consistency (#160851)
Fixes #147842
Fix torch.slice_scatter type inconsistency issue. I noticed previous PRs on this have stalled, so I'm opening this new PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160851
Approved by: https://github.com/soulitzer
2025-09-02 01:08:26 +00:00
2a5c0785e2 [AOTI] split too long string to smaller pieces when its length larger than 16000, fix msvc c2026. (#161850)
Split too long string to smaller pieces when its length larger than 16000, fix msvc c2026.

reproducer:
```cmd
pytest test\inductor\test_aot_inductor.py -v -k test_runtime_checks_large_cpu
```

Error message:
<img width="1660" height="174" alt="image" src="https://github.com/user-attachments/assets/56fcd9be-24cb-484b-bfdc-f719ff2650b8" />

For MSVC c2026:
https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170

We can split too long string to smaller pieces, it can fix this issue.

Local validated:
<img width="1122" height="232" alt="image" src="https://github.com/user-attachments/assets/cac54cc9-be51-4a5d-b408-06755a4debd5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161850
Approved by: https://github.com/jansel
2025-09-02 00:09:01 +00:00
626cb7df81 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-01 23:00:21 +00:00
b7034e9c92 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-01 23:00:21 +00:00
13b65196db Revert "Defer loading hipify until it is needed (#160824)"
This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d.

Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](403a3a393c) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))
2025-09-01 21:34:13 +00:00
403a3a393c Defer loading hipify until it is needed (#160824)
Saves a few milliseconds when running a test case:

Before:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 1.497s
```

After:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 0.909s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824
Approved by: https://github.com/zou3519
2025-09-01 20:57:41 +00:00
cbfb005f7c Fix type checking for persistent loads in the weights-only unpickler (#161661)
The error message here implies that we can only call `self.persistent_load(...)` for ints or tuples, but due to the second part of the type check being inverted, weights-only unpickler will throw an exception iff `pid` is an int.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161661
Approved by: https://github.com/Skylion007
2025-09-01 19:57:19 +00:00
d232a95d4a [BE] Consolidate inductor benchmark Docker images and rename jobs (#161536)
We have 4 different version of inductor benchmark Docker images used in CI at the moment:

1. `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks` is used by almost all inductor jobs including nightly benchmark
2. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.12
3. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.13
4. `pytorch-linux-jammy-py3-gcc11-inductor-benchmarks` runs inductor unit tests on CPU

My proposal here is to clean up (2) and (3) and to keep (1) under the same setup from https://ghcr.io/pytorch/torchbench.  Simplicity is the key here as inductor workflows are getting more and more complex:
1. Unit tests for Python variant like 3.12 and 3.13 were useful when they were first added to CI.  They are much less useful now.  [Flambeau](https://hud.pytorch.org/flambeau/s/3876ec7b-43f0-42c6-bfbf-899035e5bb77) shows a 0.97 correlation between them.  And we are also moving to 3.14 nowadays.  I want to choose 3.12 for (1), but will do this separately.  This is also what TorchBench and vLLM are using on CI.
1. We are gradually cleaning up 3.9 on CI https://github.com/pytorch/pytorch/issues/161167

Another BE change here is to rename the jobs various inductor workflows because I think names like `linux-jammy-cuda12_8-py3_10-gcc9-inductor-build` is too long and confusing to look at, better just use human-friendly names like `inductor-build`.  Other information is already spelled out in the build environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161536
Approved by: https://github.com/zou3519
2025-09-01 19:07:08 +00:00
17fa8eec4a Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)"
This reverts commit 4b4cdcfe3af10df624878985caac4e595fbab54c.

Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/atalman due to need to revert due to merge conflicts, please feel free to merge it back in once conflicts are resolved ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3242945661))
2025-09-01 17:08:27 +00:00
54e275e0d8 Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)"
This reverts commit c83cbd2f2a2de2e3258f07de77d8740743df6d2d.

Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))
2025-09-01 17:03:50 +00:00
63a9c23fe9 Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)"
This reverts commit 190c391a28845a14df26abb228d26aa813efb20c.

Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))
2025-09-01 16:27:03 +00:00
fefee08164 [CD] Add CUDA 13.0 Windows build (#161663)
Test CUDA 13.0 windows build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161663
Approved by: https://github.com/malfet, https://github.com/atalman
2025-09-01 15:27:17 +00:00
21fae99c18 Revert "[cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)"
This reverts commit 55c289d5c104c4959cc125c0fb4fb50c9fc71102.

Reverted https://github.com/pytorch/pytorch/pull/161305 on behalf of https://github.com/atalman due to Broke test_matmul_cuda.py::TestFP8MatmulCUDA::test_float8_error_messages_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17309011599/job/49140215634) [HUD commit link](1190b7f73e) ([comment](https://github.com/pytorch/pytorch/pull/161305#issuecomment-3242652672))
2025-09-01 14:56:47 +00:00
2ba65472dd [xla hash update] update the pinned xla hash (#161396)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161396
Approved by: https://github.com/pytorchbot
2025-09-01 11:43:03 +00:00
190c391a28 [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-01 09:25:01 +00:00
20bfb2539d Skip compilation when FX graph has no calls and returns empty (#160536)
Fixes #160437

Summary:
This PR avoids compiling empty FX graphs generated during graph breaks. If there are no calls in the graph, we can just return the empty list of instructions.

More precisely,
In compile_and_call_fx_graph, if the FX graph contains no calls (count_calls(self.graph) == 0) and the return value list is empty, we now return an empty instruction list immediately

Impact:
module: dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160536
Approved by: https://github.com/Lucaskabela
2025-09-01 08:32:22 +00:00
dd2519abe8 ci: Update sphinx, disable google search by default (#161793)
Includes fixes from https://github.com/pytorch/pytorch_sphinx_theme/pull/207

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161793
Approved by: https://github.com/malfet, https://github.com/albanD
2025-09-01 07:43:39 +00:00
2f6b4b1ad3 [4/N][SymmMem] Add get_remote_tensor + move up get_buffer and get_signal_pad (#161533)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

`get_remote_tensor `: return a symmetric tensor given a peer rank.

The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.

As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
2025-09-01 07:02:06 +00:00
6737e2c996 update supported OS for Intel client GPU (#161699)
update supported OS for Intel client GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161699
Approved by: https://github.com/chuanqi129, https://github.com/malfet
2025-09-01 05:45:09 +00:00
67c31dcd36 [vllm hash update] update the pinned vllm hash (#161867)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161867
Approved by: https://github.com/pytorchbot
2025-09-01 04:37:13 +00:00
cb1e31362c Remove background thread UT on XPU to fix CI (#161844)
# Motivation
Because we revert `torch._C._set_allocator_settings` in https://github.com/pytorch/pytorch/pull/161626, this UT becomes invalid.
Fix https://github.com/pytorch/pytorch/issues/161697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161844
Approved by: https://github.com/gujinghui
2025-09-01 03:45:26 +00:00
9a665ca3c4 Add __init__.pyi to torch/linalg (#160750)
Fixes #149639

In an effort to improve the type checking coverage, added a stub file for the torch/linalg directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160750
Approved by: https://github.com/Skylion007
2025-08-31 22:39:05 +00:00
d9d6dde0f4 Leak Python filenames so that we can give good dispatcher errors. (#160418)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160418
Approved by: https://github.com/zou3519
2025-08-31 22:31:39 +00:00
68738beff7 PythonArgs::toBool: order cheap mutually exclusive checks first (#161455)
symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161455
Approved by: https://github.com/Skylion007
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329, #161432
2025-08-31 21:35:48 +00:00
25f4aaed9e [3/N][SymmMem] Expose offset field from handle (#161532)
As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471
2025-08-31 18:08:57 +00:00
61e18b5304 [2/N][SymmMem] Add MemPool allocator and tests (#161471)
(Porting most of #161008)

Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.

To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
    tensor = torch.arange(numel, dtype=dtype, device=device)
```

Added tests for both use cases above.

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-31 18:08:57 +00:00
e92cd94153 removed duplicate imports (#161685)
Fixes #161684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161685
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-31 16:21:49 +00:00
0d421ace32 fix spelling of word - when (#160185)
just found a typo while understanding the codebase while working on another PR

This fixes typo in word `when` in files

```
native/cpu/PaddingKernel.cpp
native/cpu/batch_norm_kernel.cpp
```

@eqy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160185
Approved by: https://github.com/yewentao256, https://github.com/ezyang
2025-08-31 13:38:23 +00:00
91f0bcf43f [c10d][nvshmem] add nvshmem build rules and dependency for libtorch_cuda (#159562)
Summary:
Add guarded build option for nvshmem-related c10d code with `-c fbcode.caffe2_use_nvshmem`

Guarded clause include nvshmem device + host code (static-linked) + these 2 files:
- `torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu`
-    `torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159562
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-08-31 12:56:51 +00:00
75bc23cfc3 [CPU][Inductor] Improve performance of A16W8 GEMM template (#161148)
**Summary**
This PR improves the performance of A16W8 GEMM template by
- Removing the config with block_n=48 & block_m=16 as it is not very efficient.
- Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31.
- Converting int8 values to bf16 with intrinsics instead of `at::vec::convert` as the latter does not have optimized implementation for this case.

We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct.

**Test plan**
Already covered by UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161148
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-31 09:56:29 +00:00
377033757a Use vectorized stores for all dtypes in cat (#161649)
resurrecting #151818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649
Approved by: https://github.com/Skylion007
2025-08-31 05:42:41 +00:00
f612045ce1 [vllm hash update] update the pinned vllm hash (#161835)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161835
Approved by: https://github.com/pytorchbot
2025-08-31 04:24:04 +00:00
ad7b748686 [AOTI] fix ut, add extension file type for Windows. (#161851)
fix ut, add extension file type for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161851
Approved by: https://github.com/ezyang
2025-08-31 01:13:29 +00:00
f3697b033e [MPS] add bunch of unary funcs for sparse tensors (#161846)
adds bunch of unary functions for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161846
Approved by: https://github.com/malfet
2025-08-30 21:13:05 +00:00
2d31c3d99d Pass shared_ptr by value (#161834)
The way AsyncAllreduceCUDADeviceWork is currently implemented,
using it will force a copy of `shared_ptr<gloo::Context>`
because `std::move` does nothing for a const ref.

This PR changes the param type to shared_ptr<> instead of the
const ref. This allows more efficient parameter passing.

Here's an example that demonstrates the issue:

```cpp
#include <memory>
#include <iostream>

struct Foo {};

void useFoo_ref(const std::shared_ptr<Foo>& f) {
    std::shared_ptr<Foo> internal = std::move(f);
    std::cout << "use_count: " << internal.use_count() << '\n';
}

void useFoo_val(std::shared_ptr<Foo> f) {
    std::shared_ptr<Foo> internal = std::move(f);
    std::cout << "use_count: " << internal.use_count() << '\n';
}

int main() {
    std::shared_ptr<Foo> f1 = std::make_shared<Foo>();
    useFoo_ref(std::move(f1)); // prints "use_count: 2"

    std::shared_ptr<Foo> f2 = std::make_shared<Foo>();
    useFoo_val(std::move(f2)); // prints "use_count: 1"
}
```

This also aligns well with [C++ Core Guidelines][1] for handling
smart pointers.

[1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
2025-08-30 18:00:37 +00:00
fb2d5ea697 Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471)"
This reverts commit b291dc9684d00396239a0c7786b7aac71bf69c05.

Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3239283585))
2025-08-30 14:00:29 +00:00
2e1345a0f8 Revert "[3/N][SymmMem] Expose offset field from handle (#161532)"
This reverts commit ff9533970ad76ed1905b90df6515aca50354c193.

Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3239282308))
2025-08-30 13:57:50 +00:00
684ae48c16 Revert "[4/N][SymmMem] Add get_remote_tensor + move up get_buffer and get_signal_pad (#161533)"
This reverts commit 95516ad7e6d92ed131fb6057b29ec52e73190e3c.

Reverted https://github.com/pytorch/pytorch/pull/161533 on behalf of https://github.com/atalman due to Multiple internal failures on PR #[161471](https://github.com/pytorch/pytorch/pull/161471) will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161533#issuecomment-3239278635))
2025-08-30 13:51:22 +00:00
b93f87d67b [OpenReg] Integrate Event&Stream from OpenReg Backend into PyTorch (#160100)
We integrated the openreg backend’s `Stream` and `Event` into PyTorch, all of which are similar
to other accelerators like `CUDA`, `XPUs`, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160100
Approved by: https://github.com/albanD
ghstack dependencies: #161603, #160099, #161773
2025-08-30 13:21:28 +00:00
6284881b2a [OpenReg] Add tests of device and memory for OpenReg (#161773)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161773
Approved by: https://github.com/albanD
ghstack dependencies: #161603, #160099
2025-08-30 13:21:28 +00:00
aae9cbb6c0 [OpenReg] Add Event&Stream Support for OpenReg Backend (#160099)
Referring to the signatures and functions of `Stream` and `Event` in CUDA, we use CPU multithreading
and conditional variables to implement equivalent capabilities as the underlying foundation of torch_openreg.

**Changes:**

- Add stream capabilities for OpenReg
- Add event capabilities for OpenReg
- Add kernel launch entrypoint for OpenReg
- Add testcases about stream and event for OpenReg
- Add example for OpenReg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160099
Approved by: https://github.com/albanD
ghstack dependencies: #161603
2025-08-30 13:21:21 +00:00
dad2e50ac5 [OpenReg] Rename cpu_fallback_blacklist to cpu_fallback_blocklist (#161603)
As the title stated.

Related Infos: https://github.com/pytorch/pytorch/pull/158644#discussion_r2301460839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161603
Approved by: https://github.com/albanD
2025-08-30 13:21:15 +00:00
37da7b777b Fix _scaled_grouped_mm not reported as unsupported on SM100. (#161780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161780
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy
2025-08-30 12:33:51 +00:00
c83cbd2f2a [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)
Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142
Approved by: https://github.com/jansel
2025-08-30 11:09:07 +00:00
b994f6e3b3 [inductor] check block options after broadcasting and singleton dims have been removed (#161602)
This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match
because the innermost dimension does not have stride 1
```python
block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0])
```
After broadcasting dimensions and singleton dimensions are removed:
```python
block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602
Approved by: https://github.com/jansel
2025-08-30 08:10:51 +00:00
f44ad54bc6 Update torch-xpu-ops commit pin (#161152)
Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](8b58040ee3), includes:

- Add vectorization path on maxpool forward channel last
- Add FlightRecorder support for ProcessGroupXCCL
- Fix random build failure on codegen
- Suppress dllexport warning on Windows
- Make torch-xpu-ops build depend on ATen XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152
Approved by: https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-08-30 07:19:24 +00:00
4d3ab2669b Stop trying to intern arguments in PyObject_FastGetAttrString (#161432)
If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see b222eb66c7 (diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163))

Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161432
Approved by: https://github.com/ezyang
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329
2025-08-30 06:55:43 +00:00
0ee8a4e281 Fix accidental copy in pushPyOutToStack (#161329)
`auto` forces a copy. Confirmed this did something noticable with perf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161329
Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328
2025-08-30 06:55:43 +00:00
eb9526ae35 Avoid double hash lookup in torch._library.simple_registry (#161328)
Not a huge cost, but free win is free.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161328
Approved by: https://github.com/Skylion007
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317
2025-08-30 06:55:43 +00:00
302d860157 Improve assert perf in _python_dispatch._correct_storage_aliasing (#161317)
This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161317
Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308, #161315
2025-08-30 06:55:42 +00:00
0c459f2921 Fix pybind enum efficiency issue in return_and_correct_aliasing (#161315)
Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation.

This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161315
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308
2025-08-30 06:55:42 +00:00
b96bcb9fdb Optimize _python_dispatch.return_and_correct_aliasing.get_write_alias (#161308)
- Empty containers are Falsey
- Hoist cheap checks first
- Microbenchmarked single-element set access method

Benchmark code:
```
import timeit

to_test = [
    ('list(x)', 'x = set([3])'),
    ('x[0]', 'x = [3]'),
    ('list(x)[0]', 'x = set([3])'),
    ('next(iter(x))', 'x = set([3])'),
]

for (stmt, setup) in to_test:
    res = timeit.timeit(stmt=stmt, setup=setup)
    print(f"Time for `{stmt}`: {res}")
```

Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux)
```
Time for `list(x)`: 0.03418
Time for `x[0]`: 0.00852
Time for `list(x)[0]`: 0.03561
Time for `next(iter(x))`: 0.02278
```

FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304
2025-08-30 06:55:42 +00:00
2089ed3d5e Use is, not ==, to check exact type matches in _python_dispatch (#161304)
`is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161304
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292
2025-08-30 06:55:42 +00:00
1a64bf2636 Stop accessing func._schema in _python_dispatch.correct_storage_aliasing (#161292)
func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161292
Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh
ghstack dependencies: #161301
2025-08-30 06:55:42 +00:00
5d35b49ba7 Fix forced copying def_property_readonly for FunctionSchema & friends (#161301)
This took me a bit to figure out and I'm pretty sure I've looked at
this code before. Pybind uses
`return_value_policy::reference_internal` for `def_property`, which
[causes the owning object to be kept alive for the lifespan of the
return
value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html),
allowing the getter to safely avoid copying the property
value. However, lambdas act like they return `auto`, not
`decltype(auto)`, so our lambdas themselves were forcing copies!

Testing: observed std::vector<Argument> copying disappear in Linux
perf profile of someOpInfo._schema.arguments/returns (in
_python_dispatch.correct_storage_aliasing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161301
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab
2025-08-30 06:55:42 +00:00
db622842bc [Inductor][CPP] Optimize config selecting for micro gemm when number of mxn blocks can not occupy all the threads (#161144)
If number of mxn blocks can not occupy all the threads, use smaller register block size will get better performance since the computing size per thread is smaller.
It may get ~20% performance improvement for the real case `m1_n512_k4096`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161144
Approved by: https://github.com/leslie-fang-intel
2025-08-30 05:53:49 +00:00
77d8e98e1b [Inductor] update exp codegen for better precision (#161829)
Prior to this PR, we have:
```
[Default Behavior] uses `tl.math.exp({x})`:
eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592

TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`:
eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367

Update inductor to use `tl.extra.libdevice.exp(tmp0)`:
eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593
```

Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now.

Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829
Approved by: https://github.com/jansel
2025-08-30 04:56:51 +00:00
2fed4fb464 [FlexAttn] Fix Paged Attention Accuracy via Upper Mask Mod and Prevent Invalid Memory Access (#160861)
Fixes #159247
Issue 1: Accuracy Problem with Non-Divisible KV Sequences
---------------------------------------------------------

### Background

Paged attention in flex decoding produced inaccurate results when KV sequence length is not divisible by block size. For example, when `KV_S = 64` and `block_size = 128`, the output didn't match standard attention accuracy.

### Root Cause
The current paged attention does not apply upper mask mod when converting from logical to physical mask mod. Instead, it uses a noop_mask by default which makes all the values unmasked, leading to an accuracy mismatch. Adding a upper mask mod according to the origin actual kv_len (64 in this test case) resolves the issue.

### Solution

*   **Applied proper upper bound masking**: Updated all calls to `convert_logical_block_mask` to pass `kv_len` as a tensor with proper shape `[B, KV_S]` to provide information of actual batched KV sequence length. The function now correctly applies upper bound checks using the actual KV sequence lengths for each batch

### Files Modified
*    `torch/nn/attention/experimental/_paged_attention.py`: Added `kv_len` parameter as a tensor to `get_mask_mod` and applied upper mask to the new mask mod.
*   `test/inductor/test_flex_attention.py`: Fixed all related `kv_len` parameter call in the tests
*   `test/inductor/test_flex_decoding.py`: Fixed all related `kv_len` parameter call in the tests

Issue 2: Invalid Memory Access (IMA) in Triton Kernels
------------------------------------------------------

### Background

The Triton kernel for flex attention was experiencing invalid memory access errors when running with compute sanitizers, particularly with short KV sequences and small batch sizes.

### Root Cause

*   Kernel launches CTAs (Cooperative Thread Arrays) proportional to GPU's multi-processor count (108 via `SPLIT_KV`)
*   With small workloads, many CTAs remain idle but still attempt to access `kv_indices` with invalid `indices_idx` values
*   This caused out-of-bounds memory access violations

### Solution

Implemented boundary checks with early exit:

1.  **Added `MAX_VALID_KV_IDX` parameter** in `torch/_inductor/kernel/flex/flex_decoding.py`

    *   Calculate maximum valid KV index based on actual `kv_indices` tensor size and pass it to Triton template
2.  **Added early exit logic** in `torch/_inductor/kernel/flex/templates/flex_decode.py.jinja`

    *   Boundary checks before accessing `kv_indices` in both normal and full blocks
    *   Idle CTAs with invalid `indices_idx` skip computation entirely

This prevents invalid memory access while reducing wasted computation on idle thread blocks.

Testing & Validation
--------------------

### Accuracy Tests

*   Added comprehensive test cases covering KV sequences not divisible by block sizes
*   Verified output matches standard attention for various sequence length combinations

### Sanitizer Results

`========= COMPUTE-SANITIZER Starting standalone test_max_autotune... Running test_max_autotune on device: cuda max_autotune config: True test_max_autotune completed successfully! Test passed! ========= ERROR SUMMARY: 0 errors`

**Before**: More than 13720 invalid memory access errors with sanitizers
**After**: Clean execution with 0 errors

Both fixes work together to ensure paged attention produces accurate results while running safely without memory access violations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160861
Approved by: https://github.com/BoyuanFeng
2025-08-30 04:50:23 +00:00
76f81b56d3 [audio hash update] update the pinned audio hash (#161836)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161836
Approved by: https://github.com/pytorchbot
2025-08-30 04:23:04 +00:00
82d2d23e85 Add batch option for send/recv_object_list (#160342)
`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.

This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group.

---

BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”

See:

c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342
Approved by: https://github.com/wconstab
2025-08-30 03:29:09 +00:00
e015de1969 Revert "Use vectorized stores for all dtypes (#161649)"
This reverts commit f0a517e333d6204f560d8061a4f70523060c93bf.

Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to buggy ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3238895967))
2025-08-30 03:13:40 +00:00
0af56fc33e Cleanup stale submodule directories after checkout (#161748)
Fixes https://github.com/pytorch/pytorch/issues/161510

Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   third_party/kineto (untracked content)

% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx  0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
2025-08-30 01:30:44 +00:00
8627a19adf [MPS] sparse add unary funcs + add for sparse tensors (#160839)
Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-30 01:09:00 +00:00
ebfee60101 [WIP] more aggressive persistent reduction (#161055)
Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335).

Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed.

As criteria:

- there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input)
- we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing
- we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved).

Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055
Approved by: https://github.com/jansel
2025-08-30 01:08:45 +00:00
6db872fa2c Revert "Cleanup stale submodule directories after checkout (#161748)"
This reverts commit 0e45023cf9cbe1cf18279c1b0d391ea9464e7731.

Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I still see the same failures, and could not understand, from the log whether those checks are running on not ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3238791895))
2025-08-30 01:04:11 +00:00
7c30a9d7fc [MPS] Add slow version of kthvalue (#161817)
Which heavily borrows implementation logic from `topk`
As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817
Approved by: https://github.com/dcci
2025-08-30 00:44:29 +00:00
c1e504ec2f [SymmMEM] Move AsyncTP tests to a seperate test class (#161820)
We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core
symmetric memory APIs when Async TP ops fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820
Approved by: https://github.com/kwen2501
2025-08-30 00:40:40 +00:00
4ad9fbc83a Unify TypeAlias definitions in optimizer.py (#161493)
Fixes #160834

This issue unifies TypeAlias definitions in [optimizer.py](https://github.com/pytorch/pytorch/blob/main/torch/optim/optimizer.py)

This ensures the following:

- Consistency and Standardization
- Enhanced IDE support
- Prevents runtime confusion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161493
Approved by: https://github.com/Skylion007
2025-08-30 00:35:02 +00:00
0f81e7f640 [CI] Fix XPU ci test permission issue (#161389)
Due to new test runners, refer https://github.com/pytorch/pytorch/actions/runs/17161094208/job/48694776064#step:2:124
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161389
Approved by: https://github.com/atalman
2025-08-30 00:03:59 +00:00
3daf20f8e1 [MPS] fix empty input in posneg functions (#161824)
fix empty posneg function for mps:
```python
import torch

input_tensor = torch.empty(0, device="mps")
out_pos = torch.isposinf(input_tensor)
```

Gives:
```
RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/Irakli_Salia/Desktop/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":551, please report a bug to PyTorch. Placeholder tensor is empty!
```

on main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161824
Approved by: https://github.com/malfet
2025-08-29 23:12:04 +00:00
3e459491b5 Enable XPU path for FlexAttention (#143553)
[#RFC153024](https://github.com/pytorch/pytorch/issues/153024)

**Motivation**

1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA  in the transformers repo on XPU device. Besides,  it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and  different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.

**What does this PR do?**

 1. Enable the device type for Flexattention kernel  and UTs to ensure all important UTs pass on XPU device.
 2. For E2E model inference, ensure the functionality  of LLM models inference with FlexAttention to be ready.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553
Approved by: https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Mao Yunfei <yunfei.mao@intel.com>
Co-authored-by: Xingyuan Li <xingyuan.li@intel.com>
Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: Xiao, Wang <wang.xiao@intel.com>
2025-08-29 23:10:58 +00:00
0e2c8af5a6 [CI/CD] Windows set git config --global core.ignorecase false (#161813)
Make sure git on windows have core.ignorecase false

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161813
Approved by: https://github.com/malfet
2025-08-29 23:04:43 +00:00
ea27464a79 [inductor][decompose k] disable on everything other than cuda (#161795)
# why

- untested so far

# what

- add an empty config heuristic for all devices for decompose k
- the cuda heuristic, because it is more specific, will still be picked
  up
- add notes explaining how to enable on other devices

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "decompose_k"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161795
Approved by: https://github.com/PaulZhang12
ghstack dependencies: #161767
2025-08-29 22:41:27 +00:00
45eccf414f [inductor][heuristics registry] missing heuristic is not an error anymore, cross device heuristics (#161767)
# why

- not having a heuristic is an error but should not crash, just provide 0 configs
- some heuristics are cross device type
- cleaner to be explicit about being cross device type than having to
  enumerate every possible device type

# what

- on registration, supply device_type=None (explicitly) to say this
  heuristic is cross device
- test to guard the heuristics hierarchies

# testing

```
python3 -bb -m pytest test/inductor/test_template_heuristics_registry.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161767
Approved by: https://github.com/PaulZhang12
2025-08-29 22:41:27 +00:00
037f3bd475 [CI] Migrate XPU build and test to python 3.10 (#161708)
Follow #161167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708
Approved by: https://github.com/malfet
2025-08-29 22:31:39 +00:00
6e548c1a87 Revert "[CI] Migrate XPU build and test to python 3.10 (#161708)"
This reverts commit 2a70d98abf8256d3d768eff028fca20198579824.

Reverted https://github.com/pytorch/pytorch/pull/161708 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing rocm jobs to fail. See: test/inductor/test_max_autotune.py::TestMaxAutotuneSubproc::test_max_autotune_addmm_search_space_EXHAUSTIVE_dynamic_True [GH job link](https://github.com/pytorch/pytorch/actions/runs/17303310877/job/49125664617) [HUD commit link](2a70d98abf) ([comment](https://github.com/pytorch/pytorch/pull/161708#issuecomment-3238359944))
2025-08-29 21:49:15 +00:00
eb78757708 [inductor] Lift fw_compiler and bw_compiler as toplevel functions. (#161762)
This is a no-op refactor to compiler_fx which lifts the logic of fw_compiler and bw_compiler to toplevel, so that they can be reused in a different stack (e.g. precompile).

Differential Revision: [D81292968](https://our.internmc.facebook.com/intern/diff/D81292968/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161762
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-08-29 21:46:55 +00:00
05eeb29976 [inductor][triton] support JITCallable._hash_lock (#161768)
Fixes #161618

Triton # 7974 introduces a threading.RLock() in JITCallable, which is not pickle-able. This PR adds this field to the list of un-pickleable fields that need to be handled specially.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161768
Approved by: https://github.com/xuzhao9
2025-08-29 21:20:02 +00:00
18b4fdde8f Add MTIA to floor_divide op (#161575)
Summary: Missed file in op registration resulting in fallback during test

Reviewed By: andyanwang, srsuryadev

Differential Revision: D81085615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161575
Approved by: https://github.com/albanD, https://github.com/malfet
2025-08-29 20:39:29 +00:00
f6368e934e Revert "[MPS] sparse add unary funcs + add for sparse tensors (#160839)"
This reverts commit 93c5112f46a978a029644ae599979416ead5c917.

Reverted https://github.com/pytorch/pytorch/pull/160839 on behalf of https://github.com/atalman due to test_sparse_csr.py::TestSparseCompressedCPU::test_consistency_SparseCSR_asinh_cpu_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17329155095/job/49201551217) [HUD commit link](93c5112f46) ([comment](https://github.com/pytorch/pytorch/pull/160839#issuecomment-3238093296))
2025-08-29 19:55:39 +00:00
bf6aaba0f7 [while_loop] avoid aliasing when body_fn never executes (#160670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160670
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160669
2025-08-29 19:36:37 +00:00
456493f7ed [while_loop][inductor] remove offset check for while_loop (#160669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160669
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-08-29 19:36:37 +00:00
c74e301455 Bump TorchBench version (#161461)
To include the latest fixes from TorchBench.  I'll setup a nightly commit hash update for this next

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161461
Approved by: https://github.com/malfet
2025-08-29 19:21:07 +00:00
67457dbb9d Fix non-const reference arguments in torch/csrc/jit/python/init.cpp (#161300)
Shouldn't be any generated code impact, just fixing bad practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161300
Approved by: https://github.com/wconstab, https://github.com/malfet
ghstack dependencies: #161286
2025-08-29 19:01:32 +00:00
e9bbd28f22 make einsum produce contiguous inputs in more cases (#161755)
Fixes #161729
Written by codex
This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755
Approved by: https://github.com/malfet, https://github.com/albanD
2025-08-29 18:50:46 +00:00
348d781055 [Inductor] Update Outer Reduction Heuristic (#159093)
Update outer reduction heuristics for significant speedups.

HuggingFace:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" />

Average ~20% speedup on a kernel by kernel basis

TorchBench:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" />

Average ~40% speedup on a kernel by kernel basis

<img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093
Approved by: https://github.com/jansel
2025-08-29 18:31:22 +00:00
303f514d5b [CI] Add basic CUDA 13.0 periodic test (#161013)
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161013
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
2025-08-29 17:56:33 +00:00
f532f99822 [AOTI] normalize_path_separator zip file path (#161781)
normalize_path_separator zip file path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161781
Approved by: https://github.com/angelayi
2025-08-29 17:53:41 +00:00
93c5112f46 [MPS] sparse add unary funcs + add for sparse tensors (#160839)
Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-29 16:28:58 +00:00
0f6a08a029 [inductor] Fix SubgraphInfo round trip (#161779)
Currently `numels` is not specific to a created subgraph since it is not retrieved by `dataclasses.fields(SubgraphInfo)` due to it not being type annotated, see [ref](https://docs.python.org/3/library/dataclasses.html#module-dataclasses:~:text=The%20%40dataclass%20decorator%20examines%20the%20class%20to%20find%20fields.%20A%20field%20is%20defined%20as%20a%20class%20variable%20that%20has%20a%20type%20annotation.%20With%20two%20exceptions%20described%20below%2C%20nothing%20in%20%40dataclass%20examines%20the%20type%20specified%20in%20the%20variable%20annotation.).

So for example the following would happen:

```
self.numels = {"x": sympy.Integer(5)}
subgraph_name = "<x>"
with self.create_subgraph_body(subgraph_name):
     self.numels = {"x", sympy.Integer(7)}
# this would print that x has size 7, not the original value of 5
print(self.numels)
# numels would be None because dataclasses.fields(SubgraphInfo) does not include numels
# since it is not type annotated
print(self.subgraph_bodies[subgraph_name])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161779
Approved by: https://github.com/eellison
2025-08-29 16:27:29 +00:00
c8fa907e74 Check commit order (#161560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161560
Approved by: https://github.com/malfet
ghstack dependencies: #161558, #161637
2025-08-29 16:22:58 +00:00
b99a112688 Update optional tag for interpolation in torch.quantile() (#161706)
Fixes #146156

Refix the issue with the extra needed fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161706
Approved by: https://github.com/soulitzer
2025-08-29 16:21:14 +00:00
cd6d63f453 [SymmMEM] Fix test_empty_strided_p2p_persistent (#161677)
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677
Approved by: https://github.com/kwen2501
ghstack dependencies: #161676
2025-08-29 16:11:58 +00:00
0e45023cf9 Cleanup stale submodule directories after checkout (#161748)
Fixes https://github.com/pytorch/pytorch/issues/161510

Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   third_party/kineto (untracked content)

% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx  0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
2025-08-29 14:07:06 +00:00
823a329984 Revert "Cleanup stale submodule directories in checkout action (#161748)"
This reverts commit f3c5a82139539c63e6f08966e268c4160e138320.

Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I put the check in the wrong place ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3237080419))
2025-08-29 13:40:21 +00:00
f0a65cd6d6 Add pg argument to consolidate_safetensors_files_on_every_rank (#161421)
Summary: Based on feedback on https://github.com/pytorch/torchtitan/pull/1625, adding a pg argument to consolidate_safetensors_files_on_every_rank so that we don't infer the pg and users can supply one if needed.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D80954339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161421
Approved by: https://github.com/fegin
2025-08-29 13:31:11 +00:00
627decb0ed [DTensor] fix DTensorTestCase.destroy_pg() when device_type is "cpu" but CUDA device is available (#161015)
**Summary**
When `device_id` is not None, barrier() will choose the accelerator of the most
pripority, which means if the test specifies to use CPU for testing while CUDA is
available on the host, the barrier() will use CUDA. To avoid this and better respect
`self.device_type`, we add this branch to enforce barrier() to use CPU when
`self.device_type` is CPU and other accelerator is also available.

**Test**
`pytest test/distributed/tensor/test_dtensor_testbase.py`

**Debugging Output**
```
# from init_process_group()
init pg: backend=gloo, device_id = None
default_pg has backend: gloo, device_types: [device(type='cuda'), device(type='cpu')]

# from barrier()
barrier: device_ids = [10], devices = [], device = None, PG=[device(type='cuda'), device(type='cpu')]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161015
Approved by: https://github.com/tianyu-l
2025-08-29 12:47:11 +00:00
448a7e7e31 Fix SequentialLR deprecate warning about invoke step(epoch) (#149392)
Fixes #116776 #76113 #113222 #67958
## Changes

- Refactor `LRScheduler.step` method, leave `epoch` check logic in public method `step`
- Move update `lr` logic to `_update_lr` method
- Make `SequentialLR` use `_update_lr` to avoid unnecessary warning message

## Test Result

```bash
pytest test/optim/test_lrscheduler.py -vv
```

![image](https://github.com/user-attachments/assets/e1c5527e-193e-4328-bf95-023139ea0416)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149392
Approved by: https://github.com/janeyx99
2025-08-29 11:45:11 +00:00
ed370ae4b0 [unflatten] Fix test by supporting both MappingKey anf GetAttrKey (#161599)
Summary: As title

Test Plan:
Run internal tests

Rollback Plan:

Differential Revision: D81115712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161599
Approved by: https://github.com/tugsbayasgalan
2025-08-29 10:08:38 +00:00
5859edf113 [BE][inductor] replace "and" -> "logical_and" in bucketize_binary_search (#160941)
Get rid of these warnings:
```
/home/dberard/local/pytorch-env7/pytorch/torch/_inductor/runtime/triton_helpers.py:317: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors;
 please use '&' or '|' instead
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160941
Approved by: https://github.com/malfet, https://github.com/jingsh
2025-08-29 09:27:13 +00:00
5b701a6bb2 [AOTI][Intel GPU] Add XPU quantization ops to AOT Inductor. (#156572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156572
Approved by: https://github.com/EikanWang, https://github.com/angelayi
ghstack dependencies: #157430
2025-08-29 09:19:44 +00:00
48679ef966 [Refactor][XPU] Refactor XPU quantization op and add header files. (#157430)
This PR refactors the XPU quantization ops to align their code structure with the CPU implementation for consistency. It also adds necessary header files to enable future integration with AOTI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157430
Approved by: https://github.com/angelayi
2025-08-29 09:19:44 +00:00
0ca3a6085d use host+device_id to make sure devices are unique in rendezvous request (#161756)
Per title, for NVL72 systems where devices with the same indices on multiple hosts are within the same nvlink domain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161756
Approved by: https://github.com/kwen2501
2025-08-29 09:09:45 +00:00
a55d2beb50 [export] Support complex constant in serde (#161517)
Summary:

Fixes #160749

For a model like
```
class M(torch.nn.Module):
    def forward(self, x):
        s = torch.sin(x)
        z = 1j * s
        return z
```
Its graph will be
```
graph():
    %x : [num_users=1] = placeholder[target=x]
    %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%x,), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%sin, 1j), kwargs = {})
    return (mul,)
```

`1j` will appear as a constant complex argument in the `aten.mul`

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_complex_constant

Rollback Plan:

Differential Revision: D80672323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161517
Approved by: https://github.com/angelayi
2025-08-29 08:13:21 +00:00
d8a0bdb0d3 [BE][SymmMEM] Change Optional to the shorthand expression for symmetric memory modules (#161676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161676
Approved by: https://github.com/Skylion007
2025-08-29 07:31:16 +00:00
a7c949089a [vllm hash update] update the pinned vllm hash (#161752)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161752
Approved by: https://github.com/pytorchbot
2025-08-29 04:54:31 +00:00
a6456bfa85 [audio hash update] update the pinned audio hash (#161753)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161753
Approved by: https://github.com/pytorchbot
2025-08-29 04:52:58 +00:00
f3c5a82139 Cleanup stale submodule directories in checkout action (#161748)
Fixes https://github.com/pytorch/pytorch/issues/161510

Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   third_party/kineto (untracked content)

% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx  0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
2025-08-29 03:21:31 +00:00
5c306c3ccb [fx] Add lru_cache to warning (#161721)
Summary: Added lru_cache to the warning message to avoid flooding logs

Test Plan:
CI

Rollback Plan:

Differential Revision: D81245618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161721
Approved by: https://github.com/pianpwk
2025-08-29 02:25:45 +00:00
c1cb1cb26e fix tests caused by has_triton (#161737)
Summary: this will only cause it in the event that we are serializing a triton hop. there are a few tests that do weird mocking stuff that this function doesn't like, so this will prevent it from being called there.

Test Plan:
att

Rollback Plan:

Differential Revision: D81261486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161737
Approved by: https://github.com/angelayi
2025-08-29 02:25:35 +00:00
5cb1d71e59 [Flex] Fix float16 default config 128 headdim (#161647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161647
Approved by: https://github.com/v0i0
2025-08-29 01:48:06 +00:00
d153af713e [ez] Improve formatting in error messages for dynamic shapes (#161573)
Show the repr of `dim` to make the message more clear. Example: before `but got batch instead`, after `but got "batch" instead`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161573
Approved by: https://github.com/angelayi
2025-08-28 23:52:58 +00:00
9b67d8e344 Revert "[RELAND] Close some sources of fake tensor leakage (#161589)"
This reverts commit 5790b009751e6ebba35d3e6d05e7c1b135553eee.

Reverted https://github.com/pytorch/pytorch/pull/161589 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/17305150611/job/49128381649) [HUD commit link](5790b00975) ([comment](https://github.com/pytorch/pytorch/pull/161589#issuecomment-3235224249))
2025-08-28 23:19:36 +00:00
47742081c9 Revert "kill allow_complex_guards_as_runtime_asserts (#160198)"
This reverts commit 69d91b94ba5366f4444d8cb8fd3dab4de4f04d3d.

Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/jeffdaily due to let's revert again instead of waiting for forward fix, see earlier comments ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3235165462))
2025-08-28 22:50:37 +00:00
fffa62fa12 Ensure large tensor int32 -> int64 indexing is enabled (#157767)
Fixes: #https://github.com/pytorch/pytorch/issues/157446

I think that this delta is worth the switch form block-ptrs especially since they are deprecated

## Perf Summary

A is nightly B is this diff, so `negative` means this diff improves perf

TOP 5 differences
<img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" />

<details>
  <summary><strong>Full perf table (click to expand)</strong></summary>

| attn_type | dtype | shape(B,Hq,M,Hkv,N,D) | TFlops Version A | TFlops Version B |
| --- | --- | --- | --- | --- |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 258.38834144791923 | 258.6353685004612 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.2192450677751 | 140.12393320464972 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 122.32683823617003 | 118.51603755647925 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.48556906165314 | 137.24259849208627 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 86.59814488695922 | 84.59431398586257 |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 288.52679758135764 | 292.9174195871856 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 172.25541683643277 | 172.94326459828508 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 164.40864610599826 | 165.035129576335 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 176.54876886433945 | 175.08057670028145 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 125.22491679812626 | 121.06201152859151 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 339.11952481874283 | 339.0132835601695 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 227.58583240284406 | 228.21824999409597 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 185.98569659868966 | 182.32850843255093 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 188.9495725191772 | 180.31385312481657 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 106.25789530994302 | 106.55084959448476 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 357.6430536888533 | 363.30843452247274 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 262.3241154406613 | 265.73250045488 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 249.30498953911416 | 249.35928192833785 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 224.74126243851808 | 223.71776504077988 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 168.26977014013707 | 165.47991483333809 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 382.8178701785897 | 384.34752965862685 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 308.1449710013853 | 311.0653716044644 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 251.96365252505072 | 243.92283557225903 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 226.69316232745368 | 215.22769268913356 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 153.34142545296405 | 151.9312673939401 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 396.0998000753126 | 398.35036286102473 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 333.5198415274966 | 344.6354466169716 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 310.5955933379696 | 305.66347819546 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 260.4012412689896 | 259.758666997307 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 234.13034252182635 | 227.61676497283614 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 396.17615538477196 | 401.1419104525502 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 359.98648311998414 | 360.8285563463094 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 291.97720707257736 | 281.41694809965253 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 250.1703628419691 | 238.556760291579 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 199.50782826294306 | 191.52327358439223 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 411.0632004785396 | 413.6362648405517 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 382.9404387613185 | 397.74886235657607 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 357.0998545146633 | 350.5115200772392 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 281.8033924428203 | 281.98601309215843 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 282.56595134222135 | 277.4565795466672 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 408.89838018149516 | 405.14531386840076 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 396.07662058160264 | 393.4598228299578 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 317.8822887267849 | 304.754931401036 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 265.8801304948243 | 254.22961974295112 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 227.87390579965614 | 222.19481980110393 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 427.36821778477025 | 431.3766620314935 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 410.67994346825 | 423.4666944003808 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 381.1968748374038 | 381.77668006420424 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 292.5540046358546 | 296.5439130720502 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 321.04573768858114 | 310.7423616656888 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 427.46148866769903 | 426.162091037068 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 419.75580537687347 | 421.88640120274334 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 337.3208051798903 | 327.4912454675092 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 276.5638854539581 | 262.988360558083 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 250.82791326036886 | 245.07367032501736 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 435.8055824506086 | 441.8803729460534 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 432.02638235921006 | 450.33161016596273 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 402.25525939224883 | 393.8564689669916 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 297.5337286675904 | 297.0131881135074 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 343.8697037899545 | 329.8194073407783 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 267.58912366821056 | 256.91606054118375 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 150.81723692609629 | 146.32172267858743 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 129.51029293209245 | 122.72144394093334 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 147.627656359087 | 141.68956350566188 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 87.55100546003591 | 84.91293287692788 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 299.5931492743986 | 305.884253766691 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 179.39026367843837 | 181.64741311605096 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 173.93547669282367 | 173.23972950980564 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 185.90234171599252 | 182.80844545446686 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 128.08176696266082 | 123.27722685662111 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 340.50674552770664 | 338.9071088484576 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 225.4438318650432 | 230.22899884832975 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 194.15123248528312 | 185.02793973094865 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 200.74289714108176 | 191.76606719670647 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 107.03564946728423 | 106.82432377861258 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 371.31799283918406 | 379.7555394732925 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 275.97762744310455 | 276.71106853992995 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 261.6648679783462 | 259.4127232060398 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 237.03108223577615 | 233.92710216149527 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 172.13926800371152 | 168.74390922407585 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 381.50199487767276 | 383.9043681999597 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 307.9748883093411 | 312.2403515462001 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 251.11319684705438 | 243.17870127827277 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 236.3253127246763 | 223.81250201769552 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 154.55693991756874 | 153.11360584987685 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 407.11400078586615 | 413.53709886086557 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 348.1705797722622 | 360.09771155957367 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 321.8593280850388 | 318.2882327401255 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 270.089032013835 | 268.767323026064 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 238.07324557907788 | 228.09842078362692 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 399.8172853171901 | 401.0954526332136 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 363.4387330438581 | 364.13111024232677 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 294.1752429133857 | 283.7235663368415 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 256.8389394007649 | 246.91771015606483 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 199.3378564292656 | 192.40439590901758 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 425.5150965556111 | 430.8190098707553 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 396.00437184073013 | 411.3873625655787 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 369.92803661607815 | 361.43244467343663 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 293.4277354412933 | 295.2529537595746 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 288.0208673072841 | 281.51896404878863 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 408.3005367220567 | 408.96116482298913 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 396.90095962766304 | 396.87385456176486 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 319.0534576137999 | 302.50950358107764 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 270.3334977708081 | 258.8506349486557 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 227.46824134365394 | 222.23759438128766 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 438.24247309479694 | 437.7975163205371 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 428.34012029699227 | 433.3215899950434 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 386.52672049728875 | 388.26216893354984 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 302.71976814728083 | 302.3574867306459 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 327.39760662780986 | 308.6348428844912 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 423.31308678262695 | 426.6306972137279 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 412.6983690923106 | 419.4961977664297 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 337.41003544742273 | 324.2155049126126 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 278.7755890910794 | 265.9194286636502 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 251.55678254755364 | 244.8843180141462 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 452.5930781172308 | 457.7117122300742 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 445.05676260348116 | 463.9304535499636 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 415.78302138389415 | 406.29229555271456 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 308.0311067300895 | 304.91354721414314 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 351.43943626809335 | 329.4476923070317 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 295.1801525813241 | 291.36521287398904 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 183.23250549178067 | 182.35421238887605 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 151.56832453117747 | 151.3422139154794 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 171.02111935180432 | 160.72516856727913 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 74.05765122783826 | 74.5885345035243 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 314.3587394591763 | 319.2938677773619 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 224.57002084153177 | 225.48868542008177 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.00964804143052 | 215.39576159953486 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.1174237618258 | 214.28437413525663 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 121.08920423648368 | 119.55813661872644 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 362.2193857281911 | 360.05005804275936 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 279.8840217430121 | 279.5437918286659 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 227.76617121021982 | 222.8655938229316 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 215.43141176970562 | 207.71852284994702 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 121.35588364218539 | 121.20636565046884 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 365.1545280898012 | 373.37585444987326 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 304.360119952975 | 309.1247297936263 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 287.2603904544586 | 289.25547903162595 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 257.9852675272418 | 257.59069234098115 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 188.35158496670232 | 184.24683960154857 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 389.9744911369211 | 388.43466897254166 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 345.9228295166513 | 342.63034895210126 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 279.56334658247437 | 271.2724375402088 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 245.66477202810066 | 233.49688207371258 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 170.3270720653187 | 166.23863845657382 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 400.0041140827554 | 402.11182445396497 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 363.64641830327434 | 375.9288663364792 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 341.5776139573363 | 335.1160003213424 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 281.1811770268521 | 280.21438270014005 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 247.78716118997716 | 245.3269825179633 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 403.794126680488 | 405.2353919019577 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 387.079178426863 | 385.1461762057035 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 309.7847188173431 | 298.0443968374749 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 262.4721750159666 | 250.81679725428586 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 205.70866004479979 | 202.9620839129557 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 413.380982988662 | 418.40270594263103 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 398.450064800682 | 409.6794973994029 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 372.26297458194466 | 364.44415106552196 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 293.0818569905912 | 292.85172400643984 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 296.46717085592087 | 285.76362010612763 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 419.3186786037592 | 426.08801580934437 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 408.1648467766632 | 409.4122254207817 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 329.24396020457345 | 313.5200995121138 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 274.61257504571876 | 255.7801815432177 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 232.63806001220684 | 230.03020843492314 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 435.0785891054788 | 440.39101804225345 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 424.86925312752817 | 435.18898057396825 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 393.000417896268 | 395.11543361225256 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 297.7755459218185 | 300.7208114715287 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 331.71570861760534 | 318.07127352552885 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 424.58602747137405 | 425.84897078470715 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 422.66607285025725 | 423.5524945535485 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 344.8625760048626 | 331.6793888458635 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 282.0787281511649 | 263.7895634445868 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 252.7301927385177 | 245.41844170037427 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 437.0658069164588 | 442.9101960063628 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 433.13788271434646 | 452.3873572709863 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 404.0959191546953 | 396.7077863894884 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 300.45502211883206 | 301.3439134717943 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 344.11003202413934 | 330.8897663350314 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 298.4364205341705 | 291.6793556507056 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 187.6382133139633 | 191.05409897308772 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 156.55822078636112 | 154.178925976516 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 173.47765221825162 | 169.30862508068464 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 74.5885345035243 | 74.52689061607104 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 323.12233826013045 | 328.53889207933514 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 236.75872140126316 | 235.8378325547398 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 227.17836523816675 | 226.75357076139966 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 224.07209453308036 | 224.07209453308036 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 122.85572156047981 | 121.11642183704716 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 361.3123326658092 | 360.71014086458337 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 281.5287983927017 | 281.94301754758345 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 232.7456696285686 | 226.50976826432776 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 221.5612361744038 | 214.96188822837055 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 121.38311528944315 | 120.85441868178513 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 380.2579019244734 | 389.2520157863988 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 316.95230660496924 | 317.87597790618906 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 301.07968126657323 | 298.02424098422983 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 267.2240756921594 | 267.16353549228154 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 189.82761622494257 | 186.736450261963 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 389.88665375406805 | 387.9125133037077 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 348.70619958684887 | 346.6750499749774 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 280.5472989906087 | 271.22300822012187 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 250.02397620165968 | 241.22532776331445 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 171.67817496107645 | 166.95679280483972 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 412.626880230807 | 417.60238657950777 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 374.8829313933945 | 389.4448546468815 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 353.20410434172436 | 345.7072490717473 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 292.51045924209586 | 291.66621022138287 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 251.6264062063495 | 248.45110052911542 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 404.0155784550126 | 401.90546837237514 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 384.4389015599863 | 386.9684324594344 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 313.3731284132225 | 298.17074251037894 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 264.19199737284265 | 252.8982463999916 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 207.03696315185684 | 202.86697323136772 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 428.2436763312506 | 433.45005568619536 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 411.8516531869893 | 428.2753623461049 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 384.9095037182509 | 372.90888743000744 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 303.2438915629836 | 302.05095952914337 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 301.8689122735564 | 285.0363190513223 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 423.13592231504805 | 420.3991500185611 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 407.44527331585493 | 408.5064370765247 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 330.50050996167414 | 316.8763979925965 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 274.6833786307413 | 259.86098862141324 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 232.24019584158367 | 226.52040268160232 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 444.4596314237808 | 455.99558915752266 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 437.4245561244369 | 455.98275147271966 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 397.3350686877605 | 397.88875599028063 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 308.53809114394545 | 307.1359822042007 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 331.32379843423774 | 316.85293191675646 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 422.4622274366379 | 425.0407156418684 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 420.9547052783101 | 430.33779243510276 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 345.50265346504085 | 332.094855328957 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 280.81715528243365 | 264.6543640282054 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 252.25635200421783 | 245.46235499490305 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 452.5524207341139 | 461.7512032176736 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 445.2316469907137 | 464.4523799578466 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 416.87264016717023 | 409.17124592157046 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 309.42579489389846 | 307.9734464665731 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 350.50782004300623 | 330.98959545427294 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767
Approved by: https://github.com/Skylion007
2025-08-28 22:43:59 +00:00
c0ed87c82d [Dynamo] Fix weakref.proxy error when torch.compile (#161508)
Fixes #159258

The error occurs when we attempt to create a weak reference from a weak reference proxy.
e9d42b3880/torch/_dynamo/guards.py (L2910-L2915)

In fact, we shouldn't create a weak reference from another reference or proxy, as it would check in CPython.
f60f8225ed/Objects/weakrefobject.c (L410-L418)

However, `__weakrefoffset__` is not equal to **0** when the `guarded_object` is in `weakref.ProxyTypes`, and it will wrongly create a weak reference for the `weakref.ProxyTypes`. I think this could be a bug from CPython, but we can prevent it by adding more weakref type checks (`weakref.ProxyTypes` contains `weakref.ProxyType` and `weakref.CallableProxyType`) here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161508
Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/malfet
2025-08-28 22:34:18 +00:00
1069a08dac Enable more nightly tests on s390x (#160893)
Enable more nightly tests on s390x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160893
Approved by: https://github.com/malfet
2025-08-28 22:20:55 +00:00
1190b7f73e Support Triton kernels in SAC region (#161541)
SAC interaction with triton kernel:
- In eager, triton ops are not dispatchable, and so it is always ignored by SAC,  i.e., always recomputed.
- In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager.
- If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/xmfan
2025-08-28 21:15:25 +00:00
f46e4bcf43 Revert "Add ciflow/vllm to vLLM commit hash update PR(s) (#161678)"
This reverts commit 0e358050304c6a350dae2bce497bd1867ecc3c9f.

Reverted https://github.com/pytorch/pytorch/pull/161678 on behalf of https://github.com/yangw-dev due to we want to keep the vllm pinn updated now, right now we have some failure ([comment](https://github.com/pytorch/pytorch/pull/161678#issuecomment-3234876332))
2025-08-28 20:42:19 +00:00
496052faf6 [inductor][decompose-k] make part of template heuristics (#161098)
# why

- enable it to go through commont template heuristics point
- make easier to use in common extension point e.g. lookup table

# what

- break template heuristic into base + triton
- move k_split generation logic into a templateheuristic for decompose k
- register through normal mechanism

- to make testing work, add a context manager to temporarily set
  template heuristics for a template/op to empty (effectively skipping
  it). This is used for decompose k test to disable triton choices

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670918](https://our.internmc.facebook.com/intern/diff/D80670918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161098
Approved by: https://github.com/jansel
ghstack dependencies: #161026, #161097
2025-08-28 20:14:48 +00:00
f641effe19 [inductor][ez] move template heuristics into dir (#161097)
# why

- simplify the expansion of heuristics beyond just triton (e.g.
  decomposeK)

# what

- move template heuristics and registry into its own folder
- adjust imports accordingly

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670917](https://our.internmc.facebook.com/intern/diff/D80670917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161097
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
ghstack dependencies: #161026
2025-08-28 20:14:48 +00:00
688acf0b83 [inductor][mm] restructure decompose k (#161026)
# why

- make it easier to integrate into lookup table later

# what

- current version generates templates on the fly and uses them
  to generate a single choice
- lookup table and performance model work best when there is a
  stable set of templates (with predictable names) and those
  are then parametrized
- this change makes it so that there is a single DecomposeK template
  with a stable name, and the k split is the only parametrization we do

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_decompose_k_dynamic_False_bfloat16_sizes1 -v
```

Differential Revision: [D80670913](https://our.internmc.facebook.com/intern/diff/D80670913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161026
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
2025-08-28 20:14:41 +00:00
f0a517e333 Use vectorized stores for all dtypes (#161649)
resurrecting #151818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649
Approved by: https://github.com/Skylion007
2025-08-28 20:06:29 +00:00
bacdd985a9 [PT2] Add fastResizeToZero to all static dispatch kernels (#161679)
Summary:
Add fastResizeToZero whenever we are reusing output tensors. Otherwise it keeps throwing warning
```
Warning: An output with one or more elements was resized since it had shape [10], which does not match the required output shape [181]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
```

Test Plan:
Run local replayer.

```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=786096203
SNAPSHOT_ID=11

HARDWARE_TYPE=1 ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 3443 2>&1 | tee ~/logs/${MODEL_TYPE}/predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}

sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 1000 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_ads_mtml_offsite_cvr_oba_optout_dedicated_model_100 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} false 3443 false 2>&1 | tee ~/logs/${MODEL_TYPE}/replayer_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}
```

Before: P1921177565

After: P1921178087

Rollback Plan:

Differential Revision: D81177596

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161679
Approved by: https://github.com/henryoier
2025-08-28 19:58:40 +00:00
1621b5494c Removed redundant dtype conversion in scaled_dot_product_attention docstring example (#161613)
Suggested changes done for Fixes #161611.

Removed the line attn_bias.to(query.dtype) entirely

Fixes #161611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161613
Approved by: https://github.com/mikaylagawarecki
2025-08-28 19:58:07 +00:00
69d91b94ba kill allow_complex_guards_as_runtime_asserts (#160198)
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D79903317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
2025-08-28 19:36:19 +00:00
b76f6d117a [ROCm] fix numpy version detection and adjust fudge_factors for MI355 (#161429)
This PR fixes:

- Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions)
```
test_quantization.py::TestDynamicQuantizedOps::test_qlinear
test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy
test_quantization.py::TestQuantizedLinear::test_qlinear
test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_tanh
test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32
```
- A couple of SDPA tests on MI355 by adjusting fudge_factors:

```
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429
Approved by: https://github.com/jeffdaily
2025-08-28 19:32:09 +00:00
130e50afff [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos, https://github.com/atalman
2025-08-28 18:57:34 +00:00
30ab87c884 [inductor] don't append None to choices (#161672)
Summary: don't append None as a choice to choices in autotune

Test Plan: See internal Diff

Differential Revision: D81188644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161672
Approved by: https://github.com/angelayi
2025-08-28 18:48:50 +00:00
049c08eda8 Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934)"
This reverts commit 8f31aa97a3e1e17bed29b6cedf9884f0c6b145e9.

Reverted https://github.com/pytorch/pytorch/pull/160934 on behalf of https://github.com/anijain2305 due to causes memory leak leading to OOMs ([comment](https://github.com/pytorch/pytorch/pull/160934#issuecomment-3234426359))
2025-08-28 17:56:36 +00:00
affd071858 [export] serialization support for triton_kernel_wrapper_functional (#161314)
Summary: att

Test Plan:
buck2 test mode/opt //caffe2/test:test_export -- test_triton_hop

Rollback Plan:

Differential Revision: D80827767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161314
Approved by: https://github.com/angelayi
2025-08-28 17:42:47 +00:00
dac062f23b Add aoti to mps benchmarks (#160741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160741
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-08-28 17:32:29 +00:00
2a70d98abf [CI] Migrate XPU build and test to python 3.10 (#161708)
Follow #161167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708
Approved by: https://github.com/malfet
2025-08-28 17:27:11 +00:00
eqy
55c289d5c1 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-28 17:04:25 +00:00
2042d2174a [MPS] Migrate round unary op to Metal (#161712)
And actually use the right function, as [`torch.round`](https://docs.pytorch.org/docs/stable/generated/torch.round.html) doesn't use `std::round`, but rather `std::rint`, which can be easily seen by running something like
```python
import torch
print(torch.arange(-3., 3., step=.5, device='mps').round())
print(torch.arange(-3., 3., step=.5, device='mps').cpu().round())
```

Before this change it printed
```
tensor([-3., -3., -2., -2., -1., -1.,  0.,  1.,  1.,  2.,  2.,  3.], device='mps:0')
tensor([-3., -2., -2., -2., -1., -0.,  0.,  0.,  1.,  2.,  2.,  2.])
```
But after this change results match

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161712
Approved by: https://github.com/dcci
2025-08-28 16:45:07 +00:00
4fd761fecc [DTensor] Wrap sharding prop error with contextual exception (#161574)
Mainly, this helps tell the user more info about the operator that
failed to run if it fails during sharding propagation.

Previously, only this exception would be raised:
```
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')
```

Now you get both the above exception as well as

```
The above exception was the direct cause of the following exception:
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```

<stacktrace omitted>
<details><summary>detailed error</summary>

```
======================================================================
ERROR: test_linear (__main__.TestDTensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 668, in wrapper
    self._join_processes(fn)
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 932, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 972, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 4 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 150, in dispatch
    self.sharding_propagator.propagate(op_info)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 309, in propagate
    OutputSharding, self.propagate_op_sharding(op_info.schema)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__
    return self.cache(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 329, in propagate_op_sharding_non_cached
    op_strategy = self.op_strategy_funcs[op_schema.op](strategy_schema)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 673, in reshape_strategy
    input_tgt_placements, output_placements = propagate_shape_and_sharding(
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 601, in propagate_shape_and_sharding
    in_dim = get_in_dim_to_shard(cmd)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 537, in get_in_dim_to_shard
    raise RuntimeError(
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 816, in run_test
    getattr(self, test_name)()
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 670, in wrapper
    fn()
  File "/data/users/whc/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 490, in wrapper
    raise e
  File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 487, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/data/users/whc/pytorch/test.py", line 60, in test_linear
    print("results: ", distributed_linear(distributed_input))
  File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/nn/modules/linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
  File "/data/users/whc/pytorch/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn
    return fn(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_api.py", line 358, in __torch_dispatch__
    return DTensor._op_dispatcher.dispatch(
  File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 163, in dispatch
    raise RuntimeError(
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161574
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-08-28 15:56:15 +00:00
a8270dd124 Revert "kill allow_complex_guards_as_runtime_asserts (#160198)"
This reverts commit 196232bb935cb346f143d5c39e9a73c44121a033.

Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/atalman due to dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTestsCUDA::test_compile_selective_checkpoint_triton_kernel_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17289619543/job/49074475338) [HUD commit link](196232bb93) ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3234013520))
2025-08-28 15:40:37 +00:00
63632fc7ee Add new_zeros dtype variant to the shim and as a stable op (#161597)
In case we want this before 2.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161597
Approved by: https://github.com/mikaylagawarecki
2025-08-28 13:57:24 +00:00
05d0f11dbd Revert "Add test coverage to tf32 in max autotune mm configs (#161545)"
This reverts commit e9d34b2438d65d6d16109e2416f3698de20f85c2.

Reverted https://github.com/pytorch/pytorch/pull/161545 on behalf of https://github.com/atalman due to inductor/test_max_autotune.py::TestMaxAutotuneRemoteCache::test_get_mm_configs_float32_precision_ieee [GH job link](https://github.com/pytorch/pytorch/actions/runs/17283985553/job/49058214260) [HUD commit link](e9d34b2438) ([comment](https://github.com/pytorch/pytorch/pull/161545#issuecomment-3233569771))
2025-08-28 13:46:47 +00:00
ef0483d74c Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767)"
This reverts commit b36a20d368733740a8507b3109d193c88930323a.

Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3233558168))
2025-08-28 13:44:41 +00:00
5432966253 Revert "Remove test since it ooms on CI (#161644)"
This reverts commit 443452ca2f5beef58019f4e7e7e31c0526aee0fc.

Reverted https://github.com/pytorch/pytorch/pull/161644 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/161644#issuecomment-3233550883))
2025-08-28 13:41:58 +00:00
e9975f501c Revert "Support Triton kernels in SAC region (#161541)"
This reverts commit 149c68071ca033d5e3427e63e05d9969bd4961e4.

Reverted https://github.com/pytorch/pytorch/pull/161541 on behalf of https://github.com/malfet due to Broke some tests in trunk workflow, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=trunk%20%2F%20linux-jammy-cuda12.8 ([comment](https://github.com/pytorch/pytorch/pull/161541#issuecomment-3233457206))
2025-08-28 13:14:53 +00:00
07f76517e7 [Inductor][WIndows] Fix Windows test case failure. (#161497)
Fixes windows test case failures:
- TritonCodeGenTests.test_inductor_sequence_nr
- TritonCodeGenTests.test_indirect_device_assert
- CompiledOptimizerTests.test_static_address_finalizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161497
Approved by: https://github.com/jansel
2025-08-28 12:40:42 +00:00
3519969e4f [Intel GPU] Enable tensor memory descriptor in triton template for XPU. (#161600)
As Intel Triton now supports tensor descriptor, this PR updates the pinned Intel Triton version and introduces support for Triton MM template with tensor descriptor on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161600
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-28 12:39:58 +00:00
5790b00975 [RELAND] Close some sources of fake tensor leakage (#161589)
Reland of https://github.com/pytorch/pytorch/pull/159923

Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability.

2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict

3. We modify yolov3 to fix the previous silent incorrect behaviour
4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking

When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.

Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589
Approved by: https://github.com/avikchaudhuri
2025-08-28 09:46:42 +00:00
2e77a08b95 [cuDNN][TF32] Account for TF32 in test_super_resolution_cuda (#161662)
cuDNN seems to be dispatching to TF32 kernels on B200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161662
Approved by: https://github.com/Skylion007
2025-08-28 08:42:34 +00:00
196232bb93 kill allow_complex_guards_as_runtime_asserts (#160198)
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D79903317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
2025-08-28 07:59:29 +00:00
fa76256603 Revert "[dynamic shapes] use prims_common contiguity in create_example_tensors (#160933)"
This reverts commit 33c3794533844236a6e30ba377e0a6802b279fc8.

Reverted https://github.com/pytorch/pytorch/pull/160933 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160933#issuecomment-3232305708))
2025-08-28 07:39:26 +00:00
d2d4a3c539 Select Algorithm clear feedback savers (#161654)
Add `clear_feedback_savers` and tests for the feedback functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161654
Approved by: https://github.com/masnesral
2025-08-28 06:56:03 +00:00
95516ad7e6 [4/N][SymmMem] Add get_remote_tensor + move up get_buffer and get_signal_pad (#161533)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

`get_remote_tensor `: return a symmetric tensor given a peer rank.

The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.

As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
2025-08-28 06:47:35 +00:00
ff9533970a [3/N][SymmMem] Expose offset field from handle (#161532)
As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471
2025-08-28 06:39:12 +00:00
b291dc9684 [2/N][SymmMem] Add MemPool allocator and tests (#161471)
(Porting most of #161008)

Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.

To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
    tensor = torch.arange(numel, dtype=dtype, device=device)
```

Added tests for both use cases above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-28 06:31:29 +00:00
0fd63fd88b Guard config copy for pickle errors (#161659)
Differential Revision: [D81168335](https://our.internmc.facebook.com/intern/diff/D81168335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161659
Approved by: https://github.com/zou3519
2025-08-28 06:27:48 +00:00
eec876deb6 [SymmMem] Isolate set_device tests to avoid hang (#161668)
`test_symmetric_memory.py` hangs like this:
```
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...
```

This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`.
However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function.

Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time.

Hang is gone now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668
Approved by: https://github.com/fegin
2025-08-28 05:43:49 +00:00
c83b43d7a8 [1/2]Add summary report for vllm build (#161565)
Demo Run
https://github.com/pytorch/pytorch/actions/runs/17259533323?pr=161565

<img width="1538" height="720" alt="image" src="https://github.com/user-attachments/assets/64f6d7b4-cac6-4c12-863c-b15514bb8810" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161565
Approved by: https://github.com/huydhn
2025-08-28 05:25:55 +00:00
d3d9eb4777 Error when TORCH_STABLE_ONLY is defined in TensorBase.h (#161658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161658
Approved by: https://github.com/albanD
2025-08-28 04:36:31 +00:00
a65db6dc4c [vllm hash update] update the pinned vllm hash (#161363)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161363
Approved by: https://github.com/pytorchbot
2025-08-28 04:14:19 +00:00
149c68071c Support Triton kernels in SAC region (#161541)
SAC interaction with triton kernel:
- In eager, triton ops are not dispatchable, and so it is always ignored by SAC,  i.e., always recomputed.
- In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager.
- If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541
Approved by: https://github.com/drisspg, https://github.com/zou3519
ghstack dependencies: #160781
2025-08-28 03:54:46 +00:00
bae01479c3 [Inductor UT] Re-enable test_torchinductor_opinfo.py on XPU. (#161477)
The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet.

To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477
Approved by: https://github.com/jansel
2025-08-28 03:29:21 +00:00
cyy
8939d151d0 Use std::apply for CPU code (#152526)
The supported compilers are recent enough to enable std::apply in C++17.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152526
Approved by: https://github.com/ezyang
2025-08-28 02:47:54 +00:00
5edc3d814f Add option for TorchDispatchMode to ignore torch.compile internals (#161648)
If TorchDispatchMode.ignore_compile_internals() is True, then we turn
off the TorchDispatchMode during the compilation process, instead
turning it back on during runtime of the compiled artifact.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648
Approved by: https://github.com/bdhirsh
2025-08-28 02:41:33 +00:00
199c3633bf Fix Inductor Periodic (#161617)
Models are now passing accuracy. # of graph breaks is larger because
these were not actually tested in CI (if the model fails accuracy we
do not assert on # of graph breaks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161617
Approved by: https://github.com/anijain2305
2025-08-28 02:36:08 +00:00
e9d34b2438 Add test coverage to tf32 in max autotune mm configs (#161545)
Add a test to make sure that the configs are using the correct setting of tf32 to prevent regression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161545
Approved by: https://github.com/coconutruben
2025-08-28 02:27:58 +00:00
be1612201d [export] Support AC HOP in pre-dispatch (#161479)
Adds the pre-dispatch handling for the AC hop. This lets the HOP pre-dispatch export without actually pre-dispatch tracing into it,. However, this is not sufficient to support AC in export:
- because the HOP body will still be in torch IR, so it will fail export verifiers
- the exported module also can't be ran in eager because the AC HOP relies on partitioner to embed RNG state saving/restoring

So it must be lowered by AOT Autograd into post-dispatch first before being executed, It suffices for my purposes though.

If users had checkpoint API use in their exported model, the behavior goes from silently incorrect to now be validation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161479
Approved by: https://github.com/ydwu4
ghstack dependencies: #161353
2025-08-28 01:46:25 +00:00
15670f9075 [dtensor] support local_map as a decorator (#161353)
And extract it out as a convenience function for dynamo to wrap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161353
Approved by: https://github.com/zpcore
2025-08-28 01:46:25 +00:00
0e35805030 Add ciflow/vllm to vLLM commit hash update PR(s) (#161678)
As it should be, otherwise, PR(s) like https://github.com/pytorch/pytorch/pull/161121 were merged without the signals it needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161678
Approved by: https://github.com/atalman
2025-08-28 01:35:04 +00:00
92c2daebb6 Add inductor provenance tracking artifacts to cache (#161440)
Summary:

- Add inductor provenance tracking artifacts to cache
- Update the tlparse version pin to `0.4.0`. The old tlparse version errors out on the new tlparse output. The lowest tlparse version that works is `0.3.42`.

tlparse error:
```
thread 'main' panicked at src/parsers.rs:671:71:
called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0)
stack backtrace:
   0:     0x55e4ff1c7f00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h6d42cc84fc840290
   1:     0x55e4ff1ee503 - core::fmt::write::h5af61a909e3ec64d
   2:     0x55e4ff1c4c33 - std::io::Write::write_fmt::h5a7b54aa6e4a315d
   3:     0x55e4ff1c7d52 - std::sys::backtrace::BacktraceLock::print::h555579e7396c26ac
   4:     0x55e4ff1c8caf - std::panicking::default_hook::{{closure}}::h9128866118196224
   5:     0x55e4ff1c8b1a - std::panicking::default_hook::h52e9e7314e0255f6
   6:     0x55e4ff1c9652 - std::panicking::rust_panic_with_hook::h541791bcc774ef34
   7:     0x55e4ff1c93fa - std::panicking::begin_panic_handler::{{closure}}::h6479a2f0137c7d19
   8:     0x55e4ff1c8419 - std::sys::backtrace::__rust_end_short_backtrace::ha04e7c0fc61ded91
   9:     0x55e4ff1c908d - rust_begin_unwind
  10:     0x55e4fef7a030 - core::panicking::panic_fmt::h5764ee7030b7a73d
  11:     0x55e4fef7a406 - core::result::unwrap_failed::h3ff7104a9ace307a
  12:     0x55e4fefb3c56 - <tlparse::parsers::ArtifactParser as tlparse::parsers::StructuredLogParser>::parse::h20bc51a17ffc494a
  13:     0x55e4fef9669a - tlparse::run_parser::h20c7729f151eec62
  14:     0x55e4fef99a1b - tlparse::parse_path::he4892147f47fbade
  15:     0x55e4fef7c760 - tlparse::main::hdc05613b32f4f53b
  16:     0x55e4fef89263 - std::sys::backtrace::__rust_begin_short_backtrace::h15f188f3edf42596
  17:     0x55e4fef8827d - std::rt::lang_start::{{closure}}::he2c21e32a442538e
  18:     0x55e4ff1be0f0 - std::rt::lang_start_internal::h15895544e2012228
  19:     0x55e4fef83975 - main
  20:     0x7f0b3662a610 - __libc_start_call_main
  21:     0x7f0b3662a6c0 - __libc_start_main_alias_2
  22:     0x55e4fef7a610 - <unknown>
  23:                0x0 - <unknown>
```

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r  test_kernel_information_generation
python test/dynamo/test_structured_trace.py -k test_chromium_event
```

Differential Revision: D80976585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161440
Approved by: https://github.com/oulgen
2025-08-28 01:16:02 +00:00
768a1017c5 Allow parallel start NUMA binding (#161576)
# Context
In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.

However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).

The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:

> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.

But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.

I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)

The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.

# This PR
Remove restrictions against parallel start for NUMA binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576
Approved by: https://github.com/d4l3k
2025-08-28 01:15:58 +00:00
0c4a79b7e0 Replace some calls to new with make_{unique,shared} (#160581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160581
Approved by: https://github.com/malfet
2025-08-28 00:30:45 +00:00
9b02435e9f Improve Scheduler init duration (#161491)
Early exit merge_loops() if config.loop_ordering_after_fusion is false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161491
Approved by: https://github.com/jansel
2025-08-28 00:27:51 +00:00
fd60117051 [C10D] add _summarize_ranks util (#160284)
Prints ranges of ranks succinctly.

e.g.

For a strided list of ranks, summarizes down to start:stop:step
```
0:4096:512
```

Omits step if it's 1
```
0:8
```

Note: endpoints are exclusive. This may not be intuitive to everyone,
but in the first above the last rank is 3584, and in the second it is
7.

Currently, does not support combinations of striding _and_ range.  (e.g.
can not generate a representation like "0:2, 4:6, ..., 12:14".  Is this
needed / useful? If so it could be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284
Approved by: https://github.com/XilunWu
2025-08-28 00:17:53 +00:00
97a548b640 [PGO] skip allowlist logging for empty graphs (#161530)
Summary: reduces spurious logging

Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D81060182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161530
Approved by: https://github.com/bobrenjc93, https://github.com/mlazos
2025-08-28 00:12:13 +00:00
c55bdb26e1 Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)"
This reverts commit 378edb047f83dfb84c2d9c032bddebc5e0147b8f.

Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))
2025-08-27 23:45:12 +00:00
903181bb6f Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471)"
This reverts commit 4ed71d5412d58746d23f16689cab61da0e8149ef.

Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3230069186))
2025-08-27 23:18:36 +00:00
ba201082b6 [TorchScript] ProfilingExecutor - RemoveProfileNodesAndSpecializeTypes None handling (#161538)
ProfilingGraphExecutor works like this:
1. do some unrelated JIT optimizations
2. Add profiling nodes to collect JIT information like tensor dtypes and shapes
3. Do some more unrelated JIT optimizations
4. Remove the profiling nodes and extract the tensor info, and then use the JIT tensor info to do optimizations.

This PR is intended to fix a bug in Step 4, where the profiling nodes were removed. It was previously assumed that all the things that were profiled were either Tensors or Optional[Tensor]s - otherwise, step 2 would not have introduced a profiling node.

However, we saw a case where step 3 would remove replace Optional[Tensor] inputs with `None` inputs (e.g. if a conditional that returned a Tensor or a None could be statically known to only follow the `None` branch).

To fix this, we essentially just modify the RemoveProfileNodesAndSpecializeTypes assert so that it accepts Tensors, Optional[Tensor]s, or None (the new part).

Note that this issue is probably somewhat uncommon (maybe why we didn't see it for the first 4 years that this code existed). I expect that, typically, any time that step 3 would convert `Optional[Tensor] -> None`, step 1 would have already done that. So it's difficult to reproduce in an end-to-end TorchScript workload.

Differential Revision: [D81068172](https://our.internmc.facebook.com/intern/diff/D81068172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161538
Approved by: https://github.com/nmacchioni
2025-08-27 23:12:15 +00:00
8fc2467fe5 Revert "[3/N][SymmMem] Expose offset field from handle (#161532)"
This reverts commit 68d395d61e9d4601ab1e2bca56eb28253572c662.

Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/161471 internal failure ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3230016806))
2025-08-27 23:06:55 +00:00
30edac5da6 Updates to CuTe DSL template renderer (#161117)
# Summary
This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack.

<img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" />

Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack.

A bunch of score mods that me and Claude came up with , that exercise the actual ops.
``` Py

def causal_mask(score, b, h, q_idx, kv_idx):
    """Causal attention mask."""
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

def relative_bias(score, b, h, token_q, token_kv):
    """Relative position bias."""
    return score + torch.abs(token_q - token_kv)

def relative_bias_v2(score, b, h, token_q, token_kv):
    """Relative position bias with factor of 2."""
    return score + 2 * torch.abs(token_q - token_kv)

def times_two(score, b, h, q_idx, kv_idx):
    """Simple score modification that doubles the score."""
    return score * 2

def alibi_bias(score, b, h, q_idx, kv_idx):
    """ALiBi (Attention with Linear Biases) - used in some modern models."""
    # Different slopes for different heads
    slope = 2 ** (-8 * (h + 1) / 8)  # Simplified version
    return score - slope * torch.abs(q_idx - kv_idx)

def sliding_window(score, b, h, q_idx, kv_idx, window_size=256):
    """Sliding window attention - only attend to nearby tokens."""
    return torch.where(
        torch.abs(q_idx - kv_idx) <= window_size,
        score,
        float("-inf")
    )

def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64):
    """Block diagonal attention pattern."""
    q_block = q_idx // block_size
    kv_block = kv_idx // block_size
    return torch.where(q_block == kv_block, score, float("-inf"))

def additive_bias(score, b, h, q_idx, kv_idx):
    """Test simple addition with position-based bias."""
    return score + (q_idx + kv_idx) * 0.01

def multiplicative_decay(score, b, h, q_idx, kv_idx):
    """Test multiplication with distance-based decay."""
    distance = torch.abs(q_idx - kv_idx)
    return score * torch.exp(-0.1 * distance)

def sine_wave_bias(score, b, h, q_idx, kv_idx):
    """Test trigonometric functions."""
    return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64)

def log_distance_penalty(score, b, h, q_idx, kv_idx):
    """Test logarithmic operations."""
    distance = torch.abs(q_idx - kv_idx).float()
    return score - torch.log(1 + distance)

def alternating_mask(score, b, h, q_idx, kv_idx):
    """Test with alternating pattern - good for branch prediction."""
    return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf"))

def head_specific_pattern(score, b, h, q_idx, kv_idx):
    """Different behavior per attention head."""
    even_head = h % 2 == 0
    causal = q_idx >= kv_idx
    return torch.where(even_head & causal, score, float("-inf"))

def sparse_strided(score, b, h, q_idx, kv_idx, stride=4):
    """Sparse attention with strided pattern."""
    return torch.where(
        (kv_idx % stride == 0) | (q_idx == kv_idx),
        score,
        float("-inf")
    )

def causal_with_global(score, b, h, q_idx, kv_idx):
    """Causal mask but first few tokens are globally attended."""
    is_causal = q_idx >= kv_idx
    is_global = kv_idx < 4
    return torch.where(is_causal | is_global, score, float("-inf"))

def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2):
    """Dilated attention pattern - exponentially increasing gaps."""
    distance = torch.abs(q_idx - kv_idx)
    is_attended = (distance == 0) | ((distance > 0) & ((distance & (distance - 1)) == 0))
    return torch.where(is_attended, score, float("-inf"))

```

Example outputs:
```
[Test Suite]
Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128

[Test 1: none]
[No score_mod, flash='enabled'] Found flash_attncute: True
[No score_mod, flash='disabled'] Found flash_attncute: False
✓ Outputs match between flash enabled/disabled
✓ Output matches eager SDPA (rtol=0.001, atol=0.001)

[Test 2: causal]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 17879 / 134217728 (0.0%)
Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed)
Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed)

[Test 3: rel_bias]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 12836 / 134217728 (0.0%)
Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed)
Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed)

[Test 4: rel_bias_v2]
```

This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117
Approved by: https://github.com/mlazos
2025-08-27 23:01:31 +00:00
12c0cf3fab switch prefer_deferred_runtime_asserts_over_guards in export (#160111)
Summary:
In preparation for checking shape guards in export, this PR effectively switches `prefer_deferred_runtime_asserts_over_guards` to `False`, matching Dynamo.

Actually that's a lie: we switch it to `allow_complex_guards_as_runtime_asserts`, which is `False` by default but can be controlled via an internally API to be `True`. This makes the two flags synchronized, so we should be able to kill `allow_complex_guards_as_runtime_asserts` at this point.

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D79734206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160111
Approved by: https://github.com/tugsbayasgalan
2025-08-27 22:51:10 +00:00
6b051d7de3 [BE] Refactor trymerge for readability (#161637)
Two changes:
- Extract getting the last_commit's sha into it's own function
- Rename merge_changes to merge_changes_locally to better explain it's functionality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161637
Approved by: https://github.com/seemethere, https://github.com/malfet
ghstack dependencies: #161558
2025-08-27 22:44:00 +00:00
ee0ec21191 Ensure that tensors are contiguous before using no-graph MPS impl (#161641)
Fixes #161640

Check if tensors are contiguous before using the no-graph implementation. Using the script in the issue above with this change I get expected results.

```
MPS contiguous result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061], device='mps:0')
MPS non-contig result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061], device='mps:0')
CPU non-contig result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161641
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-27 22:31:57 +00:00
7da02bf8af Skip const folding with symbolic expression (#161437)
Summary: When performing constant folding, we must skip over operators that have symbolic `fill_value`.

Test Plan:
CI

Rollback Plan:

Reviewed By: kalpit-meta-1

Differential Revision: D80965936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161437
Approved by: https://github.com/StellarrZ
2025-08-27 22:09:58 +00:00
1041805c1e [dynamo, nested graph breaks] prevent excessive recompilations (#159786)
Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object.

Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break.

Followup: we can skip guards on continuation functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817, #160138
2025-08-27 21:53:37 +00:00
6562646dab [dynamo, nested graph breaks] clean up comments and codegen (#160138)
Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures.

Also simplify some codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817
2025-08-27 21:53:37 +00:00
d0a242e547 [dynamo, nested graph breaks] support nested closures (#159817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678
2025-08-27 21:53:37 +00:00
3f8090809f [dynamo, nested graph breaks] support nested graph breaks x context managers (#159678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329
2025-08-27 21:53:37 +00:00
10d93325b1 [dynamo, nested graph breaks] support very simple nested graph breaks (#159329)
e.g. this graph breaks once now:
```python
import torch

torch._dynamo.config.nested_graph_breaks = True

def inner(x):
    x = x + 1
    torch._dynamo.graph_break()
    return x + 2

@torch.compile(backend="eager")
def outer(x):
    return inner(x)

print(outer(torch.ones(3)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329
Approved by: https://github.com/anijain2305
2025-08-27 21:53:37 +00:00
68fa882dad [dynamo] Correctly track mutation class source for MutableMappingVariable (#161568)
Fixes https://github.com/pytorch/pytorch/issues/161505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161568
Approved by: https://github.com/Lucaskabela, https://github.com/malfet
2025-08-27 21:47:17 +00:00
b9c6aa1e17 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)" (#161628)
This reverts commit ae1a706444d6c0a6019ffc936c8b36574335a5d5.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161628
Approved by: https://github.com/atalman
ghstack dependencies: #161625, #161626, #161627
2025-08-27 21:37:14 +00:00
b7b9fb9962 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" (#161627)
This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627
Approved by: https://github.com/atalman
ghstack dependencies: #161625, #161626
2025-08-27 21:37:14 +00:00
c03d8d4082 Revert "Generalize torch._C._set_allocator_settings to be generic (#156175)" (#161626)
This reverts commit 908c5cc4c0f22d141776bde47c296b5186691855.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626
Approved by: https://github.com/atalman
ghstack dependencies: #161625
2025-08-27 21:37:14 +00:00
clr
40f46b09c7 async_compile: Fix the wait method to actually wait (#161561)
This method never triggered. It's used in 2 tests and they pass, so no serious
concern.

Note that I did introduce and fix a latent bug, which is if we called
shutdown_compile_workers, jobs would crash with this change due to ready_future
being finished if we called wait.

However we only call wait in tests so that bug is fine.

The other behaviour, is that if you called shutdown, I believe we may
potentially block on your first triton compile after that, until the pool was
ready. This should correctly switch to direct mode, until the pool is ready on
later warmups.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161561
Approved by: https://github.com/masnesral
ghstack dependencies: #161452
2025-08-27 21:35:31 +00:00
clr
0d6597138c inductor: Log the specific triton kernel that fails (#161452)
Added a optional name argument to SubprocPool.submit.

We record this in a dictionary, and when raising exceptions, add the name.
We manage the lifecycle the same as the pending futures.

Added a specific testcase to make sure this logs correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161452
Approved by: https://github.com/masnesral
2025-08-27 21:35:31 +00:00
06ddaf1e0a Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" (#160999)" (#161625)
This reverts commit a818fa77e3a72271f144514ef349c5a666313205.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625
Approved by: https://github.com/atalman
2025-08-27 21:34:12 +00:00
26d0ff1cba [AOTI-FX] Enhance launch grid FloorDiv replacement using sympy.together. (#161582)
# Feature
2d launch grids with dynamic shapes can contain sympy expressions like `floor(x / 128 + y / 128)`. This breaks the dynamic shapes tracer which only supports `FloorDiv`, and not `floor`.  To handle this case, call `sympy.together` prior to pattern matching to convert this to `floor((x + y) / 128)`. Then, we can recognize the pattern and map it to `FloorDiv(x + y, 128)`.

# Test plan
Added a custom Triton test exposing this. The test calls a 2d autotuned kernel with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161582
Approved by: https://github.com/nandesuka
2025-08-27 21:31:28 +00:00
c36d18d7e8 [rfc] aot precompile with custom backend api (#161383)
Adding a new feature to torch.compile(fullgraph=True) which "aot_compile" a function with given example inputs.

On user side it should look like:

```
def foo(x, y):
    return x + y

compiled_fn = torch.compile(fullgraph=True).aot_compile(((torch.randn(3, 4), torch.randn(3, 4)), {}))
```

This is different from the traditional `torch.compile` workflow where compiled object will be a drop-in replacement for the original eager model:
```
tensor input -> torch.compile() -> tensor output (and populates the cache entry)
```
`aot_compile` will instead return a compiled function as result, and it's purely functional and doesn't populate the compile cache entry in dynamo:
```
tensor input -> aot_compile() -> compiled function
```
The aot compiled function will be savable and loadable on disk as well:
```
torch.compile(fullgraph=True).aot_compile(...).save_compiled_function('my/path')
compiled_fn = torch.compiler.load_compiled_function("my/path")
```

Right now we treat compiler backend as a blackbox and it needs to implement the following interface to make compile artifacts serialzable:
```
class SerializableCallable:
    def save_compile_artifacts(): ....
    def load_compile_artifacts(): ....
```
We haven't implemented this for inductor yet, but this shouldn't be an issue since we gate this feature through `torch._dynamo.config.aot_compile` (which defaults to False), and this will be left as follow up PR to the current PR.

Differential Revision: [D80914270](https://our.internmc.facebook.com/intern/diff/D80914270/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161383
Approved by: https://github.com/tugsbayasgalan
2025-08-27 21:26:25 +00:00
014b98dd09 Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)"
This reverts commit 77bc959fe122bfd131e339ca36cab445a1860806.

Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))
2025-08-27 21:21:19 +00:00
38ed57d446 Revert "Updates to CuTe DSL template renderer (#161117)"
This reverts commit 1750cc80374a9dd22fc26701c0602ae11a62baf0.

Reverted https://github.com/pytorch/pytorch/pull/161117 on behalf of https://github.com/atalman due to will need to revert to unblock revert of https://github.com/pytorch/pytorch/pull/151314 ([comment](https://github.com/pytorch/pytorch/pull/161117#issuecomment-3229754295))
2025-08-27 21:17:25 +00:00
007935a802 [cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063
Approved by: https://github.com/Skylion007
ghstack dependencies: #160754
2025-08-27 21:15:01 +00:00
cbc53b7696 Update pybind11 submodule to 3.0.1 (#160754)
Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling.

Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754
Approved by: https://github.com/Skylion007
2025-08-27 21:15:01 +00:00
624bc36163 Ensure the comment id is always passed in to trymerge (#161558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161558
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-08-27 19:53:28 +00:00
06c7516994 [BE] Upgrade XPU support package to 2025.2 (#158733)
Including below changes,

- Add XPU support package 2025.2 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.1 build in CI to ensure no break issue until PyTorch 2.9 release
- Upgrade XPU support package from 2025.1 to 2025.2 in CD for both Linux and Windows
- Rename Linux CI job name & image name to n & n-1
- Update XPU runtime pypi packages dependencies of CD wheels
- Remove deprecated support package version docker image build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158733
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-27 19:33:38 +00:00
2efcf9d081 [dynamo] Fix graph break registry loading in fbcode (#161550)
Summary: Add `torch/_dynamo/graph_break_registry.json` as an internal dependency. Minor related fixes.

Test Plan:
Test on OSS.

Rollback Plan:

Differential Revision: D81078973

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161550
Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305
2025-08-27 19:25:15 +00:00
443452ca2f Remove test since it ooms on CI (#161644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161644
Approved by: https://github.com/BoyuanFeng
2025-08-27 19:11:29 +00:00
47ecd2042f [ONNX] Fix index_put_ usage (#161263)
Summary:
It's hard to understand how it's working in most of our models, but in general it looks like `aten::copy_` is replaced incorrectly.
There are two schemas for `aten::copy_`:
1. `aten::copy_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)`
2. `aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)`

According to the logic in the comments we don't need one of the parameters for `aten::index_put_`.

It seems logic has been inferred from ordinary `aten::copy` where there could be a third parameter which is `non_blocking` flag.

Depending on the execution environment the sliced copying can be replaced either by first schema or by second schema with explicitly setting default parameter to `False`.

If first schema is selected it will lead to the crash (which is easily to catch in our prod env). In case of the second schema selection, there is no crash, but the third parameter is treated as `accumulate` parameter of the `index_put_` function which doesn't make sense.

So, in any case usage of the third parameter must be removed from the `aten::copy_` replacement.

For more details and check this post:
https://fb.workplace.com/groups/1405155842844877/permalink/25337687649165028/

Test Plan:

The test fails in production envirounment only.
In the test env `non_blocking` flag is mapped as `False` to the `acumulate` flag, which doesn't cause test to fail, but has no sense in terms of flags mapping.

The export works without errors, before the fix it was failing with accessing by index out of bounds vector, like this:
```
   1095     _C._jit_onnx_log("Torch IR graph at exception: ", graph)
File ~/.bento/kernels/bento_kernel_gaia_ml/1578/bento_kernel_gaia_ml_binary-inplace#link-tree/torch/onnx/utils.py:636, in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module)
    629 _C._jit_pass_lower_all_tuples(graph)
    630 # in _jit_pass_onnx, symbolic functions are called for each node for conversion.
    631 # However, there are nodes that cannot be converted without additional context.
    632 # For example, the number of outputs from split (and whether it is static or dynamic) is unknown
    633 # until the point where it is unpacked by listUnpack node.
    634 # This pass does a preprocess, and prepares the nodes such that enough context can be received
    635 # by the symbolic function.
--> 636 _C._jit_pass_onnx_remove_inplace_ops_for_onnx(graph, module)
    637 _C._jit_pass_onnx_preprocess(graph)
    639 # onnx does not support tuples, so try to remove them
RuntimeError: vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)
```

The test script:
```
import torch as th
import tempfile

class CopyTest(th.nn.Module):
    def forward(
        self,
        input_th: th.Tensor
    ):
        to_fill = th.ones((3, 3))
        to_fill[:, 0] = input_th[:, 0]
        return to_fill

m = CopyTest()

test_tensor = th.zeros((3, 3))

with tempfile.NamedTemporaryFile() as f:
    th.onnx.export(
            m,
            (test_tensor,),
            f,
            export_params=True,
            opset_version=17,
            do_constant_folding=True,
            input_names=["input"],
            output_names=["features"],
            dynamo=False,
        )
```

The exported model test:
```
import torch
import onnx
import onnxruntime

model_name = '/home/ironsided/test_model.onnx'
onnx_model = onnx.load(model_name)
onnx.checker.check_model(onnx_model)

example_inputs = (torch.zeros(3, 3),)

onnx_inputs = [tensor.numpy(force=True) for tensor in example_inputs]
print(f"Input length: {len(onnx_inputs)}")
print(f"Sample input: {onnx_inputs}")

ort_session = onnxruntime.InferenceSession(
    model_name, providers=["CPUExecutionProvider"]
)

onnxruntime_input = {input_arg.name: input_value for input_arg, input_value in zip(ort_session.get_inputs(), onnx_inputs)}

# ONNX Runtime returns a list of outputs
onnxruntime_outputs = ort_session.run(None, onnxruntime_input)[0]

print(onnxruntime_outputs)
```

The produced result is correct:
```
Input length: 1
Sample input: [array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)]
[[0. 1. 1.]
 [0. 1. 1.]
 [0. 1. 1.]]
```

Rollback Plan:

Differential Revision: D80797028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161263
Approved by: https://github.com/justinchuby, https://github.com/jermenkoo
2025-08-27 18:53:13 +00:00
1750cc8037 Updates to CuTe DSL template renderer (#161117)
# Summary
This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack.

<img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" />

Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack.

A bunch of score mods that me and Claude came up with , that exercise the actual ops.
``` Py

def causal_mask(score, b, h, q_idx, kv_idx):
    """Causal attention mask."""
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

def relative_bias(score, b, h, token_q, token_kv):
    """Relative position bias."""
    return score + torch.abs(token_q - token_kv)

def relative_bias_v2(score, b, h, token_q, token_kv):
    """Relative position bias with factor of 2."""
    return score + 2 * torch.abs(token_q - token_kv)

def times_two(score, b, h, q_idx, kv_idx):
    """Simple score modification that doubles the score."""
    return score * 2

def alibi_bias(score, b, h, q_idx, kv_idx):
    """ALiBi (Attention with Linear Biases) - used in some modern models."""
    # Different slopes for different heads
    slope = 2 ** (-8 * (h + 1) / 8)  # Simplified version
    return score - slope * torch.abs(q_idx - kv_idx)

def sliding_window(score, b, h, q_idx, kv_idx, window_size=256):
    """Sliding window attention - only attend to nearby tokens."""
    return torch.where(
        torch.abs(q_idx - kv_idx) <= window_size,
        score,
        float("-inf")
    )

def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64):
    """Block diagonal attention pattern."""
    q_block = q_idx // block_size
    kv_block = kv_idx // block_size
    return torch.where(q_block == kv_block, score, float("-inf"))

def additive_bias(score, b, h, q_idx, kv_idx):
    """Test simple addition with position-based bias."""
    return score + (q_idx + kv_idx) * 0.01

def multiplicative_decay(score, b, h, q_idx, kv_idx):
    """Test multiplication with distance-based decay."""
    distance = torch.abs(q_idx - kv_idx)
    return score * torch.exp(-0.1 * distance)

def sine_wave_bias(score, b, h, q_idx, kv_idx):
    """Test trigonometric functions."""
    return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64)

def log_distance_penalty(score, b, h, q_idx, kv_idx):
    """Test logarithmic operations."""
    distance = torch.abs(q_idx - kv_idx).float()
    return score - torch.log(1 + distance)

def alternating_mask(score, b, h, q_idx, kv_idx):
    """Test with alternating pattern - good for branch prediction."""
    return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf"))

def head_specific_pattern(score, b, h, q_idx, kv_idx):
    """Different behavior per attention head."""
    even_head = h % 2 == 0
    causal = q_idx >= kv_idx
    return torch.where(even_head & causal, score, float("-inf"))

def sparse_strided(score, b, h, q_idx, kv_idx, stride=4):
    """Sparse attention with strided pattern."""
    return torch.where(
        (kv_idx % stride == 0) | (q_idx == kv_idx),
        score,
        float("-inf")
    )

def causal_with_global(score, b, h, q_idx, kv_idx):
    """Causal mask but first few tokens are globally attended."""
    is_causal = q_idx >= kv_idx
    is_global = kv_idx < 4
    return torch.where(is_causal | is_global, score, float("-inf"))

def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2):
    """Dilated attention pattern - exponentially increasing gaps."""
    distance = torch.abs(q_idx - kv_idx)
    is_attended = (distance == 0) | ((distance > 0) & ((distance & (distance - 1)) == 0))
    return torch.where(is_attended, score, float("-inf"))

```

Example outputs:
```
[Test Suite]
Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128

[Test 1: none]
[No score_mod, flash='enabled'] Found flash_attncute: True
[No score_mod, flash='disabled'] Found flash_attncute: False
✓ Outputs match between flash enabled/disabled
✓ Output matches eager SDPA (rtol=0.001, atol=0.001)

[Test 2: causal]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 17879 / 134217728 (0.0%)
Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed)
Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed)

[Test 3: rel_bias]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 12836 / 134217728 (0.0%)
Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed)
Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed)

[Test 4: rel_bias_v2]
```

This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117
Approved by: https://github.com/mlazos
2025-08-27 18:39:09 +00:00
ec585ceab4 [inductor] structured-log graph execution order + test (#160448)
Summary:

- Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse.
- Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}.

Testing:
- Add inline test to verify structure and output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448
Approved by: https://github.com/xmfan
2025-08-27 18:12:46 +00:00
16ce6a4aad [hop] move insert_deferred_runtime_asserts under subtracer (#161416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161416
Approved by: https://github.com/pianpwk
ghstack dependencies: #160548
2025-08-27 17:43:02 +00:00
3345a7ff8a [VLLM][FLASHINFER UPDATE] (#161537)
VLLM build x torch fails due to flashinfer build fail, detected that vllm team recently changed the point to flashinfer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161537
Approved by: https://github.com/huydhn
2025-08-27 17:41:26 +00:00
55e6ea105c Fix running the benchmark jobs twice (#161619)
I made a mistake in https://github.com/pytorch/pytorch/pull/160935 removing this condition check.  This ran the benchmark job twice for schedule jobs, i.e. https://github.com/pytorch/pytorch/actions/runs/17266546494.  This was missed during testing because `pull_request` and `workflow_dispatch` were working ok.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161619
Approved by: https://github.com/anijain2305
2025-08-27 17:18:10 +00:00
a3fa1b8c2a Set USE_NVSHMEM only if USE_DISTRIBUTED is set (#161451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161451
Approved by: https://github.com/eqy
2025-08-27 17:11:19 +00:00
620d52e882 Fix sort doc error (#161539)
Fixes #129298. Updated torch.sort documentation so that the 'stable' parameter is a Keyword Argument. This is how it's implemented in PyTorch.
@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161539
Approved by: https://github.com/soulitzer
2025-08-27 17:01:53 +00:00
69c7b16e6f Revert "Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)" (#161002)"
This reverts commit a03cc53e6f6e2fe67316cb8c74c25f5b953f445b.

Reverted https://github.com/pytorch/pytorch/pull/161002 on behalf of https://github.com/guangyey due to This PR breaks CI TestCudaMallocAsync::test_allocator_settings ([comment](https://github.com/pytorch/pytorch/pull/161002#issuecomment-3228980897))
2025-08-27 16:52:22 +00:00
379ebdaf5e [OrderedDict] Implement OrderedDict.popitem(last=...) (#155153)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155153
Approved by: https://github.com/anijain2305
ghstack dependencies: #160156, #155072, #155152
2025-08-27 15:46:40 +00:00
7c8f049d54 [OrderedDict] Implement OrderedDict.move_to_end(key, last=False) (#155152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155152
Approved by: https://github.com/anijain2305
ghstack dependencies: #160156, #155072
2025-08-27 15:46:40 +00:00
e3718c4855 [dict] Implement dict.__ior__ and fix return type in dict.__or__ (#155072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155072
Approved by: https://github.com/anijain2305
ghstack dependencies: #160156
2025-08-27 15:46:40 +00:00
2d44969bbd Wrap class definitions in set_fullgraph(False) in test_dict/test_ordered_dict (#160156)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160156
Approved by: https://github.com/zou3519
2025-08-27 15:46:40 +00:00
a2af6a9d6b Run WoArm64 CI every 4 hours (#161504)
Since WoArm64 isn’t part of CI yet, this PR schedules the workflow to increase visibility and insights. It will execute every 4 hours and still support manual runs via the `ciflow/win-arm64` tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161504
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-08-27 15:46:34 +00:00
28af843ee0 Revert "Fix index_add for int64 input + zerodim index (#161511)"
This reverts commit d51486616cb3fe54bc298669a88059be56c1fb22.

Reverted https://github.com/pytorch/pytorch/pull/161511 on behalf of https://github.com/clee2000 due to broke test_indexing.py::TestIndexingCPU::test_index_add_zerodim_index_floating_alpha_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/17257089116/job/48971728595) [HUD commit link](d51486616c) on dynamo? ([comment](https://github.com/pytorch/pytorch/pull/161511#issuecomment-3228705842))
2025-08-27 15:38:11 +00:00
378edb047f [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
2025-08-27 14:49:20 +00:00
d2db6c86b0 [OpenReg] Add Develop Notes for Integrating New Backend into PyTorch (#158644)
To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators.

This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644
Approved by: https://github.com/albanD
2025-08-27 14:47:25 +00:00
a3c1cbdbc6 [dynamo][higher order ops] Refactor for out spec (#161354)
Preparing for the next PR to add more info in the output spec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161354
Approved by: https://github.com/zou3519
2025-08-27 14:41:18 +00:00
9632f4ea9f [CD] [aarch64] Add CUDA 13.0 sbsa nightly build (#161257)
https://github.com/pytorch/pytorch/issues/159779

CUDA SBSA build for CUDA 13.0
1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103).
"This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.
3. Refactored the libs_to_copy list with common libs, and version_specific_libs.

TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161257
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-08-27 14:38:07 +00:00
3d406429b0 [dynamo][vllm] Support typing.get_type_hints (#161362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161362
Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel
2025-08-27 09:55:31 +00:00
9a12bab0d3 Add debug handle to inductor provenance tracking (#161110)
Summary:
Use debug handle on kernel names to distinguish different calls to the same kernel.

Previous kernel name: kernel_name

New kernel name: kernel_name:debug_handle

We add the debug handle to the tlparse artifacts: `inductor_provenance_tracking_node_mappings` and `inductor_provenance_tracking_kernel_stack_traces`.

We also add debug handles in the comments of the generated code so we can map to them in the provenance tracking highlighter tool: https://github.com/pytorch/tlparse/pull/134

Example output code is below. If a kernel doesn't have a debug handle, the `[Provenance debug handles]` comment line will not be written.

```
        # Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.addmm, aten.gelu]
        # [Provenance debug handles] triton_poi_fused_addmm_gelu_2:3
        stream0 = get_raw_stream(0)
        triton_poi_fused_addmm_gelu_2.run(buf4, primals_5, 300, stream=stream0)
```

The debug handles will also be used by downstream profilers such as zoomer.

Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D78994959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161110
Approved by: https://github.com/angelayi
2025-08-27 04:56:11 +00:00
d51486616c Fix index_add for int64 input + zerodim index (#161511)
Fixes #161446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161511
Approved by: https://github.com/malfet
2025-08-27 04:11:10 +00:00
07a4e9fea8 [benchmarks] Skip mobilenetv3_large_100 in CI for accuracy (#161570)
To keep the CI green - https://github.com/pytorch/pytorch/issues/161419

Its unclear if this is a real failure. And debugging it is non trivial.
Skipping for now to keep the CI greenst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161570
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
2025-08-27 03:44:04 +00:00
be55d7ac9e Revert "[Dynamo] Allow inlining into AO quantization modules (#152934)" (#161567)
This reverts commit 20e2ca3e29ce9eb33eef17db077696222c175764.

Fixes https://github.com/pytorch/pytorch/issues/157434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161567
Approved by: https://github.com/Lucaskabela
2025-08-27 03:33:04 +00:00
8b78ba07b1 [dynamo, nested graph breaks] add nested graph break tests (#144516)
Note: nested graph break tests (and wrapped tests) are xfailed/skipped for now - we will iteratively enable the tests as more of the nested graph break implementation is complete.

Differential Revision: [D81084809](https://our.internmc.facebook.com/intern/diff/D81084809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516
Approved by: https://github.com/anijain2305
2025-08-27 03:00:56 +00:00
b36a20d368 Ensure large tensor int32 -> int64 indexing is enabled (#157767)
Fixes: #https://github.com/pytorch/pytorch/issues/157446

I think that this delta is worth the switch form block-ptrs especially since they are deprecated

## Perf Summary

A is nightly B is this diff, so `negative` means this diff improves perf

TOP 5 differences
<img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" />

<details>
  <summary><strong>Full perf table (click to expand)</strong></summary>

| attn_type | dtype | shape(B,Hq,M,Hkv,N,D) | TFlops Version A | TFlops Version B |
| --- | --- | --- | --- | --- |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 258.38834144791923 | 258.6353685004612 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.2192450677751 | 140.12393320464972 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 122.32683823617003 | 118.51603755647925 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.48556906165314 | 137.24259849208627 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 86.59814488695922 | 84.59431398586257 |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 288.52679758135764 | 292.9174195871856 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 172.25541683643277 | 172.94326459828508 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 164.40864610599826 | 165.035129576335 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 176.54876886433945 | 175.08057670028145 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 125.22491679812626 | 121.06201152859151 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 339.11952481874283 | 339.0132835601695 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 227.58583240284406 | 228.21824999409597 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 185.98569659868966 | 182.32850843255093 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 188.9495725191772 | 180.31385312481657 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 106.25789530994302 | 106.55084959448476 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 357.6430536888533 | 363.30843452247274 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 262.3241154406613 | 265.73250045488 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 249.30498953911416 | 249.35928192833785 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 224.74126243851808 | 223.71776504077988 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 168.26977014013707 | 165.47991483333809 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 382.8178701785897 | 384.34752965862685 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 308.1449710013853 | 311.0653716044644 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 251.96365252505072 | 243.92283557225903 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 226.69316232745368 | 215.22769268913356 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 153.34142545296405 | 151.9312673939401 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 396.0998000753126 | 398.35036286102473 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 333.5198415274966 | 344.6354466169716 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 310.5955933379696 | 305.66347819546 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 260.4012412689896 | 259.758666997307 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 234.13034252182635 | 227.61676497283614 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 396.17615538477196 | 401.1419104525502 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 359.98648311998414 | 360.8285563463094 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 291.97720707257736 | 281.41694809965253 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 250.1703628419691 | 238.556760291579 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 199.50782826294306 | 191.52327358439223 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 411.0632004785396 | 413.6362648405517 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 382.9404387613185 | 397.74886235657607 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 357.0998545146633 | 350.5115200772392 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 281.8033924428203 | 281.98601309215843 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 282.56595134222135 | 277.4565795466672 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 408.89838018149516 | 405.14531386840076 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 396.07662058160264 | 393.4598228299578 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 317.8822887267849 | 304.754931401036 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 265.8801304948243 | 254.22961974295112 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 227.87390579965614 | 222.19481980110393 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 427.36821778477025 | 431.3766620314935 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 410.67994346825 | 423.4666944003808 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 381.1968748374038 | 381.77668006420424 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 292.5540046358546 | 296.5439130720502 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 321.04573768858114 | 310.7423616656888 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 427.46148866769903 | 426.162091037068 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 419.75580537687347 | 421.88640120274334 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 337.3208051798903 | 327.4912454675092 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 276.5638854539581 | 262.988360558083 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 250.82791326036886 | 245.07367032501736 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 435.8055824506086 | 441.8803729460534 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 432.02638235921006 | 450.33161016596273 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 402.25525939224883 | 393.8564689669916 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 297.5337286675904 | 297.0131881135074 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 343.8697037899545 | 329.8194073407783 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 267.58912366821056 | 256.91606054118375 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 150.81723692609629 | 146.32172267858743 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 129.51029293209245 | 122.72144394093334 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 147.627656359087 | 141.68956350566188 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 87.55100546003591 | 84.91293287692788 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 299.5931492743986 | 305.884253766691 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 179.39026367843837 | 181.64741311605096 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 173.93547669282367 | 173.23972950980564 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 185.90234171599252 | 182.80844545446686 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 128.08176696266082 | 123.27722685662111 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 340.50674552770664 | 338.9071088484576 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 225.4438318650432 | 230.22899884832975 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 194.15123248528312 | 185.02793973094865 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 200.74289714108176 | 191.76606719670647 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 107.03564946728423 | 106.82432377861258 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 371.31799283918406 | 379.7555394732925 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 275.97762744310455 | 276.71106853992995 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 261.6648679783462 | 259.4127232060398 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 237.03108223577615 | 233.92710216149527 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 172.13926800371152 | 168.74390922407585 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 381.50199487767276 | 383.9043681999597 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 307.9748883093411 | 312.2403515462001 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 251.11319684705438 | 243.17870127827277 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 236.3253127246763 | 223.81250201769552 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 154.55693991756874 | 153.11360584987685 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 407.11400078586615 | 413.53709886086557 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 348.1705797722622 | 360.09771155957367 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 321.8593280850388 | 318.2882327401255 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 270.089032013835 | 268.767323026064 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 238.07324557907788 | 228.09842078362692 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 399.8172853171901 | 401.0954526332136 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 363.4387330438581 | 364.13111024232677 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 294.1752429133857 | 283.7235663368415 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 256.8389394007649 | 246.91771015606483 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 199.3378564292656 | 192.40439590901758 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 425.5150965556111 | 430.8190098707553 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 396.00437184073013 | 411.3873625655787 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 369.92803661607815 | 361.43244467343663 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 293.4277354412933 | 295.2529537595746 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 288.0208673072841 | 281.51896404878863 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 408.3005367220567 | 408.96116482298913 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 396.90095962766304 | 396.87385456176486 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 319.0534576137999 | 302.50950358107764 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 270.3334977708081 | 258.8506349486557 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 227.46824134365394 | 222.23759438128766 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 438.24247309479694 | 437.7975163205371 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 428.34012029699227 | 433.3215899950434 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 386.52672049728875 | 388.26216893354984 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 302.71976814728083 | 302.3574867306459 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 327.39760662780986 | 308.6348428844912 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 423.31308678262695 | 426.6306972137279 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 412.6983690923106 | 419.4961977664297 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 337.41003544742273 | 324.2155049126126 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 278.7755890910794 | 265.9194286636502 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 251.55678254755364 | 244.8843180141462 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 452.5930781172308 | 457.7117122300742 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 445.05676260348116 | 463.9304535499636 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 415.78302138389415 | 406.29229555271456 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 308.0311067300895 | 304.91354721414314 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 351.43943626809335 | 329.4476923070317 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 295.1801525813241 | 291.36521287398904 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 183.23250549178067 | 182.35421238887605 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 151.56832453117747 | 151.3422139154794 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 171.02111935180432 | 160.72516856727913 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 74.05765122783826 | 74.5885345035243 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 314.3587394591763 | 319.2938677773619 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 224.57002084153177 | 225.48868542008177 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.00964804143052 | 215.39576159953486 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.1174237618258 | 214.28437413525663 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 121.08920423648368 | 119.55813661872644 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 362.2193857281911 | 360.05005804275936 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 279.8840217430121 | 279.5437918286659 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 227.76617121021982 | 222.8655938229316 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 215.43141176970562 | 207.71852284994702 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 121.35588364218539 | 121.20636565046884 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 365.1545280898012 | 373.37585444987326 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 304.360119952975 | 309.1247297936263 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 287.2603904544586 | 289.25547903162595 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 257.9852675272418 | 257.59069234098115 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 188.35158496670232 | 184.24683960154857 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 389.9744911369211 | 388.43466897254166 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 345.9228295166513 | 342.63034895210126 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 279.56334658247437 | 271.2724375402088 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 245.66477202810066 | 233.49688207371258 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 170.3270720653187 | 166.23863845657382 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 400.0041140827554 | 402.11182445396497 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 363.64641830327434 | 375.9288663364792 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 341.5776139573363 | 335.1160003213424 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 281.1811770268521 | 280.21438270014005 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 247.78716118997716 | 245.3269825179633 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 403.794126680488 | 405.2353919019577 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 387.079178426863 | 385.1461762057035 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 309.7847188173431 | 298.0443968374749 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 262.4721750159666 | 250.81679725428586 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 205.70866004479979 | 202.9620839129557 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 413.380982988662 | 418.40270594263103 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 398.450064800682 | 409.6794973994029 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 372.26297458194466 | 364.44415106552196 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 293.0818569905912 | 292.85172400643984 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 296.46717085592087 | 285.76362010612763 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 419.3186786037592 | 426.08801580934437 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 408.1648467766632 | 409.4122254207817 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 329.24396020457345 | 313.5200995121138 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 274.61257504571876 | 255.7801815432177 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 232.63806001220684 | 230.03020843492314 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 435.0785891054788 | 440.39101804225345 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 424.86925312752817 | 435.18898057396825 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 393.000417896268 | 395.11543361225256 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 297.7755459218185 | 300.7208114715287 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 331.71570861760534 | 318.07127352552885 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 424.58602747137405 | 425.84897078470715 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 422.66607285025725 | 423.5524945535485 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 344.8625760048626 | 331.6793888458635 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 282.0787281511649 | 263.7895634445868 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 252.7301927385177 | 245.41844170037427 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 437.0658069164588 | 442.9101960063628 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 433.13788271434646 | 452.3873572709863 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 404.0959191546953 | 396.7077863894884 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 300.45502211883206 | 301.3439134717943 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 344.11003202413934 | 330.8897663350314 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 298.4364205341705 | 291.6793556507056 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 187.6382133139633 | 191.05409897308772 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 156.55822078636112 | 154.178925976516 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 173.47765221825162 | 169.30862508068464 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 74.5885345035243 | 74.52689061607104 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 323.12233826013045 | 328.53889207933514 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 236.75872140126316 | 235.8378325547398 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 227.17836523816675 | 226.75357076139966 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 224.07209453308036 | 224.07209453308036 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 122.85572156047981 | 121.11642183704716 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 361.3123326658092 | 360.71014086458337 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 281.5287983927017 | 281.94301754758345 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 232.7456696285686 | 226.50976826432776 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 221.5612361744038 | 214.96188822837055 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 121.38311528944315 | 120.85441868178513 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 380.2579019244734 | 389.2520157863988 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 316.95230660496924 | 317.87597790618906 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 301.07968126657323 | 298.02424098422983 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 267.2240756921594 | 267.16353549228154 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 189.82761622494257 | 186.736450261963 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 389.88665375406805 | 387.9125133037077 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 348.70619958684887 | 346.6750499749774 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 280.5472989906087 | 271.22300822012187 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 250.02397620165968 | 241.22532776331445 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 171.67817496107645 | 166.95679280483972 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 412.626880230807 | 417.60238657950777 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 374.8829313933945 | 389.4448546468815 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 353.20410434172436 | 345.7072490717473 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 292.51045924209586 | 291.66621022138287 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 251.6264062063495 | 248.45110052911542 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 404.0155784550126 | 401.90546837237514 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 384.4389015599863 | 386.9684324594344 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 313.3731284132225 | 298.17074251037894 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 264.19199737284265 | 252.8982463999916 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 207.03696315185684 | 202.86697323136772 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 428.2436763312506 | 433.45005568619536 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 411.8516531869893 | 428.2753623461049 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 384.9095037182509 | 372.90888743000744 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 303.2438915629836 | 302.05095952914337 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 301.8689122735564 | 285.0363190513223 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 423.13592231504805 | 420.3991500185611 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 407.44527331585493 | 408.5064370765247 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 330.50050996167414 | 316.8763979925965 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 274.6833786307413 | 259.86098862141324 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 232.24019584158367 | 226.52040268160232 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 444.4596314237808 | 455.99558915752266 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 437.4245561244369 | 455.98275147271966 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 397.3350686877605 | 397.88875599028063 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 308.53809114394545 | 307.1359822042007 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 331.32379843423774 | 316.85293191675646 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 422.4622274366379 | 425.0407156418684 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 420.9547052783101 | 430.33779243510276 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 345.50265346504085 | 332.094855328957 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 280.81715528243365 | 264.6543640282054 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 252.25635200421783 | 245.46235499490305 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 452.5524207341139 | 461.7512032176736 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 445.2316469907137 | 464.4523799578466 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 416.87264016717023 | 409.17124592157046 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 309.42579489389846 | 307.9734464665731 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 350.50782004300623 | 330.98959545427294 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767
Approved by: https://github.com/Skylion007
2025-08-27 02:45:20 +00:00
de58505890 Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)"
This reverts commit cddcaa19035d6414a351be7c7b16c47d5a0c3466.

Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))
2025-08-27 02:36:42 +00:00
6913529ff8 Move non inductor workflows to Python 3.9 -> 3.10 (#161182)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/seemethere
2025-08-27 02:32:24 +00:00
4b4cdcfe3a Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)
- Fix Conv exhaustive.
- Fix AMD config pruning.
- Expand exhaustive test suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387
Approved by: https://github.com/coconutruben
2025-08-27 01:54:50 +00:00
68d395d61e [3/N][SymmMem] Expose offset field from handle (#161532)
As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471
2025-08-27 00:49:06 +00:00
4ed71d5412 [2/N][SymmMem] Add MemPool allocator and tests (#161471)
(Porting most of #161008)

Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.

To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
    tensor = torch.arange(numel, dtype=dtype, device=device)
```

Added tests for both use cases above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-27 00:49:06 +00:00
8dd5aa9689 [1/N][SymmMem] Add offset to handle, cache on base address (#161470)
For the kernels that need peer pointers directly, the rendezvous handle should allow user to get the offset of tensor wrt to base allocation address. Thus the need to add an `offset` field to SymmMem handle.

But we don't want to cache all the handles just bc they have different offsets, hence the search and cache logic below:

(i) At rendezvous, the search key is still `x.storage().data_ptr()`, like now, but it should do search in 2 parts - one is just dictionary lookup, like today, if that failed, it needs to search `allocations_` to see if the storage ptr falls in one of the segments. This is possible as we have all segments recorded during alloc.
(ii) If this segment hasn't been rendezvoused, we rendezvous it, cache it in the `symm_mem_` map with its base address as key.
(iii) We still need to return a handle for the current tensor, with a corresponding offset. This handle will be a shallow copy of the base handle, with the offset adjusted.

Some impl details:
(i.1) If we find a matching allocation, we can immediately use the allocation base address to do a re-search in `symm_mem_`.

(iii.1) To make the handle copy shallow, we move the common information -- base ptrs, base signal pad, etc -- to a structure referenced by both handles. The structure is called `NVSHMEMPeerAllocInfo`. A copy of handle just adds one more `intrusive_ptr` to it. The handle copy constructor accepts an `offset` argument.

Test:
Existing tests should not fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161470
Approved by: https://github.com/ngimel
2025-08-27 00:49:06 +00:00
8ff9485815 [export] Update unflattening dynamo.disable (#161306)
Summary:
Doing inline disabling causes recompiles with the reason "Cache line
invalidated because L['___stack0'] got deallocated"

Test Plan:
CI

Rollback Plan:

Differential Revision: D80816956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161306
Approved by: https://github.com/pianpwk
2025-08-27 00:27:16 +00:00
b074cbaedd [dynamo] allow resume functions to have name in both freevars and varnames (#161544)
fixes https://github.com/pytorch/pytorch/issues/161542

Differential Revision: [D81073109](https://our.internmc.facebook.com/intern/diff/D81073109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161544
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-08-27 00:25:16 +00:00
80bf883d21 Replace manual cache in _python_dispatch.get_alias_info with functools.cache (#161286)
In addition to being more code, the manual cache was doing an extra dictionary lookup on each cache hit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161286
Approved by: https://github.com/wconstab
2025-08-27 00:17:51 +00:00
9de9d25f8d [Inductor-FX] Support custom triton kernels (#161474)
# Feature
Add support for custom Triton kernels to the FX backend. This turned out not to require any new features, except for a minor change to handle `tl.constexpr` arguments which are not part of the autotuning config.

# Caveat

This may not cover every possible case. For example, we might need more features for autotuning custom Triton code. This PR entirely skips the [custom codegen ](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py#L1034-L1039) for user-defined grid functions, but there may be edge cases requiring this logic. However, this PR seems to do a reasonable job as many of the grids end up being written into Inductor/Triton metadata and don't require special codegen.

As a follow up, I'm planning to test this against all of AOTI's custom Triton kernel tests.

# Test plan
Added a CI test using a custom Triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161474
Approved by: https://github.com/angelayi
2025-08-27 00:15:19 +00:00
dbc903a94a [APS IR] Minfor fix - use GetAttrKey in get_keystr to match with flat args path in unflatten (#161453)
Summary: While passing path info to [_check_input_constraints_for_graph](https://www.internalfb.com/code/fbsource/[6b5b2dc35902a26ce265e3c0ae5189a3faba1d38]/fbcode/caffe2/torch/export/unflatten.py?lines=594), GetAttrKey is used to specify path str. To match with that get_keystr should also use GetAttrKey.

Test Plan:
Existing tests

```
buck run mode/opt caffe2/test:test_export -- -r unflatten
```

```
Ran 413 tests in 204.533s

OK (skipped=1, expected failures=13)
```

Rollback Plan:

Differential Revision: D80984083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161453
Approved by: https://github.com/tugsbayasgalan
2025-08-27 00:05:20 +00:00
1b34e04485 Revert "Update pybind11 submodule to 3.0.1 (#160754)"
This reverts commit 660b0b8128181d11165176ea3f979fa899f24db1.

Reverted https://github.com/pytorch/pytorch/pull/160754 on behalf of https://github.com/atalman due to please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226078102))
2025-08-26 23:35:22 +00:00
1ce423274d Revert "[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063)"
This reverts commit 74c4c758afa8c28162f00a456c185552e1159fd3.

Reverted https://github.com/pytorch/pytorch/pull/161063 on behalf of https://github.com/atalman due to sorry broke vllm tests please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/161063#issuecomment-3226065212))
2025-08-26 23:31:23 +00:00
4e630f0629 Revert "[Inductor] Update Outer Reduction Heuristic (#159093)"
This reverts commit ca9fe0107e165a4a4147325ff6d34235ebde447f.

Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/PaulZhang12 due to Addressing internal implications then relanding ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3225942525))
2025-08-26 22:37:56 +00:00
cddcaa1903 [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
2025-08-26 22:33:23 +00:00
1e4dfeeb06 Add early_stop kwarg to torch.utils.checkpoint (#160781)
We already have a context manager "set_checkpoint_early_stop". This PR adds a kwarg that toggles the same setting.

It is also useful to have a kwarg version of the setting in addition to the context manager because is annoying to apply a context manager when the AC is being applied via CheckpointWrapper.

Similar to the "debug" kwarg and the corresponding "set_checkpoint_debug_enabled" context manager, the context manager defaults to None and overrides the local setting when non-None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160781
Approved by: https://github.com/tianyu-l
2025-08-26 22:32:35 +00:00
4d078cfc4e [fx] Add is_fx_symbolic_tracing flag (#161385)
Fixes https://github.com/pytorch/pytorch/issues/135276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161385
Approved by: https://github.com/pianpwk
2025-08-26 22:26:27 +00:00
da838f65af [ONNX] Drop draft_export in exporter API (#161454)
If onnx exporter fallbacks to draft_export with big models, this is taking forever for users, and possibly spam the printout, which keeps users from their stack trace with strict=False.

We could consider make another API for draft_export as debugging tool, or combine it with report=True when "model is small"?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161454
Approved by: https://github.com/justinchuby
2025-08-26 22:13:43 +00:00
cde54fe4e9 fix-unpin-memory-tensor-param (#160992)
Fixes #160983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160992
Approved by: https://github.com/ngimel
2025-08-26 21:55:25 +00:00
e06d1d6610 [BE] Improve torch.inference_mode docs and error message (#161164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161164
Approved by: https://github.com/sfc-gh-sbekman, https://github.com/janeyx99
2025-08-26 20:58:56 +00:00
b2db293abc [ROCm] No-fence global reduce (#161180)
This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161180
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 20:43:59 +00:00
6686974ddd Revert "[dynamo, nested graph breaks] add nested graph break tests (#144516)"
This reverts commit 9a756c2d710a0680bac93ab0b42db519ec2dc6cf.

Reverted https://github.com/pytorch/pytorch/pull/144516 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144516#issuecomment-3225659358))
2025-08-26 20:40:17 +00:00
eqy
3d82256a86 [FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on sm110 or sm120 either (#161236)
See also #160693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161236
Approved by: https://github.com/Skylion007
2025-08-26 20:40:11 +00:00
a4fb65701b Revert "[dynamo, nested graph breaks] support very simple nested graph breaks (#159329)"
This reverts commit 8dab6d4c414bf997297804008c3da893e69cd51f.

Reverted https://github.com/pytorch/pytorch/pull/159329 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/159329#issuecomment-3225617445))
2025-08-26 20:24:10 +00:00
6afd766401 Revert "[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678)"
This reverts commit 02fa5bf6d80fa4baa6bb6dd2fa6a16d88852da91.

Reverted https://github.com/pytorch/pytorch/pull/159678 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159678#issuecomment-3225597425))
2025-08-26 20:16:36 +00:00
a7aa480e55 Revert "[dynamo, nested graph breaks] support nested closures (#159817)"
This reverts commit ef0ef6f93f7ef6d16d71a6997b72185504acd4b6.

Reverted https://github.com/pytorch/pytorch/pull/159817 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159817#issuecomment-3225586996))
2025-08-26 20:13:33 +00:00
9f6e1b8730 Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864)"
This reverts commit 3caddd4daa5b1a167663c07219e065e86247ad76.

Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))
2025-08-26 20:03:59 +00:00
caf98fde0d Revert "[dynamo, nested graph breaks] clean up comments and codegen (#160138)"
This reverts commit ac6316caaa74513cbcf3c7f9269bc23cd74749db.

Reverted https://github.com/pytorch/pytorch/pull/160138 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/160138#issuecomment-3225546707))
2025-08-26 20:01:26 +00:00
46576f5a16 Revert "[dynamo, nested graph breaks] prevent excessive recompilations (#159786)"
This reverts commit 67d31f6b281d3b15b205756fc7ebc450cdde1dab.

Reverted https://github.com/pytorch/pytorch/pull/159786 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159786#issuecomment-3225535752))
2025-08-26 19:54:22 +00:00
77bc959fe1 Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)
Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314
Approved by: https://github.com/eellison
2025-08-26 19:40:37 +00:00
262640fd22 [ROCm][CI] restore test_flex_attention tests (#161519)
Reverts #161450 and targets specific subtests to skip on MI200.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161519
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 19:31:30 +00:00
74124d1b46 [reland] [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#161514)
Summary:
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.

This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.

Test Plan:
CI

Rollback Plan:

Differential Revision: D81041296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161514
Approved by: https://github.com/tugsbayasgalan
2025-08-26 19:16:05 +00:00
a03cc53e6f Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)" (#161002)
Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails.

Test Plan:
NA

Rollback Plan:

Differential Revision: D80553588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161002
Approved by: https://github.com/jingsh, https://github.com/izaitsevfb
2025-08-26 19:04:13 +00:00
00efeabc29 [hop] make materialize_as_graph disable pre-existing dispatch modes (#161220)
For materializing_as_subgraph, we just want to trace a graph. The handling of different modes should register their own logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161220
Approved by: https://github.com/Lucaskabela
2025-08-26 18:52:38 +00:00
d4703fb91c [dtensor] Add propagate_tensor_meta function that skips cache if _are_we_tracing (#161334)
Fixes an issue where the log softmax handler checked the tensor metadata cache without checking for tracing or symints.

Probably best to merge this after #160798, but not strictly blocking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161334
Approved by: https://github.com/xmfan
2025-08-26 18:46:58 +00:00
cd87f30295 DOC: Clarify documentation for torch.matmul and fix a typo (#161424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161424
Approved by: https://github.com/AlannaBurke
2025-08-26 18:30:57 +00:00
f0e0a6897e type misc init and tools for dynamo (#161293)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161293
Approved by: https://github.com/anijain2305
2025-08-26 17:38:49 +00:00
d2bd55d8de Typo correction in variable name inital_grad of Class TestFullyShardG… (#161501)
Typo correction in variable name inital_grad of Class TestFullyShardGradientScaler implementation.

Fixes #161480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161501
Approved by: https://github.com/soulitzer
2025-08-26 17:16:42 +00:00
6598f00c18 [dynamo] auto lift unbacked symbol in tensor's storage_offset (#161199)
```python
import torch

torch._dynamo.config.capture_scalar_outputs = True

class M(torch.nn.Module):
    def forward(self, idx, x):
        u0 = idx.item()
        x0 = x.select(0, u0)
        def fn():
            return x0.sin()
        return torch.cond(x0.sum() > 0, fn, fn)

m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64, device="cuda"), torch.randn(3, 3, device="cuda"))
print(out)

```

Before the PR, we didn't track the storage_offset symbol of a tensor. After https://github.com/pytorch/pytorch/pull/157605, we create an unbacked_symint for stroage_offset for the result of select. So when we try to lift the free basic symbols of x0  during speculating fn, we found a free symbol that's not bound to a proxy.

This PR tracks the symbols of storage_offset and associated it with a proxy using torch.ops.aten.storage_offest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161199
Approved by: https://github.com/zou3519
ghstack dependencies: #161198
2025-08-26 17:06:54 +00:00
ba6ce66698 [dynamo] lift backed symint output of item() (#161198)
Before the change in this PR, we have an error for the following code
```python
import torch

torch._dynamo.config.capture_scalar_outputs = True

class M(torch.nn.Module):
    def forward(self, idx, x):
        u0 = idx.item()
        x0 = x.select(0, u0)
        def fn():
            return x0.sin()
        return torch.cond(x0.sum() > 0, fn, fn)

m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64), torch.randn(3, 3))
```

The error is caused when speculate fn, and tries to lift symbol of x0.storage_offset() but found the symbols doesn't have a source associated with it.

What really happens is that, when input tensor is a scalar tensor of int type and resides on CPU, we have a short cut that creates a norm symint when .item() is called see https://github.com/pytorch/pytorch/pull/126245.

However, previously, we only track the unbacked symint output of an operation because we believe all the backed symint must have a source associated with it and has already bee lifted as input at the top-level. Now this invariant no longer holds, so we end up an error saying the symbol doesn't have source (because only input and symbols derided from inputs have source and result of .item() doesn't have a source).

In this PR, we start to also track the normal symint with the proxy that created it (i.e. in this case the proxy .item()).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161198
Approved by: https://github.com/zou3519
2025-08-26 17:06:54 +00:00
ca9fe0107e [Inductor] Update Outer Reduction Heuristic (#159093)
Update outer reduction heuristics for significant speedups.

HuggingFace:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" />

Average ~20% speedup on a kernel by kernel basis

TorchBench:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" />

Average ~40% speedup on a kernel by kernel basis

<img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" />

Differential Revision: [D80835998](https://our.internmc.facebook.com/intern/diff/D80835998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093
Approved by: https://github.com/jansel
2025-08-26 16:12:07 +00:00
f9df4ec2af SDPA skip logic for ROCm (#160522)
Skips some test for flex and eff attention if they are not supported by the hardware

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160522
Approved by: https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 15:51:07 +00:00
a72803f1e3 [ez][CI] GIve the linux check job a name that isn't linux-job (#161413)
Reason:
The default name is linux-job, which gets put in the linux category on HUD, but this isn't really a linux related job.  Renaming it like this will make it go into the "other" category on HUD

Other options:
Change the grouping code in test-infra
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161413
Approved by: https://github.com/huydhn, https://github.com/seemethere
2025-08-26 15:18:35 +00:00
10e67f5ec3 forward fix #161102 (#161465)
PR #161102 caused tf32 to be the default precision for flex attention.  This PR forward-fixes the broken logic and restores ROCm MI200 CI flex attention test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161465
Approved by: https://github.com/jeffdaily, https://github.com/eqy

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 15:11:54 +00:00
818ba434c7 Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767)"
This reverts commit fc69c2bc67672c3b2d0c62c1821895f09288f1c0.

Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to internal failure, sorry will revert ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3224341111))
2025-08-26 14:12:06 +00:00
ae8d319fd4 Update NVSHMEM to 3.3.24 and fix download link (#161321)
https://github.com/pytorch/pytorch/issues/159779

Update NVSHMEM 3.3.24 for [PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM](https://github.com/pytorch/pytorch/issues/160980)
Enabled back sm_75 for NVSHMEM
Fixed the NVSHMEM download link for the issue with 3.3.20 download in issue - [[CD] nvshem-3.3.9 wheels for aarch64 is not manylinux2_28 compliant](https://github.com/pytorch/pytorch/issues/160425)

Todo: Should also enable back build ARM with NVSHMEM since it is compatible with manylinux2_28

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161321
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-08-26 13:26:18 +00:00
e795450a35 Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)"
This reverts commit 447d34b5f80fb7350f79decd855cb599cab39083.

Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to reverting since can't land existing diff internally, will need to reland it ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3224029031))
2025-08-26 12:45:59 +00:00
8c506e6310 [easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445)
This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed.

Notably:
* the test has output_size kwarg specified
* the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur)

Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445
Approved by: https://github.com/eellison, https://github.com/v0i0
2025-08-26 12:32:24 +00:00
4a1aca11c2 Revert "[inductor] structured-log graph execution order + test (#160448)"
This reverts commit 995397d47a0e27394ee1010f158e181eb304100a.

Reverted https://github.com/pytorch/pytorch/pull/160448 on behalf of https://github.com/atalman due to internal failure please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/160448#issuecomment-3223939035))
2025-08-26 12:20:37 +00:00
e9d42b3880 [small][muon] Use addmm for Newton–Schulz orthogonalization (#161379)
A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op.

**Benchmark**
In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a **1.62×** speedup.

matmul
<img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" />

addmm
<img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" />

**Testing**
End-to-end training:
We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm.
<img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" />

Unit test:

```python
    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = Muon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
        nesterov=nesterov,
        adjust_lr_fn="original",
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
        nesterov=nesterov,
        adjust_lr_fn="original",
        use_addmm=True,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)
```

shows numeric difference, but this is expected on bf16 precision:
```
Mismatched elements: 96 / 100 (96.0%)
Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed)
Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed)
```

~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~
Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161379
Approved by: https://github.com/janeyx99
2025-08-26 09:17:28 +00:00
8cfc119491 [pytorch] Simplify codes using std::all_of() for _check_tensors_share_device_and_dtype() (#161411)
Summary: These two nested loops of checks could be simplified with `std::all_of()` to make it more compact.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80946082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161411
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-08-26 08:56:24 +00:00
e7e270a33a [pytorch] Merge two nested if statement checks into one (#161387)
Summary: This reduces the code indentation level by one.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80915357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161387
Approved by: https://github.com/janeyx99
2025-08-26 08:45:36 +00:00
6aef9f3a69 [Inductor][Tritonparse] Call jit_post_compile_hook within Inductor Triton Kernel compile path (#161443)
Summary: Since Inductor skips JIT compilation for Triton kernels, we need to manually invoke `knobs.runtime.jit_post_compile_hook` if one exists. Here, we do this to enable Tritonparse to extract launch metadata from Inductor launched kernels. We can control whether or not Inductor will run the hook with a new `TORCHINDUCTOR_RUN_JIT_POST_COMPILE_HOOK=1 ` config variable.

Reviewed By: davidberard98

Differential Revision: D80624932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161443
Approved by: https://github.com/FindHao
2025-08-26 06:24:42 +00:00
7376111d59 [BE] fix compute_global_tensor_shape test (#161441)
Fixes #161154

**Test**
`pytest  test/distributed/tensor/test_utils.py -s -k test_compute_global_tensor_shape_1D`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161441
Approved by: https://github.com/kwen2501
2025-08-26 03:22:29 +00:00
92ab184824 Revert "[Inductor] Prune configs that require more shared memory than the hardware limit (#161040)"
This reverts commit b2e06e0194c3fa8f7578a1b48751cc027394fb67.

Reverted https://github.com/pytorch/pytorch/pull/161040 on behalf of https://github.com/jeffdaily due to still failing on rocm, see https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=inductor%2Ftest_triton_heuristics.py%3A%3ATestTritonHeuristics%3A%3Atest_prune_configs_over_shared_memory_limit_do_pruning_True ([comment](https://github.com/pytorch/pytorch/pull/161040#issuecomment-3222430129))
2025-08-26 03:15:32 +00:00
8c442e4fd3 Fix LBFGS warning convert a tensor with requires_grad=True to a scalar (#160389)
Fixes #160197

## Test Result

```python
In [1]: import warnings
   ...: warnings.simplefilter('error')
   ...: import torch
   ...: print(torch.__version__)
   ...: a, b = torch.rand((2, 32, 32))
   ...: a.requires_grad_()
   ...: optimizer = torch.optim.LBFGS([a])
   ...: loss_fn = lambda x, y: (x-y).pow(2).mean()
   ...:
   ...: def closure():
   ...:     optimizer.zero_grad()
   ...:     loss = loss_fn(a, b)
   ...:     loss.backward()
   ...:     return loss
   ...:
   ...: for i in range(100):
   ...:     optimizer.step(closure)
   ...:     print(i, loss_fn(a, b))
   ...:
2.9.0a0+gitf33f3f8
0 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
1 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
2 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
3 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
4 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
5 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
6 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
7 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
8 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
9 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
10 tensor(5.8066e-11, grad_fn=<MeanBackward0>)

...

```

```bash
pytest test/test_optim.py -vv

...
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_NAdam_cuda_float32 PASSED [2.7192s]                                                                                                                                           [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RAdam_cuda_float32 PASSED [2.5370s]                                                                                                                                           [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RMSprop_cuda_float32 PASSED [2.0190s]                                                                                                                                         [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_Rprop_cuda_float32 PASSED [1.8554s]                                                                                                                                           [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SGD_cuda_float32 PASSED [2.0433s]                                                                                                                                             [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SparseAdam_cuda_float32 PASSED [1.1788s]                                                                                                                                      [100%]

================== 1471 passed, 242 skipped in 2440.52s (0:40:40) ============
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160389
Approved by: https://github.com/janeyx99

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-08-26 03:07:47 +00:00
e34b6a0103 Add meta for add.Scalar (#161332)
Fixes https://github.com/pytorch/pytorch/issues/161076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161332
Approved by: https://github.com/Skylion007
2025-08-26 02:26:51 +00:00
f795e92802 space added between type and checking for typechecking (#161352)
space added between type and checking for "typechecking"

Fixes #161282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161352
Approved by: https://github.com/malfet
2025-08-26 02:07:33 +00:00
becd6cd744 Increase timeout value when pushing to ghcr.io (#161444)
Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047.  The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444
Approved by: https://github.com/atalman
2025-08-26 01:51:16 +00:00
ec21cafd85 [OpenReg] Refactor and Optimize the OpenReg for Preparation of Docs (#159640)
As the title stated.

**Changes:**

- Fixed a bug where abs_stub could not be triggered
- Refactor registration to prepare for documentation
- Add meta, fallback for openreg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159640
Approved by: https://github.com/albanD
2025-08-26 01:44:21 +00:00
908b0ccb1f Revert "Increase timeout value when pushing to ghcr.io (#161444)"
This reverts commit b9e9e92817fd7d1a778f074105603efb07e05004.

Reverted https://github.com/pytorch/pytorch/pull/161444 on behalf of https://github.com/huydhn due to Reland this to generate a different has value for the benchmark Docker image ([comment](https://github.com/pytorch/pytorch/pull/161444#issuecomment-3222257119))
2025-08-26 01:41:59 +00:00
85adf80cf1 Disable inductor/test_flex_attention.py (#161450)
Currently inductor/test_flex_attention.py is causing rocm pytorch mi250 shard 1 to go over the timeout limit. This PR is for disabling that test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161450
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 01:28:51 +00:00
74c4c758af [cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063
Approved by: https://github.com/Skylion007
ghstack dependencies: #160754
2025-08-26 01:21:18 +00:00
660b0b8128 Update pybind11 submodule to 3.0.1 (#160754)
Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling.

Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754
Approved by: https://github.com/Skylion007
2025-08-26 01:21:18 +00:00
089ad1d88b [1/n][export] Refactor PT2 Archive weight saving and loading (#160394)
Summary:

We split the refactoring in two parts for forward compatibility concerns
First, we land the deserialization (loading part)
Then, we land the serialization (saving part)

Save weights and constants as individual files in PT2 archive. Each weight/constant will be saved as raw bytes, unless it is a custom object (TorchBind object) or a non-fake tensor subclass, for these two special cases we still save them using pickle.

The metadata of saved tensors along with the file name will be saved as `PayloadMeta`.
The mapping from FQN to `PayloadMeta` will be saved as `PayloadConfig` under `WEIGHTS_CONFIG_FORMAT` and `CONTANTS_CONFIG_FORMAT`

This changes the serialization in python side when calling `torch.export.save()`.

For deserialization in python `torch.export.load()`, we make it BC-safe by allowing loading legacy format weights/constants.

For deserialization in C++ `torch/nativert/ModelRunner.cpp`, we make this a BC breaking change as currently the OSS ModelRunner API is not being used.

The file structure

```
├── archive_format
├── archive_version
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── sample_inputs
│   │   └── model.pt
│   ├── constants
│   │   ├── tensor_0
│   │   ├── tensor_1
│   │   └── model_constants_config.json
│   └── weights
│       ├── weight_0
│       ├── weight_1
│       ├── weight_2
│       ├── weight_3
│       └── model_weights_config.json
└── models
    └── model.json
```

Test Plan:
CI

Rollback Plan:

Differential Revision: D80035490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160394
Approved by: https://github.com/SherlockNoMad
2025-08-26 01:15:42 +00:00
67d31f6b28 [dynamo, nested graph breaks] prevent excessive recompilations (#159786)
Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object.

Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break.

Followup: we can skip guards on continuation functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817, #160138
2025-08-26 00:58:38 +00:00
ac6316caaa [dynamo, nested graph breaks] clean up comments and codegen (#160138)
Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures.

Also simplify some codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817
2025-08-26 00:58:38 +00:00
ef0ef6f93f [dynamo, nested graph breaks] support nested closures (#159817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329, #159678
2025-08-26 00:58:28 +00:00
02fa5bf6d8 [dynamo, nested graph breaks] support nested graph breaks x context managers (#159678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329
2025-08-26 00:58:18 +00:00
8dab6d4c41 [dynamo, nested graph breaks] support very simple nested graph breaks (#159329)
e.g. this graph breaks once now:
```python
import torch

torch._dynamo.config.nested_graph_breaks = True

def inner(x):
    x = x + 1
    torch._dynamo.graph_break()
    return x + 2

@torch.compile(backend="eager")
def outer(x):
    return inner(x)

print(outer(torch.ones(3)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516
2025-08-26 00:58:07 +00:00
9a756c2d71 [dynamo, nested graph breaks] add nested graph break tests (#144516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281
2025-08-26 00:57:58 +00:00
504a6445a4 [dynamo, nested graph breaks] use CALL_FUNCTION_EX when calling resume function (#159281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159281
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971
2025-08-26 00:57:48 +00:00
2df9b437e3 [dynamo, nested graph breaks] implement new resume frame stack/locals/cell layout convention (#157971)
The comments/conventions are not exactly correct here, as the implementation at this PR is partial. They will be fixed in #160138.

No tests added, since there shouldn't be any overall semantic changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157971
Approved by: https://github.com/anijain2305
2025-08-26 00:57:39 +00:00
4e19c1906a Get Inductor periodic CI green (#161297)
I'll file hi-pri issues for the things that need looking into.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161297
Approved by: https://github.com/angelayi
2025-08-26 00:49:49 +00:00
332fa5b388 [Inductor][Triton] Fix SCALING_ROWWISE misclassification for scalar scales (#160450)
Summary: In `tuned_scaled_mm()`, we unsqeeuze any scalar scale from [] -> [1, 1]. Later, when we are determining how to set the `SCALING_ROWWISE` kernel attribute, we check whether the scale has 2 dimensions. However, since we previously unsqueezed any scalar scales, this will always evaluate to True.

Test Plan:
Run the following tests in test/inductor/test_fp8.py:
test_tensorwise_scaling_tma_template
test_rowwise_scaling_tma_template

Rollback Plan:

Differential Revision: D80108117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160450
Approved by: https://github.com/eellison
2025-08-26 00:24:55 +00:00
b9e9e92817 Increase timeout value when pushing to ghcr.io (#161444)
Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047.  The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444
Approved by: https://github.com/atalman
2025-08-25 23:52:59 +00:00
e6aa7287f8 [pytorch] Leverage unordered_map.try_emplace() to simplify code (#161388)
Summary: Because [`unordered_map.try_emplace()`](https://en.cppreference.com/w/cpp/container/unordered_map/try_emplace.html) does not invoke value's constructor if key is already existed, this matches with the previous the behavior on checking the key's existence first, and then instantiate the value.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80916349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161388
Approved by: https://github.com/janeyx99
2025-08-25 23:33:59 +00:00
94b9569c4a Forward fix periodic vision build (#161408)
Trying to forward fix: https://github.com/pytorch/pytorch/issues/161358 use SM 80 architecture by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161408
Approved by: https://github.com/zou3519, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-25 23:28:22 +00:00
2cf7ac2fb7 Issue 160495 inductor complex float (#160736)
Avoiding calling tensor.view(tensor.real.dtype) when tensor.ndim =0 fixes the issue. Called a reshape. Fixes #160495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160736
Approved by: https://github.com/ngimel
2025-08-25 23:23:13 +00:00
447d34b5f8 [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.

This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.
@exported-using-ghexport

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/)

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900
Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305
2025-08-25 23:16:21 +00:00
b2e06e0194 [Inductor] Prune configs that require more shared memory than the hardware limit (#161040)
Summary:
This diff removes configs that require more shared memory than the hardware limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```

Test Plan:
```
buck2 test mode/dev-nosan fbcode//caffe2/test/inductor:max_autotune -- test_max_autotune_prune_choices -v 1,stderr
```

Rollback Plan:

Differential Revision: D80594562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161040
Approved by: https://github.com/eellison
2025-08-25 23:09:09 +00:00
fc69c2bc67 Ensure large tensor int32 -> int64 indexing is enabled (#157767)
Fixes: #https://github.com/pytorch/pytorch/issues/157446

I think that this delta is worth the switch form block-ptrs especially since they are deprecated

## Perf Summary

A is nightly B is this diff, so `negative` means this diff improves perf

TOP 5 differences
<img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" />

<details>
  <summary><strong>Full perf table (click to expand)</strong></summary>

| attn_type | dtype | shape(B,Hq,M,Hkv,N,D) | TFlops Version A | TFlops Version B |
| --- | --- | --- | --- | --- |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 258.38834144791923 | 258.6353685004612 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.2192450677751 | 140.12393320464972 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 122.32683823617003 | 118.51603755647925 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.48556906165314 | 137.24259849208627 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 86.59814488695922 | 84.59431398586257 |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 288.52679758135764 | 292.9174195871856 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 172.25541683643277 | 172.94326459828508 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 164.40864610599826 | 165.035129576335 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 176.54876886433945 | 175.08057670028145 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 125.22491679812626 | 121.06201152859151 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 339.11952481874283 | 339.0132835601695 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 227.58583240284406 | 228.21824999409597 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 185.98569659868966 | 182.32850843255093 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 188.9495725191772 | 180.31385312481657 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 106.25789530994302 | 106.55084959448476 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 357.6430536888533 | 363.30843452247274 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 262.3241154406613 | 265.73250045488 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 249.30498953911416 | 249.35928192833785 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 224.74126243851808 | 223.71776504077988 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 168.26977014013707 | 165.47991483333809 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 382.8178701785897 | 384.34752965862685 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 308.1449710013853 | 311.0653716044644 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 251.96365252505072 | 243.92283557225903 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 226.69316232745368 | 215.22769268913356 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 153.34142545296405 | 151.9312673939401 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 396.0998000753126 | 398.35036286102473 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 333.5198415274966 | 344.6354466169716 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 310.5955933379696 | 305.66347819546 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 260.4012412689896 | 259.758666997307 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 234.13034252182635 | 227.61676497283614 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 396.17615538477196 | 401.1419104525502 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 359.98648311998414 | 360.8285563463094 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 291.97720707257736 | 281.41694809965253 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 250.1703628419691 | 238.556760291579 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 199.50782826294306 | 191.52327358439223 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 411.0632004785396 | 413.6362648405517 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 382.9404387613185 | 397.74886235657607 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 357.0998545146633 | 350.5115200772392 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 281.8033924428203 | 281.98601309215843 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 282.56595134222135 | 277.4565795466672 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 408.89838018149516 | 405.14531386840076 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 396.07662058160264 | 393.4598228299578 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 317.8822887267849 | 304.754931401036 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 265.8801304948243 | 254.22961974295112 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 227.87390579965614 | 222.19481980110393 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 427.36821778477025 | 431.3766620314935 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 410.67994346825 | 423.4666944003808 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 381.1968748374038 | 381.77668006420424 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 292.5540046358546 | 296.5439130720502 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 321.04573768858114 | 310.7423616656888 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 427.46148866769903 | 426.162091037068 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 419.75580537687347 | 421.88640120274334 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 337.3208051798903 | 327.4912454675092 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 276.5638854539581 | 262.988360558083 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 250.82791326036886 | 245.07367032501736 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 435.8055824506086 | 441.8803729460534 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 432.02638235921006 | 450.33161016596273 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 402.25525939224883 | 393.8564689669916 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 297.5337286675904 | 297.0131881135074 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 343.8697037899545 | 329.8194073407783 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 267.58912366821056 | 256.91606054118375 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 150.81723692609629 | 146.32172267858743 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 129.51029293209245 | 122.72144394093334 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 147.627656359087 | 141.68956350566188 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 87.55100546003591 | 84.91293287692788 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 299.5931492743986 | 305.884253766691 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 179.39026367843837 | 181.64741311605096 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 173.93547669282367 | 173.23972950980564 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 185.90234171599252 | 182.80844545446686 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 128.08176696266082 | 123.27722685662111 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 340.50674552770664 | 338.9071088484576 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 225.4438318650432 | 230.22899884832975 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 194.15123248528312 | 185.02793973094865 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 200.74289714108176 | 191.76606719670647 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 107.03564946728423 | 106.82432377861258 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 371.31799283918406 | 379.7555394732925 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 275.97762744310455 | 276.71106853992995 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 261.6648679783462 | 259.4127232060398 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 237.03108223577615 | 233.92710216149527 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 172.13926800371152 | 168.74390922407585 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 381.50199487767276 | 383.9043681999597 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 307.9748883093411 | 312.2403515462001 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 251.11319684705438 | 243.17870127827277 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 236.3253127246763 | 223.81250201769552 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 154.55693991756874 | 153.11360584987685 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 407.11400078586615 | 413.53709886086557 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 348.1705797722622 | 360.09771155957367 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 321.8593280850388 | 318.2882327401255 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 270.089032013835 | 268.767323026064 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 238.07324557907788 | 228.09842078362692 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 399.8172853171901 | 401.0954526332136 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 363.4387330438581 | 364.13111024232677 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 294.1752429133857 | 283.7235663368415 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 256.8389394007649 | 246.91771015606483 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 199.3378564292656 | 192.40439590901758 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 425.5150965556111 | 430.8190098707553 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 396.00437184073013 | 411.3873625655787 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 369.92803661607815 | 361.43244467343663 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 293.4277354412933 | 295.2529537595746 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 288.0208673072841 | 281.51896404878863 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 408.3005367220567 | 408.96116482298913 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 396.90095962766304 | 396.87385456176486 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 319.0534576137999 | 302.50950358107764 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 270.3334977708081 | 258.8506349486557 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 227.46824134365394 | 222.23759438128766 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 438.24247309479694 | 437.7975163205371 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 428.34012029699227 | 433.3215899950434 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 386.52672049728875 | 388.26216893354984 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 302.71976814728083 | 302.3574867306459 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 327.39760662780986 | 308.6348428844912 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 423.31308678262695 | 426.6306972137279 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 412.6983690923106 | 419.4961977664297 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 337.41003544742273 | 324.2155049126126 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 278.7755890910794 | 265.9194286636502 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 251.55678254755364 | 244.8843180141462 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 452.5930781172308 | 457.7117122300742 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 445.05676260348116 | 463.9304535499636 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 415.78302138389415 | 406.29229555271456 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 308.0311067300895 | 304.91354721414314 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 351.43943626809335 | 329.4476923070317 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 295.1801525813241 | 291.36521287398904 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 183.23250549178067 | 182.35421238887605 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 151.56832453117747 | 151.3422139154794 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 171.02111935180432 | 160.72516856727913 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 74.05765122783826 | 74.5885345035243 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 314.3587394591763 | 319.2938677773619 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 224.57002084153177 | 225.48868542008177 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.00964804143052 | 215.39576159953486 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.1174237618258 | 214.28437413525663 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 121.08920423648368 | 119.55813661872644 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 362.2193857281911 | 360.05005804275936 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 279.8840217430121 | 279.5437918286659 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 227.76617121021982 | 222.8655938229316 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 215.43141176970562 | 207.71852284994702 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 121.35588364218539 | 121.20636565046884 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 365.1545280898012 | 373.37585444987326 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 304.360119952975 | 309.1247297936263 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 287.2603904544586 | 289.25547903162595 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 257.9852675272418 | 257.59069234098115 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 188.35158496670232 | 184.24683960154857 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 389.9744911369211 | 388.43466897254166 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 345.9228295166513 | 342.63034895210126 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 279.56334658247437 | 271.2724375402088 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 245.66477202810066 | 233.49688207371258 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 170.3270720653187 | 166.23863845657382 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 400.0041140827554 | 402.11182445396497 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 363.64641830327434 | 375.9288663364792 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 341.5776139573363 | 335.1160003213424 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 281.1811770268521 | 280.21438270014005 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 247.78716118997716 | 245.3269825179633 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 403.794126680488 | 405.2353919019577 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 387.079178426863 | 385.1461762057035 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 309.7847188173431 | 298.0443968374749 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 262.4721750159666 | 250.81679725428586 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 205.70866004479979 | 202.9620839129557 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 413.380982988662 | 418.40270594263103 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 398.450064800682 | 409.6794973994029 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 372.26297458194466 | 364.44415106552196 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 293.0818569905912 | 292.85172400643984 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 296.46717085592087 | 285.76362010612763 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 419.3186786037592 | 426.08801580934437 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 408.1648467766632 | 409.4122254207817 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 329.24396020457345 | 313.5200995121138 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 274.61257504571876 | 255.7801815432177 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 232.63806001220684 | 230.03020843492314 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 435.0785891054788 | 440.39101804225345 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 424.86925312752817 | 435.18898057396825 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 393.000417896268 | 395.11543361225256 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 297.7755459218185 | 300.7208114715287 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 331.71570861760534 | 318.07127352552885 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 424.58602747137405 | 425.84897078470715 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 422.66607285025725 | 423.5524945535485 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 344.8625760048626 | 331.6793888458635 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 282.0787281511649 | 263.7895634445868 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 252.7301927385177 | 245.41844170037427 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 437.0658069164588 | 442.9101960063628 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 433.13788271434646 | 452.3873572709863 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 404.0959191546953 | 396.7077863894884 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 300.45502211883206 | 301.3439134717943 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 344.11003202413934 | 330.8897663350314 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 298.4364205341705 | 291.6793556507056 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 187.6382133139633 | 191.05409897308772 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 156.55822078636112 | 154.178925976516 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 173.47765221825162 | 169.30862508068464 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 74.5885345035243 | 74.52689061607104 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 323.12233826013045 | 328.53889207933514 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 236.75872140126316 | 235.8378325547398 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 227.17836523816675 | 226.75357076139966 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 224.07209453308036 | 224.07209453308036 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 122.85572156047981 | 121.11642183704716 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 361.3123326658092 | 360.71014086458337 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 281.5287983927017 | 281.94301754758345 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 232.7456696285686 | 226.50976826432776 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 221.5612361744038 | 214.96188822837055 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 121.38311528944315 | 120.85441868178513 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 380.2579019244734 | 389.2520157863988 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 316.95230660496924 | 317.87597790618906 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 301.07968126657323 | 298.02424098422983 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 267.2240756921594 | 267.16353549228154 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 189.82761622494257 | 186.736450261963 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 389.88665375406805 | 387.9125133037077 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 348.70619958684887 | 346.6750499749774 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 280.5472989906087 | 271.22300822012187 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 250.02397620165968 | 241.22532776331445 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 171.67817496107645 | 166.95679280483972 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 412.626880230807 | 417.60238657950777 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 374.8829313933945 | 389.4448546468815 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 353.20410434172436 | 345.7072490717473 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 292.51045924209586 | 291.66621022138287 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 251.6264062063495 | 248.45110052911542 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 404.0155784550126 | 401.90546837237514 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 384.4389015599863 | 386.9684324594344 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 313.3731284132225 | 298.17074251037894 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 264.19199737284265 | 252.8982463999916 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 207.03696315185684 | 202.86697323136772 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 428.2436763312506 | 433.45005568619536 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 411.8516531869893 | 428.2753623461049 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 384.9095037182509 | 372.90888743000744 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 303.2438915629836 | 302.05095952914337 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 301.8689122735564 | 285.0363190513223 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 423.13592231504805 | 420.3991500185611 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 407.44527331585493 | 408.5064370765247 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 330.50050996167414 | 316.8763979925965 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 274.6833786307413 | 259.86098862141324 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 232.24019584158367 | 226.52040268160232 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 444.4596314237808 | 455.99558915752266 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 437.4245561244369 | 455.98275147271966 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 397.3350686877605 | 397.88875599028063 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 308.53809114394545 | 307.1359822042007 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 331.32379843423774 | 316.85293191675646 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 422.4622274366379 | 425.0407156418684 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 420.9547052783101 | 430.33779243510276 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 345.50265346504085 | 332.094855328957 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 280.81715528243365 | 264.6543640282054 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 252.25635200421783 | 245.46235499490305 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 452.5524207341139 | 461.7512032176736 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 445.2316469907137 | 464.4523799578466 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 416.87264016717023 | 409.17124592157046 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 309.42579489389846 | 307.9734464665731 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 350.50782004300623 | 330.98959545427294 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767
Approved by: https://github.com/Skylion007
2025-08-25 22:51:00 +00:00
adecb0c9e8 [Cutlass-EVT] Fix buffer size issues (#161335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161335
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #161398
2025-08-25 22:08:30 +00:00
d57c79e609 [Cutlass] Fix regression from f7ad69f (#161398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161398
Approved by: https://github.com/henrylhtsang
2025-08-25 22:08:30 +00:00
1a566c4909 Remove Python 3.9 nightly builds (#161427)
Please see https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161427
Approved by: https://github.com/huydhn
2025-08-25 22:05:40 +00:00
37a34022b5 [Pattern Matcher] improve error msg (#161423)
Updates pattern matcher error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161423
Approved by: https://github.com/mengluy0125, https://github.com/masnesral
2025-08-25 21:48:54 +00:00
763053dc53 Always run OIDC auth on B200 to be able to upload artifacts to S3 (#161436)
Reported by @drisspg , in its current form, the OIDC auth step wasn't run when the previous test step failed.  We need this to always run to be able to upload artifacts to S3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161436
Approved by: https://github.com/nWEIdia, https://github.com/drisspg
2025-08-25 21:05:20 +00:00
cf94cadbee [CUDAGraph] Add getter for cuda graph exec (#161294)
This is far simpler than #155164 since we never destroy the cudaGraphExec_t.

The request comes from TRT-LLM specifically. The motivation is that some power users would like to mutate specific kernel parameters via APIs like `cudaGraphExec*SetParams` after a cuda graph has been instantiated. For example, a common request has been to be able to change the sequence length of attention kernels, after having captured a graph for the largest possible sequence length. It turns out that the host overhead you eliminate via cuda graphs in LLM inference ends up causing an increase in computation time when you size your kernels to the maximum possible sequence length (which I believe is done in both TRT-LLM and vLLM). Attention is the most problematic kernel because its computation time is quadratic in the sequence length, rather than linear.

This can work if your attention kernel can work for arbitrary shapes (this is not the case for all attention implementations! Many of them specialize with templates), and you have a persistent kernel that allocates only as many blocks as you have SM's (so you don't have to figure out how many blocks to allocate for a specific sequence length). Using a conditional SWITCH node is a better generic approach to this problem, but that requires more infrastructure work.

Note that this requires knowledge of the exact location of the value in your kernel's parameter buffer to mutate. It won't work with arbitrary stream capture code whose kernels you don't know before hand. So I expect this code path to be rarely used.

Testing:

```
pytest -s -k raw_graph_exec test/test_cuda.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161294
Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/eqy
2025-08-25 20:57:37 +00:00
995397d47a [inductor] structured-log graph execution order + test (#160448)
Summary:

- Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse.
- Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}.

Testing:
- Add inline test to verify structure and output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448
Approved by: https://github.com/xmfan
2025-08-25 20:12:18 +00:00
ffa1ce7650 Fix the parity of original and exported module parameters (#160600)
## Problem
Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below):

When there are two attribute mapping to the same tensor, the strict mode will
1. Have a standard param buffer table to standardize the name (bug happens [here](f861dc1826/torch/export/_trace.py (L356))! when 2 parameter have same id(param), the latter name will overwrite the previous name)
2. [Update](f861dc1826/torch/export/_trace.py (L1481)) exported signature with updated standard FQN (problematic)
3. When getting exported_program.module(), it will call [_unlift_exported_program_lifted_states](f861dc1826/torch/export/exported_program.py (L1297)) to recover attribute from exported signature where the parameter name is defined and standardized
Then the named_parameter of this module will have overwritten name instead of original name

## How to reproduce the issue?
reproduce issue shared by @taotaohuang001

torch version: 2.8.0
```python
import torch
from torch import nn

# ---- Toy model with embedding weight sharing (aliasing) ----
class Toy(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_layers = nn.ModuleDict()
        tbl = nn.Embedding(100, 8)
        self.embedding_layers["ActorId"] = tbl
        # Alias: reuse the SAME module instance for another feature
        self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"]
        self.proj = nn.Linear(16, 1)

    def forward(self, feats: dict[str, torch.Tensor]):
        e1 = self.embedding_layers["ActorId"](feats["ActorId"])
        e2 = self.embedding_layers["RootActorId"](feats["RootActorId"])
        return self.proj(torch.cat([e1, e2], dim=-1))

torch.manual_seed(0)

m = Toy().eval()

# Show pre-export parameter names (canonicalized; shared weight appears once)
print("PRE-EXPORT named_parameters:")
print([name for name, _ in m.named_parameters()])

# Sanity: the two feature names point to the same weight object
w1 = m.embedding_layers["ActorId"].weight
w2 = m.embedding_layers["RootActorId"].weight
print("PRE-EXPORT alias -> same object:", w1 is w2, "| same storage:", w1.data_ptr() == w2.data_ptr())

# Example inputs (dict structure will be captured by export)
ex_in = {
    "ActorId":     torch.randint(0, 100, (4,)),
    "RootActorId": torch.randint(0, 100, (4,)),
}

# ---- Export (in memory) and materialize the runnable module ----
ep = torch.export.export(m, (ex_in,), strict=True)
gm = ep.module()  # GraphModule with new (canonical) parameter names

print("\nPOST-EXPORT named_parameters (GraphModule):")
post_names = [name for name, _ in gm.named_parameters()]
print(post_names)

# Prove alias persists after export: run fwd/bwd and check a single grad tensor exists
out = gm(ex_in).sum()
out.backward()

# Find the embedding weight in the exported module by shape (100, 8)
emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])]
print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>")

# Show that only one grad exists for the shared table
for name, p in gm.named_parameters():
    if p.grad is not None and p.shape == torch.Size([100, 8]):
        print("Grad present on shared embedding weight:", name, "| grad shape:", tuple(p.grad.shape))
        break

```

And you will see parameters are different before and after export
```
PRE-EXPORT named_parameters:
['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias']
PRE-EXPORT alias -> same object: True | same storage: True

POST-EXPORT named_parameters (GraphModule):
['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias']

Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight
Grad present on shared embedding weight: embedding_layers.RootActorId.weight | grad shape: (100, 8)

```
## Solution
Fixing this issue by making sure latter named parameter will not overwrite the `param_buffer_table` when original model's named parameter already maps to certain parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160600
Approved by: https://github.com/angelayi
2025-08-25 19:40:06 +00:00
3e210f90c2 Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)"
This reverts commit 1113e7de30da95973c1eac7921601f9a0e94f2db.

Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to executorch failure ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3221372096))
2025-08-25 18:56:18 +00:00
660b5656a4 Inline is_read_only_alias_match in _correct_storage_aliasing (#161285)
Drives down the overhead of return_and_correct_storage_aliasing slightly. Hopefully you'll agree it doesn't compromise readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161285
Approved by: https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235, #161240, #161284
2025-08-25 18:35:21 +00:00
0e0bb4f1fd Remove unnecessary len() call in _correct_storage_aliasing.is_read_only_alias_match (#161284)
Containers are truthy iff they're non-empty.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161284
Approved by: https://github.com/Skylion007, https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235, #161240
2025-08-25 18:35:21 +00:00
b048f0e189 Improve efficiency of _python_dispatch.return_and_correct_aliasing (#161240)
get_write_alias() call count reduction explained briefly in code comment.

We don't need to check write_aliases against None in the final outs_to_return calculation because we just did that check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161240
Approved by: https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235
2025-08-25 18:35:21 +00:00
c35538d3c5 Minor cleanup of DeviceMesh.__eq__ (#161235)
`self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161235
Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj
ghstack dependencies: #161231, #161234
2025-08-25 18:35:21 +00:00
cfafd98c53 Use comparison key in OpSchema to avoid duplicate work between __hash__ and __eq__ (#161234)
The performance cost of `dict` lookups keyed by `OpSchema` is a
significant minority of DTensor overhead. With this change we shave a
net ~1% off the total running time of the benchmark from #160580, as
measured by using cProfile and comparing cumulative time spent in
propagate + OpSchema's `__post_init__`. (`__post_init__` grew from
2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161234
Approved by: https://github.com/wconstab
ghstack dependencies: #161231
2025-08-25 18:35:21 +00:00
5d6434b132 Fix OpSchema equality check (#161231)
`__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and
it looks like attention was paid to hash, so I made comparison follow
suit).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161231
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/zpcore
2025-08-25 18:35:21 +00:00
2f0de0ff93 [Inductor] Update Intel Triton for PyTorch 2.9. (#161050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161050
Approved by: https://github.com/anmyachev, https://github.com/EikanWang, https://github.com/jansel
2025-08-25 17:18:19 +00:00
c081481bbe [aoti-fx] Output OpOverload fallbacks (#161195)
Updates the inductor-wrapper-fxir code to use the kernel.op_overload when generating extern kernel calls. This way we can keep the IR consistent with using ATen ops.

TODO: we're also inserting torch.empty_strided calls -- need to turn this into aten too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161195
Approved by: https://github.com/blaine-rister
2025-08-25 17:03:05 +00:00
df571ae7ad Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)"
This reverts commit 3ea6cc8c2d443d6104159d50e8328c144f6caa39.

Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/jeffdaily due to breaks ROCm, AttributeError: 'torch._C._CudaDeviceProperties' object has no attribute 'shared_memory_per_block_optin' ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3220989480))
2025-08-25 16:50:03 +00:00
9e1c954134 [dynamo] Pass requires_grad to nn.Parameter construction (#161364)
Fixes https://github.com/pytorch/pytorch/issues/161191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161364
Approved by: https://github.com/Skylion007, https://github.com/StrongerXi
2025-08-25 16:49:28 +00:00
83283ce7f5 docstring_linter: Fix #151692 and other issues (#156596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156596
Approved by: https://github.com/eellison
2025-08-25 16:04:14 +00:00
ab8d60f4c8 [ROCm] Unroll loads in global_reduce (#161181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161181
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-25 15:45:49 +00:00
af3265d20f [BE][CI] fix pkg=<pin> to pkg==<pin> in pip requirement specs (#160811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160811
Approved by: https://github.com/seemethere
2025-08-25 15:31:21 +00:00
f391afe9bf [cuDNN][convolution] remove redundant conv3d 64bit test (#161177)
turns out it's the same as
```
    @onlyCUDA
    @largeTensorTest("40GB")
    @largeTensorTest("24GB", "cpu")
    @tf32_on_and_off(0.005)
    def test_conv3d_64bit_indexing(self, device):
        x = torch.rand(1, 32, 512, 512, 256)
        m = torch.nn.Conv3d(32, 1, kernel_size=1, padding=0, stride=1, bias=False)
        yref = m(x)
        y = m.to(device=device)(x.to(device=device))
        self.assertEqual(yref, y)
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161177
Approved by: https://github.com/Skylion007
2025-08-25 15:01:05 +00:00
1113e7de30 [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.

This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.
@exported-using-ghexport

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/)

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900
Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305
2025-08-25 14:53:54 +00:00
40c0e700a4 Revert "[AMD] Fix AMD User Defined Kernel Autotune (#160671)"
This reverts commit 431846a6323c6f1d02da49e311ac694324f386f4.

Reverted https://github.com/pytorch/pytorch/pull/160671 on behalf of https://github.com/atalman due to new test is failing: inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_rocm_triton_autotuning_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17172795679/job/48725235301) [HUD commit link](431846a632) ([comment](https://github.com/pytorch/pytorch/pull/160671#issuecomment-3220442141))
2025-08-25 14:07:48 +00:00
510825e5fe Optimize dynamo typing (#147499)
Optimize dynamo methods type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147499
Approved by: https://github.com/anijain2305
2025-08-25 13:20:45 +00:00
ab7787fb82 Revert "[inductor] Windows inductor use intel-openmp. (#160258)"
This reverts commit 41673110cd7c5960824cc74a6fcaeda1a8bc7a23.

Reverted https://github.com/pytorch/pytorch/pull/160258 on behalf of https://github.com/malfet due to Reverting to fix https://github.com/pytorch/pytorch/issues/160898 and https://github.com/pytorch/pytorch/issues/160962 ([comment](https://github.com/pytorch/pytorch/pull/160258#issuecomment-3220158145))
2025-08-25 12:57:47 +00:00
1eccfb157a Revert "[BE] Remove intel-openmp dependency in setup.py (#160976)"
This reverts commit e4839470470168648dee5997f57347bb8541ea2b.

Reverted https://github.com/pytorch/pytorch/pull/160976 on behalf of https://github.com/malfet due to This PR is doing something strange ([comment](https://github.com/pytorch/pytorch/pull/160976#issuecomment-3220120462))
2025-08-25 12:46:12 +00:00
4651aaac47 Fix typo: 'complext' (#160335)
minor fix for a typo: `complext` to `complex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160335
Approved by: https://github.com/Skylion007
2025-08-25 10:37:59 +00:00
037c43d3b2 [tgif] fix getattr_recursive with ModuleList (#161204)
Summary: This change updates `getattr_recursive`  to handle qualnames with ModuleList that contain digit indices, for example, `op_instances.1.value_model.feature_weights`

Test Plan:
TBA

Rollback Plan:

Reviewed By: jiayisuse

Differential Revision: D80503985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161204
Approved by: https://github.com/jiayisuse
2025-08-25 10:08:47 +00:00
eb5549a431 xpu: fix cpp_extension compatibility with oneapi dpc++ 2025.2 compiler (#161012)
Intel oneapi DPC++ compiler has changed (fixed) parsing of `-fsycl-host-compiler-options` option in the respect of treating arguments with escaped quotes. This commit adds an if branches depending on compiler versions.

Fixes: https://github.com/intel/torch-xpu-ops/issues/1938

CC: @chuanqi129 @EikanWang @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161012
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-25 09:29:53 +00:00
56ebed627a [OpenReg] Add OSX/Windows Support for OpenReg (#159441)
As the title stated.

**Changes:**

- Abstract platform-specific APIs
- Add OSX/Windows support
- Set default symbol visibility to "hidden"

Co-authored-by: @can-gaa-hou

Original PR:https://github.com/pytorch/pytorch/pull/159029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159441
Approved by: https://github.com/albanD

Co-authored-by: jiahaochen666 <jiahaochen535@gmail.com>
2025-08-25 08:03:27 +00:00
80df27a612 port distributed pipeline test files for Intel GPU (#159033)
In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. instantiate_device_type_tests()
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
4. use "get_default_backend_for_device()" to get backend
5. enabled XPU for some test path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033
Approved by: https://github.com/guangyey, https://github.com/kwen2501
2025-08-25 05:24:27 +00:00
e3d68dfae2 [DTensor] Make default RNG semantics match user-passed generator (#160482)
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.

After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).

The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator.  The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.

Fixes #159991

confirmed docs rendered OK

<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
2025-08-25 04:21:19 +00:00
726dce3c94 [nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks (#161238)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161238
Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed
2025-08-25 03:09:32 +00:00
74280d0913 [muon] Introduce Muon optimizer to PyTorch (#160213)
A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy.

This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory.

**Usage**

```python
    model = MyModelForCausalLM
    # filter out your params manually
    muon_params = [...]
    adamw_params = [...]
    muon = Muon(
        params = muon_params
        lr=lr,
        wd=wd,
    )
    adamw = AdamW(
        params = adamw_params
        lr=lr,
        wd=wd,
    )

    # in training loop
    loss = model(input)
    loss.backward()
    muon.step()
    adamw.step()
    muon.zero_grad()
    adamw.zero_grad()
```

~~**Additional usage**~~
~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~

```python
~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~
~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~
```

As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs.

By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW.

**Testing**

~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~

As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test.

2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.
<img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" />

**Numerics**
We evaluate our implementation with existing implementation to confirm numerical consistency.

As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap.

As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below.

```python
    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = KellySingleDeviceMuon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)
```

As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without.
<img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" />

**Performance**
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

- adamw_ddp finishes in 13.12 min
- pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is *2.5%* slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms
<img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" />

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms
<img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" />

**Note**
We restrict the implementation to accept only 2D parameters.

An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful.

Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound.

**Next Steps**

1. Add `MuP`
2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
3. Open-source unsharded Muon co-designed with FSDP2.

****

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213
Approved by: https://github.com/janeyx99
2025-08-24 08:03:04 +00:00
1de4540449 Use -compress-mode=size for CUDA 13 build for binary size reduction (#161316)
https://github.com/pytorch/pytorch/issues/159779

CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/

Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058027353

Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions.

Related - https://github.com/pytorch/pytorch/pull/157791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161316
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-08-24 03:28:29 +00:00
3e5b021f21 [ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)
This pull request adds the following ops for sparse matrices using Eigen library:
```python
    add(a_csr, b_csr)
    add(a_csc, b_csc)

    addmm(c_csr, a_csr, b_csr)
    addmm(c_csr, a_csr, b_csc)
    addmm(c_csr, a_csc, b_csc)
    addmm(c_csr, a_csc, b_csr)

    addmm(c_csc, a_csr, b_csr)
    addmm(c_csc, a_csr, b_csc)
    addmm(c_csc, a_csc, b_csc)
    addmm(c_csc, a_csc, b_csr)
```

Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.

This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
2025-08-23 19:03:55 +00:00
4acdbb8311 [MPS] Fix index_copy for strided indices (#161333)
By passing strides to strided variant of the tensor

Fixes https://github.com/pytorch/pytorch/issues/160993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161333
Approved by: https://github.com/huydhn, https://github.com/wdvr
ghstack dependencies: #161206, #161267
2025-08-23 14:38:57 +00:00
f912c93344 Revert "Move non inductor workflows to Python 3.9 -> 3.10 (#161182)"
This reverts commit e20f6d798606f3245686e950c43635bbe526232d.

Reverted https://github.com/pytorch/pytorch/pull/161182 on behalf of https://github.com/zou3519 due to broke dynamo_wrapped tests, those are a bit finicky to fix (there is probably more than one failure!) ([comment](https://github.com/pytorch/pytorch/pull/161182#issuecomment-3216953097))
2025-08-23 13:00:42 +00:00
33346b5814 Support NUMA Binding for Callable Entrypoints, Take 2 (#161183)
# Context
In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges.

Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set.

# This PR
On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836)

Specifically, for each local rank `i`:
1. The parent process sets its own CPU affinity to what local rank `i`'s should be.
2. Then, the parent spawns the subprocess for local rank `i`.
3. Finally, the parent resets its own CPU affinity to what it was originally.

There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple.

This required a bit of refactoring:
1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`.
2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.).
3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
4. Use the context manager for both `str` and `Callable` paths

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183
Approved by: https://github.com/d4l3k
2025-08-23 07:23:22 +00:00
431846a632 [AMD] Fix AMD User Defined Kernel Autotune (#160671)
Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel.

Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
can succeed after this change.

Rollback Plan:

Differential Revision: D80285441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671
Approved by: https://github.com/muchulee8
2025-08-23 07:23:09 +00:00
cd31be28ec Reland D80238201: [Torch.Export] Add flat arg paths in error message (#160919)
Summary:
[The diff was reverted due to CLA error, in the process of retrieving account]
Previous error message
```
RuntimeError: Expected input at *args.<unknown location>.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)
```
New error message
```
RuntimeError: Expected input at *args.[0].supervision_input.weight.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)
```

Test Plan:
```
buck test mode/opt apf/rec/ir/tests:ir_export_deserialize_test
```
https://www.internalfb.com/intern/testinfra/testrun/4785074906254375

```
buck run mode/opt caffe2/test:test_export -- -r unflatten
```

```
Ran 413 tests in 208.414s

OK (skipped=1, expected failures=13)
```

Rollback Plan:

Differential Revision: D80487367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160919
Approved by: https://github.com/angelayi
2025-08-23 07:20:58 +00:00
710514a2a5 Revert "Enable output padding when only outermost dim is dynamic (#159404)"
This reverts commit f15ada5c6fad97a7dcbfa4673f067b6942dda640.

Reverted https://github.com/pytorch/pytorch/pull/159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159404#issuecomment-3216517032))
2025-08-23 07:17:30 +00:00
22df59efc0 [inductor] add MSVC language pack check. (#161298)
Check MSVC's language pack: https://github.com/pytorch/pytorch/issues/157673#issuecomment-3051682766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161298
Approved by: https://github.com/angelayi
2025-08-23 07:06:48 +00:00
3a4140bf8e [FlexAttention] fixing learnable bias assertion error in inductor (#161170)
Users encountered unexpected behaviour when using FlexAttention with learnable biases, including assertion errors (#157677)

We traced the root cause to the registration of subgraph buffers—this caused inconsistencies in the naming and ultimately incorrect retrieval later on. This problem only arose if the model was compiled as a whole (ie using @torch.compile) since only then would there be naming conflicts.

In this PR, we register the buffers with the base graph to solve this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161170
Approved by: https://github.com/drisspg
2025-08-23 06:24:22 +00:00
6443ea337d enable more tests (#161192)
Enable more vllm test against pytorch main, add schedule to run the test every 12 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161192
Approved by: https://github.com/huydhn
2025-08-23 06:01:22 +00:00
36ac916929 [ONNX] Fix lower opset version support in dynamo=True (#161056)
After we switched to constructing the registry with the specified opset version in dynamo=True, support for opset<18 was broken because there would be no torchlib ops registered for these opsets. I updated the registry creation logic to always use opset 18 if the requested opset is lower, and use the version converter (as designed) to target those opsets.

This requires onnxscript>=0.4 (https://github.com/pytorch/pytorch/pull/161312)

Fixes https://github.com/onnx/onnx/issues/7235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161056
Approved by: https://github.com/titaiwangms
2025-08-23 05:04:36 +00:00
7131bfab89 [vllm hash update] update the pinned vllm hash (#161227)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161227
Approved by: https://github.com/pytorchbot
2025-08-23 04:25:16 +00:00
ac8d9418ae [audio hash update] update the pinned audio hash (#161331)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161331
Approved by: https://github.com/pytorchbot
2025-08-23 04:21:03 +00:00
38a492d40d [ONNX] Remove unused _onnx_supported_ops (#161322)
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161322
Approved by: https://github.com/titaiwangms
2025-08-23 02:42:25 +00:00
394728bab2 [MPS] Update avg_pool3d kernel to use opmath_t (#161071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161071
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #161011
2025-08-23 02:36:22 +00:00
121afd6a8f [MPS] Update avg_pool2d to use Metal kernel when ceil_mode=True (#161011)
Fixes #160743

The MPS impl of `avg_pool2d` seems to only give incorrect results when `ceil_mode=True`. I wrote a performance measurement script (0ee6e58643/avg_pool_mps/perf_2d.py) which tests a bunch of different cases and also marks the cases where MPS and CPU results do not match.

I found that if I update `avg_pool2d` to use the new Metal kernel in all cases, that fixes all the mismatches, but it also decreases performance for some of the `ceil_mode=False` cases. So I opted to only run the new Metal kernel when  `ceil_mode=True`, which does not significantly decrease performance in any of the cases tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161011
Approved by: https://github.com/malfet
2025-08-23 02:36:22 +00:00
d228a776e9 [Inductor-FX] Support Tensorbox outputs (#161245)
# Problem
The FX converter previously supported graph outputs which were `StorageBox`, but not `TensorBox`. The latter seems to show up in certain cases when the output is a slice/view of the input.

# Fix
This PR generalizes the code to handle `MutableBox` instead of `StorageBox` specifically.

# Test
Added a CI test exposing the issue. The test case was found by intentionally breaking `TensorBox(ReinterpretView` support in https://github.com/pytorch/pytorch/pull/161258.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161245
Approved by: https://github.com/angelayi
2025-08-23 02:04:13 +00:00
cee72119b2 [Test] Adding a testcase for constant_pad_nd (#161259)
Fixes #161066

This PR adds a simple testcase for constant_pad_nd on MPS as mentioned in https://github.com/pytorch/pytorch/pull/161149#issuecomment-3211701274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161259
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-23 01:00:50 +00:00
47d267364c Revert "[SymmMem] Support rendezvous on slice of a tensor (#160825)"
This reverts commit 9d9cc9897ac44a1a8df38211b03d8342a8af48c3.

Reverted https://github.com/pytorch/pytorch/pull/160825 on behalf of https://github.com/kwen2501 due to Change of course; use storage_ptr as key ([comment](https://github.com/pytorch/pytorch/pull/160825#issuecomment-3215951048))
2025-08-22 23:41:55 +00:00
0d9da384ef Bump onnxscript to 0.4.0 in CI (#161312)
Use onnxscript apis for torch 2.9.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161312
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-08-22 23:23:08 +00:00
f521e82a4e Update pyrefly config for better codenav (#161200)
This fixes behavior in codenav by switching from `replace_imports_with_any` to `ignore-missing-imports`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161200
Approved by: https://github.com/aorenste, https://github.com/albanD
2025-08-22 23:05:07 +00:00
bcfe1b2d71 Add initial bc-linter configuration (#161319)
Preparation for https://github.com/pytorch/test-infra/pull/7016

Currently merging this PR is a noop change for PyTorch repo (bc-linter is not looking at the config yet).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161319
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2025-08-22 22:54:25 +00:00
419a2dbf5f [ONNX] Remove enable_fake_mode and exporter_legacy (#161222)
Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222
Approved by: https://github.com/titaiwangms
2025-08-22 22:15:27 +00:00
3373b074f5 [Profiler] Add GC Events to Python Stack Tracer (#161209)
Summary:
Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL.

Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on!

Test Plan:
Ran trace with GC induced and saw it on trace

Also added a test

Rollback Plan:

Differential Revision: D80491146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209
Approved by: https://github.com/ngimel
2025-08-22 22:11:25 +00:00
c8bb0e4720 [MPS] Fix index_copy for scalars (#161267)
By `squeezing the input` when copying into scalar tensor from a 1d one
And enable `test_index_copy_scalars_mps`

Fixes https://github.com/pytorch/pytorch/issues/160737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161267
Approved by: https://github.com/manuelcandales, https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #161206
2025-08-22 21:45:34 +00:00
4c36c8a994 [dynamo] Support method calls on complex ConstantVariables (#161122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161122
Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas
2025-08-22 21:40:03 +00:00
9d882fd9ff [benchmark] Add torchscript jit.trace to benchmark option (#161223)
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
2025-08-22 21:38:28 +00:00
2835cc5e91 [cuDNN] head dim > 128 works on H100 again in cuDNN SDPA? (#161210)
reference: https://github.com/pytorch/torchtitan/pull/1610

9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161210
Approved by: https://github.com/Skylion007
2025-08-22 21:21:53 +00:00
3f1a97a99c Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 44549c7146bd6c4166f97e856037babe1b7f4f49.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135))
2025-08-22 20:48:46 +00:00
981ac533c6 Revert "Close some sources of fake tensor leakages (#159923)"
This reverts commit 5afa4187dfe1e99278f8e372ec09102d5b937572.

Reverted https://github.com/pytorch/pytorch/pull/159923 on behalf of https://github.com/zou3519 due to broke aoti test in inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/159923#issuecomment-3215580688))
2025-08-22 20:42:50 +00:00
3ea6cc8c2d Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)
Conv exhuastive currently throws an error, and I think it's worth adding tests to the other ops too in order to prevent regression in exhaustive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387
Approved by: https://github.com/coconutruben
2025-08-22 20:06:09 +00:00
2c0650a00a Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)"
This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e.

Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))
2025-08-22 19:10:35 +00:00
eba1ad09e4 Revert "[SymmMem] Support rendezvous on view of a tensor (#160925)"
This reverts commit 9d7cecdd6c44c5421d341bcc359be4097ea9a2f5.

Reverted https://github.com/pytorch/pytorch/pull/160925 on behalf of https://github.com/kwen2501 due to Change of course: use storage ptr as symm mem keys as in the old days and force no_split in MemPool ([comment](https://github.com/pytorch/pytorch/pull/160925#issuecomment-3215315717))
2025-08-22 18:59:25 +00:00
a43480d19c [CD] Enable triton xpu Windows build for Python 3.14 (#161255)
Follow #159869
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161255
Approved by: https://github.com/atalman
2025-08-22 18:39:31 +00:00
17b0263e86 [inductor] fix march=native pass to Windows CC. (#161264)
fix march=native pass to Windows CC.

<img width="593" height="218" alt="image" src="https://github.com/user-attachments/assets/1caedffa-d9be-43d9-9ce2-590c055980cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161264
Approved by: https://github.com/angelayi
2025-08-22 18:38:51 +00:00
97200c9711 [inductor] Add get page_size support for Windows. (#161273)
`resource` can't work on Windows, as it is a Unix specific package as seen in https://docs.python.org/2/library/resource.html

Use Windows system API to get page_size.

Local tested:
<img width="467" height="433" alt="image" src="https://github.com/user-attachments/assets/47a39060-3aea-46c3-bd8e-35a39413c51f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161273
Approved by: https://github.com/angelayi
2025-08-22 18:36:14 +00:00
1d458e2947 Revert "[Inductor] Update Outer Reduction Heuristic (#159093)"
This reverts commit f085f299584b06a2a7d8855eda2a411313e782ad.

Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/seemethere due to this fails internal tests, see D80630416 for more info ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3215263317))
2025-08-22 18:35:36 +00:00
266784ec6a remove old while_loop_schema_gen test (#161202)
Fixes https://github.com/pytorch/pytorch/issues/141202.

This test is flaky for mysterious reasons and we have created a new way of creating schemas for hops. So delete the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161202
Approved by: https://github.com/zou3519
2025-08-22 18:22:29 +00:00
25df65afd8 [ROCm] revamp HIPCachingAllocatorMasqueradingAsCUDA (#161221)
HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead.

This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-22 18:13:12 +00:00
e20f6d7986 Move non inductor workflows to Python 3.9 -> 3.10 (#161182)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-08-22 16:48:43 +00:00
c2390087c3 [MPS] Fix index_select for scalar_types (#161206)
By copy-n-pasting logic from `index_select_out_cpu` (and `_cuda`), where essentially the resizing is done inside the op,  which also fixes faulty logic for scalars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161206
Approved by: https://github.com/manuelcandales
2025-08-22 16:45:35 +00:00
f09458c2e1 Enable test/test_numpy_interop.py config in mypy (#158556)
## Test Result

```bash
lintrunner --take MYPY test/test_numpy_interop.py

Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556
Approved by: https://github.com/soulitzer
2025-08-22 16:18:58 +00:00
7fcdd8d6af Use ROCm MI325 runners for trunk.yml (#161184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161184
Approved by: https://github.com/jeffdaily
2025-08-22 16:18:55 +00:00
c7a77470c5 Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482)"
This reverts commit d1faf2ef0476eb60b42c057baee9af0f48ae849a.

Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))
2025-08-22 15:04:28 +00:00
ce467df5d1 rm platform args xplat/langtech/mobile/BUCK (#161018)
Differential Revision: D80460691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161018
Approved by: https://github.com/drisspg
2025-08-22 14:47:36 +00:00
db44de4c0d [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-22 14:19:57 +00:00
639b8cc51d Revert "cd: Add no-cache for test binaries (#149218)"
This reverts commit 523bffd38856dc9fca36bddded64f74822a6e1a2.

Reverted https://github.com/pytorch/pytorch/pull/149218 on behalf of https://github.com/atalman due to Lets not use no-cache flags on test binaries ([comment](https://github.com/pytorch/pytorch/pull/149218#issuecomment-3214338844))
2025-08-22 13:14:23 +00:00
49ff884b1e Add CUDA 13.0 x86 builds (#160956)
https://github.com/pytorch/pytorch/issues/159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:
1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160956
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-08-22 11:31:09 +00:00
a68f63e331 Add Windows CUDA 13 build and magma script (#161073)
Add magma build 13.0 for Windows
Add cuda_install.bat 13.0 for Windows build
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161073
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-08-22 11:24:25 +00:00
774b4befa1 [BE] [dynamo] Simplify two methods in ConstDictVariable (#159361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159361
Approved by: https://github.com/anijain2305
2025-08-22 11:11:30 +00:00
2beffb3311 Refactoring TensorImpl by using constexpr and std::is_same_v (#161043)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161043
Approved by: https://github.com/Skylion007
2025-08-22 10:49:49 +00:00
9b4adc4db7 [fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568)
Adds support for FlightRecorder in ProcessGroupXCCL.

See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568
Approved by: https://github.com/guangyey, https://github.com/fduwjj
2025-08-22 09:03:35 +00:00
9e491f753e [dynamo] Remove extra if statement in builder _wrap (#161215)
Removes a redundant if statement. Does not impact logic so no test changes needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161215
Approved by: https://github.com/StrongerXi
2025-08-22 08:56:06 +00:00
373e25c2eb Disable background threads for XPU host allocator (#161242)
# Motivation
https://github.com/pytorch/pytorch/pull/160505 enables background threads for XPU host allocator. However, it will hang on Windows during program exit. Now disable it until we narrow down the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161242
Approved by: https://github.com/EikanWang
2025-08-22 08:40:13 +00:00
595987d28d [bucketing] allow convert_element_type after fsdp reduce_scatter (#161159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161159
Approved by: https://github.com/eellison
2025-08-22 06:41:50 +00:00
c4670e40c9 [inductor] remove Windows unsupported build options. (#161197)
Changes:
1. Math related build option is not supported by msvc, skip them on Windows.
2. Move all math related build option to `_get_ffast_math_flags` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161197
Approved by: https://github.com/jansel
2025-08-22 06:23:43 +00:00
9b3ebd25ac [inductor] Enable max compatible to msvc for oneAPI headers. (#161196)
Enable max compatible to msvc for oneAPI headers.

The key context is `The /permissive- option is compatible with almost all of the header files from the latest Windows Kits` from https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161196
Approved by: https://github.com/jansel
2025-08-22 06:23:26 +00:00
f8bd85827d Optimzie zero_grad description (#161239)
Optimize [zero_grad doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) format and description.

## Test Result

### Before

<img width="996" height="534" alt="image" src="https://github.com/user-attachments/assets/e1db973c-57e8-4525-90e7-0500cde2263d" />

### After

<img width="890" height="496" alt="image" src="https://github.com/user-attachments/assets/5579c4fb-a857-4030-9303-34770083d1a5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161239
Approved by: https://github.com/janeyx99
2025-08-22 06:18:25 +00:00
bc7eaa0d8a [BE] Remove the default TORCH_CUDA_ARCH_LIST in CI Docker image (#161137)
This doesn't make sense to have this default to Maxwell, which is too old.  All other places in CI/CD needs to overwrite this value.  IMO, it makes more sense to not set this at all and let CI/CD jobs set it for their own use cases instead.  This is partly responsible for the build failure in https://github.com/pytorch/pytorch/issues/160988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161137
Approved by: https://github.com/msaroufim
2025-08-22 06:03:11 +00:00
0dea191ff7 [VLLM TEST]setup test workflow (#160583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160583
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-08-22 05:38:39 +00:00
8aad3a60ce [dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) (#161036)
Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: 5ee464db5c/torch/_dynamo/variables/misc.py (L671)

Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor.

FIXES https://github.com/pytorch/pytorch/issues/160901

It should also be the root cause behind the issue in https://github.com/pytorch/torchtitan/pull/1604 @bdhirsh  @ruisizhang123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161036
Approved by: https://github.com/anijain2305
ghstack dependencies: #160805
2025-08-22 04:43:22 +00:00
c7fb031706 [audio hash update] update the pinned audio hash (#161226)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161226
Approved by: https://github.com/pytorchbot
2025-08-22 04:22:08 +00:00
c60dea5261 [export] Allow tempfile._TemporaryFileWrapper in package_pt2 (#161203)
Summary:
We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py`

However, it is not recognized as an allowed file format and a warning will be thrown.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80740916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203
Approved by: https://github.com/angelayi
2025-08-22 04:10:35 +00:00
bf8431ba06 [inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE (#159233)
Fixes #158076

Basically, the gemm template generates code like
```
cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
            &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
            &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
            &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
            static_cast<int64_t>(m_end + ((-1LL)*m_start)),
            static_cast<int64_t>(Nr),
            static_cast<int64_t>(k_end + ((-1LL)*k_start)),
            static_cast<int64_t>(196LL),
            static_cast<int64_t>(80LL),
            static_cast<int64_t>(Nc_blocks*Nr)
        );
```

However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access.

The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly.

I think a good way to fix this issue is to create a new matrix that has no storage offset.

When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue.

BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct.

The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.

I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233
Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
2025-08-22 03:47:28 +00:00
2fdd4f918c Log exception_stack_trace to dynamo_compile (#161096)
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.

Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n  File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n    x = 1/0\n        ~^~\nZeroDivisionError: division by zero\n']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096
Approved by: https://github.com/c00w
2025-08-22 03:29:15 +00:00
31a41daff4 [ROCm][Windows] Include native_transformers srcs to fix link errors. (#160373)
Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like
```
[7019/7028] Linking CXX shared library bin\torch_hip.dll
FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests  -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp  /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64)
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64))
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64))
```

The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available.

I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373
Approved by: https://github.com/alugorey, https://github.com/jeffdaily
2025-08-22 01:43:25 +00:00
cc791d5857 Quick fix to headers in stable/tensor_inl.h (#161168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161168
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2025-08-22 01:27:44 +00:00
be2e6b3158 [export] Remove unused Model, tensor_paths, constant_paths (#161185)
Summary:
Removed `Model`, it's not being used anywhere so it's safe.

Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram`
- BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields
- FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()`

Differential Revision: D80725094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185
Approved by: https://github.com/SherlockNoMad
2025-08-22 01:07:01 +00:00
a85711d565 Avoid making node a successor/predecessor of itself (#161205)
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161205
Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
2025-08-22 00:30:29 +00:00
ff4f5dd8ed [nativert] oss layout planner tests (#160942)
Summary: att - changed one of the tests to get rid of torcharrow dep.

Test Plan:
```
buck2 test //caffe2/test/cpp/nativert:layout_planner_tests
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D80108549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160942
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-08-22 00:26:25 +00:00
46429be723 [DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205)
Parallelize reading of data behind thread_count argument to HFStorageReader
Test plan: ensure existing tests pass and run a job successfully with these changes

Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205
Approved by: https://github.com/meetv18
2025-08-21 23:58:02 +00:00
f5bf5147ad Bump uv from 0.8.4 to 0.8.6 in /.ci/lumen_cli (#161212)
Bumps [uv](https://github.com/astral-sh/uv) from 0.8.4 to 0.8.6.
- [Release notes](https://github.com/astral-sh/uv/releases)
- [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/uv/compare/0.8.4...0.8.6)

---
updated-dependencies:
- dependency-name: uv
  dependency-version: 0.8.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-21 15:54:34 -07:00
fc0683b1e7 Revert "[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)"
This reverts commit ce048de608180fa88335e5821070472539968b54.

Reverted https://github.com/pytorch/pytorch/pull/155357 on behalf of https://github.com/seemethere due to This is causing buck builds to fail since we didn't add the definition of AT_USE_EIGEN_SPARSE in the buckbuild.bzl file, will follow-up and re-land this. ([comment](https://github.com/pytorch/pytorch/pull/155357#issuecomment-3212270510))
2025-08-21 22:38:40 +00:00
cb57953215 [BE] Enable test_index_put_accumulate_duplicate_indices on MPS (#161201)
By changing dtype to float if device is MPS

Note: for some reason test runs much longer on MPS than on CPU
```
% python ../test/test_indexing.py -v -k test_index_put_accumulate_duplicate_indices_mps
test_index_put_accumulate_duplicate_indices_mps (__main__.TestIndexingMPS.test_index_put_accumulate_duplicate_indices_mps) ... ok

----------------------------------------------------------------------
Ran 1 test in 9.139s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161201
Approved by: https://github.com/dcci
2025-08-21 22:05:42 +00:00
f085f29958 [Inductor] Update Outer Reduction Heuristic (#159093)
Update outer reduction heuristics for significant speedups.

HuggingFace:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" />

Average ~20% speedup on a kernel by kernel basis

TorchBench:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" />

Average ~40% speedup on a kernel by kernel basis

<img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" />

Differential Revision: [D80630416](https://our.internmc.facebook.com/intern/diff/D80630416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093
Approved by: https://github.com/jansel
2025-08-21 22:02:49 +00:00
d1faf2ef04 [DTensor] Make default RNG semantics match user-passed generator (#160482)
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.

After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).

The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator.  The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.

Fixes #159991

confirmed docs rendered OK

<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
2025-08-21 22:02:16 +00:00
cc2b65a91a [VLLM]setup test cli logics (#160361)
setup vllm test logics.
1.  install wheels generated from previous build stage
2. generate and install vllm test pkg list on run time based on the torch wheels in the instance
3. run test based on the pre-defined test plan

notice the test-plan format is temporary for some basic vllm testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160361
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-08-21 21:59:41 +00:00
67fc16c744 Add profiler analysis flag to combine multiple profiles into one (#161145)
Combine multiple profiles into one:
```
python profile_analysis.py --combine <file1> <file2> ... <out>
```
This only works well if they have different pids, like from different programs in a distributed run.

<img width="1521" height="465" alt="combining_multiple_profiles" src="https://github.com/user-attachments/assets/aba7112b-e9a9-4075-b82b-a4e4408384da" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161145
Approved by: https://github.com/xmfan
2025-08-21 21:36:58 +00:00
fb241d0a44 [dcp][hf] Fix multi-rank consolidation for no files to process case (#160660)
Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.

Test Plan:
tested this case on a job e2e
added a unit test

Rollback Plan:

Differential Revision: D80273616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660
Approved by: https://github.com/sibuachu
2025-08-21 21:18:03 +00:00
d2b8c0d431 forward fix of #152198 (#161166)
torch._inductor.virtualized.OpsValue objects instance does not have shape attribute. This breaks the fp8 test on ROCm. Add the OpsValue class in todo list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161166
Approved by: https://github.com/jeffdaily
2025-08-21 21:09:48 +00:00
e25ee0290e Fix constant_pad_nd_mps bug when pad is empty (#161149)
Fixes #161066

There is a size check here, which causes the error.
8ce81bcee1/aten/src/ATen/native/mps/operations/Pad.mm (L39-L40)

If the argument `pad` is empty, it will return the cloned tensor on CPU.

8ce81bcee1/aten/src/ATen/native/PadNd.cpp (L43-L64)

Therefore, this PR fixes the empty padding argument error by checking the size first and returning a cloned tensor immediately if the padding size is 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161149
Approved by: https://github.com/malfet
2025-08-21 20:45:26 +00:00
5805c4210b [invoke_subgraph][inductor] Thread graphsafe rng input states for hops (#160713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160713
Approved by: https://github.com/eellison
2025-08-21 20:41:29 +00:00
db38c44ad6 [inductor] add libraries_dirs for level_zero (#161146)
Changes:
1. change set `include_dirs` to append value.
2. add append `libraries_dirs` for level_zero.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161146
Approved by: https://github.com/angelayi
2025-08-21 19:55:12 +00:00
1e3fe78a10 [inductor] disable min/max macro on Windows. (#161133)
Disable min/max macro on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161133
Approved by: https://github.com/angelayi
2025-08-21 19:52:56 +00:00
a445b41e4f [pytorch] Simplify PyTorch foreach_* API restrictions check (#161039)
Summary: C++'s polymorphism and reusing components help us reduce the amount of bolierplate codes here.

Test Plan:
CI & tests

Rollback Plan:

Differential Revision: D80594353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161039
Approved by: https://github.com/janeyx99
2025-08-21 19:50:02 +00:00
801851086d [pytorch] Invoke vector.reserve() consistently for non-inplace foreach operations (#161128)
Summary:
The `reserve()` method is used to pre-allocate memory for the result vector before adding elements to it. This is an optimization that makes sense for several reasons:

1. Performance improvement: By pre-allocating memory for the exact number of elements needed, it avoids multiple reallocations and memory copies that would occur as the vector grows dynamically.

2. Memory efficiency: It ensures that the vector allocates exactly the amount of memory needed, no more and no less, which is efficient when we know the final size in advance.

3. Reduced overhead: Each reallocation typically involves:
- Allocating a new, larger block of memory
- Copying all existing elements to the new location
- Destroying the old elements
- Deallocating the old memory block
- Consistent performance: Without reservation, vector growth typically follows a geometric progression (like 1, 2, 4, 8, 16...), which can lead to unpredictable performance spikes when reallocation occurs.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80674453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161128
Approved by: https://github.com/Skylion007
2025-08-21 19:43:11 +00:00
958f9ca88e [nativert] oss static kernel tests (#161087)
Summary: att - should be no-op

Test Plan:
buck2 test //caffe2/test/cpp/nativert:static_kernel_ops_tests
Tests finished: Pass 24. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D80216488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161087
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-08-21 19:42:21 +00:00
9668210302 Allow bypasses for Precompile when guards, etc. cannot be serialized (#160902)
This adds a new function `bypass_package` and `CompilePackage.bypass_current_entry()`. This allows us to safely bypass if there are models with unserializable or incompatible parts. When we encounter something incompatible, we'll raise a bypass and ignore that particular code in DynamoCodeEntry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160902
Approved by: https://github.com/zhxchen17
2025-08-21 18:20:42 +00:00
3f5a8e2003 Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084)
Fixes https://github.com/pytorch/pytorch/issues/160988.  The root cause can be found in the same issue.  This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084
Approved by: https://github.com/malfet, https://github.com/zou3519
2025-08-21 17:38:32 +00:00
3dacaf0e1e [aoti-fx] Add meta["val"] metadata (#161019)
Summary: Added a `_set_node_metadata_hook` which automatically adds node.meta["val"] to every new node that gets created under this context.

Test Plan:
` buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_advanced_ops`
https://www.internalfb.com/buck2/866439a2-2ba6-42d1-8e43-508d60456e2e

`buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_basic_ops`
https://www.internalfb.com/intern/testinfra/testrun/11540474149662857

Rollback Plan:

Differential Revision: D80579336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161019
Approved by: https://github.com/blaine-rister
2025-08-21 16:45:41 +00:00
a6401cb5aa Revert "flip the list-as-tuple behavior for short lists (#160794)"
This reverts commit febfc3ec03004116dfd6d504e6853ff02a1dd6e0.

Reverted https://github.com/pytorch/pytorch/pull/160794 on behalf of https://github.com/seemethere due to This if failing internal tests, see D80671241 ([comment](https://github.com/pytorch/pytorch/pull/160794#issuecomment-3211314867))
2025-08-21 16:33:30 +00:00
7006fd0c88 Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)"
This reverts commit 517d38d3406abbba35d0694bff259a698cad3ec9.

Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/IvanKobzarev due to Segment tree starts failing on trunk even ciflows/trunk passed on PR ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3211286092))
2025-08-21 16:22:44 +00:00
517d38d340 [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-21 15:45:06 +00:00
3caddd4daa [ROCm] SDPA fix mem fault when dropout is enabled (#154864)
Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-21 14:23:13 +00:00
18271148d3 [dist] expose unsafe_get_ptr for dist.ProcessGroupNCCL.NCCLConfig (#161136)
expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161136
Approved by: https://github.com/kwen2501
2025-08-21 10:47:03 +00:00
a941d7ffe5 [Quant][CPU] Avoid NaN in fp8 output of qlinear and qconv (#160957)
**Summary**
When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip.

**Test plan**
```
pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160957
Approved by: https://github.com/Valentine233, https://github.com/CaoE
2025-08-21 08:36:21 +00:00
acb00d3ccf Revert "Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084)"
This reverts commit cfdaaaaa26d7f34427ba941569eca46f02f79f3e.

Reverted https://github.com/pytorch/pytorch/pull/161084 on behalf of https://github.com/huydhn due to My mistake in not checking for nvidia-smi availability ([comment](https://github.com/pytorch/pytorch/pull/161084#issuecomment-3209498435))
2025-08-21 08:17:04 +00:00
bd5857a1d6 Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)"
This reverts commit 9d18bf01b1661d227f6af41ac07a1e9ef20a9e1a.

Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of failures showing up after this lands ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3209487237))
2025-08-21 08:13:33 +00:00
23b033452f [Inductor][CPP] Fix layout for local buf in outer loop fusion (#160857)
Fixes #159154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160857
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-21 06:00:04 +00:00
2f50ae7d20 [nativert] make runtime const folding aware of run_const_graph (#160760)
Summary: it's possible that we have foldable nodes that use things that will be folded by run_const_graph

Test Plan:
CI

Rollback Plan:

Differential Revision: D80355542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160760
Approved by: https://github.com/SherlockNoMad
2025-08-21 05:22:03 +00:00
9d18bf01b1 [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-21 05:19:38 +00:00
67b98da1b2 [nativert] oss static kernel test utils (#161086)
Summary: att - should be a no-op

Test Plan:
ci

Rollback Plan:

Differential Revision: D80214768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161086
Approved by: https://github.com/georgiaphillips
2025-08-21 04:49:06 +00:00
b0420d2438 [vllm hash update] update the pinned vllm hash (#161121)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161121
Approved by: https://github.com/pytorchbot
2025-08-21 04:21:09 +00:00
6096d277c5 [audio hash update] update the pinned audio hash (#161021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161021
Approved by: https://github.com/pytorchbot
2025-08-21 04:20:56 +00:00
cfdaaaaa26 Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084)
Fixes https://github.com/pytorch/pytorch/issues/160988.  The root cause can be found in the same issue.  This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084
Approved by: https://github.com/malfet, https://github.com/zou3519
2025-08-21 03:47:15 +00:00
117f11adb4 [FlexAttention][TF32] Handle uninitialized torch.backends.cuda.matmul.fp32_precision (#161102)
For https://github.com/pytorch/pytorch/issues/161022
The warning says the old API will be deprecated in 2.9+ anyway, leaving it up to the author of #125888 to decide on initialization behavior then

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161102
Approved by: https://github.com/ngimel, https://github.com/drisspg, https://github.com/BoyuanFeng
2025-08-21 03:36:52 +00:00
a154c2093c remove redundant installation (#160634)
Fixes #160302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160634
Approved by: https://github.com/sekyondaMeta, https://github.com/malfet
2025-08-21 03:31:12 +00:00
39862acb2e [CPU][Inductor] improve performance of A16W4 GEMM template (#159127)
**Summary**
This PR improves performance of A16W4 GEMM template by removing boundary check of prefetch in the kernel code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159127
Approved by: https://github.com/CaoE
2025-08-21 03:16:26 +00:00
9a41570199 [rfc] add hint_override kwarg to mark_dynamic (#161007)
The motivation for this change can be seen through the following example:

```
import torch

GPU_TYPE = "cuda"

@torch.compile
def no_override(x):
    return x.sum(dim=0)

@torch.compile
def override(x):
    return x.sum(dim=0)

x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```

Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:

```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
    xnumel = 16384
    rnumel = r0_numel
```

With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:

```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
    xnumel = 1024000
    rnumel = r0_numel
```

This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:

```
f(s0) -> f(s2)
f(s1) -> f(s2)
```

could generate different kernels. With the new approach, an explicit override pins the chosen configuration:

```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```

ensuring consistent kernel generation regardless of input order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
2025-08-21 02:22:52 +00:00
f9875166a9 Revert "[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136)"
This reverts commit 3d126e17e0c2630031e7a359d6a6fd1dbe52c4f7.

Reverted https://github.com/pytorch/pytorch/pull/160136 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))
2025-08-21 01:34:19 +00:00
6b5be1f4a0 Revert "[FSDP][Replicate] replicate tests for param registration and input device movements (#160147)"
This reverts commit a3a82e3da85a53afc4bbf3d75bd3d3dcc2e06645.

Reverted https://github.com/pytorch/pytorch/pull/160147 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))
2025-08-21 01:34:19 +00:00
0924304e72 [AOTI] Add a new config cpp.use_constexpr_for_int_array (#160927)
Summary: Default True so same as before, but make it configurable

Differential Revision: D80185094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160927
Approved by: https://github.com/henryoier
2025-08-21 01:16:27 +00:00
d875d3ca1e don't try to set lazy module loading env var (#161103)
This is not needed on drivers >=525, and in DriverAPI::get() we are initializing the context anyway, so setting environment variable after that is beside the point
As a result of calling DriverAPI::get on systems that don't have gpus available (e.g. due to CUDA_VISIBLE_DEVICES="") people were getting confusing errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161103
Approved by: https://github.com/eqy, https://github.com/malfet
2025-08-21 01:06:51 +00:00
a825557ed5 Workaround ATen SFINAE under libc++ (#161101)
The existing logic here to workaround dealing with SFINAE under Microsoft platforms also applies to libc++ platforms. It appears that nvcc reports ambiguity in overload resolution for `pow_`. This seems like a nvcc limitation.

```
fbcode/caffe2/aten/src/ATen/native/cuda/Pow.cuh(42): error: more than one instance of overloaded function "pow" matches the argument list:
            function template "std::__2::enable_if<<expression>, std::__2::__promote<_A1, _A2, void>>::type::type pow(_A1, _A2) noexcept" (declared at line 848 of fbcode/third-party-buck/platform010-libcxx/build/libcxx/include/c++/v1/math.h)
            function template "std::__2::enable_if<<expression>, std::__2::__promote<_Tp, _Up, void>>::type pow(_Tp, _Up) noexcept" (declared at line 11308 of fbcode/third-party-buck/platform010/build/cuda/12.4/bin/..//include/crt/math_functions.h)
            argument types are: (double, float)
    return ::pow(base, exp);
           ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161101
Approved by: https://github.com/malfet
2025-08-21 00:55:58 +00:00
3e3e83418d [BE] Move indexing tests to test_indexing (#160994)
Which enables them on MPS device
- xfail all `test_index_reduce` on MPS, as op is not implemented
- xfail all `test_index_copy` on MPS due to the silent correctness problems, see https://github.com/pytorch/pytorch/issues/160993
- Fixed hard crash in `index_fill` and replaced `skipIfMPS` with `expectedFailueMPS`
- Created issue for the lack of deterministic algorithms for MPS backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160994
Approved by: https://github.com/manuelcandales
ghstack dependencies: #160850, #160889, #160926
2025-08-21 00:42:55 +00:00
667245dc60 TritonKernel.inductor_meta_common() -> self.inductor_meta_common() (#160895)
Summary: use `self.inductor_meta_common()` to call the static method, since the custom subclasses may overwrite the method to be an instance method

Test Plan:
```
caffe2/test/inductor:select_algorithm -- test_finalized_subclass_hooks
```

Rollback Plan:

Differential Revision: D80375351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160895
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2025-08-21 00:22:51 +00:00
54c2b66592 Replace _device_t with torch.types.Device in torch/cpu/__init__.py (#161031)
Fixes #152952

Replace `_device_t` with `torch.types.Device` in `torch/cpu/__init__.py`. Did basic smoke test by running tests that `import torch.cpu` including `test/distributed/test_c10d_functional_native.py` and `test/test_decomp.py`.

Based this PR off of #152935 which is referenced in the main issue.

(also, this is my first contribution but I followed the contributing guide closely)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161031
Approved by: https://github.com/janeyx99
2025-08-21 00:22:43 +00:00
be87f22dfb [inductor] Enable updated __cplusplus macro (#161064)
Intel oneAPI has some header depends on `__cplusplus` macro.
This PR is enable updated __cplusplus macro for msvc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161064
Approved by: https://github.com/angelayi
2025-08-21 00:17:08 +00:00
2a7a7ad711 [inductor] add level zero for xpu (#161061)
Add level zero for Inductor xpu on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161061
Approved by: https://github.com/angelayi
2025-08-21 00:14:15 +00:00
7e6ce41555 [dcp_poc] add async checkpointing tests (#161034)
Summary: add tests for async checkpointer for the experimental checkpointer

Test Plan:
tests

Rollback Plan:

Differential Revision: D80590461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161034
Approved by: https://github.com/pradeepfn
2025-08-21 00:08:53 +00:00
4ed3184dee Conditionally enable ACL for bmm_out_or_baddbmm_ (#161065)
Summary: Similar to #ifdef checks added in addmm_impl_cpu_ to conditionally enable ACL, we add the same checks in bmm_out_or_baddbmm_. This essentially disables ACL for bmm_out_or_baddbmm_ and enables ArmPL, which seems to be performing better.

Test Plan: AR SL

Rollback Plan:

Reviewed By: Nicoshev

Differential Revision: D80494623

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161065
Approved by: https://github.com/q10
2025-08-20 23:32:25 +00:00
44549c7146 [dynamic shapes] unbacked-safe slicing (#157944)
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-20 22:52:56 +00:00
febfc3ec03 flip the list-as-tuple behavior for short lists (#160794)
Per title, previously we started throwing noisy warnings, but given how popular this pattern was in our test suite decided to leave it as warning, not as silent behavior change for one release.
Now `treatSequenceAsTuple` would return `true` in the only case where the sequence was indeed a tuple, so no need for a special function anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160794
Approved by: https://github.com/albanD
2025-08-20 22:40:42 +00:00
5afa4187df Close some sources of fake tensor leakages (#159923)
Differential Revision: D79694055

Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and error using the FQN of the lifted constant
2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict
3. We modify yolov3 to fix the previous silent incorrect behaviour

When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159923
Approved by: https://github.com/avikchaudhuri
2025-08-20 22:24:23 +00:00
30384abcb1 Decrease number of bytes used by uninitialized tokens_ in KernelFunction (#160764)
std::unique_ptr to decrease bytes from 24 to 8

Since std::unique_ptr is not copyable this required defining the copy / copy assignment constructors. Which made me realize we shouldn't be copying `tokens_` in those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160764
Approved by: https://github.com/albanD
2025-08-20 21:33:27 +00:00
16e811e0b5 [CI] remove tb-nightly (#160996)
Removing tb-nightly because we found issues when importing tensorboard as having both tb-nightly and tensorboard causes issues when pip would report 2.18.0 (pinned tensorboard) but importing in a python shell would report 2.13.XXX. This mismatch causes issues when running tests in a numpy2.X environment. e.g.

```
/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
/opt/venv/lib/python3.12/site-packages/redis/connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
  warnings.warn(msg)
/opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)
E
======================================================================
ERROR: test_event_handler (__main__.TestMonitorTensorboard.test_event_handler)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_monitor.py", line 116, in setUp
    from tensorboard.backend.event_processing import (
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 25, in <module>
    from tensorboard.backend.event_processing import (
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 25, in <module>
    from tensorboard.backend.event_processing import event_file_loader
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/event_file_loader.py", line 21, in <module>
    from tensorboard import dataclass_compat
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/dataclass_compat.py", line 33, in <module>
    from tensorboard.plugins.hparams import metadata as hparams_metadata
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/plugins/hparams/metadata.py", line 32, in <module>
    NULL_TENSOR = tensor_util.make_tensor_proto(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/util/tensor_util.py", line 405, in make_tensor_proto
    numpy_dtype = dtypes.as_dtype(nparray.dtype)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py", line 677, in as_dtype
    if type_value.type == np.string_ or type_value.type == np.unicode_:
                          ^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/numpy/__init__.py", line 400, in __getattr__
    raise AttributeError(
AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

----------------------------------------------------------------------
Ran 1 test in 0.355s

FAILED (errors=1)

```
After removing tb-nightly and ensuring that tensorboard 2.18.0 is the only tensoboard in the env:

```
root@rocm-framework-47:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
.
----------------------------------------------------------------------
Ran 1 test in 0.409s

OK

```

```
>>> import tensorboard
>>> print(tensorboard.__version__)
2.13.0a20230426
```
```:/# pip show tensorboard
Name: tensorboard
Version: 2.18.0
Summary: TensorBoard lets you watch Tensors Flow
Home-page: https://github.com/tensorflow/tensorboard
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /opt/venv/lib/python3.12/site-packages
Requires: absl-py, grpcio, markdown, numpy, packaging, protobuf, setuptools, six, tensorboard-data-server, werkzeug
Required-by:

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160996
Approved by: https://github.com/huydhn
2025-08-20 21:25:58 +00:00
19c70c2f3d [pytorch] Faster and safer lambda expression capture in has_integral_tensor() (#161042)
Summary: Because `includeBool` is already a small value type (i.e., `bool`, 1 byte) that's passed by value to the function. Capturing by reference (4 or 8 bytes depending on the system) is unnecessary and could potentially lead to dangling reference issues if the lambda outlives the original variable. Capturing by value is more efficient for small types and safer.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80595698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161042
Approved by: https://github.com/Skylion007
2025-08-20 20:59:41 +00:00
8047cde0f3 Try to fix Inductor CI periodic tests (#160932)
- hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count.
- dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work.
- hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932
Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet
2025-08-20 20:36:46 +00:00
24e7f3c21c [ROCm] fix large tensor sort on MI350 (#161054)
Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648

Replace `std::min` to ternary statement
Also `std::min` can be replaced by explicit typing `std::min<int64_t>`

fixes on ROCm:
test_sort_and_select.py::TestSortAndSelectCUDA::test_sort_large_cuda_float16
error:
RuntimeError: Cannot sort dimension of length 8192

Similar PR to fix large tensors on ROCm https://github.com/pytorch/pytorch/pull/130994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161054
Approved by: https://github.com/jeffdaily
2025-08-20 19:58:01 +00:00
e1a64b75ff [CD] Delete full builds (#161075)
As they are no longer needed for Colab, see https://github.com/googlecolab/colabtools/issues/5508#issuecomment-3200871941 and
[<img width="896" height="128" alt="image" src="https://github.com/user-attachments/assets/a287393c-bde7-4e10-99bf-2e0d66346efe" />
](https://colab.research.google.com/drive/1YJ5Y0xsApXSewM1cQwWQ_AS3A77vytgq)

Fixes https://github.com/pytorch/pytorch/issues/160972
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161075
Approved by: https://github.com/atalman
2025-08-20 19:40:15 +00:00
b708966201 Fix bucketing introducing cycles (#160967)
We were just looking at direct arguments, but not transitive dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160967
Approved by: https://github.com/IvanKobzarev
2025-08-20 19:38:46 +00:00
dbef606631 Add support for tracing vmap in pre-dispatch export (#154650)
Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops.

The implementation strategy is:
1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC.
2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API.

Test Plan: CI

Differential Revision: D75623875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-08-20 19:31:07 +00:00
c5cb255625 [inductor][mm] fix tma issue (#161025)
# why

- head is broken

# what

- the template for experimental API is broken
- the test assumes not experimental API

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_persistent_tma_strided_a_transposed_True_b_transposed_False_dynamic_True -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161025
Approved by: https://github.com/PaulZhang12
2025-08-20 18:52:38 +00:00
957b170d8e Fix SVD forward-mode AD multiplication priority (#161027)
Multiplication order priority for the SVD JVP appears to have been the opposite of the optimal one.

Results from a crude CPU benchmark on my laptop for random matrices of various ratios:

```
  Performance Results Table

  | Test Case                        | Matrix Size | Aspect Ratio | Before JVP (ms) | After JVP (ms) | Change (ms) | % Change | Status              |
  |----------------------------------|-------------|--------------|-----------------|----------------|-------------|----------|---------------------|
  | Tall matrix (10:1 ratio)         | 1000×100    | 10:1 tall    | 3.13            | 3.24           | +0.11       | -3.5%    |  Regression        |
  | Tall matrix (10:1 ratio, larger) | 2000×200    | 10:1 tall    | 15.72           | 14.66          | -1.06       | +6.7%    |  Improvement       |
  | Tall matrix (10:1 ratio, large)  | 5000×500    | 10:1 tall    | 105.97          | 101.84         | -4.13       | +3.9%    |  Improvement       |
  | Wide matrix (1:10 ratio)         | 100×1000    | 1:10 wide    | 5.90            | 4.64           | -1.26       | +21.4%   |  Major Improvement |
  | Wide matrix (1:10 ratio, larger) | 200×2000    | 1:10 wide    | 18.29           | 17.78          | -0.51       | +2.8%    |  Improvement       |
  | Wide matrix (1:10 ratio, large)  | 500×5000    | 1:10 wide    | 137.40          | 128.70         | -8.70       | +6.3%    |  Improvement       |
  | Square matrix (baseline)         | 1000×1000   | 1:1 square   | 116.16          | 106.09         | -10.07      | +8.7%    |  Improvement       |
  | Square matrix (larger baseline)  | 2000×2000   | 1:1 square   | 714.30          | 673.23         | -41.07      | +5.7%    |  Improvement       |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161027
Approved by: https://github.com/soulitzer
2025-08-20 18:47:11 +00:00
c02e26bf31 Fix filename showing up as ints in dynamo_compile stack_trace column. (#160916)
Test plan:
$ python -m test_utils

Note:
Another way is adding the actual file_name to from_traceback, but since it's referenced in multiple places and may have associated tests this seems safer. Lmk if changes are needed @c00w

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160916
Approved by: https://github.com/c00w, https://github.com/masnesral
2025-08-20 18:38:38 +00:00
eqy
c74e5f6061 [CUDA] Bump tolerances for test_baddmm (#159915)
Only one mismatch out of the entire result tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159915
Approved by: https://github.com/nWEIdia, https://github.com/drisspg
2025-08-20 18:05:51 +00:00
1471b20cb3 add static dispatch kernel registration to open source (#160439)
Summary: static dispatch registry should be moved to open source. the rest can maintain internally for now, since delegates will all go through ET hop.

Test Plan: spot checked existing tests and didn't see any missing registrations

Differential Revision: D80099377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160439
Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17
2025-08-20 17:58:00 +00:00
b2632e7982 Fix error message for fsdp_pre_all_gather (#160817)
See: 20e40492b0/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L97-L104)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160817
Approved by: https://github.com/weifengpy, https://github.com/H-Huang
2025-08-20 17:43:57 +00:00
5255e65c01 [dynamo] Refactor convert_frame to remove usage of nonlocal tracer output return. [4/n] (#160899)
Today convert_frame is implemented like the following:
```
def _compile():
    tracer_output = None
    def transform():
        nonlocal tracer_output
        ...
    def _compile_inner():
         transform(...)

     compile_inner(...)
```

The code is using unconventional nonlocal variable as the return value. This is not ideal for 2 reasons:
1. Reasoning about the code, especially together with error handling code becomes harder.
2. more importantly, this makes it harder to extract out common code pieces into a shared library because everything must depend on a central global state.

In this diff we remove the usage of nonlocal return and just use the conventional function return to output the compilation data.

Differential Revision: [D80461258](https://our.internmc.facebook.com/intern/diff/D80461258/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160899
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #160814, #160815, #160855
2025-08-20 17:37:26 +00:00
9e050b6339 [dynamo] Refactor convert_frame._compile_inner to return compiled bytecode + output graph. [3/n] (#160855)
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).

This PR adds a new helper function compile_frame() which takes a bytecode and a transform function and return compiled bytecode + output graph as DynamoOutput type.

Differential Revision: [D80430802](https://our.internmc.facebook.com/intern/diff/D80430802/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160855
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #160814, #160815
2025-08-20 17:37:26 +00:00
b3e215b864 Trigger h100 on test_max_autotune, mm, grouped_mm changes (#160678)
Following  @henrylhtsang 's pr here: https://github.com/pytorch/pytorch/pull/160656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160678
Approved by: https://github.com/henrylhtsang, https://github.com/ngimel
2025-08-20 16:56:30 +00:00
e483947047 [BE] Remove intel-openmp dependency in setup.py (#160976)
Fixes #160962

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160976
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-08-20 16:33:16 +00:00
8e17709055 FlexDecode not guarding on GQA groups correctly (#160904)
Addressing #151359

Updates flex_decode dispatch to use flex attention rather than flex decode if number of groups is not a power of 2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160904
Approved by: https://github.com/drisspg
2025-08-20 16:32:16 +00:00
e631557518 Fix meta function for aten.complex (#160894)
Closes https://github.com/pytorch/pytorch/issues/160882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160894
Approved by: https://github.com/mlazos
2025-08-20 16:30:04 +00:00
7f201baf41 Allow exposing more functions during initial template expansion (#159554)
Also adds a `_register_hook` utility, and documents & type annotates `PartialRender`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159554
Approved by: https://github.com/laithsakka, https://github.com/kundaMwiza
2025-08-20 16:08:55 +00:00
ce048de608 [ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)
This pull request adds the following ops for sparse matrices using Eigen library:
```python
    add(a_csr, b_csr)
    add(a_csc, b_csc)

    addmm(c_csr, a_csr, b_csr)
    addmm(c_csr, a_csr, b_csc)
    addmm(c_csr, a_csc, b_csc)
    addmm(c_csr, a_csc, b_csr)

    addmm(c_csc, a_csr, b_csr)
    addmm(c_csc, a_csr, b_csc)
    addmm(c_csc, a_csc, b_csc)
    addmm(c_csc, a_csc, b_csr)
```

Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.

This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
2025-08-20 15:44:54 +00:00
90ea9ccefe Revert "[rfc] add hint_override kwarg to mark_dynamic (#161007)"
This reverts commit 0533ff2ccba7e77622ac3c6758f1032bdc10feff.

Reverted https://github.com/pytorch/pytorch/pull/161007 on behalf of https://github.com/jeffdaily due to failing on both cuda and rocm ([comment](https://github.com/pytorch/pytorch/pull/161007#issuecomment-3206893756))
2025-08-20 15:31:33 +00:00
6ea4be1e2e Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 2f0cba934de7094a66c6ce68f5e937254f23142a.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/seemethere due to This is blocking internal sync due to merge conflicts ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3206833193))
2025-08-20 15:16:45 +00:00
a818fa77e3 Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" (#160999)
Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails.

Test Plan:
NA

Rollback Plan:

Differential Revision: D80553314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999
Approved by: https://github.com/izaitsevfb, https://github.com/jingsh
2025-08-20 15:04:36 +00:00
5ee464db5c [inductor] Fix descriptor broadcasting for singleton dimensions (#160310)
This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail.

As an example, consider the following store instruction:

```
name=buf1
index=x2 + 192*y0 + 64*y1
valule=TritonCSEVariable('tmp7')
params = BlockParameters(
    shape=[3, 4, 1, 1, 64],
    block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK],
    strides=[64, 192, 0, 0, 1],
    offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset]
)
broadcasting_dims=[False, False, True, True, False]
broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK]
```
Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly
marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`.

```
9             pre_broadcast_shape = [
280                 sympy.S.One if is_broadcasting else dim
281                 for dim, is_broadcasting in zip(
282  ->                 self.broadcast_shape, self.broadcasting_dims
283                 )
284             ]
```

The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]`

Triton throws the following error: `reshape() cannot change total number of elements in tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310
Approved by: https://github.com/blaine-rister
2025-08-20 09:48:58 +00:00
0533ff2ccb [rfc] add hint_override kwarg to mark_dynamic (#161007)
The motivation for this change can be seen through the following example:

```
import torch

GPU_TYPE = "cuda"

@torch.compile
def no_override(x):
    return x.sum(dim=0)

@torch.compile
def override(x):
    return x.sum(dim=0)

x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```

Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:

```
def triton_per_fused_sum_1(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr):
    xnumel = 512
    r0_numel = 32
```

With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:

```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
    xnumel = 16384
    r0_numel = 128
```

This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:

```
f(s0) -> f(s2)
f(s1) -> f(s2)
```

could generate different kernels. With the new approach, an explicit override pins the chosen configuration:

```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```

ensuring consistent kernel generation regardless of input order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
2025-08-20 07:51:09 +00:00
a9fabeb012 [BE] Fix old TMA API in persistent matmul template (#161030)
Summary: Fixes a bug introduced by https://github.com/pytorch/pytorch/pull/159407

Test Plan:
NA

Rollback Plan:

Differential Revision: D80588320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161030
Approved by: https://github.com/adamomainz, https://github.com/NikhilAPatel, https://github.com/nmacchioni, https://github.com/aakhundov
2025-08-20 05:53:57 +00:00
0f801a510f Using std::vector or c10::SmallVector instead of CArray (#160959)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160959
Approved by: https://github.com/Skylion007
2025-08-20 05:32:29 +00:00
576a0e64ed [nativert] ensure that moveable outputs are set in other executionframe ctor (#161005)
Summary:
so we use this constructor in HigherOrderKernel. problems arise in the loop condition, where it's possible for an output from the prev. iteration to be an input to the next. so the Output(N) of a kernel may be the Input(M) to a kernel in the next iteration. Thus, if the output value is reset (via. fastresizetozero) or overwritten by a prev. kernel before it is to be used, we have major major issues.

we need to enforce that outputs are moved, not copied, to ensure this doesn't happen.

Test Plan:
buck2 test //caffe2/test:test_export --local-only -- test_while_loop_tensor_constant_idx_cpp_runtime_nonstrict

Rollback Plan:

Differential Revision: D80565374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161005
Approved by: https://github.com/SherlockNoMad
2025-08-20 05:05:32 +00:00
a3fe1ced40 [Optimus][decompose_mm] Fix BooleanAtom corner case (#160987)
Summary:
We observe a case where the BooleanAtom does not support regular sum op for bool exp, thus we fix it by using bool()

Rollback Plan:

Differential Revision: D80550876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160987
Approved by: https://github.com/Yuzhen11, https://github.com/mlazos
2025-08-20 04:36:12 +00:00
7e4bfa74ea [vllm hash update] update the pinned vllm hash (#161020)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161020
Approved by: https://github.com/pytorchbot
2025-08-20 04:15:50 +00:00
d8fcb2a4ac [dcp_poc] Fix parameter order in distributed checkpoint API to use path-first for consistency (#160986)
Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns.

Test Plan:
sandcastle tests

Rollback Plan:

Differential Revision: D80549014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986
Approved by: https://github.com/pradeepfn
2025-08-20 04:09:18 +00:00
2b62ef7420 Add kernel information JSON generation for AOTI packages (#160540)
Summary:
Build on D80031559. Generate kernel_information.json in AOTI compiled artifacts by combining stack traces and node mappings from provenance tracking.

This implementation delivers exactly what Zoomer team requested:

**1. Core Function**: `create_kernel_information_json()` in debug.py combines 3 data sources:
- `_inductor_kernel_stack_trace` → `stack_traces` field
- `_inductor_triton_kernel_to_post_grad_node_info` → `post_grad_nodes` field
- `_inductor_post_to_pre_grad_nodes["postToPre"]` → `pre_grad_nodes` field

**2. AOTI Integration**: codecache.py writes `kernel_information.json` to pt2 packages when both AOTI packaging and provenance tracking are enabled.

**3. Test Coverage**: TestKernelInformationAOTI class validates:
- JSON file creation in AOTI packages using zipfile
- Exact format compliance
- Proper disabling without provenance tracking

**Output Format** (exact specification):
```json
{
  "triton_kernel_name_1": {
    "stack_traces": [str, str, ...],
    "post_grad_nodes": [str, str, ...],
    "pre_grad_nodes": [str, str, ...]
  }
}
```

Test Plan:
```
buck test fbcode//caffe2/test/inductor:provenance_tracing -- TestKernelInformationAOTI
```

Manual validation:
```python
import torch
model = torch.nn.Linear(10, 1)
with torch._inductor.config.patch("aot_inductor.package", True):
    with torch._inductor.config.patch("trace.basic_provenance_tracking", True):
        # AOTI compilation should generate kernel_information.json
        compiled = torch.export.export(model, (torch.randn(1, 10),))
```
---

Rollback Plan:

Differential Revision: D80139160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160540
Approved by: https://github.com/yushangdi
2025-08-20 02:33:45 +00:00
54cc63b467 [BE][Dynamo] Type coverage for symbolic_convert (#160922)
As part of better engineering, we add type coverage to `dynamo/symbolic_convert.py`, which is the main work engine of dynamo for emulating python bytecode.

Running
```
mypy torch/_dynamo/symbolic_convert.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  764 | 4286 | 17.83% | 43 | 241 | 17.84% |
| This PR | 4322 | 4322 | 100.00% | 241 | 241 | 100.00% |
| Delta    | +3558 | +36 | +82.17% | +198 | 0 | +82.16% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160922
Approved by: https://github.com/StrongerXi
2025-08-20 01:24:31 +00:00
599f639ddb [dynamo] Refactor transform() so that instruction translator can be used as a tracing function. [2/n] (#160815)
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).

This PR follows the last one which separate out the part to run instruction translator on a given frame and return a DynamoTracerOutput.

The end result is a free function that runs instruction translator indepedently. A follow up diff will wrap the low level function.

Differential Revision: [D80388694](https://our.internmc.facebook.com/intern/diff/D80388694/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160815
Approved by: https://github.com/anijain2305
ghstack dependencies: #160814
2025-08-20 01:16:35 +00:00
72e4786d16 [dynamo][dist] trace DeviceMesh's get_local_rank and get_rank as constants (#160805)
Used in https://github.com/pytorch/torchtitan/pull/1555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160805
Approved by: https://github.com/StrongerXi, https://github.com/mlazos
2025-08-20 01:12:24 +00:00
371909cfd1 [Inductor][CPP] Add float16 support for CppMicroGemmAMX (#147368)
Add float16 support for CppMicroGemmAMX for float16 gemm template. Float16 CppMicroGemmAMX needs a higher version of compiler, e.g., GCC 13.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147368
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-20 01:04:05 +00:00
78a8e6a671 Add new_empty (with dtype argument only) to torch::stable (#159508)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159508
Approved by: https://github.com/janeyx99
ghstack dependencies: #160557
2025-08-20 00:50:42 +00:00
543896fcf3 test_matmul_cuda: Refine MX test skipping (#161009)
Replace return unittest.skip with raise unittest.SkipTest to ensure that the test suite correctly reports skipped tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161009
Approved by: https://github.com/jeffdaily
2025-08-20 00:47:45 +00:00
a3a82e3da8 [FSDP][Replicate] replicate tests for param registration and input device movements (#160147)
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device
2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160147
Approved by: https://github.com/weifengpy
ghstack dependencies: #160135, #160136
2025-08-20 00:47:00 +00:00
9d7cecdd6c [SymmMem] Support rendezvous on view of a tensor (#160925)
`tensor.view` share the same `data_ptr()` as the original tensor, thus cannot serve as key to rendezvous' map (we want a 1:1 match between handle and tensor, thus need a unique key).

@ezyang suggests using the raw `TensorImpl*` of a tensor, for which `tensor.view` would have a different value than the original tensor.

But the raw `TensorImpl*` can be stumbled on again when a previous tensor gets deallocated and a new one allocated. For that reason, we'd also need to use a `weak_instrusive_ptr` to distinguish the two tensors, i.e. for the deallocated tensor, `weak_instrusive_ptr::expired()` would return true.

Added `test_rendezvous_view` and `test_rendezvous_same`.

Note: the view support has been added to NVSHMEM backend and NCCL backend. For CUDA backend, I have yet to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160925
Approved by: https://github.com/ngimel
ghstack dependencies: #160825
2025-08-19 23:49:25 +00:00
0d19541284 fabric detection - fix build on an old toolkit (#160984)
Fixes #160960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160984
Approved by: https://github.com/eqy
2025-08-19 23:43:36 +00:00
eqy
e836323a23 [FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on sm100 (#160693)
See also: https://docs.nvidia.com/cuda/cublas/#id93

Only tensor-wide scales and 1D scales with tiled layout are supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160693
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-08-19 23:22:51 +00:00
512fc768e9 Add tlparse artifact for joint graph passes (for inference & non-freezing only) (#160589)
Summary:
Joint graph passes run several FX passes which can modify the graph before it hits Inductor.

There's three usages of joint graph passes:
- **for inference & not freezing** (we add structured loggings only for this)
- for inference & freezing
- for fw/bw split

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D80130321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160589
Approved by: https://github.com/yushangdi
2025-08-19 23:18:40 +00:00
a7b5955ea8 [ContextParallel] add Document Masking test (#160700)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #160700

**Summary**
add test case to CP + FlexAttention for Document Masking

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention_document_mask`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160700
Approved by: https://github.com/fegin
2025-08-19 23:03:18 +00:00
e83825f91c Revert "handling special case for pow(3) for GPU (#157537)"
This reverts commit 05e8fac4f374c4dbf0cd0e85e925e9112cf234a2.

Reverted https://github.com/pytorch/pytorch/pull/157537 on behalf of https://github.com/malfet due to This is really really bad from performance point of view, wonder if any benchmarks will detect that ([comment](https://github.com/pytorch/pytorch/pull/157537#issuecomment-3202661810))
2025-08-19 22:57:45 +00:00
33c3794533 [dynamic shapes] use prims_common contiguity in create_example_tensors (#160933)
Summary: forward fix T234739699

Test Plan:
T234739699

Rollback Plan:

Differential Revision: D80503451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160933
Approved by: https://github.com/henrylhtsang
2025-08-19 22:43:13 +00:00
8f766d6839 Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557)
TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes.

This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear.

Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t).

I then changed the test to test the scalar_type API.

This code change required some refactoring because of circular dependencies.

## BC Breaking note
This commit is (narrowly) BC-breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2025-08-19 22:13:47 +00:00
05e8fac4f3 handling special case for pow(3) for GPU (#157537)
follows #152373

Special case for pow(3):
Similar to the [CPU kernel](d27d36136c/aten/src/ATen/native/cpu/PowKernel.cpp (L64)), added corresponding GPU code for numerical stability.

issue #150951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157537
Approved by: https://github.com/soulitzer
2025-08-19 21:57:08 +00:00
f90ccad165 [export] Relax FC requirement of serde.deserialize by allowing unknown fields. (#160918)
Summary:
Previously we will pass all serialized data to dataclass ctors.
Now we just loop over all the existing fields in dataclass and fetch only the field we need to run ctor.

This should help with the case when we deserializing a buffer with new field.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80487716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160918
Approved by: https://github.com/angelayi
2025-08-19 21:54:46 +00:00
35e4d97e04 [dynamo] Support builtin complex with constant args (#160799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160799
Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos
2025-08-19 20:38:54 +00:00
66166cf1e7 preserve node meta to fix inductor generated kernel name for pattern matched graphs (#160542)
Summary:
When using inductor pattern matcher to replace graphs, the graph generated by replacement function can be missing `original_aten` metadata for the replaced nodes.  This further results in inductor failing to generate a sensible kernel name, eg. `tri_poi_fused_0` , missing the aten op name.

This diff attempts to fix that by allowing tracing the graph in replacement function with `preserve_node_meta`. Included this as an option to turn on in `pattern_matcher.fwd_only` function.

Can confirm that with the fix, MTIA's pattern matcher replaced original graph with a node that has original_aten meta, and inductor generated kernel name has op name.

Test Plan:
added kernel_name check to afg_inductor_test silu test

Rollback Plan:

Differential Revision: D80183670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160542
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2025-08-19 20:32:17 +00:00
eba20d2d74 Revert "[WIP] Merge Test (#160998)"
This reverts commit ef761c43538abae5bccc0c4b6ebaf42ff676db7a.

Reverted https://github.com/pytorch/pytorch/pull/160998 on behalf of https://github.com/ZainRizvi due to Undoing test merge ([comment](https://github.com/pytorch/pytorch/pull/160998#issuecomment-3202125839))
2025-08-19 20:30:39 +00:00
ef761c4353 [WIP] Merge Test (#160998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160998
Approved by: https://github.com/ZainRizvi
2025-08-19 20:26:07 +00:00
1ea918caf9 [C10D] Make MultiProcContinuousTest less spammy (#160821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160821
Approved by: https://github.com/fduwjj
ghstack dependencies: #160892
2025-08-19 20:17:19 +00:00
779fc29c04 [C10D] Fix spelling of MultiProcContinuousTest (#160892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892
Approved by: https://github.com/fduwjj
2025-08-19 20:17:19 +00:00
ed8bcccf31 [BE][Ez]: Update ruff to 0.12.9 (#160896)
Updates ruff. Fixes false positives and other miscellaneous ruff linting and formatting fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160896
Approved by: https://github.com/zou3519
2025-08-19 19:56:24 +00:00
9d9cc9897a [SymmMem] Support rendezvous on slice of a tensor (#160825)
When we search for a NVSHMEM allocation backing a tensor, don't limit it to an exact match between `tensor.data_ptr()` and `allocation.base_ptr`. Instead, test whether the former is within an allocation range, i.e. [base_ptr, base_ptr + size).

This PR also squashed in original base PR #160795:
Since (i) `handle = rendezvous(tensor)`, and (ii) we pass `handle->buffer_ptrs` to kernels, `handle` should carry the `data_ptr()` of tensor instead of the base address of a memory allocation (previous case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160825
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-08-19 19:08:45 +00:00
65d21dae18 [inductor] dont reuse buffers if it affects peak (#145883) (#159530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530
Approved by: https://github.com/eellison
2025-08-19 19:02:56 +00:00
62db8ec391 windows python 3.14 nightly builds (#159869)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159869
Approved by: https://github.com/malfet, https://github.com/williamwen42
2025-08-19 18:36:16 +00:00
5dad5b4f57 [AIDIR] Revise the insight content (#160649)
Summary:
Make it more descriptive and understable to user.

Rollback Plan:

Differential Revision: D80218659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160649
Approved by: https://github.com/jingsh
2025-08-19 18:04:49 +00:00
fab5dac734 Tweak dependabot to run inductor jobs (#160935)
After https://github.com/pytorch/pytorch/pull/160635, I can see dependabot creating the PR to bump `transformers` version at https://github.com/pytorch/pytorch/pull/160807.  This a good start, but there are several tweaks we need:

1. Run inductor tests on the PR including one round of perf benchmark, which is always needed.  So, we need `ciflow/inductor` label and a `pull_request` trigger for the benchmark
2. Per @anijain2305 feedback, we don't need to update patch version.  So, I add a rule to ignore it.  Again, we would need to test this out after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160935
Approved by: https://github.com/anijain2305
2025-08-19 17:56:07 +00:00
a44a0d3671 [MPS] Fix index_add for complex + int64 (#160926)
By re-using deterministic algorithm from
bbc7c03e93/aten/src/ATen/native/cuda/Indexing.cu (L1106-L1113)

Fixes https://github.com/pytorch/pytorch/issues/160845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160926
Approved by: https://github.com/manuelcandales
ghstack dependencies: #160850, #160889
2025-08-19 17:43:06 +00:00
2f0cba934d [dynamic shapes] unbacked-safe slicing (#157944)
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-19 17:32:47 +00:00
0a5ab612dd Port amax to stable ABI (#160214)
To enable porting torchaudio to the stable ABI, we need the `amax` operation to be accessible. This PR ports the op and provides tests that it behaves correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160214
Approved by: https://github.com/mikaylagawarecki
2025-08-19 17:24:53 +00:00
1fbe230b0d forward fix #160747 (#160981)
broke rocm inductor tests

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160981
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-19 17:16:41 +00:00
eddaaa6c2a Revert "Recheck Autotune cache on Precompile serialization to prune compilation results (#158656)"
This reverts commit 664005662ad8c9aa1942015397048aa9ca14fd6d.

Reverted https://github.com/pytorch/pytorch/pull/158656 on behalf of https://github.com/seemethere due to failing internal tests, see D80486843 ([comment](https://github.com/pytorch/pytorch/pull/158656#issuecomment-3201491561))
2025-08-19 16:53:20 +00:00
fecc5f6001 [codemod] Fix unused-local-typedef issue in caffe2/aten/src/ATen/native/cuda/CUDALoops.cuh +2 (#160944)
Summary:
LLVM has a warning `-Wunused-local-typedef` which we are enabling to remove unused code. This has the side-effect of making it easier to do refactors should as removing unnecessary includes.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan:
Sandcastle

Rollback Plan:

Differential Revision: D80511128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160944
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-19 16:49:29 +00:00
f305019377 [inductor] propagate shapes in CSEVariable (#152198)
Fixes #149905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198
Approved by: https://github.com/eellison
2025-08-19 16:46:38 +00:00
50cfe76231 Update checkpoint warning to target PyTorch 2.9 (#160725)
Follow-up to #160534. Fixes the docstrings and the warning in checkpoint_sequential, which presumably should have same deprecation notice
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160725
Approved by: https://github.com/soulitzer
2025-08-19 15:08:50 +00:00
9225c61994 Move save guard error throwing to separate phase (#160662)
This diff makes it so that the portion saving guards that can throw is completely separated from GuardBuilder, and instead in `serialize_guards`. This lets me add a try catch around it for caching precompile later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160662
Approved by: https://github.com/zhxchen17
2025-08-19 14:46:43 +00:00
e3ebf364e6 Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)"
This reverts commit 5d9653d90ee003173dd03f93e09fed236500ef06.

Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/malfet due to It broke inductor tests by improving them ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3200834103))
2025-08-19 13:46:53 +00:00
284b719005 Remove the uncessary empty file (#160728)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728
Approved by: https://github.com/Skylion007
2025-08-19 10:54:08 +00:00
daeb3a6094 Using std::make_unique<T>() instead of unique<T>(new T()) (#160723)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160723
Approved by: https://github.com/Skylion007
2025-08-19 10:25:47 +00:00
cyy
5d9653d90e Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)
Because numpy 1.22.4 had reached EOL 3 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836
Approved by: https://github.com/malfet
2025-08-19 09:15:06 +00:00
df60736410 [BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
2025-08-19 07:32:55 +00:00
8f31aa97a3 [dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934)
Fixes #157399
cherry pick of d6a5c03

@mlazos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160934
Approved by: https://github.com/mlazos
2025-08-19 06:01:26 +00:00
29afde2020 [CD] Build libtorch without nvshmem (#160910)
It was done once for cuSparseLT in f01d7105b1 , now it's nvShmem's time

Fixes https://github.com/pytorch/pytorch/issues/160762
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160910
Approved by: https://github.com/Skylion007
2025-08-19 05:58:25 +00:00
8dbe7f99bd [BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)
allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359).

dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711
Approved by: https://github.com/PaulZhang12, https://github.com/njriasan
2025-08-19 05:27:10 +00:00
1d46aa736f [audio hash update] update the pinned audio hash (#160930)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160930
Approved by: https://github.com/pytorchbot
2025-08-19 04:22:55 +00:00
2cf69fe0e1 [vllm hash update] update the pinned vllm hash (#160929)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160929
Approved by: https://github.com/pytorchbot
2025-08-19 04:22:45 +00:00
923bc46122 fix mul.Scalar with strided tensor (#160560)
Summary: out variant has to be strided like self. since memory format isn't provided, this should be equivalent.

Test Plan:
prev. when we enable static dispatch this test would have numeric issues
```
buck2 test //caffe2/test:test_export -- test__scaled_dot_product_flash_attention_cpp_runtime_nonstrict --print-passing-details
```

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D80191085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160560
Approved by: https://github.com/SherlockNoMad
2025-08-19 04:15:12 +00:00
58f9a3dd63 [ez] Only use default numa bindings if nproc == cuda device count (#160848)
# Context
Another fix to enable broad rollout of #149334.

The implementation assumes that the trainer process with local rank `n` only uses device `cuda:n`. However, there are sometimes jobs with more than one GPU per process, in which case our assumption could be incorrect and actually lead to worse memory locality.

# This PR
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160848
Approved by: https://github.com/kiukchung
2025-08-19 02:50:01 +00:00
a391fa1c42 Make Inductor benchmarker more compatible with Triton do_bench (#160921)
Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden.

I also opened a PR to use this timing measurement function on TritonBench side: https://github.com/meta-pytorch/tritonbench/pull/333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160921
Approved by: https://github.com/BoyuanFeng
2025-08-19 02:40:21 +00:00
209143ddeb [while_loop][inductor] fix aliased inputs by cloning (#160668)
[fx_graph_cse](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/compile_utils.py#L46) is executed in min_cut partitioner which accidentally creates the aliasing for empty buffers and we could see the following graph node for joint graph with cmd: "pytest test/functorch/test_control_flow.py -k test_scan_multiple_layers_gradient_layers_2_device_cpu"
```python
while_loop = torch.ops.higher_order.while_loop(while_loop_cond_graph_0_0, while_loop_body_graph_0_0, (full_default_4, empty_strided_default, full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default, rev, rev_1, rev_2, rev_3), (primals_4, primals_5, primals_6, primals_7));
```

Notice the operands sequence **"full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default"**, which indicates the gradient of different layers now sharing the same buffer, which create silent incorrectness.

Fixes https://github.com/pytorch/pytorch/pull/158168.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160668
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160374
2025-08-19 02:33:59 +00:00
b1380f434d [CD] Disable USE_MPI in XPU CI/CD wheel build (#159135)
XPU wheel build need source MPI for distributed XCCL backend build, but it also enable USE_MPI by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159135
Approved by: https://github.com/malfet
2025-08-19 02:32:03 +00:00
e6e45e6ae8 [FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload (#160481)
Fixes https://github.com/pytorch/pytorch/issues/160291
`post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream`
The hard-code could fail unit test on HSDP+CPU offload, add unit test here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160481
Approved by: https://github.com/weifengpy
2025-08-19 02:20:14 +00:00
3d126e17e0 [FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command.

**Test Cases**
1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1
2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160136
Approved by: https://github.com/weifengpy
ghstack dependencies: #160135
2025-08-19 02:13:30 +00:00
8d15af2320 [PT2]: Allow None for wrapped_fbgemm_linear_fp16_weight (#160802)
Summary: Currently the implementation of [fbgemm_linear_fp16_weight](https://www.internalfb.com/code/fbsource/[ffe8ba561cb6af33fde5b32c27411d6d3f4f2c70]/fbcode/caffe2/aten/src/ATen/native/QuantizedLinear.cpp?lines=477) does not allow None for `bias`, but it's actually a valid case and internally `fbgemm_linear_fp16_weight_fp32_activation` accept None bias as well. For BC reason, we can't directly change the function signature. So wrapping an empty tensor if bias is None to workaround it in Sigmoid.

Test Plan:
P1906210273
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=778442870
SNAPSHOT_ID=6
MODULE=user
SUFFIX=.predictor.precompute.remote_request_only

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=10000 &>  ~/logs/${MODEL_TYPE}/load_net_predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}_${MODULE}
```

Rollback Plan:

Reviewed By: henryoier, hl475

Differential Revision: D80382652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160802
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-08-19 01:46:53 +00:00
e9209e0854 [dynamo] Refactor tracer logic in convert_frame so that it doesn't leak to outer layer. [1/n] (#160814)
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).

One incremental step we can take is to refactor out InstructionTranslator as a functional piece providing bytecode tracing.

To separate out this part, we notice currently the tracer object is being passed around in the entire convert frame compile function. This is not very ideal because we want to build a boundary between the tracing and downstream compiler stack. Ideally, we should extract all the relevant information out of the tracer object and return a new data structure that is free of internal states of InstructionTranslator.

Luckily, there aren't many data used from tracer, after tracing is finished. The major one is OutputGraph, other than that, we only need to record two boolean flags for error handling purposes.

The new type we're adding is called DynamoTracerOutput, which contains all the information needed by torch.compile internal after symbolic convert is finished. To simplify the current PR, we leave out the part which reduce OutputGraph into a minimal set, since this can be done in a separate PR.

Differential Revision: [D80388693](https://our.internmc.facebook.com/intern/diff/D80388693/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160814
Approved by: https://github.com/tugsbayasgalan
2025-08-19 01:46:24 +00:00
4cb31015f2 [dynamic shapes] prims_common non_overlapping_and_dense (#160462)
Differential Revision: D80120333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160462
Approved by: https://github.com/laithsakka
2025-08-19 01:35:28 +00:00
5e98d9f9ba Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 56218d85e2da09d9ede3809718ec989c2151632c.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think this is failing test_draft_export in trunk 56218d85e2 ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3198874677))
2025-08-19 01:16:17 +00:00
5cf6567c1f [Inductor] add cuda compile cmd to autotuning logging (#160906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160906
Approved by: https://github.com/henrylhtsang
2025-08-19 01:14:46 +00:00
41b3e80a55 Fix duplicated kernel name in kernel stack trace tracking (#160905)
Summary: as title. When we have two kernels with the same name, the stack traces should be appended, not overwritten.

Test Plan:
```
 buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D80472731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160905
Approved by: https://github.com/angelayi
2025-08-19 01:14:34 +00:00
b6852778ff Add Magma build for CUDA 13.0 (#160770)
Add magma build for CUDA 13.0 after almalinux docker is available

https://github.com/pytorch/pytorch/issues/159779
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160770
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-08-19 01:10:00 +00:00
1853f71b4f [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)
Fixes #160243, Fixes #160244, Fixes #160245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403
Approved by: https://github.com/janeyx99
2025-08-19 00:54:51 +00:00
bbc7c03e93 Fix UndefinedGrad::apply (#160572)
The function incorrectly reserved space in the input parameter instead of the output parameter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160572
Approved by: https://github.com/soulitzer
2025-08-19 00:15:51 +00:00
dc200066cf [ONNX] Use onnxruntime 1.22 in CI (#160924)
Use onnxruntime 1.22 in CI to enable testing of newer opsets and IR versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160924
Approved by: https://github.com/titaiwangms
2025-08-19 00:05:26 +00:00
56218d85e2 [dynamic shapes] unbacked-safe slicing (#157944)
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-18 22:38:16 +00:00
0254646654 harden fabric checks for symmetric memory (#160790)
Now we check only that fabric allocation succeeded, but sometimes we fail during export or import afterwards, with no recourse. Check the full cycle before attempting to allocate memory with the fabric.
TODO: move it to c10/cuda so that it can be used from CUDACachingAllocator too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160790
Approved by: https://github.com/Skylion007
2025-08-18 22:35:50 +00:00
b439675ae2 [nativert] oss pass graph pass registration (#160859)
Summary: att

Test Plan:
CI

Rollback Plan:

Differential Revision: D80368343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160859
Approved by: https://github.com/georgiaphillips
2025-08-18 22:23:38 +00:00
82c7a1eb4b Revert "[ONNX] Default to dynamo export (#159646)"
This reverts commit 11b6ceb7b4f81ba02f88652136a93d685c399191.

Reverted https://github.com/pytorch/pytorch/pull/159646 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159646#issuecomment-3198507767))
2025-08-18 21:41:32 +00:00
16ada80c61 [BE][CUDA][Distributed] Add require_exact_world_size() and a few distributed unit test fixes (#160803)
1. Add require_exact_world_size()
2. Decorate the test `test_new_subgroups_with_group_param` with this require_exact_world_size(4) as the test would fail with world_size of 8 when testing with 8xB200 runner.
3. Modify `test_new_subgroups_world_size_not_divisible_by_group_size` so that it will not fail due to 4 vs. 8 mismatch. Doing so makes the test pass with both 4-GPU runner and 8-GPU runner.

Separating these changes out from B200 distributed runner PR #159323

Fixes https://github.com/pytorch/pytorch/issues/159987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160803
Approved by: https://github.com/fduwjj
2025-08-18 21:15:33 +00:00
c27d6df1ea For sdists, replace symlink with copy for docs requirements (#157811)
Before this change, there was the requirements file `.ci/docker/requirements-docs.txt` which was symlinked as `../.ci/docker/requirements-docs.txt` from `docs/requirements.txt` since #151796.
In this situation, [because `.ci` is excluded from the source tarball](3173616532/.github/workflows/create_release.yml (L67)), we end up with a broken symlink, that additionally is [invalid in a Python source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/#unpacking-without-the-data-filter).

The broken symlink can be confirmed in [the rc sources](https://github.com/pytorch/pytorch/actions/runs/15892205745).

~After this change, there is still a single source of truth, which now is `docs/requirements.txt`, symlinked as `../docs/requirements.txt` from `.ci/docker/requirements-docs.txt`, which would also be invalid in a Python source distribution, but is not included in the tarball (see above). Additionally, the docs requirements that were missing from the previous tarball, are now actually included, allowing users to build the documentation again.~

@malfet clarified offline that there is a problem with the docs workflows because they use a cache with a key that includes the hash of the requirements document in the `.ci` folder, which now does no longer change when the requirements change. Hence, a different solution is needed~, though for now the problem remains~.

The solution in this PR is simply to copy the actual document to replace the symlink just prior to creating the source distribution. This way, a single document needs to be maintained, git checkouts remain as they are, and the source distributions contain the before-missing document.

A better solution may be implemented at a later stage with a better build system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157811
Approved by: https://github.com/atalman
2025-08-18 21:10:44 +00:00
d910cb3b2d [cpp][inductor] Fix crash on bmm when input is used twice. (#160087)
Fixes #156412

For torch.bmm using CPP generated template code, when the input is used as both the first and second weights, the generated code will simplify so it only passes one input instead of 2. However, if the weights are being repacked and saved for more efficient data-loading patterns, then we need to save both inputs instead of just one. This PR fixes this issue.

## Test code:
```python
import torch

@torch.compile(mode="max-autotune")
def my_function(x, y):
    return torch.bmm(x, x)

# Test
x = torch.randn(2, 3, 3)
y = torch.randn(2, 3, 3)
result = my_function(x, y)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160087
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-08-18 20:34:14 +00:00
a1a555ed7b [dynamo] Fix graph break on calling functions decorated with special context manager (#160703)
As title. This is a follow-up of the previous patch, with the goal of
supporting a new pattern that showed up in ComfyUI:
644b23ac0b/comfy/ops.py (L44)

Effectively, the semantics of calling a function decorated with a
context manager is:

```python
@ctx_manager(args)
def f(x):
    ...

f(x)
# ----->
with ctx_manager(args):
    f.__wrapped__(x)
```

Yes, a fresh context manager instance per invokation, see CPython source code:
https://github.com/python/cpython/blob/3.12/Lib/contextlib.py#L119-L122

So Dynamo already
1. knows how to handle the `with ctx_manager(args)` syntax, and has
   special handling for a few torch native context managers, like
   `sdpa_kernel` in this patch.
2. can trace through a good chunk (at least the ones that matter in this
   case) of contextlib.

This patch just let Dynamo trace a bit more into contextlib, and then
keep the torch-native special cases by moving their handling a bit down
the stack, so that no additional logic is introduced -- it's only
refactored.

This also allows us to get rid of some `_sdpa_kernel_variadic` special
handling, since now we will trace through its code, and it boils down to
`sdpa_kernel` anyways.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160703
Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos
ghstack dependencies: #160684
2025-08-18 20:33:45 +00:00
72b559b2c8 [dynamo] Fix crash and silent incorrectness issues in attention.sdpa_kernel calls with kwargs (#160684)
This patch fixes 2 issues, illustrated by the test cases added:
1. using `sdpa_kernel(backends=..., set_priority=...)` due to an
   internal assert that forgot to be updated after #147768.
2. forgetting to convert the `set_priority` VariableTracker back to a
   python constant so that its value is properly used by `sdpa_kernel`,
   also from #147768.

I ran into (1) because ComfyUI had a recent update that actually sues
this pattern
644b23ac0b/comfy/ops.py (L44),
and then noticed (2), and fixed it conveniently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160684
Approved by: https://github.com/mlazos
2025-08-18 20:33:45 +00:00
cyy
1f19003694 Use py3.10 for ONNX CI jobs (#160852)
Use Python 3.10 for ONNX jobs because Python 3.9 is near EOL and futher ONNX versions drop 3.9 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160852
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-08-18 19:37:47 +00:00
4e90441133 Add signpost to provenance tracking error (#160755)
Summary: As title, add signpost to better track error when computing provenance tracking related debugging information

Test Plan:
CI

Rollback Plan:

Differential Revision: D80292285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160755
Approved by: https://github.com/angelayi
2025-08-18 19:17:47 +00:00
bfcae7e1c1 [ROCm] Fix Sliding Window Attention in AOTriton integration code (#159773)
AOTriton implements Sliding Window Attention (SWA) as a more generalized version of causal masks and also needs an atomic counter for dynamic workload allocation.

Fixes #158308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159773
Approved by: https://github.com/jeffdaily
2025-08-18 18:45:58 +00:00
01bba62e21 Remove unused test code (#160823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160823
Approved by: https://github.com/Skylion007
2025-08-18 18:37:52 +00:00
6ac9035a84 [aoti-fx] Dynamic shapes support (#160766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160766
Approved by: https://github.com/jansel
ghstack dependencies: #160765
2025-08-18 18:14:08 +00:00
bab79824cb [aoti-fx] Initial AOTInductor FX (#160765)
Using the existing WrapperFxCodegen backend, this PR prototypes an AOT version of it which will directly return a graph module.

How to use:
```python
exported_gm = torch.export.export(model, inp, dynamic_shapes=dynamic_shapes).module()
compiled_gm = torch._inductor.aot_compile(
    exported_gm, inp, options={"fx_wrapper": True, "compile_threads": 1}
)
assert torch.allclose(model(*inp), compiled_gm(*inp))
```

The motivation behind this is that backends like ExecuTorch/MTIA would like to use inductor's optimization technologies, but might have their own graph lowering pipelines so they might not want to use AOTI (which generates an so).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160765
Approved by: https://github.com/jansel
2025-08-18 18:14:08 +00:00
162bf78df6 [dynamo] Support itertools.filterfalse (#160596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160596
Approved by: https://github.com/guilhermeleobas
2025-08-18 18:07:57 +00:00
450517f346 [Dynamo][Hierarchical Compile] Flatten tuple inputs for regions (#158812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158812
Approved by: https://github.com/anijain2305
ghstack dependencies: #158810, #158811
2025-08-18 18:03:11 +00:00
664005662a Recheck Autotune cache on Precompile serialization to prune compilation results (#158656)
This PR rechecks the autotune cache on Precompile.serialize(), allowing us to ahead of time save autotune results for statically compiled triton kernels, so that warm start does not need to check the autotune cache.

It has a few extra changes to make this work:

### Storing source code in TritonBundler
- We now store the source_code for statically compiled triton kernels instead of the hash of the source code in TritonBundler, so that we can easily access their source code when rechecking the autotune cache on PrecompileContext.serialize. To make sure that this is not a huge space concern, I ran the entire hugging face benchmark on training. The total space of `/tmp/torchinductor_jjwu/fxgraph` before my change was 1185004 KB (1.18 GB). After my change, this increased to 1207312 KB (1.2 GB), for an increased storage cost of ~1.8%, which seems safe.

- We now return early from recheck_autotune_cache if the number of triton kernels being compiled is 1, since there's no reason to check the cache at all in those cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158656
Approved by: https://github.com/zhxchen17
2025-08-18 17:55:10 +00:00
c0a1ae4404 Add is_cpu method to stable tensor type (#160212)
Porting torchaudio to use the stable api requires the `is_cuda` and `dtype` functions. It would be more convenient if these were methods of the stable tensor class rather than utilities one needed to call from the C api. This PR adds them as methods, mirroring how `is_cuda` and `get_device` are already defined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160212
Approved by: https://github.com/janeyx99
2025-08-18 17:42:43 +00:00
b0071c65e2 [MPS] Fix error check for torch.var on scalar (#160889)
Fixes https://github.com/pytorch/pytorch/issues/160738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160889
Approved by: https://github.com/Skylion007
ghstack dependencies: #160850
2025-08-18 17:36:42 +00:00
c6333f7dae Fixes for collections.NamedTuple (#159367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159367
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864, #159865
2025-08-18 17:32:59 +00:00
87d6831b2e Add CUDA installation script for CUDA 13 (#160201)
Add the almalinux docker for building magma-cuda 13.0
https://github.com/pytorch/pytorch/issues/159779

Also fixed the NVSHMEM download link

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160201
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-08-18 17:26:25 +00:00
4014672b30 Replace guard_serialization_mode with save_guards, remove load cases (#160531)
This PR replaces "guard_serialization_mode" into `save_guards`. All cases where we care about whether or not we're *loading* guards can be inferred automatically from the existing inputs.

The only case that's special here is whether or not to check guards. We don't want to check guards on guard load in CheckFnManager, because these guards have already been checked on save. Therefore, we put the setting in OutputGraphGuardsState, so that when we save, we bypass the guards check.

Because of this change, it is *technically* possible to do a load and a save in the *same* CheckFunctionManager.__init__() by passing all the necessary parts, and also passing `save_guards=True`. This should just work out of the box, but so far no callsites need it, so not super important.

Next up, we'll work on removing save_guards from GuardBuilder, and putting it into its own phase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160531
Approved by: https://github.com/zhxchen17
2025-08-18 17:04:17 +00:00
e389a08dcd AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360)
- This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD **ROCm 7.0** and the **gfx950** architecture.

  This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform.

- Validation (**ROCm 7.0** + **gfx950** required):

  `111 relevant tests passing.`

  > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v

  Co-author: @jagadish-amd —  Thank you for the efforts leading validation on gfx950 with ROCm 7.0.

-----------------------------------

This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types.

### Support for new scalar types and scaling methods:
* [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934)

* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364)

### Updates to scalar type mappings:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0.

* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version.

### Enhancements to testing(@jagadish-amd):
* [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472)

These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360
Approved by: https://github.com/drisspg, https://github.com/malfet
2025-08-18 16:43:09 +00:00
f2be3dc8da [dynamo][guards] Optimize module getattr access for inline flag (#160864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160864
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #160863
2025-08-18 16:38:46 +00:00
b8ff0fd21b [dynamo][guards] Remove long lines from TORCH_LOGS=guards (#160863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160863
Approved by: https://github.com/Lucaskabela
2025-08-18 16:38:46 +00:00
6b994c47ca [MPS][BE] Fix unused vars in GridSampler (#160850)
This fixes following warnings during the compilation of GridSampler.metal
```
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:22:23: warning: unused parameter 'input_sizes' [-Wunused-parameter]
    constant int32_t* input_sizes,
                      ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:24:23: warning: unused parameter 'grid_sizes' [-Wunused-parameter]
    constant int32_t* grid_sizes,
                      ^
2 warnings generated.
```

Introduced by https://github.com/pytorch/pytorch/pull/160541
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160850
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-18 16:24:45 +00:00
3c8c509a9c [export] Fix custom ops in subgraphs (#160004)
Fixes https://github.com/pytorch/pytorch/issues/159995

Currently there are two problems with extern kernels in subgraphs:
1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph.
2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0.

So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004
Approved by: https://github.com/ydwu4
2025-08-18 15:42:19 +00:00
1091165826 [export] Update move_to_device_pass for to.device (#160528)
Differential Revision: D80135455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160528
Approved by: https://github.com/yushangdi
2025-08-18 15:41:48 +00:00
d91a03f96a [ROCm] Add HIPConfig.h to .gitignore like CUDAConfig.h. (#159805)
This file is generated into the source directory by CMake just like `cuda/CUDAConfig.h`, so it seems appropriate to add it to `.gitignore` in the same place: 83ba3f1101/aten/src/ATen/CMakeLists.txt (L39-L47)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159805
Approved by: https://github.com/jeffdaily
2025-08-18 15:34:01 +00:00
0298ebc97a [ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001)
GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small.  This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`.

In the PR comment below, I include a short analysis of the numerics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001
Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily
2025-08-18 15:33:30 +00:00
179511694c Update slow tests (#160870)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160870
Approved by: https://github.com/pytorchbot
2025-08-18 11:53:41 +00:00
e7c3b77b22 [xla hash update] update the pinned xla hash (#160871)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160871
Approved by: https://github.com/pytorchbot
2025-08-18 11:50:47 +00:00
95e456fcc5 [inductor] pack linear for FP32 dynamic mode (#157542)
Summary:
Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode.
I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2%  improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-18 10:18:46 +00:00
de744ca4b1 [Inductor] modify convert_to_reinterpret_view (#158914)
**Summary:**
Fix https://github.com/pytorch/pytorch/issues/159121, Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: relax the condition of `isinstance(x_unwrap_view, (ReinterpretView, Buffer))` to `isinstance(x_unwrap_view, (ReinterpretView, Buffer, MutableBox))`. Prefer channels last format according to how the format of `x_unwrap_view_fx_node` is set from eager.

**Example:**
```
import torch
import torch.nn as nn

class M(nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        n, c, h, w = x.shape
        return self.relu(x).permute(0, 2, 3, 1).reshape(
            n, h * w, c
        )

model = M().eval()
x = torch.randn(2, 32, 4, 4).to(memory_format=torch.channels_last)

compiled_model = torch.compile(model)

with torch.no_grad():
    compiled_model(x)
```

**Generated code:**
- before
```
cpp_fused_permute_relu_view_0 = async_compile.cpp_pybinding(['const float*', 'float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const float* in_ptr0,
                       float* out_ptr0,
                       float* out_ptr1)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(16L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(16L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(32L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(16L)))
                        {
                            alignas(std::max(std::size_t(16), alignof(float))) float tmp0[16*16];
                            transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(in_ptr0 + static_cast<int64_t>(x1 + 32L*x2 + 512L*x0), static_cast<int64_t>(32L), tmp0, static_cast<int64_t>(16));
                            for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++)
                            {
                                auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16L*x1_inner), static_cast<int64_t>(16));
                                auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0));
                                tmp2.store(out_ptr0 + static_cast<int64_t>(x2 + 16L*x1 + 16L*x1_inner + 512L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(16L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                        {
                            alignas(std::max(std::size_t(16), alignof(float))) float tmp0[16*16];
                            transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(out_ptr0 + static_cast<int64_t>(x1 + 16L*x2 + 512L*x0), static_cast<int64_t>(16L), tmp0, static_cast<int64_t>(16));
                            for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++)
                            {
                                auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16L*x1_inner), static_cast<int64_t>(16));
                                tmp1.store(out_ptr1 + static_cast<int64_t>(x2 + 32L*x1 + 32L*x1_inner + 512L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32))
    buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 16, 4, 1), torch.float32)
    buf1 = empty_strided_cpu((2, 16, 32), (512, 32, 1), torch.float32)
    cpp_fused_permute_relu_view_0(arg0_1, buf0, buf1)
    del arg0_1
    return (buf1, )
```

- After
```
cpp_fused_relu_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1024L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1024L)))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = at::vec::clamp_min(tmp0, decltype(tmp0)(0));
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0));
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32))
    buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 1, 128, 32), torch.float32)
    cpp_fused_relu_0(arg0_1, buf0)
    del arg0_1
    return (reinterpret_tensor(buf0, (2, 16, 32), (512, 32, 1), 0), )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158914
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-18 07:41:20 +00:00
b82aa3df20 Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197)"
This reverts commit e444cd24d48b3a46f067974f2cc157f5ed27709f.

Reverted https://github.com/pytorch/pytorch/pull/159197 on behalf of https://github.com/laithsakka due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/159197#issuecomment-3195436668))
2025-08-18 07:22:13 +00:00
d8d589bd3a Add build support for RISCV (#160172)
In requirements.txt, do not install lintrunner on riscv64

Fixes #160170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160172
Approved by: https://github.com/malfet
2025-08-18 05:29:34 +00:00
3c6efd1380 Add cutedsl template support to compile (#160108)
## Summary
Still figuring out what actually writing a template should look like, but lands alot of the base infra

<img width="1267" height="262" alt="Screenshot 2025-08-16 at 10 22 12 PM" src="https://github.com/user-attachments/assets/229f8bfa-0cb4-4fb1-8530-f535e569d350" />

Test code:

```Python
#!/usr/bin/env python3
"""
Fixed CuteDSL template test with proper def_kernel usage.
"""

import torch
import torch._inductor.config as config
from torch._inductor.lowering import lowerings
from torch._inductor.ir import TensorBox
from torch._inductor.select_algorithm import autotune_select_algorithm
from torch._inductor.codegen.cutedsl import CuteDSLTemplate

def create_fixed_cutedsl_template():
    """Create a properly structured CuteDSL template."""

    def cutedsl_grid(M, N, meta):
        return (1,)

    # Part 1: Imports and kernel definition
    template_part1 = r"""
import torch
import cutlass
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack

@cute.kernel
def {{kernel_name}}_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
    # Get thread and block indices
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()

    thread_idx = bidx * bdim + tidx
    m, n = gA.shape

    if thread_idx < m * n:
        mi = thread_idx // n
        ni = thread_idx % n

        if mi < m and ni < n:
            a_val = gA[mi, ni]
            b_val = gB[mi, ni]
            result = a_val + b_val
            gC[mi, ni] = a_val + b_val
"""

    # Part 2: JIT wrapper function
    template_part2 = r"""
@cute.jit
def {{kernel_name}}_jit(mA: cute.Tensor, mB: cute.Tensor, mC: cute.Tensor):
    m, n = mA.shape
    total_threads = m * n
    threads_per_block = 256
    num_blocks = (total_threads + threads_per_block - 1) // threads_per_block

    kernel = {{kernel_name}}_kernel(mA, mB, mC)
    kernel.launch(
        grid=[num_blocks, 1, 1],
        block=[threads_per_block, 1, 1]
    )
"""

    # Part 3: Main kernel function
    template_part3 = r"""
{{def_kernel("input_a", "input_b", "output_c")}}
    cute_a = from_dlpack(input_a, assumed_align=16)
    cute_b = from_dlpack(input_b, assumed_align=16)
    cute_c = from_dlpack(output_c, assumed_align=16)

    # Launch kernel
    {{kernel_name}}_jit(cute_a, cute_b, cute_c)

    return output_c
"""

    # Combine all parts
    template = CuteDSLTemplate(
        name="fixed_add",
        grid=cutedsl_grid,
        source=template_part1 + template_part2 + template_part3
    )

    return template

def fixed_cutedsl_lowering(a: TensorBox, b: TensorBox) -> TensorBox:
    """Fixed CuteDSL lowering."""
    print(f"[FIXED] CuteDSL lowering: {a.get_size()} + {b.get_size()}")

    template = create_fixed_cutedsl_template()
    choices = []

    error = template.maybe_append_choice(
        choices,
        input_nodes=[a.data, b.data],
        layout=a.get_layout()
    )

    if error or not choices:
        print(f"[FIXED] Falling back: {error}")
        default_lowering = lowerings[torch.ops.aten.add.Tensor]
        return default_lowering(a, b)

    print(f"[FIXED] Using CuteDSL with {len(choices)} choices")

    result = autotune_select_algorithm(
        "fixed_cutedsl_add",
        choices,
        [a, b],
        a.get_layout(),
    )

    return result

def test_fixed_cutedsl():
    """Test the fixed CuteDSL template."""
    print("=" * 50)
    print("Fixed CuteDSL Template Test")
    print("=" * 50)

    original = lowerings.get(torch.ops.aten.add.Tensor, None)

    try:
        lowerings[torch.ops.aten.add.Tensor] = fixed_cutedsl_lowering

        def test_add(x, y):
            return x + y

        device = "cuda" if torch.cuda.is_available() else "cpu"
        x = torch.randn(128, 4, device=device, dtype=torch.float32)
        y = torch.randn(128, 4, device=device, dtype=torch.float32)

        print(f"[FIXED] Testing with {x.shape} tensors on {device}")

        compiled_fn = torch.compile(test_add, backend="inductor")
        result = compiled_fn(x, y)

        # Verify correctness
        expected = x + y
        if torch.allclose(result, expected, atol=1e-5):
            print(" [FIXED] Results match!")
            return True
        else:
            print(" [FIXED] Results don't match!")
            return False

    except Exception as e:
        print(f" [FIXED] Failed: {e}")
        import traceback
        traceback.print_exc()
        return False

    finally:
        if original:
            lowerings[torch.ops.aten.add.Tensor] = original
        else:
            lowerings.pop(torch.ops.aten.add.Tensor, None)

if __name__ == "__main__":
    success = test_fixed_cutedsl()
    print("🎉 Fixed test completed!" if success else "💥 Fixed test failed!")

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160108
Approved by: https://github.com/mlazos
2025-08-18 04:37:15 +00:00
d18007a1d0 [vllm hash update] update the pinned vllm hash (#160847)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160847
Approved by: https://github.com/pytorchbot
2025-08-18 04:36:28 +00:00
138413907a [nativert] oss subgraph rewriter (#160780)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D80367765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160780
Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips
2025-08-18 04:25:05 +00:00
3ced4f1e6c Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)"
This reverts commit 7a68d02292fd7a430b55c5bce3268a33c7ec5055.

Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/clee2000 due to broke some inductor jobs? Maybe just update the expected values? Not sure what the policy is for something like this [GH job link](https://github.com/pytorch/pytorch/actions/runs/17024529273/job/48262123844) [HUD commit link](7a68d02292) ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3194953213))
2025-08-18 03:09:31 +00:00
075a2e6967 [PGO] add extra read/write keys (#160715)
Differential Revision: D80321215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160715
Approved by: https://github.com/bobrenjc93
2025-08-18 01:41:08 +00:00
cyy
7a68d02292 Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)
Because numpy 1.22.4 had reached EOL 3 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836
Approved by: https://github.com/malfet
2025-08-17 18:39:06 +00:00
63e1b58a13 [easy] [Precompile] Refactor guards, improve typing (#160530)
Purely a refactor, improve typing and get rid of some type errors. Make certain fields as nonnull, since in general it's not empty.

The goal of this stack of PRs is to move the save/load logic of guard serialization into separate, flat phases, instead of being embedded in guard creation. This way, we can put a try/catch around it and fail safely if certain guards are not serializable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160530
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-08-17 17:54:55 +00:00
cyy
960c03daf6 Remove unused CONDA_CMAKE option (#160832)
Remove CONDA_CMAKE from `.ci/docker/build.sh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160832
Approved by: https://github.com/malfet
2025-08-17 17:08:42 +00:00
04c7be903d Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)"
This reverts commit 8f434545c2e48c858d8b0d06db8f9642d6a87ad0.

Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))
2025-08-17 14:22:48 +00:00
691d17a5c6 Update TensorPipe submodule (#160808)
To a commit containing  https://github.com/pytorch/tensorpipe/pull/464 that fixes compilation with CUDA-13

Fixes https://github.com/pytorch/pytorch/issues/160104
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160808
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007, https://github.com/malfet
2025-08-17 14:11:41 +00:00
c699668009 [inductor] TLParse tensor metadata logging + test (#160132)
Summary:
- Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation.

Testing:
- Add test to verify structure and contents of tlparse artifiact

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132
Approved by: https://github.com/xmfan
2025-08-17 04:27:49 +00:00
0b56f3aed8 [vllm hash update] update the pinned vllm hash (#160831)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160831
Approved by: https://github.com/pytorchbot
2025-08-17 04:25:26 +00:00
8f434545c2 [BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
2025-08-17 00:35:12 +00:00
26297c27e2 Revert "[inductor] TLParse tensor metadata logging + test (#160132)"
This reverts commit 2603e40be5fa4a66301e6654e34a82a67f2e4913.

Reverted https://github.com/pytorch/pytorch/pull/160132 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/17010600949/job/48226137423) [HUD commit link](2603e40be5).  landrace with another PR that changed some had_cuda related things ([comment](https://github.com/pytorch/pytorch/pull/160132#issuecomment-3193969792))
2025-08-16 23:47:03 +00:00
74871d4d46 [collections.abc] Ensure that binop calls works with UserDefinedObjects (#159865)
Changes:
(1) Replace UserDefinedSetVariable by UserDefinedObjectVariable in all binop calls

Test plan:
(1) The three tests from CPython `test_collections.py` ensures that Dynamo can trace through a dunder method (e.g. __add__, __ixor__, etc) defined in a user defined class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159865
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864
2025-08-16 20:44:40 +00:00
f019da2979 Implement list(UserDefinedObject) via force_unpack_var_sequence (#159864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159864
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483, #159902
2025-08-16 20:44:40 +00:00
f1bc843a5d Wrap class definitions in set_fullgraph(False) in test_collections (#159902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159902
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483
2025-08-16 20:42:15 +00:00
2603e40be5 [inductor] TLParse tensor metadata logging + test (#160132)
Summary:
- Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation.

Testing:
- Add test to verify structure and contents of tlparse artifiact

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132
Approved by: https://github.com/xmfan
ghstack dependencies: #160260
2025-08-16 16:37:18 +00:00
8fe4b3f848 [BE][CI] move MYPYSTRICT linter from lintrunner-noclang to lintrunner-mypy (#160806)
Like `MYPY`, linter `MYPYSTRICT` will need `--all-files` too.

See also:

- https://github.com/pytorch/pytorch/pull/160652#issuecomment-3193390813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160806
Approved by: https://github.com/seemethere
2025-08-16 16:15:22 +00:00
cff6def7f4 [MTIA] add correct name for CFF in tlparse (#160599)
Differential Revision: D80201622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160599
Approved by: https://github.com/bdhirsh
2025-08-16 14:58:03 +00:00
e444cd24d4 Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197)
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
2025-08-16 09:15:58 +00:00
a84541c73f Update transformers version automatically with Dependabot (#160635)
My proposal here is to use GitHub Dependabot to make sure that `transformers` version used in CI are always up-to-date.  To achieve this, this PR does 2 things:

1. Pin `transformers` version across all CI jobs to only one place at `.ci/docker/ci_commit_pins/huggingface.txt`.  This file is now a regular pip requirements instead of a pinned commit text.  There isn't any need to pin `transformers` to a specific commit and the file already refers to a stable version `v4.54.0`
2. Create `.github/dependabot.yml` to config the bot to update `transformers` automatically when there is a new version.  Those labels will ensure that the right reviewers from torch.compile and Dev Infra are notified.  I'm not sure how to test this out in PR, but it feels ok to land and test this in main.  If this works, we should see a PR to update `v4.54.0` to the current latest `v4.55.0`

### Reference
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160635
Approved by: https://github.com/ZainRizvi
2025-08-16 05:53:39 +00:00
114813ca77 Fix mypy errors: PyTreeSpec inheritance (#160652)
Fixes #160650.

I added type ignore comment to `LeafSpec` class inheritance in `torch/utils/_cxx_pytree.py` to handle `PyTreeSpec` being marked as final in optree's type stubs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160652
Approved by: https://github.com/Skylion007
2025-08-16 05:14:11 +00:00
11b6ceb7b4 [ONNX] Default to dynamo export (#159646)
Set dynamo=True and enable fallback.

1. Implemented the compatible behavior where BytesIO objects as `f` is accepted
2. Update tests to explicitly set dynamo=False

#151693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646
Approved by: https://github.com/titaiwangms
2025-08-16 04:48:58 +00:00
fb7e60ba7a [Dynamo][Hierarchical Compile] Flatten tuple outputs in graph dedupe pass (#158811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158811
Approved by: https://github.com/anijain2305
ghstack dependencies: #158810
2025-08-16 04:45:31 +00:00
f89186e910 [audio hash update] update the pinned audio hash (#160797)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160797
Approved by: https://github.com/pytorchbot
2025-08-16 04:26:59 +00:00
10eb83734f [vllm hash update] update the pinned vllm hash (#160699)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160699
Approved by: https://github.com/pytorchbot
2025-08-16 04:26:55 +00:00
75ea93484c [vllm test] add vllm.yml and additional package (#160698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160698
Approved by: https://github.com/huydhn
ghstack dependencies: #160116
2025-08-16 04:24:20 +00:00
45c2c7a5fc Fix the wrong dataclasses_json mointoring dep MacOS test (#160796)
Typo mistake.  This should be `dataclasses_json` https://github.com/pytorch/pytorch/actions/runs/17000197828/job/48200676725#step:10:23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160796
Approved by: https://github.com/yangw-dev
2025-08-16 04:00:31 +00:00
b74c7cd335 Add kernel stack traces tlparse dump (#160608) (#160779)
Summary:

as title

This is requested by the zoomer team so they can add stack trace information to profiler result.

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r  stack_traces
```

Rollback Plan:

Differential Revision: D80050233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160779
Approved by: https://github.com/angelayi
2025-08-16 03:12:38 +00:00
b7ca502f29 [ROCm][Windows] Add hipcc compatibility flags to cpp_extension.py. (#159790)
This is a similar change to https://github.com/pytorch/pytorch/pull/153986, this time adding flags to the hipcc command under `cpp_extension.py`.

The `-Wno-ignored-attributes` flag in particular avoids about 200MB of warning spam when building torchvision, like these:
```
In file included from D:\b\vision_main\torchvision\csrc\ops\hip\deform_conv2d_kernel.hip:72:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ATen.h:13:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/Functions.h:386:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax.h:21:
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax_ops.h:18:8: warning: __declspec attribute 'dllimport' is not supported [-Wignored-attributes]
   18 | struct TORCH_API _sparse_softmax_int {
      |        ^~~~~~~~~
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h💯19: note: expanded from macro 'TORCH_API'
  100 | #define TORCH_API C10_IMPORT
      |                   ^~~~~~~~~~
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:53:31: note: expanded from macro 'C10_IMPORT'
   53 | #define C10_IMPORT __declspec(dllimport)
      |                               ^~~~~~~~~
```

The `-fms-extensions` flag just seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html.

See also this downstream issue where these changes were tested: https://github.com/ROCm/TheRock/issues/910.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159790
Approved by: https://github.com/jeffdaily
2025-08-16 02:20:49 +00:00
7bd4cfaef4 [BE] Update nvshem dependency to 3.3.20 (#160458)
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix https://github.com/pytorch/pytorch/issues/160425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
2025-08-16 02:00:57 +00:00
c015e53d37 Revert "[BE] Update nvshem dependency to 3.3.20 (#160458)"
This reverts commit e0488d9f00865fb56c931580c80e099771c6285e.

Reverted https://github.com/pytorch/pytorch/pull/160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](https://github.com/pytorch/pytorch/pull/160458#issuecomment-3193133706))
2025-08-16 01:47:42 +00:00
65dc4df74d unify broadcast_shapes functions and avoid duplicates (#160251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160251
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
ghstack dependencies: #160250
2025-08-16 00:54:32 +00:00
c03809e8a5 guard_or_false cat ops (#160250)
keep existing unbacked semantics unchanged, just use guard_or_false instead of guard_size_obl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160250
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh
2025-08-16 00:54:31 +00:00
e0488d9f00 [BE] Update nvshem dependency to 3.3.20 (#160458)
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix https://github.com/pytorch/pytorch/issues/160425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
2025-08-16 00:50:13 +00:00
f782c790df migrate more simple gso checks (#160253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160253
Approved by: https://github.com/bobrenjc93
2025-08-16 00:15:24 +00:00
16ce2c15fa Add python 3.14 support to linux aarch64 builds (#160788)
Related to https://github.com/pytorch/pytorch/issues/156856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160788
Approved by: https://github.com/malfet
2025-08-16 00:03:21 +00:00
0d28d12b11 Fix typo packing libnvshmem into libtorch (#160778)
Fix typo after https://github.com/pytorch/pytorch/pull/160465
Fixes: https://github.com/pytorch/pytorch/issues/160762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160778
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/Skylion007
2025-08-15 23:43:02 +00:00
838f22c57d Do not incorrectly chain each of the strings as iterables (#160709)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160709
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-08-15 23:22:24 +00:00
eqy
387fe847ab [cuDNN][SDPA] Introduce TORCH_CUDNN_SDPA_AVOID_RECOMPILE=1 (#155958)
Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths.

Built on top of #149282

Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155958
Approved by: https://github.com/drisspg
2025-08-15 21:59:18 +00:00
40311e2ec1 [AOTInductor] ABI-Compatibility for RecordFunction. (#159842)
Summary:
Previous our implementation for RecordFunction injects Aten into
codegen, which is breaking the ABI contract for AOTInductor.

C10::IValue is aded to call the full record function. The extension of
more profiling info will come in later PRs.

Test Plan:
Included in commit.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842
Approved by: https://github.com/desertfire
2025-08-15 21:45:47 +00:00
8ca8b6053c [inductor][while_loop][be] improve the readability of output handling (#160374)
The logic doesn't change but make it easier to read and change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160374
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-08-15 20:13:12 +00:00
ff86509a06 [map] filter none gradients and add autograd inductor tests (#160548)
Will filter the none outputs in autograd backward for other hops as follow ups

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160548
Approved by: https://github.com/zou3519
2025-08-15 20:13:12 +00:00
fa75ba9303 Change IR node's stack traces to return a set of stack traces only (#160701)
Summary: There can be excessive stack trace outputs in TORCH_LOGS="+inductor" when a single line of code corresponds to many post grad nodes, e.g. `self.multihead_attn(x, x, x)`, in that case, we'll see the same stack trace many times in the IR node, spamming the output log. So we change to return a set of stack traces.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80310549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160701
Approved by: https://github.com/angelayi
2025-08-15 19:31:59 +00:00
b78968b4d1 Support next(iterator, default) (#159483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159483
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368
2025-08-15 19:08:21 +00:00
e5621b4d8b Fixes for collections.Counter (#159368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159368
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366
2025-08-15 19:08:21 +00:00
2542e71f3f Change mutation type of MutableMappingVariable to AttributeMutationNew (#159366)
Also add MutableMappingVariable to `call_or_` / `call_ior`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159366
Approved by: https://github.com/zou3519
ghstack dependencies: #159365
2025-08-15 19:08:21 +00:00
0242d40fa5 Enable trace through the collections module (#159365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159365
Approved by: https://github.com/zou3519
2025-08-15 19:08:21 +00:00
17de899709 Add py3.14 to macos arm64 (#160593)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160593
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-08-15 18:52:10 +00:00
25d0d8b0a3 [inductor] Fix propagating torch.utils._sympy.functions.Identity in IndexPropagation (#155504)
Fixes https://github.com/pytorch/pytorch/issues/160535

Index may contain ` torch.utils._sympy.functions.Identity`. When we call `SymPyOps.index_expr`, if the value is a sympy.Expr with Identity, `TypedExpr(value, dtype)` will fail. So when we unwrap arguments, we expand the sympy expression to unwrap Identity.

Test Plan:
buck run @mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_expr_indexing

Rollback Plan:

Differential Re vision: D76308640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155504
Approved by: https://github.com/eellison
2025-08-15 18:38:23 +00:00
c6d697ff52 port 2 distributed pipeline test files for Intel GPU (#159140)
it's another pr to port distributed pipeline test for Intel GPU, while the other pr is https://github.com/pytorch/pytorch/pull/159033.
In this pr, we port two test files for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. instantiate_device_type_tests()
2. skip the case at xpu due to accuracy gap introduced by oneDNN non-deterministic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159140
Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/H-Huang
2025-08-15 18:29:50 +00:00
30d2f98daa Revert "[cutlass backend] re-add pip cutlass path (#160180)"
This reverts commit d556586448f3caab85673c7da0978fe31c7748f7.

Reverted https://github.com/pytorch/pytorch/pull/160180 on behalf of https://github.com/atalman due to broke macos nightly ([comment](https://github.com/pytorch/pytorch/pull/160180#issuecomment-3192311552))
2025-08-15 18:00:41 +00:00
8780d28c65 raise exception in case of errors in memory reordering (#160455)
This PR introduce two checks in the memory reordering pass to catch graph issues before performing the reordering task. For situation not covered by these checks, the reordering pass might fail and an exception will be thrown in this case.

This addresses issue -- https://github.com/pytorch/pytorch/issues/159568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160455
Approved by: https://github.com/eellison
2025-08-15 17:31:55 +00:00
da8f48d88f [associative_scan] support gen_schema for associative_scan (#158883)
In-place mutation may create inter-loop dependency that breaks the parallelism we have for associative_scan so we ban input mutations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158883
Approved by: https://github.com/zou3519
ghstack dependencies: #154193, #158965, #158863, #158864
2025-08-15 17:28:44 +00:00
cb9e2092a8 [scan] support gen_schema for scan (#158864)
We don't want to allow scan's combine_fn to mutate its inputs. The semantic of the mutation can be confusing. For example:
```python
def combine_fn(init, x):
```
If combine_fn mutates init, only first iteration mutates init, the rest of the iterations mutates the previous carry, which is an intermediate result. This is kind of a weird semantic because the only observable mutation is for init, which can be done outside of the combine_fn.

If combine_fn mutates x, where x is a slice of scanned inputs (i.e. xs), this pattern is more meaningful but we've not seen any use case yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158864
Approved by: https://github.com/zou3519
ghstack dependencies: #154193, #158965, #158863
2025-08-15 17:28:44 +00:00
f6bf1573fc [while_loop] support gen_schema for while_loop (#158863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158863
Approved by: https://github.com/zou3519
ghstack dependencies: #154193, #158965
2025-08-15 17:28:34 +00:00
82a18423be [BE] create an empty shape_env for check_input_alias_and_mutation_return_outputs (#158965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158965
Approved by: https://github.com/zou3519
ghstack dependencies: #154193
2025-08-15 17:28:20 +00:00
3fe3c23d4e [cond] support gen_schema for cond (#154193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193
Approved by: https://github.com/zou3519
2025-08-15 17:28:13 +00:00
052c441cf4 Add logging for when inbuilt_inline_nn_modules will help with ID_MATCH guard triggered recompiles (#160592)
We add a logging around when an ID_MATCH guard is added at a place where inbuilt_inline_nn_modules would inline it. This is done with the aim of tagging recompiles that could be avoided by setting inbuilt_inline_nn_modules flag.
It will help us log and track the flag's adoption and potentially quantify saving in the the number of recompiles.

Differential Revision: D80075975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160592
Approved by: https://github.com/anijain2305
2025-08-15 17:09:39 +00:00
b26d2a9464 [ez] Make NUMA signpost parameters JSON serializable (#160710)
# Context
Broader context in #160163.

In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable.

# This PR
Convert `NumaOptions` to serializable form before inputting to `signpost_event`.

# Test Plan
## Automated
Added tests `$ pytest test/test_numa_binding.py`.

## Manual
See [D80317206](https://www.internalfb.com/diff/D80317206).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160710
Approved by: https://github.com/kiukchung
2025-08-15 16:52:43 +00:00
6382302990 [MPS] Add grid_sampler_3d for MPS (#160541)
This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation.

NOTE: "nearest" interpolation is not yet supported

Fixes #159882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541
Approved by: https://github.com/malfet
2025-08-15 16:19:25 +00:00
80dd05e31e Disable flaky cpp test RecordDebugHandles.Basic (#160577)
Test is flaky and sometimes hangs in CI

Here's an example of the failure:
https://github.com/pytorch/pytorch/actions/runs/16946153494/job/48027937663
```

2025-08-13T20:54:00.1223688Z ==================================== RERUNS ====================================
2025-08-13T20:54:00.1224156Z ___________________________ RecordDebugHandles.Basic ___________________________
2025-08-13T20:54:00.1224682Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13
2025-08-13T20:54:00.1225568Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6):
2025-08-13T20:54:00.1226430Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1226988Z Note: Google Test filter = RecordDebugHandles.Basic-*_CUDA:*_MultiCUDA
2025-08-13T20:54:00.1227450Z [==========] Running 1 test from 1 test suite.
2025-08-13T20:54:00.1227792Z [----------] Global test environment set-up.
2025-08-13T20:54:00.1228145Z [----------] 1 test from RecordDebugHandles
2025-08-13T20:54:00.1228492Z [ RUN      ] RecordDebugHandles.Basic
2025-08-13T20:54:00.1228822Z [       OK ] RecordDebugHandles.Basic (1 ms)
2025-08-13T20:54:00.1229204Z [----------] 1 test from RecordDebugHandles (1 ms total)
2025-08-13T20:54:00.1229501Z
2025-08-13T20:54:00.1229666Z [----------] Global test environment tear-down
2025-08-13T20:54:00.1230033Z [==========] 1 test from 1 test suite ran. (1 ms total)
2025-08-13T20:54:00.1230355Z [  PASSED  ] 1 test.
2025-08-13T20:54:00.1230727Z terminate called after throwing an instance of 'std::system_error'
2025-08-13T20:54:00.1231154Z   what():  Invalid argument
2025-08-13T20:54:00.1231416Z unknown file:0: C++ failure
2025-08-13T20:54:00.1231788Z ------------------------------ Captured c++ call -------------------------------
2025-08-13T20:54:00.1232262Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1232745Z Note: Google Test filter = RecordDebugHandles.Basic-*_CUDA:*_MultiCUDA
2025-08-13T20:54:00.1233199Z [==========] Running 1 test from 1 test suite.
2025-08-13T20:54:00.1233557Z [----------] Global test environment set-up.
2025-08-13T20:54:00.1233915Z [----------] 1 test from RecordDebugHandles
2025-08-13T20:54:00.1234247Z [ RUN      ] RecordDebugHandles.Basic
2025-08-13T20:54:00.1234590Z [       OK ] RecordDebugHandles.Basic (1 ms)
2025-08-13T20:54:00.1235020Z [----------] 1 test from RecordDebugHandles (1 ms total)
2025-08-13T20:54:00.1235304Z
2025-08-13T20:54:00.1235431Z [----------] Global test environment tear-down
2025-08-13T20:54:00.1235793Z [==========] 1 test from 1 test suite ran. (1 ms total)
2025-08-13T20:54:00.1236126Z [  PASSED  ] 1 test.
2025-08-13T20:54:00.1236481Z terminate called after throwing an instance of 'std::system_error'
2025-08-13T20:54:00.1236906Z   what():  Invalid argument
2025-08-13T20:54:00.1237287Z ___________________________ RecordDebugHandles.Basic ___________________________
2025-08-13T20:54:00.1237800Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13
2025-08-13T20:54:00.1238686Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6):
2025-08-13T20:54:00.1239551Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1240048Z Note: Google Test filter = RecordDebugHandles.Basic-*_CUDA:*_MultiCUDA
2025-08-13T20:54:00.1240495Z [==========] Running 1 test from 1 test suite.
2025-08-13T20:54:00.1240848Z [----------] Global test environment set-up.
2025-08-13T20:54:00.1241199Z [----------] 1 test from RecordDebugHandles
2025-08-13T20:54:00.1241542Z [ RUN      ] RecordDebugHandles.Basic
2025-08-13T20:54:00.1241871Z [       OK ] RecordDebugHandles.Basic (1 ms)
2025-08-13T20:54:00.1242249Z [----------] 1 test from RecordDebugHandles (1 ms total)
2025-08-13T20:54:00.1242503Z
2025-08-13T20:54:00.1242641Z [----------] Global test environment tear-down
2025-08-13T20:54:00.1242993Z [==========] 1 test from 1 test suite ran. (19 ms total)
2025-08-13T20:54:00.1243329Z [  PASSED  ] 1 test.
2025-08-13T20:54:00.1243697Z terminate called after throwing an instance of 'std::system_error'
2025-08-13T20:54:00.1244113Z   what():  Invalid argument
2025-08-13T20:54:00.1244392Z unknown file:0: C++ failure
2025-08-13T20:54:00.1244759Z ------------------------------ Captured c++ call -------------------------------
2025-08-13T20:54:00.1245235Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1283768Z ============== 1 failed, 568 passed, 2 rerun in 115.57s (0:01:55) ==============
```

Here's an example of the hang:
https://github.com/pytorch/pytorch/actions/runs/16942186826/job/48015238944
Logs aren't super helpful other than stating that it took a long time.  Usually this file takes <2min to run
```
2025-08-13T18:43:24.6586481Z [gw0] [ 97%] PASSED [1.4119s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/8
2025-08-13T18:43:24.6587278Z [gw1] [ 97%] PASSED [1.4866s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/9 Command took >30min, returning 124
2025-08-13T18:43:24.6587288Z
2025-08-13T18:43:24.6587632Z FINISHED PRINTING LOG FILE of cpp/test_jit 1/1 (test/test-reports/cpp.test_jit_1.1_c259e5a152845991_.log)
2025-08-13T18:43:24.6587639Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160577
Approved by: https://github.com/huydhn
2025-08-15 15:59:21 +00:00
9df07ecfbe Revert "[inductor] dont reuse buffers if it affects peak (#145883) (#159530)"
This reverts commit 3be70dc30e893b552fc0f23ca06cd8f7949b6d08.

Reverted https://github.com/pytorch/pytorch/pull/159530 on behalf of https://github.com/clee2000 due to newly added test fail internally D80316528, probably just a targets change, but also imo the tests should probably go into a testcase class from common or inductor utils.  While I'm pretty sure CI can run the globally defined ones, theres some CI related functionality that on the testcase class that CI benefits from ([comment](https://github.com/pytorch/pytorch/pull/159530#issuecomment-3191947506))
2025-08-15 15:49:04 +00:00
846963fa9b Revert "[Inductor] addmm + activation function fusion (#158137)"
This reverts commit b9d7de3a094598c3dc0dd52e57bce30eb684c9d8.

Reverted https://github.com/pytorch/pytorch/pull/158137 on behalf of https://github.com/malfet due to Broke inductor torchbench, see 663da17b62/1 ([comment](https://github.com/pytorch/pytorch/pull/158137#issuecomment-3191841298))
2025-08-15 15:34:09 +00:00
663da17b62 Update torch-xpu-ops commit pin (#160062)
Update the torch-xpu-ops commit to [77cc792cd265179745d335579d233e6d4f9a2667](77cc792cd2), includes:

- Ensures that the XPU cache is cleared before creating tensors during the test
- Add unused variable warning
- Fix test_linalg and test_torch issue with bf32_on_and_off updates
- Fix deterministic indexing with broadcast
- Fix dist.gather with noncontiguous tensor
- Improve accuracy of index put deterministic kernel
- Add generate file rely avoid build before generate
- optimize embedding bag

Fixes #160661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160062
Approved by: https://github.com/EikanWang
2025-08-15 15:27:24 +00:00
e299926f72 [ONNX] Fix doc typo for symbolic_multi_out (#160702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160702
Approved by: https://github.com/justinchuby
2025-08-15 14:34:42 +00:00
bbd11c4f23 Uninstall torchao on MPS benchmark (#160724)
Fixes https://github.com/pytorch/pytorch/issues/160689

The current torchao 0.12.0 doesn't work with transformers 4.54.0 and ends up with this error:

```
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/models/albert/modeling_albert.py", line 37, in <module>
    from ...modeling_utils import PreTrainedModel
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/modeling_utils.py", line 51, in <module>
    from torchao.quantization import Int4WeightOnlyConfig
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/__init__.py", line 41, in <module>
    from torchao.quantization import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/__init__.py", line 6, in <module>
    from .autoquant import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/autoquant.py", line 11, in <module>
    from torchao.dtypes import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/__init__.py", line 1, in <module>
    from . import affine_quantized_tensor_ops
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/affine_quantized_tensor_ops.py", line 38, in <module>
    from torchao.dtypes.uintx.dyn_int8_act_int4_wei_cpu_layout import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/__init__.py", line 7, in <module>
    from .dyn_int8_act_int4_wei_cpu_layout import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py", line 320, in <module>
    from ...prototype.inductor.fx_passes import register_da8w4_concat_linear_cpu_pass
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/__init__.py", line 2, in <module>
    from .int8_sdpa_fusion import _int8_sdpa_init
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/int8_sdpa_fusion.py", line 22, in <module>
    from ..int8_sdpa_lowering import register_int8_sdpa  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/int8_sdpa_lowering.py", line 6, in <module>
    from torch._inductor.kernel.flex_attention import construct_strides, maybe_realize
ModuleNotFoundError: No module named 'torch._inductor.kernel.flex_attention'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160724
Approved by: https://github.com/malfet
2025-08-15 13:55:39 +00:00
eaa5d9d3d3 Introduce OpInfo test for testing export on fake device (#160694)
Summary: Prepare for the upcoming diffs for exporting on fake cuda device.

Test Plan:
test

Rollback Plan:

Differential Revision: D80304225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160694
Approved by: https://github.com/dolpm
2025-08-15 07:26:28 +00:00
a7c75ae976 [dde] use sym_or when checking normalized shape in layer_norm (#160683)
Use `sym_eq` to check equality on tuple of ints/symints

### DDE
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, u1) (unhinted: Eq(u0, u1)).  (Size-like symbols: u1, u0)

Caused by: return torch.nn.functional.layer_norm(  # test/inductor/test_unbacked_symints.py:527 in fn (_refs/__init__.py:3292 in native_layer_norm)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160683
Approved by: https://github.com/bobrenjc93
2025-08-15 06:56:00 +00:00
f7ad69f59c [dynamic shapes] handle Max(*,1) for inductor layout contiguity (#160578)
Differential Revision: D80214882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160578
Approved by: https://github.com/ZixinYang, https://github.com/bobrenjc93
2025-08-15 06:10:18 +00:00
4cae9cf2df Update triton xpu commit to support python 3.14 (#160183)
Follow PR #159725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-15 05:41:17 +00:00
7710800865 [3/3][ghstack][vllm ci build setup]vllm build workflow (#160116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160116
Approved by: https://github.com/huydhn
2025-08-15 05:35:46 +00:00
aa99e0958f Separate provenance tracking to different levels (#160383)
Summary: as title. We've got request from various parties who are interested in turning on the provenance tracking by default. In this PR, we prepare to turn on part of the provenance tracking that doesn't have too much overhead by default.

- Change `provenance_tracking` config to `provenance_tracking_level`
- turn on the following provenance tracking by default when `basic_provenance_tracking`=True
    - `set_kernel_post_grad_provenance_tracing` for kernels, this add mapping between triton kernels and post_grad nodes
    - `dump_inductor_provenance_info` if we're dumping tlparse log
    - `get_graph_provenance_json` and dump `reate_mapping_pre_post_grad_nodes`. This creates mapping between pre_grad and post_grad nodes. Since we're not turning on the provenance tracking in GraphTransformObserver by default, the mapping here maybe incomplete/limited.
    - add stack trace from post grad nodes to inductor IR nodes
    - add exception swallowing for all functions above

Test Plan:
CI

Rollback Plan:

Differential Revision: D80031559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160383
Approved by: https://github.com/angelayi
2025-08-15 04:59:35 +00:00
3fc7a95176 [audio hash update] update the pinned audio hash (#160485)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160485
Approved by: https://github.com/pytorchbot
2025-08-15 04:27:49 +00:00
858fb80b9b [PT2]: Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight (#160451)
Summary: Add static dispatch kernel for wrapped_fbgemm_linear_fp16_weight. This optimization should improve perf for all Ads DSNN models using Sigmoid.

Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=37
OTHER_MODEL_ENTITY_ID=892669089
OTHER_SNAPSHOT_ID=36

MODULES=(mix prepare_float_features object user)
SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only)

for i in "${!MODULES[@]}"; do
MODULE=${MODULES[i]}
SUFFIX=${SUFFIXES[i]}
buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true
```

Before: P1900475429
I0810 19:29:22.782902 2717337 load_net_predictor_lib.cpp:1807] Average latency A: 0.0843 ms
I0810 19:29:22.782905 2717337 load_net_predictor_lib.cpp:1807] Average latency B: 0.0989 ms

After: P1900825771
I0811 15:42:34.866408 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency A: 0.0854 ms[0m
I0811 15:42:34.866411 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency B: 0.092 ms[0m

Still has some regression but the gap is smaller...

Rollback Plan:

Reviewed By: henryoier, muchulee8

Differential Revision: D80042054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160451
Approved by: https://github.com/henryoier
2025-08-15 04:06:17 +00:00
55061c9602 [PT2]: Add Static Dispatch Kernel for scale_gradient (#160454)
Summary: Add Static Dispatch Kernel for scale_gradient

Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=37
OTHER_MODEL_ENTITY_ID=892669089
OTHER_SNAPSHOT_ID=36

MODULES=(mix prepare_float_features object user)
SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only)

for i in "${!MODULES[@]}"; do
MODULE=${MODULES[i]}
SUFFIX=${SUFFIXES[i]}
buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true
```

Rollback Plan:

Reviewed By: henryoier

Differential Revision: D80062244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160454
Approved by: https://github.com/henryoier
2025-08-15 03:42:39 +00:00
214d04833a [PT2]: Add Static Dispatch Kernel for fmod.Scalar (#160654)
Summary: Add static dispatch for torch.ops.aten.fmod.Scalar. Found this missing in user/object nets for DSNN models.

Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=36
MODULE=user
SUFFIX=.predictor.precompute.remote_request_only

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkEnableProfiling=true --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=1000
```

Object tower: P1904347784
User tower: P1904348406

Rollback Plan:

Differential Revision: D80238495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160654
Approved by: https://github.com/henryoier
2025-08-15 03:11:48 +00:00
9c5601ecc3 [NVIDIA] Refactor Family Blackwell Support codegen (#156176)
With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1.
This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced.
Thor 10.1 --> 11.0
Spark 12.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156176
Approved by: https://github.com/ezyang
2025-08-15 02:51:26 +00:00
5b9ad951f8 [BE][Docker] Do not install cuda:11.8 (#160695)
As CUDA-11.8 binary are no longer produced by CD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160695
Approved by: https://github.com/huydhn
2025-08-15 02:23:04 +00:00
4d5f92aa39 typing tvm.py (#160369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160369
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367, #160368
2025-08-15 02:09:31 +00:00
39ca0ce0c8 Type backend torchxla (#160368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160368
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367
2025-08-15 02:09:31 +00:00
d52bb67ac3 typing registry.py (#160367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160367
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365, #160366
2025-08-15 02:09:31 +00:00
05b9b63fb6 typing inductor and placeholder backends (#160366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160366
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365
2025-08-15 02:09:31 +00:00
453cfa5153 typing distributed.py (#160365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160365
Approved by: https://github.com/StrongerXi
ghstack dependencies: #160362, #160363, #160364
2025-08-15 02:09:31 +00:00
9faca5f260 typing debugging.py (#160364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160364
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363
2025-08-15 02:09:31 +00:00
6fe6dd9fdc Type cudagraphs.py (#160363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160363
Approved by: https://github.com/StrongerXi
ghstack dependencies: #160362
2025-08-15 02:09:31 +00:00
f82c7eed84 Typing for common.py (#160362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160362
Approved by: https://github.com/Skylion007
2025-08-15 02:09:31 +00:00
25ccc4716e [Inductor] [Triton] Apply feedback to Enable padded stride support (#160614)
Summary:
Issue I noticed while fixing tests for TMA store. This triton.language.make_tensor_descriptor call hardcodes the shape information as the stride, which is not necessarily correct.

In particular, its legal to have a stride bigger than the shape (e.g. padded to a size). A good example of the usage of this would be to allocate a tensor to always be a multiple of 16 and just pad the result so TMA is legal.

This is redo of https://github.com/pytorch/pytorch/pull/160493 because I broke this accidentally trying to land internally first instead of merging through Github directly.

Test Plan:
Tested with `buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.nvcc_arch=h100 caffe2/test/inductor:max_autotune 2>&1 | tee ~/test_logs.log` and confirmed all max autotune tests passed.

Rollback Plan:

Differential Revision: D80224578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160614
Approved by: https://github.com/eellison
2025-08-15 02:06:14 +00:00
d387a48c38 [generator] Raise StopIteration(value) with value from the return stmt (#157152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157152
Approved by: https://github.com/zou3519
ghstack dependencies: #157148
2025-08-15 01:42:40 +00:00
831e85104a [contextlib] Fixes for CPython contextlib tests (#157148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157148
Approved by: https://github.com/zou3519
2025-08-15 01:42:40 +00:00
211c98859a [inductor][triton] Update triton_builtin handling after triton # 7239 (#160658)
https://github.com/triton-lang/triton/pull/7239 will search for a _semantic kwarg in the signature of the function before passing in this kwarg. To fix this in Inductor:

1. explicitly take a _semantic kwarg
2. remove the functools.wraps around the wrapper function, which was causing inspect.signature to return the signature of the wrapped function (instead of the signature of the wrapper, which does contain the _semantic arg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160658
Approved by: https://github.com/PaulZhang12, https://github.com/njriasan
2025-08-15 00:39:24 +00:00
dae7710bf2 [cuda][cupy] Improve cupy device placement when device is provided with explicit index (#158529)
resubmit https://github.com/pytorch/pytorch/pull/158320 , fixing a potential bug when device index is not specified explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158529
Approved by: https://github.com/ezyang
2025-08-15 00:27:42 +00:00
dc194a3096 Test multiprocessing spawn timing fix (#160672)
Submitting PR to fix #160511.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160672
Approved by: https://github.com/mikaylagawarecki
2025-08-15 00:11:55 +00:00
4051b42c29 [ROCm] hipify needs specific header mappings (#160675)
Fixes #160579.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160675
Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-15 00:09:04 +00:00
eb0eaa67e1 [BE][ci] Increase frequency of cutlass backend ci (#160656)
* increase frequency from every 24 hours to every 12 hours
* automatically enable it if cutlass backend files are touched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160656
Approved by: https://github.com/eellison
2025-08-14 23:44:55 +00:00
98373e5ad2 [doc] AOTI debugging guide (#160430)
Folded from https://discuss.pytorch.org/t/a-beginners-guide-to-debugging-aot-inductor-cuda-illegal-memory-access/222188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160430
Approved by: https://github.com/angelayi
2025-08-14 23:42:17 +00:00
371eacb2ae [Dynamo][Hierarchical Compile] Refactor for tuple flattening (#158810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158810
Approved by: https://github.com/StrongerXi
2025-08-14 22:45:44 +00:00
3650989e6e Revert "[cutlass] fix dictionary iteration error (#160552)"
This reverts commit 29d20d49f0b7f4e362e1cefdcdc4b5659969312c.

Reverted https://github.com/pytorch/pytorch/pull/160552 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160552#issuecomment-3189940880))
2025-08-14 21:41:28 +00:00
3be70dc30e [inductor] dont reuse buffers if it affects peak (#145883) (#159530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530
Approved by: https://github.com/eellison
2025-08-14 21:14:36 +00:00
47a1db823d [triton_heuristics] Optimize the triton launcher in pt2 (#160000)
Summary:

(Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent)

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238}

After Triton pin-update:
 {F1980557240}

The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.893x
```

```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```

After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.985x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000
Approved by: https://github.com/jansel, https://github.com/mlazos

Co-authored-by: Xu Zhao <xzhao9@meta.com>
2025-08-14 21:04:08 +00:00
eac2d9d695 Revert "appending the pythonpath (#160219)"
This reverts commit 1d80d697a269234b47ec7ede192faf3bb9b159e3.

Reverted https://github.com/pytorch/pytorch/pull/160219 on behalf of https://github.com/clee2000 due to broke inductor? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16970222746/job/48108262003) [HUD commit link](1d80d697a2) ([comment](https://github.com/pytorch/pytorch/pull/160219#issuecomment-3189850381))
2025-08-14 20:58:14 +00:00
3fe19a7a0a [Test Fix] Delete dynamo skipfile for OpenMP test_one_thread (#160562)
Fixes #120648

During issue scrubbing I could not repro these failing tests, so reenabling them to close out the issue

### Test
Original repro command:
```
 PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_openmp.py -v -k test_one_thread
```

Now results in
```
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0 -- /home/lucaskabela/.conda/envs/pytorch-3.12/bin/python3.12
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /home/lucaskabela/pytorch
configfile: pytest.ini
plugins: hypothesis-6.138.0
collected 2 items / 1 deselected / 1 selected
Running 1 items in this shard

test/test_openmp.py::TestOpenMP_ParallelFor::test_one_thread PASSED [3.6874s]                                                       [100%]

===================================================== 1 passed, 1 deselected in 6.07s =====================================================
```

And:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_openmp.py TestOpenMP_ParallelFor.test_one_thread
```
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_sort_and_select.py TestSortAndSelectCPU.test_sort_overflow_cpu_int16
```

Both result in:
```
.
----------------------------------------------------------------------
Ran 1 test in 0.003s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160562
Approved by: https://github.com/zou3519
2025-08-14 20:55:59 +00:00
4a90dc0c1f Update checkpoint warning to target PyTorch 2.9 (#160643)
Fixes #160534

Updates the warning in torch.utils.checkpoint to state that starting in PyTorch 2.9, calling checkpoint without explicitly passing use_reentrant will raise an exception. Follows the guidance from the issue discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160643
Approved by: https://github.com/soulitzer
2025-08-14 20:53:17 +00:00
1fc683cf17 [Inductor] Allow indexing a flexible layout for extract_input_node_reduction_ranges (#160645)
Differential Revision: D79831747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160645
Approved by: https://github.com/eellison
2025-08-14 20:43:35 +00:00
b9d7de3a09 [Inductor] addmm + activation function fusion (#158137)
PR implements a pass in post_grad to fuse activation(add + mm)

This was previously done similarly here #106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion.

however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results

perf dash board
<img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" />

Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API

Graph module before and after this pass

Relu(addmm)
```
graph():
    %primals_1 : [num_users=1] = placeholder[target=primals_1]
    %primals_2 : [num_users=2] = placeholder[target=primals_2]
    %primals_3 : [num_users=2] = placeholder[target=primals_3]
    %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
    %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {})
    %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {})
    %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
    return (relu, primals_2, le, permute_1)
graph():
    %primals_1 : [num_users=1] = placeholder[target=primals_1]
    %primals_2 : [num_users=2] = placeholder[target=primals_2]
    %primals_3 : [num_users=2] = placeholder[target=primals_3]
    %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
    %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {})
    %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
    return (_addmm_activation_default, primals_2, le, permute_1)
```
Gelu (addmm)
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {})
    %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {})
    %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {})
    %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {})
    %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {})
    %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {})
    %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {})
    %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {})
    return (mul_5,)
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True})
    return (_addmm_activation_default,)
```

Benchmark setup:
NGC pytorch 25.06 container
cublas version: 12.9.1.4
torch.compile ran with dynamic = False and max_autotune

H100
```
Testing with M=1024, N=1024, K=1024, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.0107 ms
Average Time per Iteration (torch compile):	 0.0296 ms

============================================================
Testing with M=2048, N=2048, K=2048, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.0262 ms
Average Time per Iteration (torch compile):	 0.0327 ms

============================================================
Testing with M=4096, N=4096, K=4096, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.1763 ms
Average Time per Iteration (torch compile):	 0.2457 ms

============================================================
Testing with M=8192, N=8192, K=8192, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 1.5280 ms
Average Time per Iteration (torch compile):	 1.9437 ms
```

A100
```
############################################################
Testing with dtype: float16
############################################################

============================================================
Testing with M=1024, N=1024, K=1024, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.0313 ms
Average Time per Iteration (torch compile):	 0.0643 ms

============================================================
Testing with M=2048, N=2048, K=2048, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.1149 ms
Average Time per Iteration (torch compile):	 0.1255 ms

============================================================
Testing with M=4096, N=4096, K=4096, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.6297 ms
Average Time per Iteration (torch compile):	 0.7547 ms

============================================================
Testing with M=8192, N=8192, K=8192, dtype=float16
============================================================
Average Time per Iteration (cublas):	 4.3821 ms
Average Time per Iteration (torch compile):	 5.0740 ms
```

Script
```py
import torch
torch.manual_seed(0)

warmup, numrun= 10, 100

sizes = [1024, 2048, 4096, 8192]
dtypes = [torch.float16, torch.bfloat16, torch.float32]

device = torch.device("cuda")

for dtype in dtypes:
    dtype_name = str(dtype).split('.')[-1]
    print(f"\n{'#'*60}")
    print(f"Testing with dtype: {dtype_name}")
    print(f"{'#'*60}")

    for size in sizes:
        M, N, K = size, size, size
        print(f"\n{'='*60}")
        print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}")
        print(f"{'='*60}")

        A = torch.randn(M, K, device=device, dtype=dtype)
        B = torch.randn(K, N, device=device, dtype=dtype)
        C = torch.randn(M, device=device, dtype=dtype)

        def func1():
            return torch._addmm_activation(C, A, B, use_gelu=True)

        def func2():
            return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh")

        func2_compiled = torch.compile(
            func2,
            dynamic=False,
            options={
                "force_disable_caches": True,
                "max_autotune": True,
                "max_autotune_gemm": True,
                "max_autotune_gemm_backends": "TRITON",
                "autotune_fallback_to_aten": False,
            }
        )

        for _ in range(warmup): func1()
        torch.cuda.synchronize(device=device)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        total_time_ms = 0.0
        start_event.record()
        for _ in range(numrun): func1()
        end_event.record()
        torch.cuda.synchronize(device=device)
        total_time_ms += start_event.elapsed_time(end_event)
        avg_time_ms = total_time_ms / numrun

        print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms")

        for _ in range(warmup): func2_compiled()
        torch.cuda.synchronize(device=device)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        total_time_ms = 0.0
        start_event.record()
        for _ in range(numrun): func2_compiled()
        end_event.record()
        torch.cuda.synchronize(device=device)
        total_time_ms += start_event.elapsed_time(end_event)
        avg_time_ms = total_time_ms / numrun

        print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158137
Approved by: https://github.com/eellison
2025-08-14 20:41:38 +00:00
1028c5e2d5 [Dynamo] Add CPython default dict tests (#155263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155263
Approved by: https://github.com/zou3519
2025-08-14 20:22:22 +00:00
19b4283884 Typo correction in variable name uninitalized_val in resize() function (#160636)
Fixes #160633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160636
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2025-08-14 20:11:43 +00:00
8d6d324631 [Dynamo][Hierarchical-Compile] Don't allow node duplicates to be added (#160605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160605
Approved by: https://github.com/StrongerXi
2025-08-14 20:02:10 +00:00
fdfd69bb05 Set PYTHONHOME for inductor subprocesses using torch (#160008)
This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`.  If they're part of an application that bundles the Python runtime, then they should use the bundled runtime to keep their view of the world consistent.

There are more `sys.executable` subprocesses in torch/ but it seems like they're fine.

Previous PR at https://github.com/pytorch/pytorch/pull/159382, but was reverted because it caused macOS jobs on GitHub to timeout.  What was happening was inductor subprocesses were scheduling C++ compilation tasks that were failing to find the Python.h header.  This was because they were running in venvs and now trying to find the CPython headers inside the venv, where the headers do not exist.  This PR gates the new behavior to internal builds only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160008
Approved by: https://github.com/aorenste
2025-08-14 19:57:14 +00:00
0d3461bac0 DOC: update CrossEntropyLoss with note and example of incorrect target specification (#155649)
Fixes #134771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155649
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-08-14 18:34:57 +00:00
65053c03a3 [FR] Don't check incomplete ranks for printing (#160195)
When just printing the ranks (`-j` option) we should skip the check for "incomplete ranks" since that doesn't affect the print

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160195
Approved by: https://github.com/fduwjj
ghstack dependencies: #160097
2025-08-14 18:19:45 +00:00
96f9fbe21a Fix flight recorder for P2P ops (#160097)
Fixes errors in debugging a trace as mentioned in https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160097
Approved by: https://github.com/fduwjj
2025-08-14 18:19:45 +00:00
1c25871191 Allow torch.hub.load with unauthorized GITHUB_TOKEN (#159896)
Allow torch.hub.load with unauthorized GITHUB_TOKEN

`torch.hub.load` fails if a `GITHUB_TOKEN` with few permissions is set, as can be seen in the following example. Make sure that the model has not been cached before, for example with `rm ~/.cache/torch`. If the model has been downloaded already, it will not be downloaded again and the authorization error will not occur.

```python
export GITHUB_TOKEN=""
python
>>> import torch
>>> torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 567, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 231, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, ref)
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 191, in _validate_not_a_forked_repo
    response = json.loads(_read_url(Request(url, headers=headers)))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 174, in _read_url
    with urlopen(url) as r:
         ^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
```

The cause of the error is that the function `_validate_not_a_forked_repo` in `hub.py` always uses `GITHUB_TOKEN` for authorization,  even when downloading does not require authorization.

0ba09a6d34/torch/hub.py (L194)

This fix simply retries the download without the token in case of a failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159896
Approved by: https://github.com/albanD
2025-08-14 18:15:49 +00:00
6c05ea6475 [DTensor] add op support: aten.squeeze_.dim (#159532)
**Summary**
This PR enables in-place op `aten.squeeze_.dim` on DTensor with a change to
DTensor dispatch logic: when processing in-place operator, we should assign
`output_sharding.output_spec` back to the first argument. This is because
the in-place op_call on `arg._local_tensor` could also shift the tensor meta.

**Test**
`pytest test/distributed/tensor/test_view_ops.py -s -k  test_squeeze_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159532
Approved by: https://github.com/zpcore
2025-08-14 18:01:19 +00:00
5665dc9ab7 [PP] Allow larger world_size schedule tests (#160559)
Update schedule tests to use `world_size=4`, changes needed:
- Move some tests that require world_size=2 to new class
- Move helper methods from class level to function level
- Update some initialization to pass assert since gradients were super small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160559
Approved by: https://github.com/wconstab
ghstack dependencies: #159591, #160558
2025-08-14 17:41:58 +00:00
2ff7c1c774 [PP] Rename _load_actions and validate (#160558)
Rename method and add validation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160558
Approved by: https://github.com/wconstab
ghstack dependencies: #159591
2025-08-14 17:41:58 +00:00
3028fa6ce9 Wrap class definitions in set_fullgraph(False) in test_list/tuple (#160277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160277
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276, #160278, #160330, #160331
2025-08-14 17:29:45 +00:00
077cb38974 Add dtype checks in meta dispatch for various ordering ops (#159556)
This adds data type checks for the unsupported bool and complex types for argmax/min topk, sort, minimum, maximum. As listed here:

0a99b026d6/torch/testing/_internal/common_methods_invocations.py (L21076)

Currently the ops will fail on CPU or CUDA calculation, rather than at meta dispatch stage as with for example max: 0a99b026d6/aten/src/ATen/native/TensorCompare.cpp (L285) . This will catch it early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159556
Approved by: https://github.com/janeyx99
2025-08-14 17:06:27 +00:00
cd8d8c18f5 [pytorch][dynamo_compile] Log graph_node_shape to dynamo_compile (#160556)
This PR adds the dynamo graph node shape logging to dynamo compile. Also added unit tests to check if correct graph node shape is being logged.

Test Plan:
$ python -m test_utils
Ran 12 tests in 36.447s
OK

Note: Will merge after D80185628 lands.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160556
Approved by: https://github.com/masnesral, https://github.com/jingsh
2025-08-14 16:42:35 +00:00
63654ba4c5 [BE][Dynamo] Type improvements in _dynamo/utils to generics (#159824)
Follow up to #159580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824
Approved by: https://github.com/williamwen42
2025-08-14 16:06:50 +00:00
7e27347fd3 [SymmMem] Check return of nvshmem_malloc (#160603)
`nvshmem_malloc` returns a null pointer when allocation fails. We should check here.
Otherwise, the nullptr can go down the road and into the device kernel, causing CUDA illegal memory access.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160603
Approved by: https://github.com/fduwjj, https://github.com/ngimel
2025-08-14 15:57:55 +00:00
1d80d697a2 appending the pythonpath (#160219)
Fixes #160193

`PYTHONPATH=/torchbench` to `PYTHONPATH=/torchbench:$PYTHONPATH` in [pytorch/.ci/pytorch/test.sh](b5fd7223b1/.ci/pytorch/test.sh (L1715))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160219
Approved by: https://github.com/malfet
2025-08-14 15:55:31 +00:00
b6b74aed60 [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jeffdaily
2025-08-14 15:09:16 +00:00
4a773e1e86 Warn when there is side effect in strict mode (#160060)
Differential Revision: [D79784354](https://our.internmc.facebook.com/intern/diff/D79784354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160060
Approved by: https://github.com/zhxchen17, https://github.com/StrongerXi
2025-08-14 14:59:44 +00:00
198b5fd2d4 [PP] Add DualPipeV schedule (#159591)
Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11

<img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" />

This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs.

Tests:
```sh
python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules
python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules
```

Also tested in TorchTitan and is running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591
Approved by: https://github.com/wconstab
2025-08-14 14:58:35 +00:00
20bdabbb3c [Dynamo] Fix MTIA dynamo backend by avoiding has_trition() at import time (#160604)
# Summary
MTIA's torch.compile tests were broken by D80037015. (For details, see internal task T234563969.) The root cause was that `has_triton` can change state after we call `torch.mtia.init()`, but it was used in a way that fixes Inductor's behavior at import time. (Note that `has_triton` is cached, and there's no opportunity to call `torch.mtia.init()` prior to `import torch`.)

To fix this, we use `try: import triton` as opposed to `has_triton()` at the module level.

# Test Plan

See the internal diff. As a follow-up, we will add appropriate unit tests and/or CI hints so this type of issue can be caught at PR/diff time.

Differential Revision: D80228000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160604
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
2025-08-14 14:54:49 +00:00
d556586448 [cutlass backend] re-add pip cutlass path (#160180)
Revert #156651 to allow using the cutlass PIP package which is easier for users than the Git checkout or similar method.

Also fix a bug where the PIP cutlass path wouldn't be available to subprocesses spawned during benchmarking for algorithm selection. Looks like the "spawn" method does not inherit the (potentially) already set up `config.cuda.cutlass_dir` so in the subprocess the include paths will still be set to `"../third_party/cutlass/"` leading to compilation failure due to missing headers.

Ensure `try_import_cutlass` is called at that point, which due to caching is a no-op in most cases, so doesn't hurt.
Change the logic to return `None` when cutlass isn't available returning more useful values for include paths, namely an empty list. This is in line with other inductor code which disables the CUTLASS backend when `try_import_cutlass` returns False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160180
Approved by: https://github.com/henrylhtsang, https://github.com/mlazos
2025-08-14 14:48:31 +00:00
781e9a7724 Fix meta for constant_pad_nd (#159878)
Fixes https://github.com/pytorch/pytorch/issues/144187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159878
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-14 14:47:47 +00:00
e4de93f6a3 Add sm50 and sm60 back to windows builds (#160586)
Addresses the issue reported in  https://github.com/pytorch/pytorch/issues/160575
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160586
Approved by: https://github.com/malfet
2025-08-14 12:46:35 +00:00
a5652407e4 [CI] Fix triton xpu build on Windows (#160442)
Pin the ninja version to 1.11

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160442
Approved by: https://github.com/atalman
2025-08-14 12:43:49 +00:00
6f0f4e0c3e reduce threshold to suggest changes to expected results (#160463)
Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463
Approved by: https://github.com/jamesjwu
2025-08-14 09:11:27 +00:00
db763b1717 [Intel GPU] Support SDPA backend selection and priority setting on XPU (#159464)
Currentlly SPDA XPU use own `priority_order` instead of the one from global context. Hence it does not support `with sdpa_kernel(order, set_priority=True)` with set_priority=True.

This PR enables this feature. To make default `priority_order` from global context works for XPU, I also move MATH backend to lowest priority, otherwise `cudnn attention` and `overrideable attention` will never be selected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159464
Approved by: https://github.com/guangyey, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: mayuyuace <qiming1.zhang@intel.com>
2025-08-14 08:55:31 +00:00
089c4a1ba0 Fix wrong log file name in the docs of torch.distributed.elastic.multiprocessing.start_processes() (#160396)
Fixes #160395

In https://docs.pytorch.org/docs/stable/elastic/multiprocessing.html#starting-multiple-workers and also in the code comment of the function[1], it was specified that:

```
    For each process, the ``log_dir`` will contain:

    #. ``{local_rank}/error.json``: if the process failed, a file with the error info
    #. ``{local_rank}/stdout.json``: if ``redirect & STDOUT == STDOUT``
    #. ``{local_rank}/stderr.json``: if ``redirect & STDERR == STDERR``
```

While in code[2], the files are `stdout.log` and `stderr.log`, instead of the `.json` ones listed in the doc.

[1]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/__init__.py#L144-L145
[2]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L354-L357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160396
Approved by: https://github.com/fduwjj
2025-08-14 08:24:07 +00:00
97c8c98f8d measure dispatch overhead (#160504)
Reopen https://github.com/pytorch/pytorch/pull/159699 to merge to main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160504
Approved by: https://github.com/wconstab
2025-08-14 06:13:53 +00:00
39aa3d1471 Remove the dead code in setup.py (#160515)
The following line has no effect.

34ec5ed275/setup.py (L1205)

This code was originally introduced in this PR: dd7cec680c,
and clang11 and later now support `-fstack-clash-protection`. Can we remove this line?

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160515
Approved by: https://github.com/isuruf, https://github.com/albanD
2025-08-14 06:02:11 +00:00
639778b3ee [2/3 step][ vllm ci build setup] Add vlllm buld logic and dockerfile (#160089)
# set up vllm build logic
- dockerfile:  please notice the dockfile introduced here is only temporary, once we migrate this file to vllm, we will fetch it directly from there
- VllmBuildRunner:
   - implement logic to prepare and run vllm build with dockerfile
   -

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160089
Approved by: https://github.com/huydhn
ghstack dependencies: #160043
2025-08-14 05:51:45 +00:00
00d7d6f123 [1/3][ghstack] [vllm ci build setup ]setup lumen_cli (#160043)
# Description
set up torch_cli using argparses

## Details:
- add vllm placeholer in the cli
- add unittest for cli command

see Readme.md to see how to run the cli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160043
Approved by: https://github.com/huydhn
2025-08-14 05:51:45 +00:00
c6d78d4dbd [ROCm] enable miopen channels last 3d for conv and batchnorm (#160529)
miopen batchnorm for channels last is guarded by env var PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM similar to existing PYTORCH_MIOPEN_SUGGEST_NHWC for conv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160529
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-14 05:30:19 +00:00
2898d3f965 [Lowering] Add assertion msg to sym_size and sym_stride (#160591)
Summary: Add assertion msg to sym_size and sym_stride lowering function.

Test Plan:
Will test in mast job.

Rollback Plan:

Differential Revision: D80187693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160591
Approved by: https://github.com/angelayi
2025-08-14 04:55:32 +00:00
34358f335d [vllm hash update] update the pinned vllm hash (#160594)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160594
Approved by: https://github.com/pytorchbot
2025-08-14 04:21:28 +00:00
fe3f5fe4ea Optimize min, max gradient behavior description (#160312)
Fixes #160273

## Test Result
<img width="897" height="593" alt="image" src="https://github.com/user-attachments/assets/6ebcdb2c-8a2c-4f0d-8195-656089e88325" />
<img width="985" height="653" alt="image" src="https://github.com/user-attachments/assets/606a7264-e223-4d2b-8c3f-f153ce43b208" />
<img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/0ae2f56f-820f-4194-b15c-a02a078c0487" />
<img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/79c38a17-45ac-4808-829f-d538178de36b" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160312
Approved by: https://github.com/ngimel
2025-08-14 04:18:49 +00:00
45ba7ecda8 Flex Attention heuristics: a Blackwell config (#160192)
Fixes #160074 and more.

This is the working config for B200 and RTX 5080.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160192
Approved by: https://github.com/drisspg
2025-08-14 03:47:02 +00:00
194fcfcfbd Add support for param mutation under inference mode (#159661)
Summary:
In HF model rwkv, we have parameter mutation under inference mode which should be safe. This PR does multiple things to make sure it works:
1. We execute global autograd mutation while tracing so that we can actually trace through parameter inplace mutation
2. Add support for parameter mutation under inference mode in AOTAutograd
3. Add support for parameter mutation under inference mode in export.

Test Plan:
test

Rollback Plan:

Differential Revision: D79460136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159661
Approved by: https://github.com/ydwu4
2025-08-14 03:34:04 +00:00
29d20d49f0 [cutlass] fix dictionary iteration error (#160552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160552
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
2025-08-14 03:23:46 +00:00
3faee0a631 Update nullcontext to return input args (#158776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158776
Approved by: https://github.com/zou3519
2025-08-14 03:02:44 +00:00
8cfaf51d4e Generalize support of background thread in pinned allocator (#160505)
# Motivation
https://github.com/pytorch/pytorch/pull/135524 only introduces the support of background thread for CUDA, this PR intends to support it for other backend such as XPU as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160505
Approved by: https://github.com/albanD
2025-08-14 02:22:39 +00:00
af3cabc55d Wrap class definitions in set_fullgraph(False) in test_sort (#160331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160331
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276, #160278, #160330
2025-08-14 02:12:20 +00:00
74bbe7b4a3 Wrap class definitions in set_fullgraph(False) in test_math/cmath (#160330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160330
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276, #160278
2025-08-14 02:12:20 +00:00
7bfc424a61 Wrap class definitions in set_fullgraph(False) in test_iter (#160278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160278
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276
2025-08-14 02:12:20 +00:00
5ace061254 finfo eps doc fix (#160502)
Existing documentation for torch.finfo().eps is as below:
| eps             | float  | The smallest representable number such that ``1.0 + eps != 1.0``.          |

Proposed documentation for torch.finfo().eps is as below:
| eps             | float  | The difference between 1.0 and the next smallest representable float larger than 1.0.	|

Fixes #160397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160502
Approved by: https://github.com/ngimel
2025-08-14 01:49:35 +00:00
15e49f6164 Factor out the strings to templates for better editor integration (#160357)
# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-14 01:07:53 +00:00
dd21c8a578 refresh expected results (#160537)
regression introduced  by https://github.com/pytorch/pytorch/pull/160314
not much worried about it since it did not effect other inductor benchmarks could not repo locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537
Approved by: https://github.com/eellison
2025-08-14 00:56:14 +00:00
a06ec54d40 [MPS] Add API to query GPU core count (#160414)
Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service
Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor

Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType|head -n10`
```
% python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"
Apple M1 Pro 16
% system_profiler SPDisplaysDataType|head -n10
Graphics/Displays:

    Apple M1 Pro:

      Chipset Model: Apple M1 Pro
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 16
      Vendor: Apple (0x106b)
      Metal Support: Metal 3
```

This would significantly improve occupancy for torch.compile generated kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414
Approved by: https://github.com/dcci
2025-08-14 00:05:17 +00:00
50a8c11875 Add getCurrentDeviceIndex to torch::stable::accelerator (#160453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160453
Approved by: https://github.com/janeyx99
ghstack dependencies: #159679
2025-08-13 23:42:24 +00:00
e4e4dbd2f8 Add beginnings of torch::stable::accelerator (#159679)
Adds
- `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic)

   50eac811a6/torch/csrc/inductor/aoti_runtime/utils_cuda.h (L30-L46)
    - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**)
    - `set_index(DeviceIndex)`
- `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque`
     - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor)
     - `id() -> StreamId`

- `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159679
Approved by: https://github.com/guangyey, https://github.com/janeyx99
2025-08-13 23:42:24 +00:00
d670304001 [ATen][CUDA] Use new CCCL API in v2.8 (#160554)
Silences deprecation warnings like:
```
In file included from tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:1:
/tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c: At global scope:
/tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:243:219: warning: 'template<class ValueType, class OffsetT> class at_cuda_detail::cub::CountingInputIterator' is deprecated: Use thrust::counting_iterator instead [-Wdeprecated-declarations]
  243 | static void __device_stub__ZN2at6native43_GLOBAL__N__3cee4041_10_Nonzero_cu_cba1aaa011flag_kernelILi512ELi16EhEEvPKT1_PlPKllli( const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE *__par0,  int64_t *__par1,  const int64_t *__par2,  int64_t __par3,  int64_t __par4,  int __par5) {  __cudaLaunchPrologue(6); __cudaSetupArgSimple(__par0, 0UL); __cudaSetupArgSimple(__par1, 8UL); __cudaSetupArgSimple(__par2, 16UL); __cudaSetupArgSimple(__par3, 24UL); __cudaSetupArgSimple(__par4, 32UL); __cudaSetupArgSimple(__par5, 40UL); __cudaLaunch(((char *)((void ( *)(const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE *, int64_t *, const int64_t *, int64_t, int64_t, int))at::native::_NV_ANON_NAMESPACE::flag_kernel<(int)512, (int)16, unsigned char> ))); }namespace at{
      |                                                                                                                                                                                                                           ^~~~~~~~~~~~~~~~~~~~~
/usr/local/cuda-12.9/include/cub/iterator/counting_input_iterator.cuh:93:63: note: declared here
   93 | class CCCL_DEPRECATED_BECAUSE("Use thrust::counting_iterator instead") CountingInputIterator
      |                                                               ^~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160554
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman
2025-08-13 23:15:53 +00:00
c5efc5c8a6 Fix unit test test_equivalent_template_code (#160432)
Summary:
Fix unit test test_equivalent_template_code

https://github.com/pytorch/pytorch/pull/159920 treats  ReinterpretView as a not-realized node when searching FX origin nodes for fused triton kernel. In test_equivalent_template_code, there is a transpose node (which is a ReinterpretView) before matmul. It was not in FX graph segment before PR 159920. FX origin nodes are used to define the name of triton kernel. That is the reason test_equivalent_template_code failed with PR 159920 since it uses hard-coded triton kernel name to check the result. The fix is to update the triton kernel name in the unit test.

Test Plan:
buck2 run mode/opt caffe2/test/inductor:benchmark_fusion -- caffe2.test.inductor.test_benchmark_fusion.BenchmarkMultiTemplateFusionCudaTest

Rollback Plan:

Differential Revision: D80101711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160432
Approved by: https://github.com/clee2000
2025-08-13 23:14:51 +00:00
6da11d9aaf [C10D] Add check_rng_sync util (#160283)
Debugs RNG desync by checking the current state on each rank in the group and summarizing the differences if any are detected.

Notes:
- used allgather instead of gather since its simpler to do this SPMD rather than add conditional behavior, though I could be convinced we only want to log on rank0.

Usage:
`check_rng_sync(generator, group)`

Prints something like this:

(cuda):
```
[rank0]:E0808 ] Generator desync detected:
[rank0]:E0808 ] Ranks    (Seed, Offset) values
[rank0]:E0808 ] -------  -----------------------
[rank0]:E0808 ] 0        (456, 0)
[rank0]:E0808 ] 1        (123, 4)
[rank0]:E0808 ] 2-3      (123, 0)
```

(cpu):
```
[rank2]:E0810 ] Generator desync detected:
[rank2]:E0810 ] Ranks      Generator State Hash values
[rank2]:E0810 ] -------  -----------------------------
[rank2]:E0810 ] 0                  7633364531954955665
[rank2]:E0810 ] 1                  8807615394212033278
[rank2]:E0810 ] 2-3               -6150027303226666531
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160283
Approved by: https://github.com/ezyang
2025-08-13 23:05:29 +00:00
182efe31db [inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462
Approved by: https://github.com/eellison
2025-08-13 22:54:18 +00:00
1ea688f9a2 [dynamo] fix EXTENDED_ARG starts_line dropping bug (#160478)
Fixes https://github.com/pytorch/pytorch/issues/160471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160478
Approved by: https://github.com/Lucaskabela, https://github.com/billmguo
2025-08-13 22:27:40 +00:00
53e3949495 [MTIA-T][CFF] Pass backend parameter into GPU vertical pass file and pattern matcher (#160404)
Summary:
As titled
Please see https://fb.workplace.com/groups/1075192433118967/posts/1735215827116621/?comment_id=1735220747116129&reply_comment_id=1735242997113904

Basically, for MTIA, we want mtia_afg to show up in the counters and backend, instead of Inductor. MTIA is not using inductor yet. Using env var TORCHINDUCTOR_PATTERN_MATCH_BACKEND to pass in the actual backend.

The env var default value is "inductor", so nothing should break for GPU.

Test Plan:
Default is always "inductor", so existing test should not break.

CI tests

Rollback Plan:

Differential Revision: D80069072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160404
Approved by: https://github.com/BoyuanFeng
2025-08-13 22:24:27 +00:00
33d9401866 Revert "[BE][Dynamo] Type improvements in _dynamo/utils to generics (#159824)"
This reverts commit 3ef2e1ef769582a82c6ddf150e9d11bf4bf1c44f.

Reverted https://github.com/pytorch/pytorch/pull/159824 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_trace_rules.py::TraceRuleTests::test_almost_impossible_missing_name [GH job link](https://github.com/pytorch/pytorch/actions/runs/16948305999/job/48035192324) [HUD commit link](3ef2e1ef76) ([comment](https://github.com/pytorch/pytorch/pull/159824#issuecomment-3186003531))
2025-08-13 22:17:29 +00:00
d1950d4bb5 Change IR node's stack trace to be computed lazily (#160487)
Summary: When an IR node is an inherited class, post_init is called once for each super().__init__() call. To avoid duplicated calls, we make stack trace computation happen lazily.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80137870

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160487
Approved by: https://github.com/angelayi
2025-08-13 21:41:25 +00:00
1196bb1c2e Add utility to get computed kernel in torch.library (#158393)
Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key
- Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered).
- `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens
- `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel`

Related to https://github.com/pytorch/pytorch/issues/155330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158393
Approved by: https://github.com/albanD
2025-08-13 21:00:59 +00:00
e9eb2096a5 [cutlass backend] Allow bmm use cases when batch stride is 0 (#160356)
Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/)

The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only.

But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160356
Approved by: https://github.com/mlazos
2025-08-13 20:52:24 +00:00
3ef2e1ef76 [BE][Dynamo] Type improvements in _dynamo/utils to generics (#159824)
Follow up to #159580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824
Approved by: https://github.com/williamwen42
2025-08-13 20:17:01 +00:00
4cde0acc0e Make triton build ROCm library version-agnostic (#158408)
Fixes maintenance of triton packaging script when library versions change from one ROCm version to next.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158408
Approved by: https://github.com/jeffdaily

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
2025-08-13 19:49:23 +00:00
70ccdec44b [ROCm] Improve reduction sum performance (#160466)
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128

**Reproducer:**
```
import time
import torch

shapes = [
    (5079670, 128)
]

dims = [
    (1)
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.float)
    for _ in range(10):
        w = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    print(w.size())

    start_time = time.time()
    for _ in range(50):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    end_time = time.time()
    mean_time = (end_time - start_time)/50
    print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us")
```

**Before (MI300X):**
Avg time for shape (5079670, 128): 1629.99 us

**After (MI300X)**
Avg time for shape (5079670, 128): 1008.59 us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160466
Approved by: https://github.com/petrex, https://github.com/jeffdaily
2025-08-13 18:46:58 +00:00
db0b7f1cc9 [BE][CI] Adjust error_inputs for cat and complex (#160378)
MPS backend does not support double, so errors should be different
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160378
Approved by: https://github.com/dcci
2025-08-13 18:35:06 +00:00
1c26c53851 Fix the Doc of pivot in torch.lu (#159617)
Fixes #159616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159617
Approved by: https://github.com/lezcano, https://github.com/jansel
2025-08-13 18:30:54 +00:00
adcca7d9a1 Do not rpath CUDA stubs folder in JIT generated code (#160179)
`_transform_cuda_paths` intentionally includes the CUDA stubs folder.

However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with
> CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library"

This results in e.g. non-descriptive errors like
```
cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67  cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096
terminate called after throwing an instance of 'cutlass::cuda_exception'
  what():  std::exception
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160179
Approved by: https://github.com/jansel
2025-08-13 18:29:24 +00:00
01584d2a7d [ROCm] remove extra transposes in NHWC convolutions on MIOpen (#160435)
remove aten::contiguous for NHWC convolutions on ROCm

Tests:
- nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32
- nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16

Before:
<img width="1255" height="228" alt="image"
src="https://github.com/user-attachments/assets/b125ccab-00c2-4d3a-a341-4583e51d8d57" />

After:
<img width="874" height="153" alt="image"
src="https://github.com/user-attachments/assets/ec200754-3622-488e-8762-bff1c2d22818" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160435
Approved by: https://github.com/jeffdaily
2025-08-13 17:58:22 +00:00
87e6c4079d Fix the Doc issue on the description of edge_order in torch.gradient() (#159130)
Fixes #159129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159130
Approved by: https://github.com/soulitzer
2025-08-13 16:48:47 +00:00
7d87e358ac Fix MPS conv3d autocast bias dtype mismatch (#160423)
## Summary
- register conv3d with MPS autocast to ensure bias dtypes match under AMP
- add regression test chaining two Conv3d layers on MPS autocast

Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160423
Approved by: https://github.com/dcci
2025-08-13 16:23:21 +00:00
6ee175195a [DCP][OSS] Rank local checkpointing in DCP without collectives (#147758)
Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758
Approved by: https://github.com/meetv18
2025-08-13 16:20:28 +00:00
db32b60662 [ci] Add riscv opt-int build (#143979)
Hi, @malfet
Based on the previous discussion:

[RISCV CI support · Issue #141550 · pytorch/pytorch](https://github.com/pytorch/pytorch/issues/141550)

I have cross-compiled PyTorch for the RISC-V architecture on x86_64 Ubuntu 24.04 and created a new PR for it. Could you please help review it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143979
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-13 16:12:02 +00:00
56c828bef9 Followup of #160002, gracefully fail if Triton functions don't contain attributes (#160436)
Summary: Fixes internal test failures of D80037015

Test Plan:
CI

Rollback Plan:

Differential Revision: D80094187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160436
Approved by: https://github.com/clee2000
2025-08-13 16:04:56 +00:00
a2fd106d67 guard cuMulticastUnbind call (#160499)
Fixes builds for old compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160499
Approved by: https://github.com/Skylion007
2025-08-13 15:45:51 +00:00
c656334120 Revert "Factor out the strings to templates for better editor integration (#160357)"
This reverts commit cbffde774557752cf20447d42d99ec6102673c31.

Reverted https://github.com/pytorch/pytorch/pull/160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file  No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](https://github.com/pytorch/pytorch/pull/160357#issuecomment-3184435581))
2025-08-13 15:40:50 +00:00
31c9ac4319 [c10d] Fix test test_nccl_user_buffer_registration (#160497)
Fixed `test_nccl_user_buffer_registration ` due to https://github.com/pytorch/pytorch/pull/160145, somehow CI didn't capture it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160497
Approved by: https://github.com/ngimel
2025-08-13 15:29:41 +00:00
deea71a90e [ez][CI] Set timeout for linux-jammy-py3_13-clang12-test from 600min -> default val of 240 (#160500)
10 hours is very long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160500
Approved by: https://github.com/huydhn
2025-08-13 15:14:24 +00:00
114a6c4043 Add placeholder for the User Guide (#159379)
- Add pytorch_overview.md
- Add pytorch_main_components.md
- Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials
- Move notes under user guide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159379
Approved by: https://github.com/albanD

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-13 14:56:04 +00:00
ee1b0412b9 [1/N]Port 3 distributed/_tools test cases to Intel GPU (#159543)
For [#114850](https://github.com/pytorch/pytorch/issues/114850), we will port distributed tests to Intel GPU.

We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
2. enabled XPU for some test path
3. skip some test cases which Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159543
Approved by: https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-08-13 12:49:01 +00:00
42e51cd4b3 Support ddp zero hook XCCL path (#159240)
XCCL backend no https://github.com/pytorch/pytorch/issues/62300 issue, add xccl path here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159240
Approved by: https://github.com/guangyey, https://github.com/Skylion007, https://github.com/EikanWang
2025-08-13 12:37:33 +00:00
96bd33b2de Fix get_free_symbol_uses for several nodes (#160314)
get_free_symbol_uses is used to know what unbacked symbols are used by a given node.
not having correct get_free_symbol_uses defined properly leads to :

- eliminating of some nodes due to not detection of any users. (See the added unit test)
- Incorrect topological sort.

Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel.
for ComputedBuffer with NonOwningLayout its interesting case.
when layout is NonOwningLayout we need to access the actual view op base layout and use
detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314
Approved by: https://github.com/eellison
2025-08-13 12:28:29 +00:00
ecde76c764 [Hierarchical Compile] Sort all regions identically (#158814)
Before we would topologically sort each region individually, this works well except if some nodes have no arguments, then their order may change. To rectify this, we sort the first region as the reference region and use that sort order to sort the remaining regions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158814
Approved by: https://github.com/williamwen42
2025-08-13 11:55:23 +00:00
34ec5ed275 [Dynamo][Hierarchical Compile] Allow parameters to be propagated to submodules (#157979)
Fixes issue with HF Gen AI models where we mark a param as static and a get_attr node gets put in the region.

The effect of this is lifting get_attr nodes to be inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157979
Approved by: https://github.com/williamwen42
2025-08-13 09:12:10 +00:00
641ee74781 Revert "Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss (#150282)"
This reverts commit f990490a23815ea6ee27e487c70ba2cf513ba43d.

Reverted https://github.com/pytorch/pytorch/pull/150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150282#issuecomment-3182844949))
2025-08-13 09:01:52 +00:00
6e8865fbc1 port 3 distributed test to Intel GPU and unified some common functions (#158533)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  - requires_nccl_version
  - _dynamo_dist_per_rank_init
  - DynamoDistributedSingleProcTestCase
  - DistTestCases
  - FSDPTestMultiThread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158533
Approved by: https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-08-13 08:13:23 +00:00
9a06e6d031 [claude-code] Add top-level module doc for torch/distributed/tensor/_op_schema.py (#157804)
Not sure how good the description is, seeking insight from maintainers.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157804
Approved by: https://github.com/wanchaol
2025-08-13 07:27:11 +00:00
6ea8376f84 Enable XPU for test_autograd_function.py (#160309)
# Description
Fixes #114850, we will port dynamo tests to Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

# Changes
1. Get device type from get_devtype() method.
2. Replace the requires_cuda_and_triton with requires_gpu.
3. Add HAS_XPU_AND_TRITON into the scope.

# Notify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160309
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-08-13 06:38:34 +00:00
8eee08d227 Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK (#160411)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160411
Approved by: https://github.com/ezyang
2025-08-13 06:31:10 +00:00
e497620260 Add compile_id: Optional[CompileID] to torch._logging._internal.trace_structured_artifact (#160440)
Context:
When writing a custom `torch.compile` backend, I quite frequently (ab)use `trace_structured_artifact` because I'm too lazy to customize tlparse (ref: 6d8b13c867).

I recently notice some of the artifacts I want to store are generated where CompileID cannot be correlated and `tlparse` html says
> Sometimes, logs are made without a compile id. This makes it difficult to correlate related logs. This stack trie shows all places where log entries occurred without compile context; to fix, look an appropriate place in the stack where compile id should have been specified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160440
Approved by: https://github.com/ezyang
2025-08-13 06:28:23 +00:00
199e9abb6a [fx] fix split_module with symint (#160093)
Fixes https://github.com/pytorch/pytorch/issues/155220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160093
Approved by: https://github.com/ezyang
2025-08-13 05:50:15 +00:00
685f15dbea [vllm hash update] update the pinned vllm hash (#160484)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160484
Approved by: https://github.com/pytorchbot
2025-08-13 04:54:03 +00:00
85db508af5 Wrap class definitions in set_fullgraph(False) in test_int/bool/float/complex (#160276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160276
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217
2025-08-13 04:53:03 +00:00
27156ec804 Wrap class definitions in set_fullgraph(False) in test_operator (#160217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160217
Approved by: https://github.com/zou3519
ghstack dependencies: #160216
2025-08-13 04:53:03 +00:00
6746bc59df Wrap class definitions in set_fullgraph(False) in test_set (#160216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160216
Approved by: https://github.com/zou3519
2025-08-13 04:53:03 +00:00
3008d985a8 [CD] Do not build pytorch with nvshem on ARM (#160465)
As nvshmem binary from 3.3.9 is not compatible with manylinux2_28, and 3.3.20 is not available for download yet
Also, package nvshmem binary into full wheel

Fixes https://github.com/pytorch/pytorch/issues/160425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160465
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-08-13 04:10:43 +00:00
652a6f5954 Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)"
This reverts commit 5a9c4cfce42b9eb87da0de40c5633f083115c307.

Reverted https://github.com/pytorch/pytorch/pull/160403 on behalf of https://github.com/malfet due to It indeed consistently broken inductor, see 118bc97b14/1 ([comment](https://github.com/pytorch/pytorch/pull/160403#issuecomment-3182101130))
2025-08-13 04:05:46 +00:00
118bc97b14 Write full tensors out at once in HF consolidation script (#159394)
Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller.

Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394
Approved by: https://github.com/saumishr
ghstack dependencies: #159392, #159393
2025-08-13 03:51:16 +00:00
305fa22393 [GHF] Remove app { name databaseId} query (#160494)
From `PRCheckSuites` fragment, as it's causes security exception when used with new GITHUB_TOKEN, that will looks as follows
```
RuntimeError: GraphQL query
fragment PRReviews on PullRequestReviewConnection {
  nodes {
    author {
      login
    }
    bodyText
    createdAt
    authorAssociation
    editor {
      login
    }
    databaseId
    url
    state
  }
  pageInfo {
    startCursor
    hasPreviousPage
  }
}

fragment PRCheckSuites on CheckSuiteConnection {
  edges {
    node {
      app {
        name
        databaseId
      }
      workflowRun {
        workflow {
          name
          databaseId
        }
        databaseId
        url
      }
      checkRuns(first: 50) {
        nodes {
          name
          conclusion
          detailsUrl
          databaseId
          title
          summary
        }
        pageInfo {
          endCursor
          hasNextPage
        }
      }
      conclusion
    }
    cursor
  }
  pageInfo {
    hasNextPage
  }
}

fragment CommitAuthors on PullRequestCommitConnection {
  nodes {
    commit {
      authors(first: 2) {
        nodes {
          user {
            login
          }
          email
          name
        }
      }
      oid
    }
  }
  pageInfo {
    endCursor
    hasNextPage
  }
}

query ($owner: String!, $name: String!, $number: Int!) {
  repository(owner: $owner, name: $name) {
    pullRequest(number: $number) {
      closed
      isCrossRepository
      author {
        login
      }
      title
      body
      headRefName
      headRepository {
        nameWithOwner
      }
      baseRefName
      baseRefOid
      baseRepository {
        nameWithOwner
        isPrivate
        defaultBranchRef {
          name
        }
      }
      mergeCommit {
        oid
      }
      commits_with_authors: commits(first: 100) {
        ...CommitAuthors
        totalCount
      }
      commits(last: 1) {
        nodes {
          commit {
            checkSuites(first: 10) {
              ...PRCheckSuites
            }
            status {
              contexts {
                context
                state
                targetUrl
              }
            }
            oid
          }
        }
      }
      changedFiles
      files(first: 100) {
        nodes {
          path
        }
        pageInfo {
          endCursor
          hasNextPage
        }
      }
      reviews(last: 100) {
        ...PRReviews
      }
      comments(last: 5) {
        nodes {
          bodyText
          createdAt
          author {
            login
          }
          authorAssociation
          editor {
            login
          }
          databaseId
          url
        }
        pageInfo {
          startCursor
          hasPreviousPage
        }
      }
      labels(first: 100) {
        edges {
          node {
            name
          }
        }
      }
    }
  }
}
, args {'name': 'pytorch', 'owner': 'pytorch', 'number': 159820} failed: [{'type': 'FORBIDDEN', 'path': ['repository', 'pullRequest', 'commits', 'nodes', 0, 'commit', 'checkSuites', 'edges', 4, 'node', 'app'], 'extensions': {'saml_failure': False}, 'locations': [{'line': 26, 'column': 7}], 'message': 'Resource not accessible by integration'}]
```
But the same query works fine if executed using one's Personal Access Token

Updated mocks file by running
```
sed -i -e s/a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/ gql_mocks.json
sed -i -e s/157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f/28349cb4c891bbf85255fab2c33c770baf77c3e02b29ca9a0e4c6c97bed041db/ gql_mocks.json
sed '/"app": {/,+3d' gql_mocks-orig.json >gql_mocks.json
sed '/"app": null/d' gql_mocks-orig.json >gql_mocks.json
```

Undisable offending jobs

Fixes https://github.com/pytorch/pytorch/issues/159894
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160494
Approved by: https://github.com/huydhn
ghstack dependencies: #160490, #160492
2025-08-13 03:46:39 +00:00
1151b40cbf [BE] Filter unused mocks (#160492)
Somebody checked in twice the number of mocks into the archive

Filter them out by running following script
```python
import json
with open("gql_mocks-orig.json") as f:
    mocks = json.load(f)

keys = list(mocks.keys())
good_shas = {'a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876',
             '157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f',
             '4715ed05b382e572135c049664939f22f9b1249bc0c499ae278d655ad8cb598b',
             'a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5',
             'e5130469b5373479776bfbccade8039ce4741b97873bb3bec4e279fed08602be',
             '5dc32efeb8306f03744f6804ef4b500882f2759f7ac17fdc9f123669bfe4805a',
             '0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98',
             '8b50878b010492fe64005cc4b4ed34ac5f6695ce093f06b0d8d5403b7787c2c0',
             '2877b3b1e8630ca4ae797b9d85d5673d25ca8488c01141e11ff55f4a1359fca7'}
for k in keys:
    if any(sha in k for sha in good_shas):
        continue
    del mocks[k]

with open("gql_mocks.json","w") as f:
    json.dump(mocks, f, indent=2)
    f.write("\n")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160492
Approved by: https://github.com/huydhn
ghstack dependencies: #160490
2025-08-13 03:46:39 +00:00
d0f9785af3 [CI] Prevent accidental gql_mocks updates by test_trymerge (#160490)
As they could not longer be fetched from GitHub, see https://github.com/pytorch/pytorch/issues/160489
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160490
Approved by: https://github.com/huydhn
2025-08-13 03:46:32 +00:00
ba47821f52 [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X (#160444)
* thread_work_size of 16 is giving better perf with many workloads for MI300X

cherry-pick of fb81400d34

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160444
Approved by: https://github.com/jeffdaily
2025-08-13 03:41:25 +00:00
2c5e10a5fc Add new function consolidate_safetensors_files_on_every_rank for HF consolidation (#159393)
Currently we are only using rank-0 for HF consolidation. But we should be able to use every rank to consolidate the sharded files, which will speed up the consolidation by Nx (where N is the number of ranks). Adding a new method consolidate_safetensors_files_on_every_rank to do this.

Differential Revision: [D79000720](https://our.internmc.facebook.com/intern/diff/D79000720/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159393
Approved by: https://github.com/saumishr
ghstack dependencies: #159392
2025-08-13 03:31:36 +00:00
355462e127 Add stable Tensor get_device_index, use more stable DeviceIndex (#160143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160143
Approved by: https://github.com/mikaylagawarecki
2025-08-13 03:27:10 +00:00
41673110cd [inductor] Windows inductor use intel-openmp. (#160258)
After some debug work, I found PyTorch torch_cpu.dll is using intel-openmp, but not MSVC openmp.
So, switch Windows inductor to intel-openmp.

It fixed: c8205cb354/test/inductor/test_aot_inductor.py (L2405-L2408)
<img width="896" height="230" alt="image" src="https://github.com/user-attachments/assets/273b00f8-7dc1-43c9-9b7f-752e16355a80" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160258
Approved by: https://github.com/ezyang
2025-08-13 02:36:19 +00:00
6be6d06295 Avoid potential deadlocks in host allocator (#159352)
# Motivation
This PR fixes a potential deadlock in the host allocator.
When calling `event->record(stream)`, the `record_stream` implementation may acquire the Python GIL.
In places such as 842cc77ab9/aten/src/ATen/cuda/CachingHostAllocator.cpp (L145-L151), and 842cc77ab9/aten/src/ATen/xpu/CachingHostAllocator.cpp (L22-L28) `record_stream` is invoked while holding the allocator lock.

To prevent deadlocks, we must ensure the locking order is:
**GIL → Allocator Lock**.
Reversing the order (**Allocator Lock → GIL**) can cause a deadlock.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159352
Approved by: https://github.com/cyyever, https://github.com/ezyang
2025-08-13 02:30:17 +00:00
f15ada5c6f Enable output padding when only outermost dim is dynamic (#159404)
Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79146886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159404
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-08-13 01:28:22 +00:00
69a0a9aa7f [Inductor][Triton] Pass GPUTarget param to updated make_ir function (#160422)
Summary: A recent Triton commit changed `ASTSource.make_ir` to a 5-arg signature that includes a `GPUTarget`. We need to pass in this new argument.

Test Plan:
`buck2 test 'fbcode//mode/opt' -m ovr_config//triton:trunk  fbcode//caffe2/test/inductor:test_inductor_cuda -- triton_kernel`

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D80069909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160422
Approved by: https://github.com/davidberard98, https://github.com/mlazos
2025-08-13 01:27:57 +00:00
32099961d5 [EZ] Delete CircleCI case (#160479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160479
Approved by: https://github.com/izaitsevfb
ghstack dependencies: #160477
2025-08-13 01:19:09 +00:00
8d1cf52922 [EZ][BE] Remove unused conda-env-macOS-ARM64 (#160477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160477
Approved by: https://github.com/atalman
2025-08-12 23:41:25 +00:00
b1f43548ca [c10d] Error out the case when registering symmetric memory without eager init (#160145)
Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160145
Approved by: https://github.com/kwen2501
2025-08-12 23:25:04 +00:00
0d71ca2c46 [EZ] Replace pytorch-labs with meta-pytorch (#160459)
This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'

## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files

## Files Modified
This PR updates files that contained the target text.

Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
2025-08-12 22:44:25 +00:00
5737372862 [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882)
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

This should help increase available runners even with same number of CI nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 22:42:40 +00:00
2e4e5ab4be [MPS] Add mps keys to indices and values ops (#160223)
enable indices and values on sparse mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223
Approved by: https://github.com/malfet
2025-08-12 22:08:44 +00:00
16d15445f8 Fullgraph graph capture with dynamo. (#159749)
Summary:
Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0

We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API.

This PR adds a prototype for the API described in the doc.

Test Plan:
test_misc -- -k test_aot_capture

Rollback Plan:

Differential Revision: D79534608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159749
Approved by: https://github.com/tugsbayasgalan
2025-08-12 22:06:18 +00:00
101276f81b [BE] Save attributes for CppCompileError for pickleing (#160294)
Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/)

Context:
When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by
```
Traceback (most recent call last):
  File "/torch/_inductor/autotune_process.py", line 175, in get
    result = TuningProcess.recv(self.read_pipe)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/autotune_process.py", line 99, in recv
    return pickle.load(read_pipe)
           ^^^^^^^^^^^^^^^^^^^^^^
TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output'
```
which is unexpected. After asking claude, it seems

> Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly.
>
> The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments.
>
> The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly.
>
> The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix:

Adding these seem to help.

fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160294
Approved by: https://github.com/masnesral
2025-08-12 22:03:36 +00:00
cbffde7745 Factor out the strings to templates for better editor integration (#160357)
# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-12 21:59:54 +00:00
78a2fe1d42 [TorchScript] thread-safe ErrorReport::CallStack (#160386)
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160386
Approved by: https://github.com/eellison
2025-08-12 21:59:04 +00:00
f8f0414a59 fix cpp builder to avoid missing-source compile error (#160354)
Summary:
the condition
```
if config.is_fbcode() and (not self._aot_mode or self._use_relative_path):
    sources = [os.path.basename(i) for i in sources]
```
unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths.

Fixes internal issue:
```

FAILED (errors=1)

CppCompileError: C++ compile error

Command:
/mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so

Output:
clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp'
clang-17: error: no input files
```

Reviewed By: clee2000

Differential Revision: D80025417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160354
Approved by: https://github.com/benjaminglass1, https://github.com/clee2000
2025-08-12 21:36:22 +00:00
4d419a7461 Add pad and narrow to torch/csrc/stable/ops.h (#159328)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159328
Approved by: https://github.com/janeyx99
ghstack dependencies: #159507
2025-08-12 21:29:49 +00:00
655137b678 Update torch::stable::Tensor() default constructor (#159507)
Allows things like

```cpp
Tensor cu_seqlens_q;
if (...) {
   cu_seqlens_q = ...
}
...
```

Also adds `torch::stable::Tensor.defined()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159507
Approved by: https://github.com/janeyx99
2025-08-12 21:29:49 +00:00
f27232a213 [ROCm] Limit number of values per thread for reductions on three dimensions (#159652)
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652
Approved by: https://github.com/jeffdaily
2025-08-12 21:15:56 +00:00
c24ca7f4bf [FSDP][Collectives] skipping allgather when world size is 1 (#160135)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160135
Approved by: https://github.com/weifengpy
2025-08-12 21:13:29 +00:00
b4596895b9 [DTensor] Registers sharding rule for rms_norm (#159692)
Reduces collective calls in the forward pass from 2 to 1

In #158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After #159324 this should get properly dispatched hence I am adding it now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159692
Approved by: https://github.com/tianyu-l
2025-08-12 21:05:24 +00:00
5a9c4cfce4 [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)
Fixes #160243, Fixes #160244, Fixes #160245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403
Approved by: https://github.com/janeyx99
2025-08-12 21:02:44 +00:00
a354fa91e2 added class or module info for functions blocked by weight-only load (#159935)
Fixes #152985
In #152985, users are confused why weights-only load failed even though functions were registered in safe_globals.
Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration.
This fix is to make that point more clear.

Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out.
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error:

Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'>

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

To execute this test, run the following from the base repo dir:
    python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159935
Approved by: https://github.com/mikaylagawarecki
2025-08-12 20:52:25 +00:00
f95b58c284 Remove usage of fsspec in HF consolidation script (#159392)
Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only

Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392
Approved by: https://github.com/sibuachu
2025-08-12 20:41:06 +00:00
8e6a313858 Add ownership token when needed on GradientEdge (#160098)
We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare.
Happy to be talked into doing it though if someone really wants to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160098
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-08-12 20:14:18 +00:00
7e91394955 Support NUMA Binding for Callable Entrypoints (#160163)
# Context
This is an extension of #149334.

# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.

Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.

Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran

```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```

and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings

I also ran similar with `str` entrypoints as before just to be sure it's still working.

NOTE: [--run-path triggers the code to be run inside a `Callable`.](017259f9c6/torch/distributed/run.py (L870))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163
Approved by: https://github.com/d4l3k
2025-08-12 20:08:49 +00:00
89654db1ab [inductor] fix triton bucketize mask propagation (#159961)
See 6b414f56a4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961
Approved by: https://github.com/eellison
2025-08-12 19:59:32 +00:00
2d0cdee394 move thread-local capture mode guard to include work.isStarted (#160398)
Per title, should fix capture errors that happen because nccl watchdog races with capture start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398
Approved by: https://github.com/aorenste
2025-08-12 19:25:04 +00:00
eqy
9903ca4f70 [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-08-12 18:07:41 +00:00
f341077ce4 Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281)"
This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b.

Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](a7abf57aab).  Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))
2025-08-12 17:57:57 +00:00
3cec82a7e9 Ensure outer aliasing on DTensor matches inner aliasing (#158954)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954
Approved by: https://github.com/albanD, https://github.com/wconstab
2025-08-12 17:47:48 +00:00
ee9f8ba11d [ROCm] Use opportunistic fastatomics based on hueristics (#159430)
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-08-12 17:13:54 +00:00
1f4057c11a [inductor] remove no_x_dim (#159810)
no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional.

no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue.

However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell.

H100 inference benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

H100 training benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

Overall, the benchmarks show minimal change in performance.

Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-08-12 17:10:31 +00:00
94b91a8763 [redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352)
Summary:
Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.
ref: D79456310 (got reverted because of linter)

Testing:
Refer differential Revision: D79917440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352
Approved by: https://github.com/masnesral
2025-08-12 16:49:08 +00:00
a7abf57aab [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-08-12 16:42:55 +00:00
f7b2f3314c Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000)"
This reverts commit d0e2240f680ea2a553f7ee8188f52482e130bfd0.

Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))
2025-08-12 16:33:02 +00:00
9d37c960a4 [ROCm][CI] use new benchmark image for dynamo (#160421)
Follow-up to #160047 that separated the rocm image into default CI and benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 16:07:19 +00:00
b219ca2a00 Revert "Update triton xpu commit to support python 3.14 (#160183)"
This reverts commit 7fbc22855c17741ae016992803b2e147a13aa22d.

Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](7fbc22855c).  Maybe because the docker build changed?  Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))
2025-08-12 15:29:19 +00:00
b7db86600a Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416)
Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416
Approved by: https://github.com/malfet
2025-08-12 15:15:12 +00:00
9708fcf92d Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120)
This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user *only* changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet.

This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time.

This should handle 99% of cases, though. I added an expectedFailure test to show the limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120
Approved by: https://github.com/zou3519
2025-08-12 14:11:06 +00:00
a288b15ea9 [CI] Reduce XPU Windows build time (#159763)
Reduce the time cost from 2.5 hours to about 1.5 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:04:29 +00:00
7fbc22855c Update triton xpu commit to support python 3.14 (#160183)
Follow PR #159725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:02:36 +00:00
f33ce40bc0 [bucketing] Bucket only adjacent collectives to prevent reordering (#159983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983
Approved by: https://github.com/wconstab, https://github.com/eellison
2025-08-12 11:57:00 +00:00
4d5b3f2d5a [dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796
Approved by: https://github.com/jansel
2025-08-12 09:49:11 +00:00
f990490a23 Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss (#150282)
Fixes #91545

## Changes

- Add `label_smoothing` param and docs
- Add test case for `label_smoothing`
- Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss`

##  Test Result

```bash
pytest -s test/test_nn.py -k test_bce
```

![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05)

![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964)

![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282
Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki
2025-08-12 09:37:03 +00:00
b9003ed3d8 Dynamo Deep Dive Documentation Fix (#158860)
changed SourceBuilder to VariableBuilder

Fixes #158447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860
Approved by: https://github.com/mlazos
2025-08-12 08:53:33 +00:00
fea7e9dd37 extract shape in _view_has_unbacked_input (#160255)
Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..]

Test Plan:
existing tests.

Rollback Plan:

Differential Revision: D79951119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255
Approved by: https://github.com/bobrenjc93
2025-08-12 08:38:19 +00:00
9a0f7a3bb0 [retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348)
refer: https://github.com/pytorch/pytorch/pull/159655

Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed.
Updated test_dynamo_timed + re-ran locally to test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348
Approved by: https://github.com/masnesral
2025-08-12 06:24:54 +00:00
01bcf9a40d Bump transformers pin (#159291)
Trying to update hf pin.

Benchmarking run to figure out issues

<img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" />

Retrying - https://github.com/pytorch/pytorch/pull/156118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-12 05:14:17 +00:00
8d3d1c8443 [dynamo] fixes to propagate tag safeness (#159807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807
Approved by: https://github.com/jansel
2025-08-12 04:50:13 +00:00
0f3b10b8ee [audio hash update] update the pinned audio hash (#160384)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384
Approved by: https://github.com/pytorchbot
2025-08-12 04:38:04 +00:00
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
edaa151d0d [CI] Move CUDA tests to trunk workflow (#160379)
Which is getting run before PR is merged anyway, but according to 3X
less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b)
<img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" />

I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379
Approved by: https://github.com/izaitsevfb
2025-08-12 04:23:50 +00:00
10bc36fe84 Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341)
Short-term fix for https://github.com/pytorch/pytorch/issues/160333

The problem is:
1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation
2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented.
3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition.

The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented
return logic into the decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341
Approved by: https://github.com/drisspg
2025-08-12 04:09:37 +00:00
32e5e2f596 [vllm hash update] update the pinned vllm hash (#160259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259
Approved by: https://github.com/pytorchbot
2025-08-12 04:04:53 +00:00
bfc873d02e [ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083)
This reverts the changes from b367e5f6a6. This will also close https://github.com/pytorch/pytorch/pull/158922.

Since 30387ab2e4, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib.

There was also a bug in here - if `ROCM_DIR` is unset, the code crashes:
```
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command
    cmd_obj.run()
  File "D:\b\pytorch_main\setup.py", line 853, in run
    rocm_dir_path = Path(os.environ["ROCM_DIR"])
                         ~~~~~~~~~~^^^^^^^^^^^^
  File "<frozen os>", line 714, in __getitem__
KeyError: 'ROCM_DIR'
```
The code could have checked for `ROCM_PATH` too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083
Approved by: https://github.com/jeffdaily
2025-08-12 02:45:49 +00:00
eed9dbf70f [ROCm] Add torch/_rocm_init.py to .gitignore. (#159806)
Follow-up to https://github.com/pytorch/pytorch/pull/155285.

Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like:

```python
def initialize():
    import rocm_sdk
    rocm_sdk.initialize_process(
        preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'],
        check_version='7.0.0rc20250804')
```

We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520).

This file is then loaded if present here: a7f3bdf550/torch/__init__.py (L145-L157)

Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806
Approved by: https://github.com/jeffdaily
2025-08-12 02:24:21 +00:00
be53f609aa fix retaining multimem in symmetric memory (#160343)
fixes OOM in #160289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343
Approved by: https://github.com/eqy
2025-08-12 02:03:20 +00:00
95210cc409 [BE] Isolate pre-push hook dependencies in dedicated virtual environment (#160048)
This adds two changes:
- Isolates pre-push hook dependencies into an isolated venv, no longer affect your system environment
- Lets you manually run the pre-push lintrunner (including with lintrunner -a) by invoking `python scripts/lintrunner.py [-a]` (it's ugly, but better than nothing...for now)

This is a follow up to:
- https://github.com/pytorch/pytorch/pull/158389

## Problem
The current pre-push hook setup installs lintrunner and related dependencies globally, which makes developers nervous about system pollution and can cause version conflicts with existing installations.

Also, if the pre-push lintrunner found errors, you had to hope your normal lintrunner could fix them (which wasn't always the case, e.g. if those errors only manifested in certain python versions)

##  Key Changes:
  - Isolated Environment: Creates .git/hooks/linter/.venv/ with Python 3.9 (the python used in CI) and an isolated lintrunner installation
  - User-Friendly CLI: New python scripts/lintrunner.py wrapper allows developers to run lintrunner (including -a auto-fix) from any environment
  - Simplified Architecture: Eliminates pre-commit dependency entirely - uses direct git hooks

  File Changes:
  - scripts/setup_hooks.py: Rewritten to create isolated uv-managed virtual environment
  - scripts/lintrunner.py: New wrapper script with shared hash management logic
  - scripts/run_lintrunner.py: Removed (functionality merged into lintrunner.py)
  - .pre-commit-config.yaml: Removed (no longer needed)

##  Usage:
```
  # Setup (run once)
  python scripts/setup_hooks.py

  # Manual linting (works from any environment)
  python scripts/lintrunner.py        # Check mode
  python scripts/lintrunner.py -a     # Auto-fix mode

  # Git hooks work automatically
  git push  # Runs lintrunner in isolated environment

  # Need to skip the pre-push hook?
  git push --no-verify
```

##  Benefits:
  -  Zero global dependency installation
  -  Per-repository isolation prevents version conflicts
  -  Full lintrunner functionality is now accessible

##  Implementation Notes:
  - Virtual env is kept in a dedicated dir in .git, to keep per-repo mechanics
  - lintrunner.py does not need to be invoked from a specific venv.  It'll invoke the right venv itself.

A minor bug: It tends to garble the lintrunner output a bit, like the screenshot below shows, but I haven't found a workaround so far and it remains understandable to users:
<img width="241" height="154" alt="image" src="https://github.com/user-attachments/assets/9496f925-8524-4434-8486-dc579442d688" />

## What's next?
Features that could be added:
- Check for lintrunner updates, auto-update if needed
- Depending on dev response, this could be enabled by default for all pytorch/pytorch environments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160048
Approved by: https://github.com/seemethere
2025-08-12 01:58:46 +00:00
7a974a88f2 [ROCm] Fix resource_strings.h (#159996)
This PR fixes the errors like below:

```
[rank7]: RuntimeError: /tmp/comgr-c3c81b/input/CompileSourceejOPx6:34:8: error: unknown type name 'uint64_t'; did you mean
'__hip_internal::uint64_t'? [rank7]: 34 | if(((uint64_t) t0.data) % (4 * sizeof(half)) != 0) flag_vec4 = false;
```

The following datatypes needs to be defined in `torch/csrc/jit/codegen/fuser/cuda/resource_strings.h` for ROCm versions >= 7.0.

```
typedef unsigned char uint8_t;
typedef signed char int8_t;
typedef short int  int16_t;
typedef long long int int64_t;
typedef unsigned long long int uint64_t;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159996
Approved by: https://github.com/pruthvistony, https://github.com/Skylion007, https://github.com/jeffdaily
2025-08-12 01:58:02 +00:00
f3f159ff8c [BE][cutlass backend] Reduce severity of log message for no cutlass config found (#160148)
This is not really a problem. Sometimes we cannot find a cutlass config due to shape, e.g. when k is odd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160148
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-08-12 01:41:58 +00:00
b90feeac86 [BE][cutlass backend] Fix subproc addmm tests (#160295)
Differential Revision: [D79977421](https://our.internmc.facebook.com/intern/diff/D79977421/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160295
Approved by: https://github.com/jingsh
2025-08-12 01:41:06 +00:00
0d40ff3b49 [inductor] fix test_different_file_paths_local_pgo on Windows. (#160382)
fix test_different_file_paths_local_pgo on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160382
Approved by: https://github.com/angelayi
2025-08-12 01:35:39 +00:00
cae2b5e3d2 [ROCm][Windows] Enable USE_ROCM, disable USE_RCCL on Windows. (#159079)
This allows setting `USE_ROCM` on Windows. A few other patches are still required to build (see https://github.com/ROCm/TheRock/issues/589), but we have instructions using open source code and rocm python packages available at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#build-pytorch-with-rocm-support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159079
Approved by: https://github.com/jeffdaily
2025-08-12 01:28:20 +00:00
ee89cc7a0a [ROCm][Windows] Fix LoadHIP handling of environment variable paths on Windows. (#159080)
See https://cmake.org/cmake/help/latest/command/file.html#path-conversion. Paths stored in environment variables may use `/` or `\` (e.g. on Windows), while cmake-style paths always use `/`.

This fixes configure errors like:
```
CMake Error at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 (set):
  Syntax error in cmake code at

    D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2

  when parsing string

    D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel/cmake/;D:/b/pytorch_main/cmake/Modules

  Invalid character escape '\p'.

CMake Error at D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/cmake/data/share/cmake-3.31/Modules/Internal/CheckSourceCompiles.cmake:108 (try_compile):
  Failed to configure test project build system.
```

(note the mixed usage of `\` and `/` in that string)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159080
Approved by: https://github.com/jeffdaily
2025-08-12 00:18:19 +00:00
e63c2b21c1 [PP] Initialize P2P communicators on first step (#160210)
Was hitting hangs in multi-node settings and initializing the NCCL communicators needed for batch p2p ops ahead of time fixes this.

This change adds extra communication since it communicates a dummy tensor to next and previous stage ranks. However, this is only paid on the first step so it is negligible.

Debug history: https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160210
Approved by: https://github.com/wconstab
2025-08-11 23:46:58 +00:00
3626ba711b [FlexAttention] Swap from and to & for new triton (#160227)
Fixes #158463

On B200 I am getting a bunch of error spew:
```Shell
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
Triton compilation failed: triton_tem_fused_zeros_1
def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0):
    PRESCALE_QK : tl.constexpr = False
```
```Shell
74 = arith.subi %170, %166 : i32
          %175 = arith.muli %174, %c128_i32 : i32
          %176 = arith.subi %175, %c64_i32 : i32
          %177 = arith.extui %173 : i1 to i32
          %178 = arith.muli %176, %177 : i32
          %179 = arith.subi %c1_i32, %177 : i32
          %180 = arith.muli %179, %c64_i32 : i32
          %181 = arith.addi %178, %180 : i32
          %182 = arith.muli %181, %c64_i32 : i32
          %183 = tt.splat %182 : i32 -> tensor<64x64xi32>
          %184 = tt.addptr %arg19, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %185 = tt.addptr %arg20, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %186 = tt.splat %181 : i32 -> tensor<64xi32>
          %187 = arith.addi %arg21, %186 : tensor<64xi32>
          scf.yield %163, %184, %185, %187 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
        }
        %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32>
        %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1>
        %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32>
        %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32>
        %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
        %122 = arith.select %115, %cst_4, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1>
        %123 = tt.broadcast %122 : tensor<1x64xi1> -> tensor<64x64xi1>
        %124 = arith.select %123, %121, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
        %125 = arith.mulf %124, %cst_2 : tensor<64x64xf32>
        %126 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32>
        %127 = arith.subf %125, %126 : tensor<64x64xf32>
        %128 = math.exp2 %127 : tensor<64x64xf32>
        %129 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %130 = tt.dot %51, %129, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %131 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
        %132 = tt.broadcast %131 : tensor<64x1xf32> -> tensor<64x64xf32>
        %133 = arith.subf %130, %132 : tensor<64x64xf32>
        %134 = arith.mulf %128, %133 : tensor<64x64xf32>
        %135 = arith.mulf %134, %cst_3 : tensor<64x64xf32>
        %136 = arith.select %116, %135, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
        %137 = arith.select %115, %122, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1>
        %138 = tt.broadcast %137 : tensor<1x64xi1> -> tensor<64x64xi1>
        %139 = arith.select %138, %136, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
        %140 = arith.truncf %139 : tensor<64x64xf32> to tensor<64x64xf16>
        %141 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
        %142 = tt.dot %140, %141, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        scf.yield %142 : tensor<64x64xf32>
      } else {
        scf.yield %cst_9 : tensor<64x64xf32>
      }
      %84 = tt.addptr %arg13, %22 : !tt.ptr<i32>, i32
      %85 = tt.load %84 : !tt.ptr<i32>
      %86 = arith.muli %85, %c128_i32 : i32
      %87 = tt.addptr %arg12, %21 : !tt.ptr<i32>, i32
      %88 = tt.load %87 : !tt.ptr<i32>
      %89 = tt.splat %86 : i32 -> tensor<64xi32>
      %90 = arith.addi %89, %14 : tensor<64xi32>
      %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
      %92 = arith.muli %91, %cst_11 : tensor<1x64xi32>
      %93 = tt.addptr %71, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
      %94 = tt.broadcast %93 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %95 = tt.addptr %94, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %96 = tt.addptr %76, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
      %97 = tt.broadcast %96 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %98 = tt.addptr %97, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %99 = arith.muli %88, %c2_i32 : i32
      %100 = arith.minsi %99, %c4_i32 : i32
      %101 = arith.cmpi sge, %100, %c1_i32 : i32
      %102 = scf.if %101 -> (tensor<64x64xf32>) {
        %112 = arith.subi %100, %c1_i32 : i32
        %113:4 = scf.for %arg17 = %c0_i32 to %112 step %c1_i32 iter_args(%arg18 = %83, %arg19 = %95, %arg20 = %98, %arg21 = %90) -> (tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>)  : i32 {
          %137 = tt.expand_dims %arg21 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
          %138 = arith.cmpi slt, %137, %cst_7 : tensor<1x64xi32>
          %139 = tt.broadcast %138 : tensor<1x64xi1> -> tensor<64x64xi1>
          %140 = tt.load %arg19, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %141 = tt.dot %46, %140, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %142 = arith.mulf %141, %cst_13 : tensor<64x64xf32>
          %143 = arith.mulf %142, %cst_3 : tensor<64x64xf32>
          %144 = arith.mulf %143, %cst_2 : tensor<64x64xf32>
          %145 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32>
          %146 = arith.subf %144, %145 : tensor<64x64xf32>
          %147 = math.exp2 %146 : tensor<64x64xf32>
          %148 = tt.load %arg20, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %149 = tt.dot %51, %148, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %150 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
          %151 = tt.broadcast %150 : tensor<64x1xf32> -> tensor<64x64xf32>
          %152 = arith.subf %149, %151 : tensor<64x64xf32>
          %153 = arith.mulf %147, %152 : tensor<64x64xf32>
          %154 = arith.mulf %153, %cst_3 : tensor<64x64xf32>
          %155 = arith.truncf %154 : tensor<64x64xf32> to tensor<64x64xf16>
          %156 = tt.trans %140 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %157 = tt.dot %155, %156, %arg18, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %158 = arith.divsi %arg17, %c2_i32 : i32
          %159 = tt.addptr %84, %158 : !tt.ptr<i32>, i32
          %160 = tt.load %159 evictionPolicy = evict_last : !tt.ptr<i32>
          %161 = arith.addi %158, %c1_i32 : i32
          %162 = arith.cmpi slt, %161, %88 : i32
          %163 = tt.addptr %159, %c1_i32 : !tt.ptr<i32>, i32
          %164 = tt.load %163, %162 evictionPolicy = evict_last : !tt.ptr<i32>
          %165 = arith.addi %arg17, %c1_i32 : i32
          %166 = arith.remsi %165, %c2_i32 : i32
          %167 = arith.cmpi eq, %166, %c0_i32 : i32
          %168 = arith.subi %164, %160 : i32
          %169 = arith.muli %168, %c128_i32 : i32
          %170 = arith.subi %169, %c64_i32 : i32
          %171 = arith.extui %167 : i1 to i32
          %172 = arith.muli %170, %171 : i32
          %173 = arith.subi %c1_i32, %171 : i32
          %174 = arith.muli %173, %c64_i32 : i32
          %175 = arith.addi %172, %174 : i32
          %176 = arith.muli %175, %c64_i32 : i32
          %177 = tt.splat %176 : i32 -> tensor<64x64xi32>
          %178 = tt.addptr %arg19, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %179 = tt.addptr %arg20, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %180 = tt.splat %175 : i32 -> tensor<64xi32>
          %181 = arith.addi %arg21, %180 : tensor<64xi32>
          scf.yield %157, %178, %179, %181 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
        }
        %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32>
        %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1>
        %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32>
        %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32>
        %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
        %122 = arith.mulf %121, %cst_2 : tensor<64x64xf32>
        %123 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32>
        %124 = arith.subf %122, %123 : tensor<64x64xf32>
        %125 = math.exp2 %124 : tensor<64x64xf32>
        %126 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %127 = tt.dot %51, %126, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %128 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
        %129 = tt.broadcast %128 : tensor<64x1xf32> -> tensor<64x64xf32>
        %130 = arith.subf %127, %129 : tensor<64x64xf32>
        %131 = arith.mulf %125, %130 : tensor<64x64xf32>
        %132 = arith.mulf %131, %cst_3 : tensor<64x64xf32>
        %133 = arith.select %116, %132, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
        %134 = arith.truncf %133 : tensor<64x64xf32> to tensor<64x64xf16>
        %135 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
        %136 = tt.dot %134, %135, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        scf.yield %136 : tensor<64x64xf32>
      } else {
        scf.yield %83 : tensor<64x64xf32>
      }
      %103 = tt.splat %33 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %104 = tt.addptr %103, %37 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %105 = tt.broadcast %104 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %106 = tt.addptr %105, %42 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %107 = arith.mulf %102, %cst_13 : tensor<64x64xf32>
      %108 = arith.cmpi slt, %40, %cst_11 : tensor<1x64xi32>
      %109 = tt.broadcast %108 : tensor<1x64xi1> -> tensor<64x64xi1>
      %110 = arith.andi %45, %109 : tensor<64x64xi1>
      %111 = arith.truncf %107 : tensor<64x64xf32> to tensor<64x64xf16>
      tt.store %106, %111, %110 : tensor<64x64x!tt.ptr<f16>>
    } else {
      %16 = arith.divsi %0, %c2_i32 : i32
      %17 = arith.muli %0, %c64_i32 : i32
      %18 = tt.splat %17 : i32 -> tensor<64xi32>
      %19 = arith.addi %18, %14 : tensor<64xi32>
      %20 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
      %21 = arith.muli %20, %cst_14 : tensor<64x1xi32>
      %22 = tt.splat %11 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %23 = tt.addptr %22, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %24 = tt.expand_dims %14 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
      %25 = tt.broadcast %23 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %26 = tt.broadcast %24 : tensor<1x64xi32> -> tensor<64x64xi32>
      %27 = tt.addptr %25, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %28 = arith.cmpi slt, %20, %cst_10 : tensor<64x1xi32>
      %29 = tt.broadcast %28 : tensor<64x1xi1> -> tensor<64x64xi1>
      %30 = tt.load %27, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>>
      %31 = tt.splat %12 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %32 = tt.addptr %31, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %33 = tt.broadcast %32 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %34 = tt.addptr %33, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %35 = tt.load %34, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>>
      %36:2 = scf.for %arg17 = %c0_i32 to %c4_i32 step %c1_i32 iter_args(%arg18 = %cst_9, %arg19 = %cst_9) -> (tensor<64x64xf32>, tensor<64x64xf32>)  : i32 {
        %55 = arith.muli %2, %c4_i32 : i32
        %56 = arith.addi %55, %arg17 : i32
        %57 = arith.muli %56, %c2048_i32 : i32
        %58 = arith.muli %1, %c32768_i32 : i32
        %59 = arith.addi %57, %58 : i32
        %60 = arith.extsi %59 : i32 to i64
        %61 = arith.muli %1, %c16_i32 : i32
        %62 = arith.addi %61, %56 : i32
        %63 = arith.muli %62, %c32_i32 : i32
        %64 = arith.extsi %63 : i32 to i64
        %65 = tt.addptr %arg0, %60 : !tt.ptr<f16>, i64
        %66 = tt.addptr %arg5, %60 : !tt.ptr<f16>, i64
        %67 = tt.addptr %arg3, %64 : !tt.ptr<f32>, i64
        %68 = tt.addptr %arg4, %64 : !tt.ptr<f32>, i64
        %69 = arith.remsi %56, %c16_i32 : i32
        %70 = arith.muli %3, %c16_i32 : i32
        %71 = arith.addi %70, %69 : i32
        %72 = arith.muli %71, %c2_i32 : i32
        %73 = arith.addi %72, %16 : i32
        %74 = tt.addptr %arg11, %73 : !tt.ptr<i32>, i32
        %75 = tt.load %74 : !tt.ptr<i32>
        %76 = arith.muli %75, %c128_i32 : i32
        %77 = tt.addptr %arg10, %73 : !tt.ptr<i32>, i32
        %78 = tt.load %77 : !tt.ptr<i32>
        %79 = tt.splat %76 : i32 -> tensor<64xi32>
        %80 = arith.addi %79, %14 : tensor<64xi32>
        %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %82 = arith.muli %81, %cst_11 : tensor<1x64xi32>
        %83 = tt.splat %65 : !tt.ptr<f16> -> tensor<1x64x!tt.ptr<f16>>
        %84 = tt.addptr %83, %82 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
        %85 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %86 = tt.broadcast %84 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %87 = tt.broadcast %85 : tensor<64x1xi32> -> tensor<64x64xi32>
        %88 = tt.addptr %86, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %89 = tt.expand_dims %80 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %90 = arith.muli %89, %cst_14 : tensor<64x1xi32>
        %91 = tt.splat %66 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
        %92 = tt.addptr %91, %90 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
        %93 = tt.broadcast %92 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %94 = tt.addptr %93, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %95 = arith.muli %78, %c2_i32 : i32
        %96 = arith.minsi %95, %c1_i32 : i32
        %97 = arith.cmpi sge, %96, %c1_i32 : i32
        %98:2 = scf.if %97 -> (tensor<64x64xf32>, tensor<64x64xf32>) {
          %120 = arith.subi %96, %c1_i32 : i32
          %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %arg18, %arg22 = %arg19, %arg23 = %88, %arg24 = %94, %arg25 = %80) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>)  : i32 {
            %167 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
            %168 = arith.cmpi slt, %167, %cst_1 : tensor<1x64xi32>
            %169 = tt.broadcast %168 : tensor<1x64xi1> -> tensor<64x64xi1>
            %170 = tt.load %arg23, %169, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %171 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32>
            %172 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %173 = tt.addptr %172, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %174 = tt.load %173, %171 : tensor<64x!tt.ptr<f32>>
            %175 = arith.cmpf oeq, %174, %cst_16 : tensor<64xf32>
            %176 = arith.select %175, %cst_15, %174 : tensor<64xi1>, tensor<64xf32>
            %177 = tt.dot %30, %170, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %178 = arith.mulf %177, %cst_13 : tensor<64x64xf32>
            %179 = arith.mulf %178, %cst_3 : tensor<64x64xf32>
            %180 = arith.mulf %179, %cst_2 : tensor<64x64xf32>
            %181 = tt.expand_dims %176 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %182 = tt.broadcast %181 : tensor<1x64xf32> -> tensor<64x64xf32>
            %183 = arith.subf %180, %182 : tensor<64x64xf32>
            %184 = math.exp2 %183 : tensor<64x64xf32>
            %185 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
            %186 = arith.cmpi slt, %185, %cst_12 : tensor<64x1xi32>
            %187 = tt.broadcast %186 : tensor<64x1xi1> -> tensor<64x64xi1>
            %188 = tt.load %arg24, %187, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %189 = arith.truncf %184 : tensor<64x64xf32> to tensor<64x64xf16>
            %190 = tt.dot %189, %188, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %191 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %192 = tt.addptr %191, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %193 = tt.load %192, %171 : tensor<64x!tt.ptr<f32>>
            %194 = tt.trans %188 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %195 = tt.dot %35, %194, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %196 = tt.expand_dims %193 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %197 = tt.broadcast %196 : tensor<1x64xf32> -> tensor<64x64xf32>
            %198 = arith.subf %195, %197 : tensor<64x64xf32>
            %199 = arith.mulf %184, %198 : tensor<64x64xf32>
            %200 = arith.mulf %199, %cst_3 : tensor<64x64xf32>
            %201 = arith.truncf %200 : tensor<64x64xf32> to tensor<64x64xf16>
            %202 = tt.trans %170 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %203 = tt.dot %201, %202, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %204 = arith.divsi %arg20, %c2_i32 : i32
            %205 = tt.addptr %74, %204 : !tt.ptr<i32>, i32
            %206 = tt.load %205 evictionPolicy = evict_last : !tt.ptr<i32>
            %207 = arith.addi %204, %c1_i32 : i32
            %208 = arith.cmpi slt, %207, %78 : i32
            %209 = tt.addptr %205, %c1_i32 : !tt.ptr<i32>, i32
            %210 = tt.load %209, %208 evictionPolicy = evict_last : !tt.ptr<i32>
            %211 = arith.addi %arg20, %c1_i32 : i32
            %212 = arith.remsi %211, %c2_i32 : i32
            %213 = arith.cmpi eq, %212, %c0_i32 : i32
            %214 = arith.subi %210, %206 : i32
            %215 = arith.muli %214, %c128_i32 : i32
            %216 = arith.subi %215, %c64_i32 : i32
            %217 = arith.extui %213 : i1 to i32
            %218 = arith.muli %216, %217 : i32
            %219 = arith.subi %c1_i32, %217 : i32
            %220 = arith.muli %219, %c64_i32 : i32
            %221 = arith.addi %218, %220 : i32
            %222 = arith.muli %221, %c64_i32 : i32
            %223 = tt.splat %222 : i32 -> tensor<64x64xi32>
            %224 = tt.addptr %arg23, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %225 = tt.addptr %arg24, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %226 = tt.splat %221 : i32 -> tensor<64xi32>
            %227 = arith.addi %arg25, %226 : tensor<64xi32>
            scf.yield %203, %190, %224, %225, %227 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
          }
          %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
          %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32>
          %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1>
          %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32>
          %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>>
          %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32>
          %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32>
          %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32>
          %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32>
          %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
          %136 = arith.select %28, %cst, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1>
          %137 = tt.broadcast %136 : tensor<64x1xi1> -> tensor<64x64xi1>
          %138 = arith.select %137, %135, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
          %139 = arith.mulf %138, %cst_2 : tensor<64x64xf32>
          %140 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %141 = tt.broadcast %140 : tensor<1x64xf32> -> tensor<64x64xf32>
          %142 = arith.subf %139, %141 : tensor<64x64xf32>
          %143 = math.exp2 %142 : tensor<64x64xf32>
          %144 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
          %145 = arith.cmpi slt, %144, %cst_12 : tensor<64x1xi32>
          %146 = tt.broadcast %145 : tensor<64x1xi1> -> tensor<64x64xi1>
          %147 = tt.load %121#3, %146, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %148 = arith.truncf %143 : tensor<64x64xf32> to tensor<64x64xf16>
          %149 = tt.dot %148, %147, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %150 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %151 = tt.addptr %150, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %152 = tt.load %151, %126 : tensor<64x!tt.ptr<f32>>
          %153 = tt.trans %147 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %154 = tt.dot %35, %153, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %155 = tt.expand_dims %152 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %156 = tt.broadcast %155 : tensor<1x64xf32> -> tensor<64x64xf32>
          %157 = arith.subf %154, %156 : tensor<64x64xf32>
          %158 = arith.mulf %143, %157 : tensor<64x64xf32>
          %159 = arith.mulf %158, %cst_3 : tensor<64x64xf32>
          %160 = arith.select %29, %159, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
          %161 = arith.select %28, %136, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1>
          %162 = tt.broadcast %161 : tensor<64x1xi1> -> tensor<64x64xi1>
          %163 = arith.select %162, %160, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
          %164 = arith.truncf %163 : tensor<64x64xf32> to tensor<64x64xf16>
          %165 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %166 = tt.dot %164, %165, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          scf.yield %166, %149 : tensor<64x64xf32>, tensor<64x64xf32>
        } else {
          scf.yield %arg18, %arg19 : tensor<64x64xf32>, tensor<64x64xf32>
        }
        %99 = tt.addptr %arg15, %73 : !tt.ptr<i32>, i32
        %100 = tt.load %99 : !tt.ptr<i32>
        %101 = arith.muli %100, %c128_i32 : i32
        %102 = tt.addptr %arg14, %73 : !tt.ptr<i32>, i32
        %103 = tt.load %102 : !tt.ptr<i32>
        %104 = tt.splat %101 : i32 -> tensor<64xi32>
        %105 = arith.addi %104, %14 : tensor<64xi32>
        %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %107 = arith.muli %106, %cst_11 : tensor<1x64xi32>
        %108 = tt.addptr %83, %107 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
        %109 = tt.broadcast %108 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %110 = tt.addptr %109, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %111 = tt.expand_dims %105 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %112 = arith.muli %111, %cst_14 : tensor<64x1xi32>
        %113 = tt.addptr %91, %112 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
        %114 = tt.broadcast %113 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %115 = tt.addptr %114, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %116 = arith.muli %103, %c2_i32 : i32
        %117 = arith.minsi %116, %c1_i32 : i32
        %118 = arith.cmpi sge, %117, %c1_i32 : i32
        %119:2 = scf.if %118 -> (tensor<64x64xf32>, tensor<64x64xf32>) {
          %120 = arith.subi %117, %c1_i32 : i32
          %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %98#0, %arg22 = %98#1, %arg23 = %110, %arg24 = %115, %arg25 = %105) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>)  : i32 {
            %161 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
            %162 = arith.cmpi slt, %161, %cst_1 : tensor<1x64xi32>
            %163 = tt.broadcast %162 : tensor<1x64xi1> -> tensor<64x64xi1>
            %164 = tt.load %arg23, %163, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %165 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32>
            %166 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %167 = tt.addptr %166, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %168 = tt.load %167, %165 : tensor<64x!tt.ptr<f32>>
            %169 = arith.cmpf oeq, %168, %cst_16 : tensor<64xf32>
            %170 = arith.select %169, %cst_15, %168 : tensor<64xi1>, tensor<64xf32>
            %171 = tt.dot %30, %164, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %172 = arith.mulf %171, %cst_13 : tensor<64x64xf32>
            %173 = arith.mulf %172, %cst_3 : tensor<64x64xf32>
            %174 = arith.mulf %173, %cst_2 : tensor<64x64xf32>
            %175 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %176 = tt.broadcast %175 : tensor<1x64xf32> -> tensor<64x64xf32>
            %177 = arith.subf %174, %176 : tensor<64x64xf32>
            %178 = math.exp2 %177 : tensor<64x64xf32>
            %179 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
            %180 = arith.cmpi slt, %179, %cst_12 : tensor<64x1xi32>
            %181 = tt.broadcast %180 : tensor<64x1xi1> -> tensor<64x64xi1>
            %182 = tt.load %arg24, %181, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %183 = arith.truncf %178 : tensor<64x64xf32> to tensor<64x64xf16>
            %184 = tt.dot %183, %182, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %185 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %186 = tt.addptr %185, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %187 = tt.load %186, %165 : tensor<64x!tt.ptr<f32>>
            %188 = tt.trans %182 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %189 = tt.dot %35, %188, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %190 = tt.expand_dims %187 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %191 = tt.broadcast %190 : tensor<1x64xf32> -> tensor<64x64xf32>
            %192 = arith.subf %189, %191 : tensor<64x64xf32>
            %193 = arith.mulf %178, %192 : tensor<64x64xf32>
            %194 = arith.mulf %193, %cst_3 : tensor<64x64xf32>
            %195 = arith.truncf %194 : tensor<64x64xf32> to tensor<64x64xf16>
            %196 = tt.trans %164 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %197 = tt.dot %195, %196, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %198 = arith.divsi %arg20, %c2_i32 : i32
            %199 = tt.addptr %99, %198 : !tt.ptr<i32>, i32
            %200 = tt.load %199 evictionPolicy = evict_last : !tt.ptr<i32>
            %201 = arith.addi %198, %c1_i32 : i32
            %202 = arith.cmpi slt, %201, %103 : i32
            %203 = tt.addptr %199, %c1_i32 : !tt.ptr<i32>, i32
            %204 = tt.load %203, %202 evictionPolicy = evict_last : !tt.ptr<i32>
            %205 = arith.addi %arg20, %c1_i32 : i32
            %206 = arith.remsi %205, %c2_i32 : i32
            %207 = arith.cmpi eq, %206, %c0_i32 : i32
            %208 = arith.subi %204, %200 : i32
            %209 = arith.muli %208, %c128_i32 : i32
            %210 = arith.subi %209, %c64_i32 : i32
            %211 = arith.extui %207 : i1 to i32
            %212 = arith.muli %210, %211 : i32
            %213 = arith.subi %c1_i32, %211 : i32
            %214 = arith.muli %213, %c64_i32 : i32
            %215 = arith.addi %212, %214 : i32
            %216 = arith.muli %215, %c64_i32 : i32
            %217 = tt.splat %216 : i32 -> tensor<64x64xi32>
            %218 = tt.addptr %arg23, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %219 = tt.addptr %arg24, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %220 = tt.splat %215 : i32 -> tensor<64xi32>
            %221 = arith.addi %arg25, %220 : tensor<64xi32>
            scf.yield %197, %184, %218, %219, %221 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
          }
          %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
          %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32>
          %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1>
          %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32>
          %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>>
          %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32>
          %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32>
          %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32>
          %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32>
          %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
          %136 = arith.mulf %135, %cst_2 : tensor<64x64xf32>
          %137 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %138 = tt.broadcast %137 : tensor<1x64xf32> -> tensor<64x64xf32>
          %139 = arith.subf %136, %138 : tensor<64x64xf32>
          %140 = math.exp2 %139 : tensor<64x64xf32>
          %141 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
          %142 = arith.cmpi slt, %141, %cst_12 : tensor<64x1xi32>
          %143 = tt.broadcast %142 : tensor<64x1xi1> -> tensor<64x64xi1>
          %144 = tt.load %121#3, %143, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %145 = arith.truncf %140 : tensor<64x64xf32> to tensor<64x64xf16>
          %146 = tt.dot %145, %144, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %147 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %148 = tt.addptr %147, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %149 = tt.load %148, %126 : tensor<64x!tt.ptr<f32>>
          %150 = tt.trans %144 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %151 = tt.dot %35, %150, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %152 = tt.expand_dims %149 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %153 = tt.broadcast %152 : tensor<1x64xf32> -> tensor<64x64xf32>
          %154 = arith.subf %151, %153 : tensor<64x64xf32>
          %155 = arith.mulf %140, %154 : tensor<64x64xf32>
          %156 = arith.mulf %155, %cst_3 : tensor<64x64xf32>
          %157 = arith.select %29, %156, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
          %158 = arith.truncf %157 : tensor<64x64xf32> to tensor<64x64xf16>
          %159 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %160 = tt.dot %158, %159, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          scf.yield %160, %146 : tensor<64x64xf32>, tensor<64x64xf32>
        } else {
          scf.yield %98#0, %98#1 : tensor<64x64xf32>, tensor<64x64xf32>
        }
        scf.yield %119#0, %119#1 : tensor<64x64xf32>, tensor<64x64xf32>
      }
      %37 = tt.splat %13 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %38 = tt.addptr %37, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %39 = tt.broadcast %38 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %40 = tt.addptr %39, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %41 = arith.cmpi slt, %24, %cst_11 : tensor<1x64xi32>
      %42 = tt.broadcast %41 : tensor<1x64xi1> -> tensor<64x64xi1>
      %43 = arith.andi %29, %42 : tensor<64x64xi1>
      %44 = arith.truncf %36#1 : tensor<64x64xf32> to tensor<64x64xf16>
      tt.store %40, %44, %43 : tensor<64x64x!tt.ptr<f16>>
      %45 = arith.mulf %36#0, %cst_13 : tensor<64x64xf32>
      %46 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x64xi32>
      %47 = arith.addi %26, %46 : tensor<64x64xi32>
      %48 = tt.splat %4 : i32 -> tensor<64x64xi32>
      %49 = arith.addi %47, %48 : tensor<64x64xi32>
      %50 = tt.splat %8 : i32 -> tensor<64x64xi32>
      %51 = arith.addi %49, %50 : tensor<64x64xi32>
      %52 = tt.splat %arg16 : !tt.ptr<f16> -> tensor<64x64x!tt.ptr<f16>>
      %53 = tt.addptr %52, %51 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %54 = arith.truncf %45 : tensor<64x64xf32> to tensor<64x64xf16>
      tt.store %53, %54, %29 : tensor<64x64x!tt.ptr<f16>>
    }
    tt.return
  }
}

{-#
  external_resources: {
    mlir_reproducer: {
      pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=90}, sccp, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
      disable_threading: false,
      verify_each: true
    }
  }
#-}
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
Triton compilation failed: triton_tem_fused_zeros_1
def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0):
    PRESCALE_QK : tl.constexpr = False
    ROWS_GUARANTEED_SAFE : tl.constexpr = False
    BLOCKS_ARE_CONTIGUOUS : tl.constexpr = False
    WRITE_DQ : tl.constexpr = True
    OUTPUT_LOGSUMEXP : tl.constexpr = True
    FLOAT32_PRECISION : tl.constexpr = 'tf32'
    IS_DIVISIBLE : tl.constexpr = False
    SM_SCALE : tl.constexpr = 0.125
    GQA_SHARED_HEADS : tl.constexpr = 4
    HAS_FULL_BLOCKS : tl.constexpr = True
    QK_HEAD_DIM : tl.constexpr = 64
    QK_HEAD_DIM_ROUNDED : tl.constexpr = 64
    V_HEAD_DIM : tl.constexpr = 64
    V_HEAD_DIM_ROUNDED : tl.constexpr = 64
    SAFE_HEAD_DIM : tl.constexpr = True
    BLOCK_M1 : tl.constexpr = 64
    BLOCK_N1 : tl.constexpr = 64
    BLOCK_M2 : tl.constexpr = 64
    BLOCK_N2 : tl.constexpr = 64
    SPARSE_Q_BLOCK_SIZE : tl.constexpr = 128
    SPARSE_KV_BLOCK_SIZE : tl.constexpr = 128
    Q = arg_Q
    K = arg_K
    V = arg_V
    LSE = arg_LSE
    DELTA = arg_DELTA
    DO = arg_DO
    DQ = arg_DQ
    DV = arg_DV
    KV_NUM_BLKS = arg_KV_NUM_BLKS
    KV_IDX = arg_KV_IDX
    Q_NUM_BLKS = arg_Q_NUM_BLKS
    Q_IDX = arg_Q_IDX
    FULL_KV_NUM_BLKS = arg_FULL_KV_NUM_BLKS
    FULL_KV_IDX = arg_FULL_KV_IDX
    FULL_Q_NUM_BLKS = arg_FULL_Q_NUM_BLKS
    FULL_Q_IDX = arg_FULL_Q_IDX

    # Sub notation for this kernel:
    #
    # Q: Query, K: Key, V: Value
    # LSE: logsumexp (logsumexp is always stored in fp32 regardless of the input dtype)
    # DELTA: Precomputed sum(OUT*DO, axis=-1)
    # DO: Derivative of Output, DQ: Derivative of Query, DV: Derivative of Value
    # DK: Derivative of Key, is the written to via the store_output call due to some limitations with
    # inductor codegen
    # M: Number of queries, N: Number of keys/values
    # QK_HEAD_DIM: The dimension of the query and key embeddings
    # V_HEAD_DIM: The dimension of the value embeddings
    # z: Batch size, h: Number of heads, m: Number of queries or keys/values, d: Head dim
    # GQA_SHARED_HEADS: number of query heads sharing one kv head in GQA setups.
    # (Modifiable) Performance tuning options
    # BLOCK_M1: when calculating DK & DV, iterate over BLOCK_M1 across the seqlen dim of Q in each thread block.
    # BLOCK_N1: when calculating DK & DV, the thread block size across the seqlen dim of K/V.
    # BLOCK_M2: when calculating DQ, the thread block size across the seqlen dim of Q.
    # BLOCK_N2: when calculating DQ, iterate over BLOCK_N2 across the seqlen dim of K/V in each thread block.
    #
    # The following FULL_* and PARTIAL_* is defined in the block sparse mask grid, rather than the thread block grid.
    # KV_NUM_BLKS: The number of KV blocks (that may or may not require masking) for each query.
    # KV_IDX: The indices of KV blocks (that may or may not require masking) for each query.
    # Q_NUM_BLKS: The number of Q blocks (that may or may not require masking) for each query.
    # Q_IDX: The indices of Q blocks (that may or may not require masking) for each query.
    # FULL_KV_NUM_BLKS: The number of fully unmasked KV blocks (so we don't need masking) for each query.
    # FULL_KV_IDX: The indices of fully unmasked KV blocks (so we don't need masking) for each query.
    # FULL_Q_NUM_BLKS: The number of fully unmasked Q blocks (so we don't need masking) for each query.
    # FULL_Q_IDX: The indices of fully unmasked Q blocks (so we don't need masking) for each query.

    # The below are kernel options that can be applied for certain score_mods,
    # or involve a numerics vs. perf tradeoff
    # PRESCALE_QK: Whether to pre-scale QK by 1/sqrt(d) and change of base. Has
    # about 20% more numerical error, but slightly faster.

    # Define strides of inputs
    stride_qz, stride_qh, stride_qm, stride_qd = 32768, 2048, 64, 1
    stride_kz, stride_kh, stride_kn, stride_kd = 65536, 16384, 64, 1
    stride_vz, stride_vh, stride_vn, stride_vd = 65536, 16384, 64, 1
    stride_doz, stride_doh, stride_dom, stride_dod = 32768, 2048, 64, 1

    stride_dqz, stride_dqh, stride_dqm, stride_dqd = 32768, 2048, 64, 1
    stride_dvz, stride_dvh, stride_dvm, stride_dvd = 65536, 16384, 64, 1

    ZQ = 2
    HQ = 16
    HKV = 4
    Q_LEN = 32
    ZKV = 2
    KV_LEN = 256

    MATMUL_PRECISION = Q.dtype.element_ty

    pid = tl.program_id(0)
    NUM_KV_BLOCKS = tl.cdiv(KV_LEN, BLOCK_N1)
    NUM_Q_BLOCKS = tl.cdiv(Q_LEN, BLOCK_M2)

    off_zq = tl.program_id(1) # q batch idx
    off_hkv = tl.program_id(2) # kv head idx
    off_zkv = off_zq % ZKV # kv batch idx

    SPARSE_Z = 2
    SPARSE_HQ = 16

    sparse_idx_z = off_zq % SPARSE_Z

    k_adj = (stride_kh * off_hkv + stride_kz * off_zkv).to(tl.int64)
    v_adj = (stride_vh * off_hkv + stride_vz * off_zkv).to(tl.int64)
    # first compute broadcasted dv of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
    # then reduce to dv of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
    dv_adj = (stride_dvh * off_hkv + stride_dvz * off_zq).to(tl.int64)

    # offset K, V, DV pointers for batch/kv-head
    K += k_adj
    V += v_adj
    DV += dv_adj

    RCP_LN2 = 1.44269504
    offs_k = tl.arange(0, QK_HEAD_DIM_ROUNDED)
    offs_v = tl.arange(0, V_HEAD_DIM_ROUNDED)

    if pid >= NUM_KV_BLOCKS:
        off_pid = pid - NUM_KV_BLOCKS
        # THIS BLOCK DOES DQ
        SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M2)
        SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N2)
        off_hq2 = off_pid // NUM_Q_BLOCKS + off_hkv * GQA_SHARED_HEADS
        start_m2_block = off_pid % NUM_Q_BLOCKS
        off_pid_mask = start_m2_block // SPARSE_Q_MULTIPLE
        stride_kv_num_blks_h = 1
        stride_kv_idx_h = 2
        stride_kv_idx_m = 2

        sparse_idx_hq2 = off_hq2 % SPARSE_HQ
        sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq2

        sparse_kv_num_blks_offset = sparse_hz_offset * stride_kv_num_blks_h + off_pid_mask
        sparse_kv_idx_offset = sparse_hz_offset * stride_kv_idx_h + off_pid_mask * stride_kv_idx_m  # noqa: B950

        # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads.
        q_adj2 = (stride_qh * off_hq2 + stride_qz * off_zq).to(tl.int64)
        do_adj2 = (stride_doh * off_hq2 + stride_doz * off_zq).to(tl.int64)
        dq_adj2 = (stride_dqh * off_hq2 + stride_dqz * off_zq).to(tl.int64)
        off_chz2 = ((off_zq * HQ + off_hq2) * Q_LEN).to(tl.int64)

        Q2 = Q + q_adj2
        DO2 = DO + do_adj2
        # TODO: This does not work if DQ is not the same layout as Q (for example,
        # if Q is broadcasted)
        DQ2 = DQ + dq_adj2
        LSE2 = LSE + off_chz2
        DELTA2 = DELTA + off_chz2

        # dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM], dtype=tl.float32)
        dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM_ROUNDED], dtype=tl.float32)

        start_m2 = start_m2_block * BLOCK_M2
        offs_m2 = start_m2 + tl.arange(0, BLOCK_M2)

        # load Q and do: they stay in SRAM throughout the inner loop.
        q = load_checked_2d(Q2, offs_m2, offs_k, stride_qm, stride_qd, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, QK_HEAD_DIM)
        do = load_checked_2d(DO2, offs_m2, offs_v, stride_dom, stride_dod, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, V_HEAD_DIM)

        if PRESCALE_QK:
            q = (q * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION)

        if IS_DIVISIBLE:
            Di = tl.load(DELTA2 + offs_m2)
            lse = tl.load(LSE2 + offs_m2)
        else:
            Di = tl.load(DELTA2 + offs_m2, mask=offs_m2 < Q_LEN)
            lse = tl.load(LSE2 + offs_m2, mask=offs_m2 < Q_LEN)
        lse = tl.where(lse == -float("inf"), 0.0, lse)
        lse = lse[:, None]

        # ~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        # KV_IDX and KV_NUM_BLKS are always contiguous.
        kv_indices = KV_IDX + sparse_kv_idx_offset
        kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading
        sparse_kv_num_blocks = tl.load(KV_NUM_BLKS + sparse_kv_num_blks_offset)

        offs_n2 = kv_start + tl.arange(0, BLOCK_N2)
        dq = bwd_dq_inner(
            arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
            K, V,
            dq, q, do, Di, lse,
            off_zq, off_hq2, offs_m2, offs_n2,
            stride_kn, stride_kd, stride_vn, stride_vd,
            kv_indices, sparse_kv_num_blocks,
            MATMUL_PRECISION,
            IS_FULL_BLOCKS=False,
        )

        if HAS_FULL_BLOCKS:
            # ~~~~~~~~~~~ partial unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # FULL_KV_IDX and FULL_KV_NUM_BLKS are always contiguous.
            kv_indices = FULL_KV_IDX + sparse_kv_idx_offset
            kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading
            sparse_kv_num_blocks = tl.load(FULL_KV_NUM_BLKS + sparse_kv_num_blks_offset)

            offs_n2 = kv_start + tl.arange(0, BLOCK_N2)
            dq = bwd_dq_inner(
                arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
                K, V,
                dq, q, do, Di, lse,
                off_zq, off_hq2, offs_m2, offs_n2,
                stride_kn, stride_kd, stride_vn, stride_vd,
                kv_indices, sparse_kv_num_blocks,
                MATMUL_PRECISION,
                IS_FULL_BLOCKS=True,
            )

        # Write back dQ.
        dq_ptrs = DQ2 + offs_m2[:, None] * stride_dqm + offs_k[None, :] * stride_dqd
        dq *= SM_SCALE
        if IS_DIVISIBLE and SAFE_HEAD_DIM:
            tl.store(dq_ptrs, dq)
        else:
            tl.store(dq_ptrs, dq, mask=(offs_m2[:, None] < Q_LEN) & (offs_k[None, :] < QK_HEAD_DIM))
    else:
        # THIS BLOCK DOES DK & DV
        SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M1)
        SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N1)

        pid_mask = pid // SPARSE_KV_MULTIPLE

        stride_q_num_blks_h = 2
        stride_q_idx_h = 2
        stride_q_idx_n = 1

        dv = tl.zeros([BLOCK_N1, V_HEAD_DIM_ROUNDED], dtype=tl.float32)
        dk = tl.zeros([BLOCK_N1, QK_HEAD_DIM_ROUNDED], dtype=tl.float32)

        start_n1 = pid * BLOCK_N1
        offs_n1 = start_n1 + tl.arange(0, BLOCK_N1)

        # load K and V: they stay in SRAM throughout the inner loop.
        k = load_checked_2d(K, offs_n1, offs_k, stride_kn, stride_kd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, QK_HEAD_DIM)
        v = load_checked_2d(V, offs_n1, offs_v, stride_vn, stride_vd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, V_HEAD_DIM)

        if PRESCALE_QK:
            k = (k * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION)

        for off_g in range(0, GQA_SHARED_HEADS):
            off_hq1 = off_hkv * GQA_SHARED_HEADS + off_g

            # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads.
            q_adj1 = (stride_qh * off_hq1 + stride_qz * off_zq).to(tl.int64)
            do_adj1 = (stride_doh * off_hq1 + stride_doz * off_zq).to(tl.int64)
            dq_adj1 = (stride_dqh * off_hq1 + stride_dqz * off_zq).to(tl.int64)
            off_chz1 = ((off_zq * HQ + off_hq1) * Q_LEN).to(tl.int64)

            Q1 = Q + q_adj1
            DO1 = DO + do_adj1
            # TODO: This does not work if DQ is not the same layout as Q (for example,
            # if Q is broadcasted)
            LSE1 = LSE + off_chz1
            DELTA1 = DELTA + off_chz1

            sparse_idx_hq1 = off_hq1 % SPARSE_HQ
            sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq1

            sparse_q_num_blks_offset = sparse_hz_offset * stride_q_num_blks_h + pid_mask
            sparse_q_idx_offset = sparse_hz_offset * stride_q_idx_h + pid_mask * stride_q_idx_n  # noqa: B950

            # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # Q_IDX and Q_NUM_BLKS are always contiguous.
            q_indices = Q_IDX + sparse_q_idx_offset
            q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading
            sparse_q_num_blocks = tl.load(Q_NUM_BLKS + sparse_q_num_blks_offset)

            offs_m1 = q_start + tl.arange(0, BLOCK_M1)
            dk, dv = bwd_dkdv_inner(
                arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
                Q1, DO1, DELTA1, LSE1,
                dk, dv, k, v,
                off_zq, off_hq1, offs_n1, offs_m1,
                stride_qm, stride_qd, stride_dom, stride_dod,
                q_indices, sparse_q_num_blocks,
                MATMUL_PRECISION,
                IS_FULL_BLOCKS=False,
            )

            if HAS_FULL_BLOCKS:
                # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                # FULL_Q_IDX and FULL_Q_NUM_BLKS are always contiguous.
                q_indices = FULL_Q_IDX + sparse_q_idx_offset
                q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading
                sparse_q_num_blocks = tl.load(FULL_Q_NUM_BLKS + sparse_q_num_blks_offset)

                offs_m1 = q_start + tl.arange(0, BLOCK_M1)
                dk, dv = bwd_dkdv_inner(
                    arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
                    Q1, DO1, DELTA1, LSE1,
                    dk, dv, k, v,
                    off_zq, off_hq1, offs_n1, offs_m1,
                    stride_qm, stride_qd, stride_dom, stride_dod,
                    q_indices, sparse_q_num_blocks,
                    MATMUL_PRECISION,
                    IS_FULL_BLOCKS=True,
                )

        # Write back dV and dK.
        dv_ptrs = DV + offs_n1[:, None] * stride_dvm + offs_v[None, :] * stride_dvd

        index_n = offs_n1[:, None]
        index_k = offs_k[None, :]
        index_v = offs_v[None, :]

        if IS_DIVISIBLE and SAFE_HEAD_DIM:
            tl.store(dv_ptrs, dv)
        else:
            tl.store(dv_ptrs, dv, mask=(index_n < KV_LEN) & (index_v < V_HEAD_DIM))

        dk *= SM_SCALE

        if SAFE_HEAD_DIM:
            mask = index_n < KV_LEN
        else:
            mask = (index_n < KV_LEN) & (index_k < QK_HEAD_DIM)

        # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
        # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
        xindex = index_k + 64*index_n + 16384*off_hkv + 65536*off_zq
        tl.store(out_ptr0 + (tl.broadcast_to(xindex, dk.shape)), dk, mask)

metadata: {'signature': {'arg_Q': '*fp16', 'arg_K': '*fp16', 'arg_V': '*fp16', 'arg_LSE': '*fp32', 'arg_DELTA': '*fp32', 'arg_DO': '*fp16', 'arg_DQ': '*fp16', 'arg_DV': '*fp16', 'arg_KV_NUM_BLKS': '*i32', 'arg_KV_IDX': '*i32', 'arg_Q_NUM_BLKS': '*i32', 'arg_Q_IDX': '*i32', 'arg_FULL_KV_NUM_BLKS': '*i32', 'arg_FULL_KV_IDX': '*i32', 'arg_FULL_Q_NUM_BLKS': '*i32', 'arg_FULL_Q_IDX': '*i32', 'out_ptr0': '*fp16'}, 'device': 0, 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (9,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (14,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 4, 'num_stages': 3, 'debug': True, 'cc': 100}
Traceback (most recent call last):
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile
    next_module = compile_ir(module, metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda>
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir
    pm.run(mod)
RuntimeError: PassManager::run failed
frames [('total', 3), ('ok', 3)]
inline_call []
stats [('calls_captured', 8), ('unique_graphs', 3)]
aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('ok', 1)]
inductor [('triton_bundler_save_kernel', 8), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1), ('fxgraph_cache_bypass', 1)]
graph_break []
F

==================================================== FAILURES =====================================================
_____________________________ TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 ______________________________
Traceback (most recent call last):
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 446, in instantiated_test
    raise rte
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1349, in dep_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1215, in dep_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 1430, in test_GQA
    self.run_test(*inputs)
  File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 566, in run_test
    compiled_out.backward(backward_grad)
  File "/home/drisspg/meta/pytorch/torch/_tensor.py", line 625, in backward
    torch.autograd.backward(
  File "/home/drisspg/meta/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/drisspg/meta/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/autograd/function.py", line 315, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2303, in backward
    return impl_fn()
           ^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2289, in impl_fn
    out = CompiledFunction._backward_impl(ctx, all_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2394, in _backward_impl
    CompiledFunction.compiled_bw = aot_config.bw_compiler(
                                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/schemas.py", line 1256, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_dynamo/backends/common.py", line 76, in _wrapped_bw_compiler
    disable(
  File "/home/drisspg/meta/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_utils_internal.py", line 92, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 2428, in bw_compiler
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 773, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_dynamo/repro/after_aot.py", line 124, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 952, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1652, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile
    compiled_module = graph.compile_to_module()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2318, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2328, in _compile_to_module
    mod = self._compile_to_module_lines(wrapper_code)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2396, in _compile_to_module_lines
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/codecache.py", line 3466, in load_by_key_path
    mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/tmp0yiz3c94/az/caza2gzmsagyuusmf2ka3oat3na4xv6zudssk244xmlzsbv2knze.py", line 117, in <module>
  File "/home/drisspg/meta/pytorch/torch/_inductor/async_compile.py", line 489, in triton
    kernel.precompile(
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 437, in precompile
    self._precompile_worker()
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 459, in _precompile_worker
    compile_results.append(self._precompile_config(c))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile
    next_module = compile_ir(module, metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda>
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir
    pm.run(mod)
RuntimeError: PassManager::run failed

To execute this test, run the following from the base repo dir:
    python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
============================================= short test summary info =============================================
FAILED [5.1441s] test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_GQA_score_mod1_cuda_float16 - RuntimeError: PassManager::run failed
================================== 1 failed, 1 passed, 1404 deselected in 18.10s ==================================
~/meta/pytorch flex-warning !1 ❯
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160227
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2025-08-11 23:30:20 +00:00
99bc2f94c1 Update export/schema.py (#160220)
Summary:
Model could have multiple ExportedPrograms
- for different methods. They can have different weights.
- for different delegates. They can also have different weights.

For this reason, we make weight per ExportedProgram.

Also, we cleanup Model, and Program. IIUC, Model and Program are not used anywhere, so it's ok to make BC breaking change.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79917395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160220
Approved by: https://github.com/angelayi, https://github.com/dolpm, https://github.com/jingsh
2025-08-11 23:14:08 +00:00
fc25c68f20 [hop][exc] make UncapturedHigherOrderOpError print user code and avoid re-raise (#159296)
After the change, the error stacktrace is attached with user code stack and  is suppressed into 1 (without the scrolling up mssage). For example:
```python
    class Test(torch.nn.Module):
        def forward(self, c, x):
            def cond_fn(c, x):
                return c > 0 and x.size(0) < 20

            def body_fn(c, x):
                return c - 1, x.sin()

            return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x))
```

Now gives the following error message:
```python
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1705, in test_while_loop_size_mismatch_tensor_expansion
    self._run_test(
    ~~~~~~~~~~~~~~^
        model=WhileLoopModels.SizeMismatchTensorExpansion(),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        dynamic=dynamic,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1417, in _run_test
    result = model(*inputs_with_counters)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1053, in forward
    return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x))
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 176, in while_loop
    return torch.compile(
           ~~~~~~~~~~~~~~
        _while_loop_op_wrapper, backend=backend, fullgraph=True
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    )(flat_cond_fn, flat_body_fn, tuple(flat_inputs), tuple())
    ~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1595, in __call__
    result = self._torchdynamo_orig_backend(
        frame, cache_entry, self.hooks, frame_state, skip=1
    )
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1353, in __call__
    result = self._inner_convert(
        frame, cache_entry, hooks, frame_state, skip=skip + 1
    )
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 682, in __call__
    result = _compile(
        frame.f_code,
    ...<16 lines>...
        convert_frame_box=self._box,
    )
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1172, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_utils_internal.py", line 98, in wrapper_function
    return function(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 858, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 897, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1461, in transform_code_object
    transformations(instructions, code_options)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 300, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 818, in transform
    tracer.run()
    ~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3528, in run
    super().run()
    ~~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run
    while self.step():
          ~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step
    self.dispatch_table[inst.opcode](self, inst)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 91, in graph_break_as_hard_error
    raise exc.with_traceback(sys.exc_info()[2]) from None
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 77, in graph_break_as_hard_error
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1287, in call_function
    ) = speculate_subgraph(
        ~~~~~~~~~~~~~~~~~~^
        tx,
        ^^^
    ...<33 lines>...
        supports_aliasing=self.supports_aliasing,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 877, in speculate_subgraph
    raise ex
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 718, in speculate_subgraph
    output = f.call_function(tx, args, sub_kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function
    return super().call_function(tx, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call
    return tracer.inline_call_()
           ~~~~~~~~~~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_
    self.run()
    ~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run
    while self.step():
          ~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step
    self.dispatch_table[inst.opcode](self, inst)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function
    return super().call_function(tx, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call
    return tracer.inline_call_()
           ~~~~~~~~~~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_
    self.run()
    ~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run
    while self.step():
          ~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step
    self.dispatch_table[inst.opcode](self, inst)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 830, in inner
    unimplemented_v2(
    ~~~~~~~~~~~~~~~~^
        gb_type="Data-dependent branching",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
        ],
        ^^
    )
    ^
  File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 580, in unimplemented_v2
    raise Unsupported(msg)
torch._dynamo.exc.UncapturedHigherOrderOpError: while_loop doesn't work unless it is captured completely with torch.compile. Got Data-dependent branching
  Explanation: Detected data-dependent branching (e.g. `if my_tensor.sum() > 0:`). Dynamo does not support tracing dynamic control flow.
  Hint: This graph break is fundamental - it is unlikely that Dynamo will ever be able to trace through your code. Consider finding a workaround.
  Hint: Use `torch.cond` to express dynamic control flow.

  Developer debug context: attempted to jump with TensorVariable()

 For more details about this graph break, please visit: https://pytorch-labs.github.io/compile-graph-break-site/gb/gb0170.html

from user code:
   File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 167, in _while_loop_op_wrapper
    return while_loop_op(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 137, in flat_cond_fn
    return cond_fn(*carried, *additional)
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1047, in cond_fn
    return c > 0 and x.size(0) < 20

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

To execute this test, run the following from the base repo dir:
    python test/inductor/test_control_flow.py WhileLoopTests.test_while_loop_size_mismatch_tensor_expansion_device_cpu_dynamic_False

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159296
Approved by: https://github.com/zou3519
2025-08-11 22:48:10 +00:00
5a40c57844 [MTIA] Implement isAvailable() for MTIA hooks (#160304)
Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods.

Test Plan:
`torch._C._get_accelerator()`

Rollback Plan:

Differential Revision: D79981115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304
Approved by: https://github.com/nautsimon
2025-08-11 21:45:11 +00:00
7d2ec704e4 Fix MPS autocast for ConvTranspose3d (#160345)
## Summary
- ensure ConvTranspose3d uses fp32 under MPS autocast
- add MPS autocast test for ConvTranspose3d

Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c

Fixes https://github.com/pytorch/pytorch/issues/160332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345
Approved by: https://github.com/dcci
2025-08-11 21:01:52 +00:00
fc80f6859e Fix collective schedule logging and runtime tests (#160260)
Summary:

- Fix collective schedule logging so that only logs when collectives present
- Fix runtime estimate test to check if each op has a number value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160260
Approved by: https://github.com/Skylion007
2025-08-11 20:58:52 +00:00
cf0a0dcb0a Make user defined Triton kernels serializable for fx_graph_runnable (#160002)
Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002
Approved by: https://github.com/eellison
2025-08-11 20:54:33 +00:00
b149c7204c Revert "port distributed pipeline test files for Intel GPU (#159033)"
This reverts commit 76a0609b6bddb2bc40f1eb4ade12885023653d59.

Reverted https://github.com/pytorch/pytorch/pull/159033 on behalf of https://github.com/clee2000 due to broke test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/16890370216/job/47849586456) [HUD commit link](76a0609b6b) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/159033#issuecomment-3176833314))
2025-08-11 20:44:45 +00:00
09381f5dac Revert "[Graph Partition] Pass all OSS unit tests (#154667)"
This reverts commit ca7315c17162ea21b1ca5ba23f4bf6168766c7b9.

Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](ca7315c171) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))
2025-08-11 20:34:27 +00:00
9eedd2a20b [PGO] no counterfactual suggestions for dynamic allowlist (#160231)
Being more conservative with whitelist suggestions as we roll out suggestions; now we only suggest sources that were dynamic in previous runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160231
Approved by: https://github.com/bobrenjc93
2025-08-11 20:13:25 +00:00
c3dc8dc412 159965 is merged, no need to patch it in (#160275)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160275
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-08-11 19:55:04 +00:00
76a0609b6b port distributed pipeline test files for Intel GPU (#159033)
In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. instantiate_device_type_tests()
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
4. use "get_default_backend_for_device()" to get backend
5. enabled XPU for some test path
6. add TEST_MULTIACCELERATOR in common_utils for all backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033
Approved by: https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Daisy Deng <daisy.deng@intel.com>
2025-08-11 19:43:15 +00:00
c8205cb354 [autograd] match 0-dim gradients device type regardless of subclassness (#160165)
Not sure if there some subclasses where the outer.dim() == 0 but you wouldn't want to move it?

FIXES https://github.com/pytorch/pytorch/issues/160084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160165
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-08-11 17:57:32 +00:00
d25c4f954d [MPS] Type-promote tensor-iterator common dtype (#160334)
Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels

Fixes https://github.com/pytorch/pytorch/issues/160208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-08-11 17:53:56 +00:00
d0e2240f68 [triton_heuristics] Optimize the triton launcher in pt2 (#160000)
Summary:

(Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent)

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238}

After Triton pin-update:
 {F1980557240}

The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.893x
```

```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```

After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.985x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000
Approved by: https://github.com/jansel

Co-authored-by: Xu Zhao <xzhao9@meta.com>
2025-08-11 17:22:40 +00:00
9ccd0f5e31 Fix unbacked symint and memory leak in inductor memory planning (#159839)
Summary:

In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so **allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints** .

So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin.

Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range.

If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool.

We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor.

In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool`  actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`.

```
AtenTensorHandle handle_name;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....);
```

This is bug prune. **If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well**, otherwise you get memory leaks.

We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool.

Test Plan:
```
 python test/inductor/test_memory_planning.py
```

Rollback Plan:

Differential Revision: D79603119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839
Approved by: https://github.com/jansel
2025-08-11 17:16:15 +00:00
ca7315c171 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-11 16:25:12 +00:00
68a4b4b2e3 [codemod] Fix unreachable-break issue in caffe2/c10/cuda/CUDAFunctions.cpp +2 (#160257)
Summary:
LLVM has a warning `-Wunreachable-code-break` which identifies `break` statements that cannot be reached. These compromise readability, are misleading, and may identify bugs. This diff removes such statements.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan:
Sandcastle

Rollback Plan:

Differential Revision: D79835614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160257
Approved by: https://github.com/Skylion007
2025-08-11 16:09:24 +00:00
80cca83079 [inductor] Skip some AOTI UTs on Windows. (#160287)
Skip some AOTI UTs on Windows, it is not fully ready.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160287
Approved by: https://github.com/ezyang
2025-08-11 13:50:43 +00:00
515cb70367 [inductor] normalize_path_separator for test_different_file_paths_local_pgo (#160286)
`normalize_path_separator` for test_different_file_paths_local_pgo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160286
Approved by: https://github.com/ezyang
2025-08-11 13:50:18 +00:00
cyy
c184cb3852 [submodule] Bump fbgemm to latest (#158210)
Merge the recent commits of FBGEMM and remove unnecessary CMake code.
Specifically, we
1. enable `fbgemm_autovec` since the target is now correctly handled.
2. remove option `USE_FAKELOWP` which is not used.
3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210
Approved by: https://github.com/q10
2025-08-11 13:48:02 +00:00
2259dbed4e Update slow tests (#158222)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158222
Approved by: https://github.com/pytorchbot
2025-08-11 12:00:13 +00:00
05029ad1c3 [xla hash update] update the pinned xla hash (#160306)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160306
Approved by: https://github.com/pytorchbot
2025-08-11 11:28:49 +00:00
cyy
cf4964be68 Remove unnecessary CMake checks for glog (#158185)
With the updating to CMake 2.27, some old scripts can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158185
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-08-11 10:14:47 +00:00
ecea81117b Fix clang builds by adding headers (#160252)
Clang compiler from llvm-14 fails to build full torch from source with the message
```
no template named 'unordered_map' in namespace 'std'
  std::unordered_map<std::string, HandlerFunc> handlers_{};
 ~~~~~^
```
A similar issue here https://github.com/intel/llvm/issues/5264
Fix is to add the correct headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160252
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-08-11 09:03:14 +00:00
1c2cba17ea [FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119)
To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional)

Screenshot:

<img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" />

<img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119
Approved by: https://github.com/H-Huang, https://github.com/wconstab
2025-08-11 07:27:10 +00:00
ff0d56d035 [Inductor] [Triton] Enable Configuration warmup/rep iterations when benchmarking in inductor (#159982)
Summary:
When benchmarking on B200 Max Autotune, I discovered that the estimations from the autotune logs consistently produced a better ATEN result by > 20% on an example shape. Here is an example of the output:

```
Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3081120103597641, "best_triton_pos": 1, "best_triton_time": 0.6589759886264801, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"}
AUTOTUNE mm(3840x1152, 1152x49136)
strides: [1, 3840], [49152, 1]
dtypes: torch.bfloat16, torch.bfloat16
  mm 0.3081 ms 100.0%
  triton_mm_16 0.6590 ms 46.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_17 0.6830 ms 45.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_13 0.7015 ms 43.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_9 0.8487 ms 36.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_11 0.8695 ms 35.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_10 0.8797 ms 35.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_18 0.9089 ms 33.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_14 0.9718 ms 31.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_15 1.0169 ms 30.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
SingleProcess AUTOTUNE benchmarking takes 2.8574 seconds and 0.1032 seconds precompiling for 20 choices
Removed 3483 outliers from 28645 samples
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.00s/it]
          (M, N, K)    pt2_matmul_maxautotune-latency    pt2_matmul_maxautotune-speedup    pt2_matmul_maxautotune-tflops
-------------------  --------------------------------  --------------------------------  -------------------------------
(3840, 49136, 1152)                 0.359392 (±8.27%)                                                            1209.61
            average                                                                                              1209.61
```

Based on my reading about B200 power usage, I believe this is due to the new for power aware benchmarking as a kernel may perform better in short bursts. This adds environment variables to expand autotuning iterations so we can get more consistent results between the estimation and the actual runtime. I did not update the default yet, even for B200 because I'm not sure how this is used in practice.

This is the new output:

```
Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3848319947719574, "best_triton_pos": 1, "best_triton_time": 0.6287680268287659, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"}
AUTOTUNE mm(3840x1152, 1152x49136)
strides: [1, 3840], [49152, 1]
dtypes: torch.bfloat16, torch.bfloat16
  mm 0.3848 ms 100.0%
  triton_mm_16 0.6288 ms 61.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_13 0.6299 ms 61.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_17 0.6728 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_9 0.7189 ms 53.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_18 0.8566 ms 44.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_11 0.8693 ms 44.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_14 0.9298 ms 41.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_10 0.9524 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_15 1.0216 ms 37.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
SingleProcess AUTOTUNE benchmarking takes 3.9245 seconds and 0.0965 seconds precompiling for 20 choices
Removed 3537 outliers from 29530 samples
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.70s/it]
          (M, N, K)    pt2_matmul_maxautotune-latency    pt2_matmul_maxautotune-speedup    pt2_matmul_maxautotune-tflops
-------------------  --------------------------------  --------------------------------  -------------------------------
(3840, 49136, 1152)                 0.359328 (±9.71%)                                                            1209.82
            average                                                                                              1209.82
```

Test Plan:
`TORCH_AUTOTUNE_REP=1000 CUDA_VISIBLE_DEVICES=2 ENABLE_MMA_V5_ATT_PIPELINE=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run mode/opt  //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op gemm --iter $NUM_ITERS --input-loader /home/njriasan/parsed_shapes.json --only pt2_matmul_maxautotune`

Rollback Plan:

Reviewed By: NikhilAPatel

Differential Revision: D79737929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159982
Approved by: https://github.com/NikhilAPatel
2025-08-11 05:27:51 +00:00
334b38ccc4 Fix typo in README.md (#160160)
The "Get the PyTorch Source" section is now located before the "Install Dependencies/Common" section, so "... using the “Get the PyTorch Source“ section below" should be "... using the “Get the PyTorch Source“ section above".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160160
Approved by: https://github.com/BoyuanFeng
2025-08-11 05:09:59 +00:00
dc0d18e023 [CUDA] Remove the uncessary CUDA_GUARD (#160249)
`CUDA_GUARD` is unnecessary in `initDeviceStreamState`, because
the `initSingleStream` has already done it.

29712314dd/c10/cuda/CUDAStream.cpp (L202-L203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160249
Approved by: https://github.com/Skylion007
2025-08-11 05:08:05 +00:00
cyy
8ae4d2652f Tidy torch/csrc/jit/passes/onnx code (#160262)
Apply clang-tidy fixes to torch/csrc/jit/passes/onnx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160262
Approved by: https://github.com/justinchuby
2025-08-11 04:50:38 +00:00
8088cfa592 Add type assert for tensor_meta, based on real bug in autoparallel. (#157927)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157927
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/wconstab
2025-08-11 04:22:02 +00:00
d8cb3db533 Add unsigned support to IValue (#160102)
- Moved repeated logic of saving int64/uint64 into a polymorphic container into `THPUtils_unpackInteger`
- Added `TestPythonDispatch.test_dispatch_uint64` regression test

Fixes https://github.com/pytorch/pytorch/issues/159168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160102
Approved by: https://github.com/ezyang
2025-08-11 03:57:18 +00:00
e7152ff8a6 [inductor] fix some windows inductor UTs (#160292)
This PR is the UT part of https://github.com/pytorch/pytorch/pull/160161. As @malfet 's comments: https://github.com/pytorch/pytorch/pull/160161#pullrequestreview-3103812178 This PR will not land turn on change, and only land UT part.

changes:
1. Fixed `test_invalid_artifact_flag_error_msg`.
2. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`.
3. Skiped whole UT `test_cpu_select_algorithm.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160292
Approved by: https://github.com/malfet
2025-08-11 02:55:37 +00:00
842cc77ab9 [MPS] Extend addmm to integral types (#160270)
By adding `addmm` kernel, which is a logical continuation  of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float

Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via
```
 PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_
```

Fixes https://github.com/pytorch/pytorch/issues/154901
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #160228, #160234
2025-08-11 00:54:17 +00:00
b602ea9cab Revert "[inductor] turn on windows inductor UTs (#160161)"
This reverts commit 4416433c7c625127b7f975c92f8ec98ea4c67fd3.

Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/xuhancn due to auto merged with two related issue ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172982125))
2025-08-11 00:04:25 +00:00
4416433c7c [inductor] turn on windows inductor UTs (#160161)
With this PR, we can turn on the inductor UTs on Windows CPU.

changes:
1. Turn on inductor UTs on Windows CPU.
2. Add a shard to balance added UTs, otherwise it should run timeout.
3. Fixed `test_invalid_artifact_flag_error_msg`.
4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`.
5. Skiped whole UT `test_cpu_select_algorithm.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161
Approved by: https://github.com/jansel
2025-08-10 23:18:35 +00:00
05c19d1ace [Inductor] Add back the revert part (#160054)
Add back the reverted code(https://github.com/pytorch/pytorch/pull/159809) as we've figured out the actual root cause of the internal test failures. Mote details in the internal diff.
Rollback Plan:

Differential Revision: D79776691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160054
Approved by: https://github.com/blaine-rister
2025-08-10 19:20:30 +00:00
d6786741a7 [inductor] slow test some Windows UTs. (#160267)
When we enabled Windows inductor UTs since the PR: https://github.com/pytorch/pytorch/pull/160161/
The main branch CI occurred timeout issue, Let's move some UT to slow test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160267
Approved by: https://github.com/ezyang
2025-08-10 18:35:42 +00:00
7ae0629d64 Revert "[inductor] turn on windows inductor UTs (#160161)"
This reverts commit f0980fc0bbd656d6c02d23ad97e945353b314f35.

Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/clee2000 due to broke some inductor tests on windows inductor\test_codecache.py::TestStandaloneCompile::test_different_process [GH job link](https://github.com/pytorch/pytorch/actions/runs/16853706010/job/47748778757) [HUD commit link](f0980fc0bb).  note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172784292))
2025-08-10 17:33:19 +00:00
0e3e377bd5 [inductor] fix CompiledArtifact.load path on Windows. (#160268)
fix CompiledArtifact.load path on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160268
Approved by: https://github.com/ezyang
2025-08-10 14:22:52 +00:00
a84b60c0c4 [MPS] Sparse coalesce more dtypes to match cpu (#160254)
More dtypes to match the cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254
Approved by: https://github.com/malfet
2025-08-10 12:25:18 +00:00
3ac86e728d Add Alban and Piotr to list of maintainers (#160187)
Add Alban and Piotr to list of maintainers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160187
Approved by: https://github.com/albanD
2025-08-10 12:00:16 +00:00
c9671dc865 Delete Python reference implementation from torchdim, as it is untested (#160115)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160115
Approved by: https://github.com/albanD
2025-08-10 11:21:33 +00:00
af10f1f86c Fix requires_cuda to requires_cuda_and_triton (#160222)
Fixes ##159399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222
Approved by: https://github.com/janeyx99
2025-08-10 07:05:52 +00:00
5dddcd5b07 Correctly copy self.module_stack in ModuleStackTracer (#159956)
There is a bigger cluster of issues which this does not completely fix, but I think this is a matter of good hygiene, especially because we immediately mutate the dict after assigning it.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159956
Approved by: https://github.com/pianpwk
2025-08-10 03:33:59 +00:00
d3d359dbaf Revert "Fix get_free_symbol_uses for several nodes. (#160134)"
This reverts commit db78943a1ca13a32a3d6045eb15e2b719ee13a2f.

Reverted https://github.com/pytorch/pytorch/pull/160134 on behalf of https://github.com/malfet due to No, those are not pre-existing, see df55ec7d4b/1 ([comment](https://github.com/pytorch/pytorch/pull/160134#issuecomment-3172314322))
2025-08-10 02:37:40 +00:00
df55ec7d4b [OpInfo][BE] Better inputs for addmm (#160234)
Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234
Approved by: https://github.com/Skylion007
ghstack dependencies: #160228
2025-08-10 01:26:48 +00:00
f0980fc0bb [inductor] turn on windows inductor UTs (#160161)
With this PR, we can turn on the inductor UTs on Windows CPU.

changes:
1. Turn on inductor UTs on Windows CPU.
2. Add a shard to balance added UTs, otherwise it should run timeout.
3. Fixed `test_invalid_artifact_flag_error_msg`.
4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`.
5. Skiped whole UT `test_cpu_select_algorithm.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161
Approved by: https://github.com/jansel
2025-08-09 21:06:00 +00:00
db78943a1c Fix get_free_symbol_uses for several nodes. (#160134)
get_free_symbol_uses is used to know what unbacked symbols are used by a given node.
not having correct get_free_symbol_uses defined properly leads to :
1. eliminating of some nodes due to not detection of any users. (See the added unit test)
2. Incorrect topological sort.

Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel.
for ComputedBuffer with NonOwningLayout its interesting case.
when layout is NonOwningLayout we need to access the actual view op base layout and use
detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160134
Approved by: https://github.com/bobrenjc93
2025-08-09 18:15:46 +00:00
29712314dd [fx][pass] Support converting a float32 tensor to a scalar in FX trace. (#158216)
Fixes https://github.com/pytorch/pytorch/issues/158083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158216
Approved by: https://github.com/laithsakka
2025-08-09 15:13:13 +00:00
cyy
01f66d08d9 Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075)
Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075
Approved by: https://github.com/Skylion007
2025-08-09 14:23:17 +00:00
2f4c222617 Revert "Make user defined Triton kernels serializable for fx_graph_runnable (#160002)"
This reverts commit 4183d4ff3dcc1d87400326a9a7998c3f9e966f60.

Reverted https://github.com/pytorch/pytorch/pull/160002 on behalf of https://github.com/albanD due to Breaks inductor tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/160002#issuecomment-3170855866))
2025-08-09 14:01:58 +00:00
8047421fbb [Linter] Expanding the scope of detecting device-bias code. (#159949)
Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios:
1. Detect device-bias code in functions decorated with @requires_triton.
2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example:
```
if __name__ == "__main__":
    if HAS_GPU:
        run_tests()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-09 09:41:16 +00:00
4183d4ff3d Make user defined Triton kernels serializable for fx_graph_runnable (#160002)
Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002
Approved by: https://github.com/eellison
2025-08-09 09:26:05 +00:00
fb887c3bb5 Add Sherlock and Zhengxu as codeowner for schema.py (#160233)
Test Plan:
CI

Rollback Plan:

Differential Revision: D79933462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160233
Approved by: https://github.com/zhxchen17
2025-08-09 04:44:12 +00:00
bcf23ecc47 [vllm hash update] update the pinned vllm hash (#160235)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160235
Approved by: https://github.com/pytorchbot
2025-08-09 04:17:32 +00:00
303c614f3d [dynamo] Be consistent with UserMethodVariable source (#160155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160155
Approved by: https://github.com/StrongerXi
2025-08-09 04:16:14 +00:00
0d88593dd8 [audio hash update] update the pinned audio hash (#160153)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160153
Approved by: https://github.com/pytorchbot
2025-08-09 04:01:31 +00:00
5ed4f91779 [dynamo] support itertools.permutations (#159694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159694
Approved by: https://github.com/guilhermeleobas
ghstack dependencies: #159693
2025-08-09 03:01:58 +00:00
e07c52b2c0 [dynamo] Improve support for itertools.product (#159693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159693
Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos
2025-08-09 03:01:58 +00:00
cyy
10e3514c96 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-08-09 02:21:22 +00:00
11a3565f18 [Torch Native] Add test for packaging weight (#158750)
Add test that require weights to be packaged for torch native

For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model.

After we added weight deduping, we should be able to let this config be False.

```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter_weights
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750
Approved by: https://github.com/desertfire
2025-08-09 01:04:21 +00:00
e96c7c4bb0 [dcp][hf] Improve HF consolidation algorithm (#158648)
Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written

Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648
Approved by: https://github.com/saumishr
2025-08-09 00:11:22 +00:00
9b803cdbe2 [BE] Remove more optim entries from docs coverage ignore list (#160194)
This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2

If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-08-09 00:09:45 +00:00
8c41cb800a [MPS][BE] Combine all pre-MacOS14 xfail lists (#160228)
It does not matter whether it started to fail after 13.1 or 13.3, fact
that it still fails on latest MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160228
Approved by: https://github.com/dcci
2025-08-09 00:00:46 +00:00
731ee31f7b [TorchScript, PT2] Add torch._check compatibility support (#159988)
Summary:
Add support for torch._check() in TorchScript jit.script frontend.

* It will be special cased to behave like torch._assert, turned into an if + raise exception.

Test Plan:
Unit tests

Rollback Plan:

Differential Revision: D79744604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159988
Approved by: https://github.com/davidberard98
2025-08-08 23:14:13 +00:00
566c6d52ef [ONNX] Fix the export of the model having none as output (#160200)
Fixes #160150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160200
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-08-08 23:09:34 +00:00
4e2ddb5db6 [Inductor][CUTLASS] Copy cutlass_mock_imports directory (#159724)
Pip wheels of PyTorch nightly and 2.8 release candidates do not contain `cutlass_mock_imports`.

This is the path to the source code:
```
root@8120d02fd9c5:$ tree ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/
./torch/_inductor/codegen/cuda/cutlass_lib_extensions/
├── cutlass_mock_imports
│   ├── cuda
│   │   ├── __init__.py
│   │   ├── cuda.py
│   │   └── cudart.py
│   ├── pydot
│   │   └── __init__.py
│   └── scipy
│       ├── __init__.py
│       └── special.py
├── evt_extensions.py
└── gemm_operation_extensions.py

5 directories, 8 files
```

And this what installed wheel has:
```
root@8120d02fd9c5:$ tree /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/
/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/
├── __init__.py
├── evt_extensions.py
└── gemm_operation_extensions.py

1 directory, 3 files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159724
Approved by: https://github.com/henrylhtsang
2025-08-08 22:56:05 +00:00
9e07673deb Fix test_fsdp_ep.py due to _MeshEnv API change (#158695)
#132339 changed parent/child mesh related APIs from _MeshEnv. UT TestFSDPWithEP.test_e2e still uses old APIs and will fail:
```
File "/home/kanya/pytorch/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 77, in test_e2e
    mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, ("dp",))
AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh'

To execute this test, run the following from the base repo dir:
    python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'?
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158695
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
2025-08-08 22:36:47 +00:00
1128f4c2a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-08-08 22:22:48 +00:00
334ecbd4ff Add torchao to install_inductor_benchmark_deps cleanup stage (#160191)
It looks like `torcho` was missed from the cleanup during torchbench setup.

Fixes #160188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160191
Approved by: https://github.com/huydhn
2025-08-08 22:18:41 +00:00
206c1eef65 Revert "[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655)"
This reverts commit 2ee22e435131369a7e4f8cc4732579acc29a941b.

Reverted https://github.com/pytorch/pytorch/pull/159655 on behalf of https://github.com/clee2000 due to broke dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed [GH job link](https://github.com/pytorch/pytorch/actions/runs/16839294394/job/47711078667) [HUD commit link](2ee22e4351).  Probably a landrace since it did run on the PR ([comment](https://github.com/pytorch/pytorch/pull/159655#issuecomment-3169400889))
2025-08-08 22:04:22 +00:00
28ccc9e724 [MPS] Extend index_put to complex types (#160159)
And delete confusing supported types check.
Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header

Fixes https://github.com/pytorch/pytorch/issues/160034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007
2025-08-08 21:54:30 +00:00
2247aa6d1d Documents tuning NVLink performance on H100/H200 (#159792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792
Approved by: https://github.com/ngimel
2025-08-08 20:28:24 +00:00
1febab2a89 Do not treat ReinterpretView as a realized node (#159920)
Summary:
Do not treat ReinterpretView as a realized node

Function [gather_origins](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L888](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L888&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) calls is_realized_node to decide if a FX node should be included in the origins of a IR node. ReinterpretView is considered a realized node, so it is not included in the origins. It leads to an incomplete graph. For example:

```
@torchdynamo.optimize("inductor")
def fn(input_data, weight):
    normalized_input = input_data * weight.unsqueeze(0)
    return normalized_input
input_data = torch.randn(4272, 192, requires_grad=True).to(device)
weight = torch.randn(192, requires_grad=True).to(device)
fn(input_data, weight)
```

The original FX graph returned in [get_kernel_metadata](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L723](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L723&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) is the following:
%primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2]
%primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1]
%mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {})
return %mul
The unsqueeze op is missing.

With this DIFF, the new FX graph is the following:
%primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2]
%primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1]
%unsqueeze : Tensor "f32[1, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.unsqueeze.default](args = (%primals_1, 0), kwargs = {})
%mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {})
return %mul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159920
Approved by: https://github.com/mlazos
2025-08-08 20:13:35 +00:00
2ee22e4351 [pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655)
This change logs the stack trace of the code being compiled by Dynamo, improving visibility into what is compiled. It adds a stack_trace field to compilation metrics. This helps with debugging and analysis of Dynamo compilation behavior.
 Ref [D79287964](https://www.internalfb.com/diff/D79287964)

Test Plan:
$ python -m test_utils
Internal: ref [D79372519](https://www.internalfb.com/diff/D79372519)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159655
Approved by: https://github.com/c00w
2025-08-08 19:53:47 +00:00
c86040a8e6 [torch.export] Fix test_export_api_with_dynamic_shapes (#160164)
Summary: Update test KJT's dynamic_shapes to match the newly exported fields.

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:test_export -- --exact 'caffe2/test:test_export - test_export_api_with_dynamic_shapes_cpp_runtime_nonstrict (caffe2.test.export.test_nativert.NativeRTTestExport)'
File changed: fbcode//caffe2/test/export/test_export.py
Buck UI:
https://www.internalfb.com/buck2/8247eaf8-eaf9-4876-95cb-7b4263d15ef2
Test UI:
https://www.internalfb.com/intern/testinfra/testrun/2533275093345198
Network: Up: 100KiB  Down: 0B  (reSessionID-72a2579f-df3f-4262-9aa3-de0db9687
Executing actions. Remaining 0/2
Command: test.
Time elapsed: 2:20.5s
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Reviewed By: malaybag

Differential Revision: D79862872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160164
Approved by: https://github.com/angelayi, https://github.com/ezyang
2025-08-08 19:45:30 +00:00
72009ec6be [replicate][be] improved readability and cleaned up remaining DDP code (#160133)
**Summary**
As much of ReplicateState functionality is copied from FSDPState, I fixed any remaining comments that incorrectly used FSDP instead of Replicate. In addition, instead of labeling modules FSDPModule or FSDPLinear, I have changed it so that is now uses Replicate____. Finally, I have removed some leftover code from the DDP implementation. I have included test cases to verify correctness.

**Test Case**
1. pytest test/distributed/_composable/test_replicate_with_fsdp.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160133
Approved by: https://github.com/mori360
ghstack dependencies: #160128
2025-08-08 19:42:23 +00:00
5f5f508aa8 [ROCm] Ck backend UX refactor (#152951)
Refactors how the enablement/disablement of CK Gemms and SDPA works.

- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-08-08 18:40:17 +00:00
da1f608ca3 Add UT for torch.accelerator memory-related API (#155200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200
Approved by: https://github.com/albanD
ghstack dependencies: #138222, #152932
2025-08-08 17:41:22 +00:00
84f7e88aef Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-08 17:41:22 +00:00
d7114f05b1 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-08 17:41:10 +00:00
c5ec5458a5 Don't build nccl when distributed is disabled (#160086)
Because distributed doesn't build on recent compilers, I have to disable distributed, but this makes it still fail as nccl is still built
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160086
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-08-08 17:19:16 +00:00
86eb65f7f0 [MPS] Move max_pool2d to Metal for stride != 1 (#157876)
This PR updates `max_pool2d` to use a Metal kernel instead of the old MPS graph impl. However, when the `stride` argument is 1 in all dimensions, the old implementation gives significantly better performance, so we fall back to it in that case. Below is a performance comparison of `max_pool2d` before and after this PR, obtained from this script: 2f02f2bf7a/max_pool_mps/perf.py

<details><summary>Click to expand</summary>

case | before PR | after PR | speedup |   | case info
-- | -- | -- | -- | -- | --
0 | 0.014264 | 0.004473 | 3.188911245 |   | (3, 2, 2), {'kernel_size': 2, 'return_indices': True}
1 | 0.010752 | 0.00421 | 2.55391924 |   | (3, 2, 2), {'kernel_size': 2, 'return_indices': False}
2 | 0.020777 | 0.006123 | 3.393271272 |   | (3, 10, 10), {'kernel_size': 5, 'return_indices': True}
3 | 0.011065 | 0.005759 | 1.921340511 |   | (3, 10, 10), {'kernel_size': 5, 'return_indices': False}
4 | 0.01452 | 0.007829 | 1.854642994 |   | (3, 100, 100), {'kernel_size': 5, 'return_indices': True}
5 | 0.009258 | 0.007075 | 1.308551237 |   | (3, 100, 100), {'kernel_size': 5, 'return_indices': False}
6 | 0.188137 | 0.168688 | 1.115295694 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True}
7 | 0.161362 | 0.154746 | 1.042753932 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False}
8 | 0.182883 | 0.16945 | 1.079274122 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True}
9 | 0.156875 | 0.163346 | 0.9603847049 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False}
10 | 0.193433 | 0.167396 | 1.155541351 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True}
11 | 0.158967 | 0.151246 | 1.051049284 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False}
12 | 0.931071 | 0.932883 | 0.9980576342 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True}
13 | 0.324496 | 0.3252 | 0.9978351784 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False}
14 | 0.944071 | 0.936246 | 1.008357846 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True}
15 | 0.322171 | 0.314854 | 1.023239343 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False}
16 | 0.894158 | 0.886408 | 1.008743152 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True}
17 | 0.309338 | 0.304146 | 1.017070749 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False}
18 | 0.606 | 0.260546 | 2.325884873 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True}
19 | 0.30445 | 0.231054 | 1.317657344 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False}
20 | 0.474708 | 0.261925 | 1.812381407 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True}
21 | 0.23175 | 0.231883 | 0.9994264349 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False}
22 | 0.434475 | 0.266246 | 1.631855502 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True}
23 | 0.236942 | 0.231792 | 1.022218196 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False}
24 | 0.202396 | 0.174888 | 1.157289237 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True}
25 | 0.160679 | 0.158246 | 1.015374796 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False}
26 | 0.200354 | 0.184133 | 1.088093932 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True}
27 | 0.160779 | 0.160679 | 1.000622359 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False}
28 | 0.199175 | 0.178625 | 1.115045486 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True}
29 | 0.159458 | 0.160883 | 0.9911426316 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False}
30 | 0.199021 | 0.165329 | 1.203787599 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True}
31 | 0.156337 | 0.158213 | 0.9881425673 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False}
32 | 0.180146 | 0.174483 | 1.032455884 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True}
33 | 0.156988 | 0.158167 | 0.9925458534 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False}
34 | 0.182133 | 0.176521 | 1.031792251 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True}
35 | 0.169042 | 0.156483 | 1.080257919 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False}
36 | 1.767821 | 1.766254 | 1.000887188 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True}
37 | 1.059346 | 1.058775 | 1.000539302 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False}
38 | 1.85755 | 1.859429 | 0.9989894747 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True}
39 | 1.100417 | 1.097683 | 1.002490701 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False}
40 | 1.843167 | 1.847558 | 0.9976233493 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True}
41 | 1.090142 | 1.093163 | 0.9972364597 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False}
42 | 0.480867 | 0.251733 | 1.910226311 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True}
43 | 0.319246 | 0.236479 | 1.349997251 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False}
44 | 0.49315 | 0.256408 | 1.923301925 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True}
45 | 0.316746 | 0.227854 | 1.390127011 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False}
46 | 0.4912 | 0.257762 | 1.905633879 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True}
47 | 0.324771 | 0.229371 | 1.41592006 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False}
48 | 0.152904 | 0.095079 | 1.608178462 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True}
49 | 0.102963 | 0.089217 | 1.154073775 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False}
50 | 0.155158 | 0.095429 | 1.625899884 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True}
51 | 0.104338 | 0.089979 | 1.15958168 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False}
52 | 0.153121 | 0.096429 | 1.587914424 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True}
53 | 0.103642 | 0.090254 | 1.148336916 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False}
54 | 0.191071 | 0.165125 | 1.157129447 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True}
55 | 0.153971 | 0.149021 | 1.033216795 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False}
56 | 0.193192 | 0.166892 | 1.157586942 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True}
57 | 0.156617 | 0.15215 | 1.029359185 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False}
58 | 0.178033 | 0.167308 | 1.06410333 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True}
59 | 0.157425 | 0.164404 | 0.9575496947 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False}
60 | 1.757638 | 1.750896 | 1.0038506 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True}
61 | 1.048471 | 1.047967 | 1.000480931 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False}
62 | 1.790708 | 1.789767 | 1.000525767 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True}
63 | 1.054575 | 1.054796 | 0.9997904808 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False}
64 | 1.785837 | 1.784192 | 1.000921986 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True}
65 | 1.054713 | 1.054492 | 1.00020958 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False}
66 | 0.478267 | 0.261017 | 1.832321266 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True}
67 | 0.32005 | 0.226654 | 1.412064204 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False}
68 | 0.484008 | 0.254721 | 1.900149575 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True}
69 | 0.321 | 0.218842 | 1.466811672 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False}
70 | 0.482087 | 0.248771 | 1.937874591 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True}
71 | 0.316558 | 0.230533 | 1.373156988 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False}
72 | 0.137842 | 0.085088 | 1.619993419 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True}
73 | 0.100671 | 0.0769 | 1.309115735 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False}
74 | 0.148321 | 0.086967 | 1.705485989 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True}
75 | 0.101392 | 0.075454 | 1.343759112 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False}
76 | 0.150208 | 0.083742 | 1.793699697 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True}
77 | 0.099587 | 0.075825 | 1.313379492 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False}
78 | 0.622546 | 0.602729 | 1.03287879 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True}
79 | 0.531696 | 0.5067 | 1.049330965 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False}
80 | 0.626646 | 0.617038 | 1.015571164 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True}
81 | 0.530354 | 0.525367 | 1.009492412 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False}
82 | 0.633933 | 0.577775 | 1.097197006 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True}
83 | 0.533067 | 0.526954 | 1.011600633 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False}
84 | 3.372867 | 3.386412 | 0.9960001914 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True}
85 | 1.155975 | 1.156604 | 0.9994561665 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False}
86 | 3.401921 | 3.39755 | 1.001286515 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True}
87 | 1.202829 | 1.192538 | 1.008629494 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False}
88 | 3.23675 | 3.220238 | 1.005127571 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True}
89 | 1.077067 | 1.085613 | 0.9921279498 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False}
90 | 1.572925 | 0.925625 | 1.699311276 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True}
91 | 0.791204 | 0.793454 | 0.9971642969 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False}
92 | 1.572742 | 0.922729 | 1.704446268 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True}
93 | 0.784292 | 0.788871 | 0.9941955022 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False}
94 | 1.526546 | 0.925708 | 1.649057802 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True}
95 | 0.769321 | 0.787675 | 0.9766985114 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False}
96 | 0.736033 | 0.612808 | 1.201082558 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True}
97 | 0.574625 | 0.530925 | 1.082309177 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False}
98 | 0.722021 | 0.614488 | 1.174996094 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True}
99 | 0.563171 | 0.533721 | 1.055178642 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False}
100 | 0.735725 | 0.613992 | 1.198264798 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True}
101 | 0.583487 | 0.532513 | 1.095723485 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False}
102 | 0.656383 | 0.575313 | 1.140914598 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True}
103 | 0.559796 | 0.509079 | 1.099625009 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False}
104 | 0.662046 | 0.572362 | 1.156691045 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True}
105 | 0.552633 | 0.508671 | 1.086425214 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False}
106 | 0.634108 | 0.574629 | 1.103508525 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True}
107 | 0.534013 | 0.510996 | 1.045043405 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False}
108 | 7.056642 | 7.066717 | 0.9985743026 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True}
109 | 4.144275 | 4.142658 | 1.000390329 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False}
110 | 7.172683 | 7.189867 | 0.9976099697 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True}
111 | 4.162538 | 4.158875 | 1.000880767 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False}
112 | 7.194233 | 7.181837 | 1.001726021 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True}
113 | 4.294083 | 4.196062 | 1.023360236 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False}
114 | 1.875692 | 0.891071 | 2.104986022 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True}
115 | 1.097479 | 0.781175 | 1.404907991 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False}
116 | 1.8883 | 0.89015 | 2.121327866 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True}
117 | 1.101329 | 0.778542 | 1.414604479 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False}
118 | 1.872833 | 0.893654 | 2.095702587 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True}
119 | 1.096712 | 0.784579 | 1.397835017 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False}
120 | 0.513029 | 0.374417 | 1.370207549 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True}
121 | 0.349546 | 0.305763 | 1.143192603 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False}
122 | 0.518929 | 0.377487 | 1.374693698 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True}
123 | 0.364662 | 0.3145 | 1.159497615 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False}
124 | 0.521275 | 0.375242 | 1.389170189 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True}
125 | 0.367488 | 0.308354 | 1.191773092 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False}
126 | 0.652342 | 0.569308 | 1.145850752 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True}
127 | 0.555696 | 0.506892 | 1.096280865 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False}
128 | 0.654333 | 0.570367 | 1.147213987 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True}
129 | 0.548925 | 0.505825 | 1.085207335 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False}
130 | 0.655908 | 0.571904 | 1.146884792 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True}
131 | 0.560808 | 0.508238 | 1.103435792 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False}
132 | 6.949462 | 6.949112 | 1.000050366 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True}
133 | 4.072913 | 4.065013 | 1.001943413 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False}
134 | 7.200896 | 7.197792 | 1.000431243 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True}
135 | 4.291367 | 4.218538 | 1.017264038 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False}
136 | 7.1823 | 7.306933 | 0.9829431856 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True}
137 | 4.151175 | 4.149592 | 1.000381483 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False}
138 | 1.781279 | 0.884288 | 2.014365229 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True}
139 | 1.050804 | 0.774362 | 1.356993241 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False}
140 | 1.860758 | 0.884637 | 2.103414169 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True}
141 | 1.099908 | 0.775887 | 1.417613647 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False}
142 | 1.857387 | 0.885738 | 2.096993693 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True}
143 | 1.105279 | 0.77365 | 1.428655077 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False}
144 | 0.489408 | 0.269583 | 1.815426047 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True}
145 | 0.322525 | 0.236979 | 1.360985573 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False}
146 | 0.515475 | 0.265813 | 1.93923924 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True}
147 | 0.315525 | 0.228146 | 1.382995976 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False}
148 | 0.503438 | 0.277204 | 1.816128194 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True}
149 | 0.335421 | 0.228275 | 1.469372467 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False}
150 | 5.72495 | 4.909554 | 1.166083518 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': True}
151 | 4.45215 | 4.251333 | 1.047236243 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': False}
152 | 29.953021 | 29.879879 | 1.002447868 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True}
153 | 9.854683 | 9.839517 | 1.001541336 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False}
154 | 6.178033 | 5.697375 | 1.084364817 |   | (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': True}
155 | 6.280317 | 5.712525 | 1.099394226 |   | (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': False}
156 | 10.256062 | 11.336527 | 0.9046917103 |   | (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': True}
157 | 9.469546 | 11.33705 | 0.8352742556 |   | (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': False}
158 | 0.119087 | 0.0797 | 1.494190715 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True}
159 | 0.098713 | 0.047173 | 2.092574142 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False}
160 | 0.960812 | 0.675762 | 1.421820108 |   | (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': True}
161 | 0.536546 | 0.485958 | 1.104099531 |   | (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': False}
162 | 2.555225 | 1.791567 | 1.426251432 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True}
163 | 1.419087 | 1.305137 | 1.087308842 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False}
164 | 5.182008 | 3.48085 | 1.488719135 |   | (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': True}
165 | 2.831779 | 2.498537 | 1.133374851 |   | (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': False}
166 | 8.546038 | 5.7783 | 1.478988284 |   | (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': True}
167 | 4.731004 | 4.161975 | 1.136720908 |   | (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': False}
168 | 0.084754 | 0.07435 | 1.139932751 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True}
169 | 0.057933 | 0.043096 | 1.344277891 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False}
170 | 2.568592 | 1.802117 | 1.425319222 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True}
171 | 1.433054 | 1.307342 | 1.096158465 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False}
172 | 10.3213 | 7.111604 | 1.451332217 |   | (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': True}
173 | 5.680525 | 5.168129 | 1.099145358 |   | (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': False}
174 | 1.02255 | 1.01375 | 1.008680641 |   | (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': False}
175 | 3.074233 | 3.094383 | 0.993488201 |   | (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': True}
176 | 1.016812 | 1.030575 | 0.9866453194 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False}
177 | 3.053658 | 3.089504 | 0.9883974903 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True}
178 | 1.025863 | 1.032088 | 0.9939685376 |   | (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': False}
179 | 3.798942 | 3.799213 | 0.9999286694 |   | (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': True}
180 | 4.492979 | 4.493421 | 0.999901634 |   | (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': False}
181 | 51.543363 | 51.266204 | 1.005406271 |   | (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': True}
182 | 1.018008 | 1.001587 | 1.016394981 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': False}
183 | 3.035404 | 3.003113 | 1.010752509 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': True}
184 | 0.610421 | 0.56 | 1.0900375 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': False}
185 | 1.138983 | 0.757296 | 1.504012962 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': True}
186 | 0.641558 | 0.557808 | 1.150141267 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': False}
187 | 1.181475 | 0.754725 | 1.565437742 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': True}
188 | 1.03045 | 1.026904 | 1.003453098 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': False}
189 | 3.041421 | 3.0263 | 1.00499653 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': True}
190 | 0.609929 | 0.572304 | 1.065743032 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': False}
191 | 1.146875 | 0.756446 | 1.516135983 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': True}
192 | 0.645187 | 0.561708 | 1.148616363 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': False}
193 | 1.181721 | 0.758054 | 1.558887625 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': True}
194 | 0.927654 | 0.925946 | 1.0018446 |   | (10, 1000, 1000), {'kernel_size': 1, 'return_indices': False}
195 | 2.749983 | 2.740354 | 1.00351378 |   | (10, 1000, 1000), {'kernel_size': 1, 'return_indices': True}

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157876
Approved by: https://github.com/malfet
2025-08-08 16:40:10 +00:00
a4f69a5da0 [dynamo][guards] Remove guards on stdlib modules (#159913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159913
Approved by: https://github.com/StrongerXi
2025-08-08 16:26:04 +00:00
231c72240d CMake build: preserve PYTHONPATH (#160144)
Fixes #160092

I'm very new to CMake, so let me know if there's a fancier way to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160144
Approved by: https://github.com/malfet

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-08-08 16:03:49 +00:00
50f23ff6f8 rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883)
Fixes #159399
"Modified torch.testing._internal.inductor_utils and test/inductor"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883
Approved by: https://github.com/janeyx99
2025-08-08 15:44:52 +00:00
8a37f0c903 improve gather and scatter_add strategy (#160140)
As title.

This PR made a small fix on top of https://github.com/meta-pytorch/autoparallel/pull/81.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160140
Approved by: https://github.com/fmassa
2025-08-08 15:06:24 +00:00
b5fd7223b1 Improve pin_memory error message on CPU-only systems (#159994)
## Summary
- clarify pin_memory error message when no accelerator backend is available

## Testing
- `python repro_pin_memory.py` (fails: Need to provide pin_memory allocator to use pin memory)
- `lintrunner -a`

------
https://chatgpt.com/codex/tasks/task_e_6893ba92c93483238a9bdfdd6c52812b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159994
Approved by: https://github.com/albanD
2025-08-08 14:36:45 +00:00
9fa8ce26cf Working setup with runnable PyTorch on Codex. (#159968)
Sample transcript: https://chatgpt.com/s/cd_68938effc1a88191ae78bc82a8cefe94

This makes use of https://github.com/pytorch/pytorch/pull/159965 to bypass doing an actual build and use nightly.

Things to improve:
- Once USE_NIGHTLY is in main can remove the patching
- We should just keep using the latest nightly, instead of a hard coded one

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159968
Approved by: https://github.com/wdvr
2025-08-08 14:34:15 +00:00
62bac07981 [inductor][triton] support profile_scratch launcher arg (#159772)
This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel.

Tests: https://github.com/pytorch/pytorch/pull/159158. I also verified these test locally with triton 3.2, 3.3, and 3.4.

Fixes:
* static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`)
* AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`)

Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772
Approved by: https://github.com/NikhilAPatel, https://github.com/eellison
2025-08-08 14:27:38 +00:00
7f4cb4a3e0 [MPS] coalesce for sparse tensors (#159729)
MPS coalesce function for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-08 13:49:55 +00:00
556e2a73f4 [Test][Easy] Use float16 dtype in test_sort_large (#159939)
The test fails with:
>RuntimeError: var_mean only support floating point and complex dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159939
Approved by: https://github.com/eqy
2025-08-08 09:56:44 +00:00
178515d0ff [BE][PYFMT] remove black: finish black -> ruff format migration (#144557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144557
Approved by: https://github.com/ezyang
2025-08-08 07:46:10 +00:00
3a56237440 [SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels (#159788)
This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers).

The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data.

-----

**TODO:**
This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until`

From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer

```
Pointer-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Remote buffer:  0x2430300c00 (dst) ← Rank 1's memory
    Remote signal:  0x2430301600 (sig) ← Rank 1's signal

  Rank 1 (waiting):
    Local signal:   0x430301600 (waits here)

Tensor-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Local buffer:   0x430300c00  (dst) ← this is wrong
    Local signal:   0x430300e00  (sig) ← this is wrong

  Rank 1 (waiting):
    Local signal:   0x430300e00 (waits here)

```

Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159788
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756
2025-08-08 05:20:42 +00:00
e0d8a315c5 [SymmMem] Add helpful docstrings for all NVSHMEM APIs (#159756)
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159756
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755
2025-08-08 05:20:42 +00:00
bfff2e3592 [SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch (#159755)
This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64).

It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159755
Approved by: https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734
2025-08-08 05:20:36 +00:00
1c881440f4 [SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name (#159734)
Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes  `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159734
Approved by: https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701
2025-08-08 05:20:29 +00:00
7c4f7b9340 [SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) (#159701)
This PR introduces support for Triton 3.4 and resolves several CI and test-related issues.

**Triton 3.4 Compatibility**
- The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook.
- The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes.

**Fix CI Errors**
- The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped.
- Added a decorator to run NVSHMEM tests only on H100s (compatible hardware)

**Peer Rank Calculation Fix**
- The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank.
Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159701
Approved by: https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215
2025-08-08 05:20:22 +00:00
1783d6e966 [SymmMem] Fix flaky wait_until test (#159215)
When playing around with it, I noticed some flakiness in this test across sessions.

After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock.

The fix was realizing `wait_until` already provides all the sync you need. Just do:
- PE A: `nvshmem_wait_until(&ivar, ...)`
- PE B: `nvshmem_put(&ivar_on_PE_A, ...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159215
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136
2025-08-08 05:20:16 +00:00
ea7fe0ecf6 [SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity (#159136)
Quick refactor for consistency and clarity.

1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms.

2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals.

To clean this up:
	•	All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit.
	•	Typed APIs consistently use nelems for element counts.
	•	Docstrings were added or updated to clarify expected units.

Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping.

This should make the API more intuitive and reduce friction for developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159136
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718
2025-08-08 05:20:09 +00:00
b0b229b197 [SymmMem] Use _get_default_group() instead of group.WORLD for group_name access (#158718)
Both approaches functionally return the default process group created by `init_process_group()` but `_get_default_group()` is a dedicated function with [better error handling and type safety](4869f71170/torch/distributed/distributed_c10d.py (L1300-L1310)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158718
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
ghstack dependencies: #158515
2025-08-08 05:20:02 +00:00
b5c937259b [SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton (#158515)
Implements sum_reduce, min_reduce, and max_reduce collective operations for NVSHMEM Triton kernels. Enables parallel reduction computations across PE teams for int64 data types.

Tests: `python test/distributed/test_nvshmem_triton.py`

<details>
<summary> Quick debug print for sanity check </summary>

```markdown
============================================================
[Rank 1] Starting min/max reduction test with world_size=2
============================================================
============================================================
[Rank 0] Starting min/max reduction test with world_size=2
============================================================
[Rank 0] Source data for min/max: [10, 20]
[Rank 1] Source data for min/max: [15, 5]
[Rank 1] All values across PEs:
[Rank 0] All values across PEs:
  - Position 0: [10, 15]
  - Position 0: [10, 15]
  - Position 1: [20, 5]
  - Position 1: [20, 5]
[Rank 1] Expected min: [10, 5]
[Rank 0] Expected min: [10, 5]
[Rank 1] Expected max: [15, 20]
[Rank 0] Expected max: [15, 20]
[Rank 0] Executing MIN reduction...
[Rank 1] Executing MIN reduction...
[Rank 0] Executing MAX reduction...
[Rank 1] Executing MAX reduction...
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Results:
[Rank 0] Results:
[Rank 1] MIN reduction result: [10, 5]
[Rank 1] MAX reduction result: [15, 20]
[Rank 0] MIN reduction result: [10, 5]
[Rank 0] MAX reduction result: [15, 20]
[Rank 1] ============================================================
[Rank 1] Min/Max reduction test PASSED ✓
[Rank 1] ============================================================
[Rank 0] ============================================================
[Rank 0] Min/Max reduction test PASSED ✓
[Rank 0] ============================================================
......
============================================================
============================================================
[Rank 0] Starting sum reduction test with world_size=2
[Rank 1] Starting sum reduction test with world_size=2
============================================================
============================================================
[Rank 0] Configuration:
[Rank 1] Configuration:
  - nreduce: 3 (number of separate reductions)
  - nreduce: 3 (number of separate reductions)
  - dtype: torch.int64
  - dtype: torch.int64
[Rank 1] Source data: [2, 4, 6]
[Rank 1] Contribution explanation:
[Rank 0] Source data: [1, 2, 3]
[Rank 0] Contribution explanation:
  - Element 0: 2 = (rank=1+1) * (index=0+1)
  - Element 0: 1 = (rank=0+1) * (index=0+1)
  - Element 1: 4 = (rank=1+1) * (index=1+1)
  - Element 1: 2 = (rank=0+1) * (index=1+1)
  - Element 2: 6 = (rank=1+1) * (index=2+1)
  - Element 2: 3 = (rank=0+1) * (index=2+1)
[Rank 1] Initial destination: [-1, -1, -1]
[Rank 0] Initial destination: [-1, -1, -1]
[Rank 0] Expected results after reduction: [3, 6, 9]
[Rank 1] Expected results after reduction: [3, 6, 9]
[Rank 0] Executing sum reduction...
[Rank 1] Executing sum reduction...
[Rank 1] Sum reduction completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Sum reduction completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Results after reduction:
[Rank 0] Destination buffer: [3, 6, 9]
[Rank 1] Results after reduction:
[Rank 0] Verification:
  - Reduction 0: PE0: 1 + PE1: 2 = 3
    Result: 3, Match: ✓
  - Reduction 1: PE0: 2 + PE1: 4 = 6
    Result: 6, Match: ✓
[Rank 1] Destination buffer: [3, 6, 9]
  - Reduction 2: PE0: 3 + PE1: 6 = 9
[Rank 1] Verification:
  - Reduction 0: PE0: 1 + PE1: 2 = 3
    Result: 9, Match: ✓
    Result: 3, Match: ✓
  - Reduction 1: PE0: 2 + PE1: 4 = 6
    Result: 6, Match: ✓
  - Reduction 2: PE0: 3 + PE1: 6 = 9
    Result: 9, Match: ✓
[Rank 0] ============================================================
[Rank 0] Sum reduction test PASSED ✓
[Rank 0] All 3 reductions computed correctly across 2 PEs
[Rank 0] ============================================================
[Rank 1] ============================================================
[Rank 1] Sum reduction test PASSED ✓
[Rank 1] All 3 reductions computed correctly across 2 PEs
[Rank 1] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158515
Approved by: https://github.com/mandroid6, https://github.com/ngimel
2025-08-08 05:19:55 +00:00
24257f5bfa [vllm hash update] update the pinned vllm hash (#159822)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159822
Approved by: https://github.com/pytorchbot
2025-08-08 04:13:48 +00:00
017259f9c6 [benchmarks] Add nativert benchmark (#159922)
Add NativeRT as an option in the PT2 OSS benchmark

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert

python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert

python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922
Approved by: https://github.com/angelayi
2025-08-08 03:38:32 +00:00
2ea40fba84 [Linter] Improve device-bias linter by adding detection for with torch.device("cuda"). (#159926)
```
For example, detect the following situation:
>>>Lint for test/dynamo/test_modes.py:
  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode `with torch.device('cuda')`,
    suggest to use torch.device(GPU_TYPE)

        687  |            flex_attention as flex_attention_eager,
        688  |        )
        689  |
    >>> 690  |        with torch.device("cuda"):
        691  |            flex_attention = torch.compile(flex_attention_eager, dynamic=False)
        692  |
        693  |            with self.assertRaisesRegex(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159926
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #159759
2025-08-08 03:20:42 +00:00
beb4d7816d [BE]: ruff PLC0207 - use maxsplit kwarg (#160107)
Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107
Approved by: https://github.com/albanD
2025-08-08 03:14:59 +00:00
3fcd79e023 Fix infinite loop when iterating over an empty zip (#159673)
Dynamo would enter in an infinite recursion when
`ZipVariable.next_variable(tx)` was called and there was no iterable to
be iterated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159673
Approved by: https://github.com/williamwen42
2025-08-08 02:50:21 +00:00
05c417715f integrate kernacle into inductor (#160121)
This adds integration into inductor in two parts

1) It kicks off the best config lookup at lowering time within mm.py
2) It awaits the future at scheduling time in select_algorithm.py

Notably this does not do the following

1) Support for enumerating between mm, addmm and bmm
2) Support for enumerating between exhaustive/max
3) Enumerating different hardware SKUs eg. H100, A100, etc.

those will come in the next diffs

Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121
Approved by: https://github.com/izaitsevfb
2025-08-08 02:14:44 +00:00
ba4ccf5d67 turn on executon frame clenaup by default (#160110)
Summary: Turning execution frame cleanup back on since D78621408 is done

Test Plan:
See D78621408

Rollback Plan:

Differential Revision: D79730674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160110
Approved by: https://github.com/jingsh
2025-08-08 02:13:48 +00:00
d68c323692 Log max_autotune exceptions (#159687) (#159688)
Summary:

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```

The exception tracebacks:
```
# inner exception
traceback:
  File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
    launchers.append(result.make_launcher())
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
    self.kernel.load_kernel(device)
  File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
    (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(

# wrapped exception
traceback:
  File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
    choice.precompile()
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
    self.bmreq.precompile()
  File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
    getattr(mod, self.kernel_name).precompile()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
    self._make_launchers()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
    raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.

The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688
Approved by: https://github.com/stashuk-olek
2025-08-08 01:30:08 +00:00
03b254e49f Extend torch function support to ALL arguments, not just scalar type (but not insides of list) (#145089)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145089
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-08-07 23:43:53 +00:00
195b5c2e27 Revert "dynamo: Remove passing or deleted dynamo_expected_failures (#159691)"
This reverts commit 36f46d082a4954921cb8493223f000f2aab79ed7.

Reverted https://github.com/pytorch/pytorch/pull/159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/159691#issuecomment-3166067241))
2025-08-07 22:55:51 +00:00
f077c2402e [replicate][be] improved readability of test case description (#160128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160128
Approved by: https://github.com/mori360
2025-08-07 22:51:58 +00:00
d46768db04 [MTIA] Allow users who know what they are doing to ignore all device mismatches in tracing and take a preferred device. (#159931)
Summary:
Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical.

Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations.

If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors.

Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing.

Users can set
```
  torch._functorch.config.fake_tensor_prefer_device_type = 'mtia'
```
to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA.

Test Plan:
Added two unit tests.

Rollback Plan:

Differential Revision: D79698438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159931
Approved by: https://github.com/jansel
2025-08-07 22:37:15 +00:00
clr
36f46d082a dynamo: Remove passing or deleted dynamo_expected_failures (#159691)
partially generated with
```
for TESTCASE in $(ls | cut -f1 -d'.' | grep -v CPython | uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else   sl rm "$TESTCASE"* ; fi; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159691
Approved by: https://github.com/xmfan
2025-08-07 21:41:50 +00:00
8147370733 Fix qembeddingbag_byte_prepack_meta to use sym_sizes (#159985)
Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead.

Test Plan:
CI

Rollback Plan:

Reviewed By: kqfu, henryoier

Differential Revision: D79744512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159985
Approved by: https://github.com/jerryzh168, https://github.com/henryoier
2025-08-07 21:22:29 +00:00
e619c6bb90 [export] Apply move_to_device_pass to all submodules (#159992)
Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159992
Approved by: https://github.com/yiming0416
2025-08-07 18:51:15 +00:00
3cf7b4024e [DTensor] Support user-supplied Generator for random ops (#159933)
If the user provides a generator kwarg to a random op (e.g.
nn.init.uniform_(..., generator=my_generator)), we can still advance
that generator's state in a SPMD-global way so that each local-tensor
gets appropriate values and the generator advances to the same state as
if it had operated on the full tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159933
Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol
2025-08-07 18:47:22 +00:00
21392c0e06 [inductor] disable flex decoding on Windows. (#160072)
Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160072
Approved by: https://github.com/angelayi
2025-08-07 18:07:36 +00:00
ee1fb43450 Fix docker image creation (#158634)
Since switching from wheel 0.34.2 to wheel 0.45.1
python symlinks are no longer correctly created.

Migrate to packaging package for symlink creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158634
Approved by: https://github.com/malfet
2025-08-07 17:41:47 +00:00
0bd3af4fb8 Further fix failing tests in test/inductor/test_analysis.py (#160070)
This is a follow up on #159800 as other tests are still failing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160070
Approved by: https://github.com/aorenste
2025-08-07 17:32:58 +00:00
8399cf88ce Use only safetensors APIs in HFStorageReader (#159681)
Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs

Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681
Approved by: https://github.com/saumishr
ghstack dependencies: #159405, #159406
2025-08-07 17:23:03 +00:00
0b187b3114 DCP HF reader: use safe_open instead of reading the bytes (#159406)
Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after.

Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406
Approved by: https://github.com/saumishr
ghstack dependencies: #159405
2025-08-07 17:23:03 +00:00
69cc606fda HF component update to not use fsspec components (#159405)
Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it

Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405
Approved by: https://github.com/saumishr
2025-08-07 17:22:54 +00:00
57f738b635 [inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983
Approved by: https://github.com/eellison
ghstack dependencies: #158758
2025-08-07 17:07:26 +00:00
e167c7d0f3 [inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)
Fixes #155121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang, https://github.com/eellison
2025-08-07 17:07:26 +00:00
b1a602762e [Profiler] Update README (#159816)
Summary: Updated README with code structure and explanation of core features within profiler

Test Plan:
N/A

Rollback Plan:

Differential Revision: D79604189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159816
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2025-08-07 16:44:41 +00:00
e1cf0d496e [inductor] unification for inductor debug. (#159998)
Unification inductor debug build, follow @desertfire 's suggestion: https://github.com/pytorch/pytorch/pull/159938#pullrequestreview-3093803196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159998
Approved by: https://github.com/angelayi
2025-08-07 16:38:00 +00:00
06824f3c72 [inductor] fix test_dynamo_timed on Windows. (#159981)
Fixed `test_dynamo_timed `:
<img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159981
Approved by: https://github.com/angelayi
2025-08-07 16:37:52 +00:00
f3a4d742ec Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
74da2604c9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 15f1173e5d72d6d45faba4cecd135e0160f06c6f.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
c4e64467b5 Revert "Add UT for torch.accelerator memory-related API (#155200)"
This reverts commit 4604f0482c2b4a3001b62e5bc5085149a9bb053c.

Reverted https://github.com/pytorch/pytorch/pull/155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
90b78ee50f Move xla jobs to unstable workflow (#159272)
Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily.

If you want to run these xla tests, add the ciflow/unstable label to your PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159272
Approved by: https://github.com/atalman, https://github.com/malfet
2025-08-07 16:22:52 +00:00
e248719ac0 [DTensor] support _StridedShard in view op (#159656)
**Summary**
Some thoughts on view-op and `_StridedShard` interaction:
1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned)
compared to `Shard`. It only changes how shards permute across the devices.
2. `view()` op on DTensor strictly forbids shard redistribution which means if
`view()` may cause shard permutation across devices, it should be rejected.
This is enforced in today's sharding prop for `view()`.
3. Since DTensor `view()` won't introduce any redistribution, it's certain that
`placements` won't change except the inner `dim` attribute of `Shard`
or `_StridedShard`.

Therefore, to support `_StridedShard` in `view()` op, the only change required
is to keep `_StridedShard` as `_StridedShard` in the output spec.

**Test**
`pytest test/distributed/tensor/test_view_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159656
Approved by: https://github.com/wconstab
2025-08-07 15:59:25 +00:00
f60454cce8 S390X: update test dependencies (#158636)
numba currently doesn't build from source due to
https://github.com/numba/numba/pull/10073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158636
Approved by: https://github.com/malfet
2025-08-07 15:58:30 +00:00
8ab5868a21 Actually run the einops tests in CI (#159776)
The test filter was wrong, it should not start with "test/".

Test Plan:
- wait for CI
- Tested locally with `python test/run_test.py --einops --verbose`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159776
Approved by: https://github.com/atalman, https://github.com/StrongerXi
2025-08-07 15:23:06 +00:00
d20c4c20e6 [CI] Update xpu ci use rolling driver for new features (#158340)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158340
Approved by: https://github.com/seemethere

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-08-07 15:18:51 +00:00
83875cdb55 [nativert] Expose ModelRunner to public through pmpl type ModelRunnerHandle. (#159989)
Summary:
Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`.

It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79751098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159989
Approved by: https://github.com/dolpm
2025-08-07 14:23:21 +00:00
a53d14d5f8 Revert "unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786)"
This reverts commit 3a2c3c8ed365eb4e4cf4620c25d70b2f70483762.

Reverted https://github.com/pytorch/pytorch/pull/157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/157786#issuecomment-3164126250))
2025-08-07 13:09:33 +00:00
8cb91e20bc Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908)
This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-08-07 11:24:44 +00:00
b0df7715e8 Remove benchmark dependencies from regular ROCm CI images (#160047)
Instead, use a new `pytorch-linux-jammy-rocm-n-py3-benchmarks` image for Docker benchmark job.  This addresses 2 issues:

* The current ROCm failures in trunk w.r.t librosa version https://github.com/pytorch/pytorch/actions/runs/16789466749/job/47549950994 that TorchBench pulls in.
* Reduce the size of the regular ROCm CI images by removing TorchBench models, which is needed only for benchmarking jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160047
Approved by: https://github.com/malfet, https://github.com/izaitsevfb
2025-08-07 09:26:58 +00:00
422bd6808b dataclass pytree fix (#159916)
Differential Revision: D79687243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159916
Approved by: https://github.com/XuehaiPan, https://github.com/angelayi
2025-08-07 08:22:41 +00:00
24f43d0da7 [inductor] [cpu] fix the dype hardcoded to int64 in store_reduction (#157904)
## Fixes https://github.com/pytorch/pytorch/issues/157683

## mini repro
* Just copy the code from the issue to reproduce it.
```python
import torch

device = "cpu"

# Input tensors
v2_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device)
v3_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device)

def my_model(v2_0, v3_0):
    v6_0 = -v3_0
    v4_0 = v2_0 * v3_0
    v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
    v0_0 = v2_0.to(torch.int32)
    v5_0 = v0_0.amax(dim=0)

    return v6_0, v4_0, v1_0, v0_0, v5_0

v6_0, v4_0, v1_0, v0_0, v5_0 = my_model(v2_0, v3_0)
print("v6_0", v6_0.shape)
print("v4_0", v4_0.shape)

compiled_model = torch.compile(my_model, backend="inductor")

v6_0, v4_0, v1_0, v0_0, v5_0 = compiled_model(v2_0, v3_0)

print("v6_0", v6_0.shape)
print("v4_0", v4_0.shape)
print("v1_0", v1_0.shape)
print("v0_0", v0_0.shape)
print("v5_0", v5_0.shape)

```
error_stack
```
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’
   41 | convert(const Vectorized<src_t>& src) {
      | ^~~~~~~
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:  template argument deduction/substitution failed:
/tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误:模板参数数目不对(不应是 4 个而应是 2 个)
   37 |                     auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
## summary
**The C++ kernel generated by the Inductor had the wrong data type for the output variable; it should be int32_t instead of int64_t. This incorrect data type led to an incompatible data type conversion, which caused the g++ compilation to fail.**
The original code that caused the problem.
```
def my_model(v2_0, v3_0):
    v6_0 = -v3_0
    v4_0 = v2_0 * v3_0
    v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
    v0_0 = v2_0.to(torch.int32)
    // The original code that caused the problem.
    v5_0 = v0_0.amax(dim=0)
```

## proof procedure
The c++ kernel generated by inductor:
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void kernel(const int32_t* in_ptr0,
                       int32_t* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1416L); x0+=static_cast<int64_t>(16L))
        {
            {
                int32_t tmp_acc0_arr[16];
                for (int i = 0; i < 16; i++)
                {
                    tmp_acc0_arr[i] = std::numeric_limits<int32_t>::min();
                }
                int32_t tmp_acc0 = std::numeric_limits<int32_t>::min();
                at::vec::Vectorized<int32_t> tmp_acc0_vec = at::vec::Vectorized<int32_t>(std::numeric_limits<int32_t>::min());
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L)))
                        {
                            auto tmp0 = at::vec::Vectorized<int32_t>::loadu(in_ptr0 + static_cast<int64_t>(x0 + 1416L*x1), static_cast<int64_t>(16));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L)))
                        {
                            for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++)
                            {
                                auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail + 1416L*x1)];
                                tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)] = max_propagate_nan(tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)], tmp0);
                            }
                        }
                    }
                }
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L)))
                {
                   // impossible data type conversion which would caused the g++ compilation to fail.
                    auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
                    int32_t_tmp_acc0_vec.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++)
                    {
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)];
                    }
                }
            }
        }
    }
}
```
the compilers complains
```text
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’
   41 | convert(const Vectorized<src_t>& src) {
      | ^~~~~~~
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:  template argument deduction/substitution failed:
/tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误:模板参数数目不对(不应是 4 个而应是 2 个)
   37 |                     auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
so the following line have problem
```c++
    // this line means that tmp_acc0_vec should be Vectorized<int64_t>, and it will convert it to Vectorized<int32_t>.
    auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
The issue is that tmp_acc0_vec is of type Vectorized<int32_t>, but the template parameters expect it to be Vectorized<int64_t>.  and it will convert it to a Vectorized<int32_t>. this is conflict. the conversion should not be exist for tmp_acc0_vec is already Vectorized<int32_t>.The following line hardcodes the output variable type to int64, which causes unnecessary and incorrect type conversions.
d89f30ad45/torch/_inductor/codegen/cpp.py (L2985-L2993)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157904
Approved by: https://github.com/jgong5
2025-08-07 08:03:05 +00:00
aa75e917bd [Export Schema] Remove deviceAllocationMap field (#159653)
Summary:
This field is not used today, and it's not useful either.

The device allocation is configured at model loading time, specified by user.
It shouldn't be part of the model definition.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79385513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159653
Approved by: https://github.com/zhxchen17
2025-08-07 07:31:42 +00:00
3f1636ebef [audio hash update] update the pinned audio hash (#160046)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160046
Approved by: https://github.com/pytorchbot
2025-08-07 04:16:35 +00:00
c859ba7114 Make onnx export SDPA match aten behavior (#159973)
This PR makes onnx sdpa export match the behavior of aten sdpa when boolean mask is used.
@justinchuby

```python
import onnxruntime as ort
import torch

class ScaledDotProductAttention(torch.nn.Module):
    def forward(self, query, key, value, attn_mask):
        return torch.nn.functional.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask)

model = ScaledDotProductAttention()
attn_mask = torch.ones(2, 4, 8, 8).bool()  # boolean mask for attention
attn_mask[0, 0, 0, :] = False  # masking an entire row (padding token)
query = key = value = torch.randn(2, 4, 8, 16)
output = model(query, key, value, attn_mask)

torch.onnx.export(
    model,
    (query, key, value, attn_mask),
    "scaled_dot_product_attention.onnx",
    input_names=["query", "key", "value", "attn_mask"],
    output_names=["output"],
    dynamo=false, # or True,
)
ort_session = ort.InferenceSession("scaled_dot_product_attention.onnx")

np_inputs = {"query": query.numpy(), "key": key.numpy(), "value": value.numpy(), "attn_mask": attn_mask.numpy()}
onnx_outputs = ort_session.run(None, np_inputs)[0]

torch.testing.assert_close(output, torch.tensor(onnx_outputs), equal_nan=True)
```
fails the assertion because the ort model outputs nans.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159973
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
2025-08-07 04:06:07 +00:00
d4c1a08c89 Relax unclaimed successes in dtype op tests when running under TEST_WITH_DYNAMO/TEST_WITH_INDUCTOR (#159976)
This PR changes the behavior for compile wrapped op tests:
- supported_but_unclaimed_forward
- supported_but_unclaimed_backward

These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile.

~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976
Approved by: https://github.com/ezyang
2025-08-07 02:38:45 +00:00
81d72fb1f7 Move smoke binary builds to 3.12 (#159993)
And limit them just to stable CUDA version (as there weren't any recent instances when only one of those jobs failed to build)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159993
Approved by: https://github.com/ngimel
ghstack dependencies: #159986, #159990
2025-08-07 01:59:30 +00:00
d0226719a9 [BE][EZ] Delete remains of split-build logic (#159990)
Hopefully last piece of https://github.com/pytorch/pytorch/issues/138750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159990
Approved by: https://github.com/atalman
ghstack dependencies: #159986
2025-08-07 01:59:30 +00:00
38d65c6465 Add a USE_NIGHTLY option to setup.py (#159965)
If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X.

Coded with claude code.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965
Approved by: https://github.com/malfet
2025-08-07 01:44:20 +00:00
2ba2f598f3 [Dynamo] Add torch.xpu.stream to trace rules (#159844)
# Motivation
Previously, I thought using `with stream:` was sufficient. However, many older scripts still use `torch.xpu.stream` as the context manager. To maintain backward compatibility, I had to include `torch.xpu.stream` in the trace rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159844
Approved by: https://github.com/jansel
2025-08-07 01:35:50 +00:00
1bb5e6c076 update expected results (#159867)
refresh due to https://github.com/pytorch/pytorch/pull/159696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867
Approved by: https://github.com/masnesral
2025-08-07 01:18:36 +00:00
8b0be7b65a [Profiler] Fix unexpected C return events (#159574)
The fix in https://github.com/pytorch/pytorch/pull/155446 addressed the "stack empty" issue that's easily reproducible on CPython 3.12.0-4. While this issue can also appear in other versions, it's not as easy to reproduce there.

I recently found a new cause for this problem.

1df5d00145/Python/ceval.c (L5807-L5836)

In the CPython 3.10 implementation, PyTrace_C_CALL and PyTrace_C_RETURN/PyTrace_C_EXCEPTION are supposed to appear in pairs. However, when c_profilefunc is changed, unexpected PyTrace_C_RETURN/PyTrace_C_EXCEPTION events can occur.

Here is the code to reproduce this problem.

```
import threading
import time
import torch

from threading import Event, Lock

lock = Lock()
lock.acquire()

event1 = Event()
event2 = Event()
event3 = Event()

def run():
    event1.set()
    event2.wait()
    lock.acquire()
    event3.set()

threading.Thread(target=run).start()

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True):
    event1.wait()
    event2.set()
    time.sleep(1)

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True):
    lock.release()
    event3.wait()
```

<img width="1766" height="1250" alt="image" src="https://github.com/user-attachments/assets/6794eeca-7364-429e-91eb-62cdad116bd3" />

To fix this problem, we can record active_frames_ and remaining_start_frames_ for each thread, and when the PyTrace_C-RETURN/PyTrace_CEXT CEPTION event occurs, we can determine whether to record this event based on these two fields.

In reality, even without this fix, the final data appears to be right since the match process can handle this case (it would just result in an exception log being printed).

Do you think the fix is necessary?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159574
Approved by: https://github.com/sraikund16
2025-08-07 01:17:55 +00:00
5cedc5a0ff [BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144552
Approved by: https://github.com/ezyang
2025-08-07 00:09:56 +00:00
fd606a3a91 [dynamo] update pytorch-labs -> meta-pytorch in graph break URLs (#159975)
Related PR: https://github.com/meta-pytorch/compile-graph-break-site/pull/30

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159975
Approved by: https://github.com/Lucaskabela
2025-08-06 23:57:31 +00:00
3daef4d128 [dynamo] Trace nn.Module __delattr__ (#159969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159969
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/StrongerXi
2025-08-06 23:43:19 +00:00
cb4b29b754 Revert "[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874)"
This reverts commit 9fd5b5f73589cf08dca60910368cc0f05c7906c8.

Reverted https://github.com/pytorch/pytorch/pull/159874 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/159874#issuecomment-3161896978))
2025-08-06 23:21:29 +00:00
a6bc296207 [FlexAttention] Update the guard semantics for divisibility (#159884)
We don't add guards unless we know (and another guard has ensured this) that this is a safe optimization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159884
Approved by: https://github.com/Chillee
2025-08-06 23:12:44 +00:00
64dc30c213 [HOP, map] Rework of map autograd to the new interface (#153343)
This PR reworks the current autograd implementation of map to the new interface.

@pytorchbot label "topic: not user facing"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343
Approved by: https://github.com/ydwu4
2025-08-06 23:02:42 +00:00
93da9952a7 gloo: fix building system gloo with CUDA/HIP (#146637)
Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.

Fixes: #146239

Reported-by: Adam J Stewart <ajstewart426@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637
Approved by: https://github.com/malfet
2025-08-06 22:56:31 +00:00
3a2c3c8ed3 unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786)
These tests now pass on AArch64 in our downstream CI.

`test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157786
Approved by: https://github.com/jerryzh168, https://github.com/malfet
2025-08-06 22:41:07 +00:00
9fd5b5f735 [pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874)
Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.

Test Plan:
See: D79456310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159874
Approved by: https://github.com/c00w
2025-08-06 22:33:04 +00:00
2507ae63f2 Partitioner: Fix to align partition node order with original graph (#157892)
Fixes #157891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892
Approved by: https://github.com/ezyang
2025-08-06 22:12:47 +00:00
40c4d61f9a [Dynamo][Better Engineering] Typing torch/_dynamo/guards.py (#159315)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/guards.py`

Running
```
mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2030 | 3945 | 51.46% | 70 | 138 | 50.72% |
| This PR | 4055 | 4055 | 100.00% | 138 | 138 | 100.00% |
| Delta    | +2025 | +90 | +48.54% | +68 | 0 | +49.28% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159315
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-08-06 21:52:14 +00:00
a5725965ea Remove unnecessary "# noqa: set_linter" comments (#159467)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467
Approved by: https://github.com/eellison
2025-08-06 21:31:52 +00:00
289f62ce8a [inductor][ez] fixup scaled_mm (#159948)
Summary:

This reverts the part of #159383 for scaled_mm where now, like before,
we pass through the normal input_nodes (not the triton_input_nodes)
to select_algorithm

- #159383 refactored how kwargs are retrieved
- it introduced this notion of KernelInputs that wrap input_nodes
- scaled_mm uses unsqueezed input nodes for triton to retrieve params
- the issue: it uses a squeezed (regular) bias for select_algorithm
  instead

This fixes that by passing the original input nodes rather
than the triton input nodes.

Test Plan:

```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)'
buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)'
```

This set of tests was failing, and is passing now

Side note: these tests were failing I believe because the unsqueezed
bias made the ATEN choice no longer eligible, and there is some minor
numerical discrepancy between ATEN and Triton for this. I'm not sure
the test should be written like that, as we're implicitly relying on
ATEN being the choice here.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159948
Approved by: https://github.com/izaitsevfb, https://github.com/eellison
2025-08-06 21:25:48 +00:00
512b4730e3 [EZ] Remove useless cross_compile_arm64 (#159986)
As we don't have any Intel Mac runners in CI for last 2+ years
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159986
Approved by: https://github.com/atalman
2025-08-06 21:01:05 +00:00
d2368aa6f3 [CPUBLAS] add macros for brgemm APIs for versioning (#158629)
**Summary**
Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158629
Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang
2025-08-06 20:54:05 +00:00
0afaeb7c4e Improve extract_test_fn (#158637)
The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637
Approved by: https://github.com/jbschlosser
2025-08-06 20:45:21 +00:00
50580b5053 Add minimal nn.functional.log_softmax support for NestedTensor (#159662)
This only works for the jagged layout and for the non-batch and non-jagged dimensions.

I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159662
Approved by: https://github.com/jbschlosser
2025-08-06 20:34:02 +00:00
b8ef60b6bc Enable XNNPACK aarch64 builds (#159762)
Summary:
This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device.

Thanks to andrewjcg for proposing this fix.

Rollback Plan:

Reviewed By: andrewjcg

Differential Revision: D79497613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159762
Approved by: https://github.com/frankseide, https://github.com/malfet

Co-authored-by: Frank Seide <seide@meta.com>
2025-08-06 20:20:32 +00:00
0de2a45a48 [BE] Merge 3 CUDA build jobs into one (#159890)
Before this change there were build+test jobs:
 - s89 build+tests
 -  sm75 build+distributed_test
 - sm_75 build+pr_time_benchmark test
This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159890
Approved by: https://github.com/clee2000, https://github.com/seemethere
2025-08-06 20:09:55 +00:00
12a54e4ac1 [Inductor UT][Fix XPU CI] Fix case failures introduced by community. (#159759)
Fixes #159631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159759
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-06 20:02:20 +00:00
d10e9e4781 [MPS] Remove all pre-MacOS14 logic (#159912)
Delete older enums, checks for MacOS-13.3+ for int64 support, etc

Fixes https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159912
Approved by: https://github.com/manuelcandales
2025-08-06 19:48:12 +00:00
c71950907d [inductor] add _get_inductor_debug_symbol_cflags for debug symbol control. (#159938)
We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol.
On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG.
On Linux, it should create some debug sections in binary file.

I added UT for it also.

It works well on Windows inductor debug.
<img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159938
Approved by: https://github.com/jansel, https://github.com/angelayi
2025-08-06 19:31:45 +00:00
6fa3592dc6 Dataloader benchmark script (#159432)
This script adds a simple dataloading benchmark tracking throughput and memory.

The output looks like this
```
System Information:
  PyTorch version: 2.9.0a0+gitf87d117
  PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py
  Torchvision version: 0.24.0a0+f52c4f1
  Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py
  CUDA available: True
  CUDA device: NVIDIA PG509-210
  CPU count: 192
  Physical CPU cores: 96
  Total system memory: 1510.11 GB

Loading dataset from imagenet/val (1 copies)
Dataset size: 50000

--- Benchmarking DataLoader with worker_method=multiprocessing ---
Memory before DataLoader creation: 500.59 MB

Detailed memory information:
  USS (Unique Set Size): 499.00 MB
  PSS (Proportional Set Size): 500.74 MB
  RSS (Resident Set Size): 497.39 MB
Memory after DataLoader creation: 1127.61 MB
Memory increase: 627.02 MB
Starting training loop with 1 epochs (max 100 batches per epoch)
Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB
Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB
Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB
Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB
Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB
Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB
Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB
Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB
Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB
Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB

Results:
  Worker method: multiprocessing
  DataLoader init time: 0.1553 seconds
  Average batch time: 0.3408 seconds
  Samples per second: 375.53
  Peak memory usage: 12738.76 MB
  Memory increase: 12238.17 MB
```

> TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159432
Approved by: https://github.com/ramanishsingh
2025-08-06 19:05:19 +00:00
ba37f589d4 Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696)"
This reverts commit ee62177c196d716fc3a2d641370bed8a673a45d3.

Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/159696#issuecomment-3161196192))
2025-08-06 18:41:05 +00:00
44dd3684d2 [AOTI] Fix memory leak from all_reduce (#159818)
Summary: This PR solves two issues:

1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak.

2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159818
Approved by: https://github.com/henryhu6, https://github.com/yushangdi
2025-08-06 18:11:14 +00:00
c669b0ab87 Fix execution frame cleanup logic (#158717)
Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic.

Test Plan:
```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```

Rollback Plan:

Differential Revision: D78621408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158717
Approved by: https://github.com/dolpm
2025-08-06 18:04:24 +00:00
d7a855d67d [async-TP] Make scaled-mm + reduce-scatter preserve alignment of scales (#159957)
After https://github.com/pytorch/pytorch/pull/157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159957
Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre
2025-08-06 17:42:26 +00:00
4c01991b38 [DCP][Prototype] Checkpoint replication via PGTransport (#157963) (#159801)
Summary:

### PR Context

Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners".

Test Plan:
CI

Rollback Plan:

Differential Revision: D79590797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801
Approved by: https://github.com/saumishr
2025-08-06 16:52:03 +00:00
a4b07fe8f6 [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-08-06 15:59:27 +00:00
d87161c3c8 [Easy] Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (#159904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159904
Approved by: https://github.com/janeyx99
2025-08-06 15:09:18 +00:00
79eca4677b [precompile] Skip serializing unnecesssary objects for guards. (#158926)
Summary:
The following type of objects don't need to be serialized for precompile:
1. PyCapsule because we don't guard on C binding objects in meaningful ways.
2. Code object because we only id matching on these but id matches will always be dropped for precompile.
3. Nested function objects since we also ban CLOSURE_MATCH.

Test Plan:
buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects

Rollback Plan:

Differential Revision: D78816888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926
Approved by: https://github.com/jamesjwu
2025-08-06 15:00:28 +00:00
2855688a1d Revert "Replace C array with std::array in formatSockAddr (#159812)"
This reverts commit e7feedf6a9bb346ad205796aa4084c8dcfb18072.

Reverted https://github.com/pytorch/pytorch/pull/159812 on behalf of https://github.com/malfet due to Looks like it broke distribtued tests, see 2231c3ca3a/1 ([comment](https://github.com/pytorch/pytorch/pull/159812#issuecomment-3160513656))
2025-08-06 14:55:48 +00:00
2231c3ca3a [CI][CD] Fix install_nvshem function (#159907)
When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder

Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907):
```
2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to:  ''
2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at:  ''
2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB:  '/usr/local/cuda/lib64/libnvshmem_host.so'
2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB:  '/usr/local/cuda/lib64/libnvshmem_device.a'
2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR:  '/usr/local/cuda/include'
2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-08-06 14:44:37 +00:00
c03a734ba1 [OpenReg] Disable automatic inclusion of data files (#159845)
# Background

After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime.

# Motivation

This PR aims to remove the stub.c file and any unused file when running torch_openreg.

**Changes:**

- Setting **include_package_data** keyword to false in the setup function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845
Approved by: https://github.com/albanD
2025-08-06 10:35:13 +00:00
98316e5896 [WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325)
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849

**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads

**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]

where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path

Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769

```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/

Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6

**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:

Differential Revision: D79042656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
2025-08-06 10:28:08 +00:00
23cf241039 [aoti][mps] Initialize mps kernels first (#159753)
In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance:
```
if ...
    auto mps_lib_0_func = ...
    mps_lib_0_func->run()

// since we already used mps_lib_0 once, we don't re-initialize it
mps_lib_0_func->run()  // error, mps_lib_0_func not initialized
```

So the solution we took here is to initialize all the kernels at the beginning:
```
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
    static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
    return func;
}
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
    return handle;
}
...
if ...
    get_mps_lib_0()->run()

get_mps_lib_0()->run()  // success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753
Approved by: https://github.com/malfet
ghstack dependencies: #159456, #159695
2025-08-06 07:54:29 +00:00
e7feedf6a9 Replace C array with std::array in formatSockAddr (#159812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159812
Approved by: https://github.com/Skylion007
2025-08-06 07:44:29 +00:00
dad2a05bec [DTensor] Set up DTensorContinuousTestBase (#159885)
Also migrate `test_common_rules.py` since it was a short file

`python test/distributed/tensor/test_common_rules.py`

Before:
Ran 10 tests in 91.516s
After:
Ran 10 tests in 5.604s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885
Approved by: https://github.com/ezyang
2025-08-06 07:40:31 +00:00
0495cab545 Wire in pt2_triton_builds (#159897)
Summary:
This allows us to start seeing the failure rate on these models (and
potentially alert on it).

Test Plan:
```
FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 | tee out
```
P1889607054

Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon.

Rollback Plan:

Reviewed By: masnesral

Differential Revision: D79308333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897
Approved by: https://github.com/masnesral
2025-08-06 07:39:51 +00:00
abfe403981 [AIDIR] Internal util function to insert MLHub debugging insight for dynamic shape (#159391)
Summary:
This feature is Meta internal only
Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub .

The rollout will be controlled by JK.

Test Plan:

MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21

 {F1980593060}

* If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility'
* The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page.
* Test fails because of the export checks. I'll export after addressing all the comments from reviewers.

Rollback Plan:

Reviewed By: pianpwk

Differential Revision: D78526522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391
Approved by: https://github.com/jingsh
2025-08-06 07:39:39 +00:00
1690c0c3a0 [Reland] Migrate ScalarType to headeronly (#159911)
The non ghstack version of #159416, to make sure we don't get reverted again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911
Approved by: https://github.com/mikaylagawarecki
2025-08-06 07:36:37 +00:00
e9d27aa8fd [CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854)
CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854
Approved by: https://github.com/eqy
2025-08-06 06:03:58 +00:00
2457e62c90 Revert "Set PYTHONHOME for inductor subprocesses using torch (#159382)"
This reverts commit fe8984a9f43bde10d1956abe7cb40710ed7ceed2.

Reverted https://github.com/pytorch/pytorch/pull/159382 on behalf of https://github.com/malfet due to Broke MacOS testing see d0fccbc99c/1 ([comment](https://github.com/pytorch/pytorch/pull/159382#issuecomment-3157455367))
2025-08-06 05:30:20 +00:00
d0fccbc99c [CI] Delete sm86 tests from pull (#159903)
And delete sm89+cuda12.4 builds from periodic (as sm86+legacy driver should be enough)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159903
Approved by: https://github.com/huydhn
2025-08-06 05:16:55 +00:00
3461988a4b [audio hash update] update the pinned audio hash (#159823)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159823
Approved by: https://github.com/pytorchbot
2025-08-06 05:02:35 +00:00
9764981116 Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814)
Allow overriding nop compilers with real ones when using this flow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159814
Approved by: https://github.com/fmassa
2025-08-06 04:50:56 +00:00
704594eb23 [Dynamo] make HOPs hashable (#159910)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159910
Approved by: https://github.com/yf225
2025-08-06 04:02:17 +00:00
eqy
bfc27cf468 [Distributed] Fix @parametrize on unordered iterable in distributed test (#159793)
seems to fix https://github.com/pytorch/pytorch/issues/145807

sets aren't ordered so `@parametrize` can cause two processes to spawn with different settings

originally debugged thanks to @k-artem, see https://github.com/pytorch/pytorch/issues/145807#issuecomment-2971009451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159793
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2025-08-06 03:51:42 +00:00
311f74089a remove print (#159917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159917
Approved by: https://github.com/laithsakka
2025-08-06 03:48:23 +00:00
14c7358c64 Enable fr_trace to read local traces from multiple hosts. (#159490)
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.

Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```

Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```

After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
    dump = pickle.load(infile)
           ^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).

Rollback Plan:

Differential Revision: D79224727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
2025-08-06 03:15:34 +00:00
8ce81bcee1 [Torch Package] Make get names of OrderedImporters support fallback to importers (#155743)
Summary:
OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class.
This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers.

Differential Revision: D76463252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743
Approved by: https://github.com/jcwchen, https://github.com/jingsh
2025-08-06 02:26:10 +00:00
4604f0482c Add UT for torch.accelerator memory-related API (#155200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200
Approved by: https://github.com/albanD
ghstack dependencies: #138222, #152932
2025-08-06 02:22:18 +00:00
15f1173e5d Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-06 02:22:18 +00:00
e16c48ae97 [BE] Fix type hint in AOTIRunnerUtil (#159577)
Not sure why it was labelled as list in the first place. In test_aot_inductor.py, I scanned a few use cases and they are tuple as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159577
Approved by: https://github.com/Skylion007
2025-08-06 01:20:45 +00:00
f7a66da5f9 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-06 00:40:29 +00:00
3eb3da9b4b [dynamo][guards] Skip ID_MATCH guard on self.__class__.__closure__ (#159888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159888
Approved by: https://github.com/williamwen42
2025-08-06 00:36:43 +00:00
3ddfd46bd2 Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-08-06 00:29:56 +00:00
6a82da392e [export] Fix generated schema for C++20/23 (#159871)
Summary: Fixing the issue from https://github.com/pytorch/pytorch/issues/159838

Test Plan:
buck run caffe2/:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/

Rollback Plan:

Differential Revision: D79647167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159871
Approved by: https://github.com/malfet
2025-08-06 00:23:05 +00:00
22bedc429f Extract some HOP utils to be importable (#159705)
Useful helper function for stage 1 export -> manual partitioner -> stage 2 compile users

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159705
Approved by: https://github.com/zou3519
ghstack dependencies: #159134
2025-08-05 23:59:47 +00:00
49abc0e3f8 [Take 2] Setup TorchBench in Docker (#159300)
Fix and reland https://github.com/pytorch/pytorch/pull/158613, I keep `checkout_install_torchbench` in `.ci/pytorch/macos-test.sh` script because it's still used there, and there is no Docker.

### Testing

MacOS perf nightly run https://github.com/pytorch/pytorch/actions/runs/16580798470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159300
Approved by: https://github.com/ZainRizvi
2025-08-05 23:47:42 +00:00
1052604acd fix logging setup issue for Windows.. (#159887)
When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html
Such as:
    TORCH_LOGS="+schedule,+inductor,+output_code"
On Linux, it shows as:
```cmd
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x TORCH_LOGS="+schedule,+inductor,+output_code"
declare -x USER="xu"
```
On Windows, it shows as:
```cmd
TORCHINDUCTOR_WINDOWS_TESTS=1
TORCH_LOGS="+schedule,+inductor,+output_code"
UCRTVersion=10.0.22000.0
```
For Linux, it shows quotes by default, And Windows is not shows quotes.
Besides that, Windows would auto assemble quotes when env var processing.

On Linux, we will get variable: "+schedule,+inductor,+output_code"
On Windows, we will get variable: '"+schedule,+inductor,+output_code"'

So, we need remove the outer quotes for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887
Approved by: https://github.com/angelayi
2025-08-05 23:44:38 +00:00
fe8984a9f4 Set PYTHONHOME for inductor subprocesses using torch (#159382)
Summary:
This is needed for subprocesses that are trying to call back into torch
functionality, i.e. anything that's also setting `PYTHONPATH`.  There are more
`sys.executable` subprocesses in torch/ but it seems like they're fine.

Test Plan: Local inference runs.

Reviewed By: aorenste

Differential Revision: D79124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382
Approved by: https://github.com/aorenste
2025-08-05 23:32:48 +00:00
74a754aae9 Add meta kernel for sdpa_math_for_mps (#159695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695
Approved by: https://github.com/malfet
ghstack dependencies: #159456
2025-08-05 22:27:06 +00:00
b1ec088113 [mps] Turn on inductor dynamic shapes tests (#159456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-08-05 22:27:06 +00:00
fb35a9ea4a [export] Improve error messages (#159881)
Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881
Approved by: https://github.com/yushangdi
2025-08-05 22:26:48 +00:00
8034b2a732 [inductor] Add TLParse artifact for logging runtime of collective and compute ops (#159730)
Summary:

- debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format
- test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730
Approved by: https://github.com/yushangdi
ghstack dependencies: #159190
2025-08-05 22:06:32 +00:00
64cc6f06b1 [Inductor] Revert minimal changes to avoid internal test failures (#159809)
The diff/PR https://github.com/pytorch/pytorch/pull/159211 caused a bunch of test failures for graph compiler(T232684410). But I couldn't figure out a forward fix so far. So with this diff/PR, I'm proposing to revert the minimal changes to resolve the test failures.

I'll continue the debugging, and re-land the reverted changes once we find out a forward fix.

Differential Revision: [D79221721](https://our.internmc.facebook.com/intern/diff/D79221721/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159809
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-08-05 22:05:26 +00:00
410812763b Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)"
This reverts commit bbc0df1094b5a4dcd2cce83f8402127b07913231.

Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))
2025-08-05 22:00:24 +00:00
bdb07a2bc5 [Cutlass] Allow offsets to be passed as arguments to kernel (#159761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159761
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #159760
2025-08-05 21:59:07 +00:00
8085edc8f9 [autograd] torch._C._set_view_replay_enabled state leaking into other tests (#159840)
This was causing view_fns to pop up in tests that ran after `TestAutograd.test_view_replay_enabled` where it isn't used as a context manager. It is unclear to me why we would want `_force_original_view_tracking` to mutate global state on __init__ rather than on __enter__, that could be an alternative fix.

FIXES https://github.com/pytorch/pytorch/issues/156306 https://github.com/pytorch/pytorch/issues/156289 https://github.com/pytorch/pytorch/issues/156265 https://github.com/pytorch/pytorch/issues/156209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159840
Approved by: https://github.com/albanD
2025-08-05 21:57:49 +00:00
882d50c5bf [C10] Add Scalar::isUnsigned() method (#159877)
That returns true if Scalar hold unsigned integral value

With the implications of `Tag::HAS_u` semantic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159877
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-05 21:43:21 +00:00
b52a4d0821 [ez][CI] Remove some unused docker images (#159171)
Removes unused docker images from the docker build workflow
Then removes unused definitions in build.sh

The only one I left is the vllm one because I'm pretty sure it's going to be used in the future

I assume everything not mentioned is old and we forgot to remove them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171
Approved by: https://github.com/yangw-dev
2025-08-05 21:31:53 +00:00
a45a840926 [CI] Disable check-labels and check_mergeability (#159900)
See https://github.com/pytorch/pytorch/issues/159825
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159900
Approved by: https://github.com/clee2000
2025-08-05 21:16:12 +00:00
9b953bb3fb [BE] Update TensorPipe pin (#159834)
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
-  Avoid using deprecated enums for MacOS builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
2025-08-05 20:45:09 +00:00
eb25a95a6e Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569)
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:

```
    When an operation mutates a buffer in-place, the scheduler creates a new buffer name
    to track the "before" and "after" states, even though they share the same memory.
    The mutated buffer represents a rename with zero allocation and deallocation cost.
    During dependency tracking, we transfer dependencies from the mutated name back to
    the original buffer, ensuring the original memory is only freed when all aliases
    are done.
    This handles cases where a buffer has multiple non-overlapping aliases - rather than
    trying to assign free costs to individual aliases, we forward all alias dependencies
    to the original buffer.
    Consider:
        buf0 = op0()
        buf1 = mutation_op_(buf0)
        del buf0
        ...
        op(buf1)
        del buf1
    The only memory events are the creation prior to op0, and the deletion following buf1.
```

As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.

This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
2025-08-05 19:58:11 +00:00
eqy
9884d0351e [CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522)
Otherwise we see `CUDA error: too many resources requested for launch`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522
Approved by: https://github.com/janeyx99
2025-08-05 19:26:25 +00:00
d7c83972d5 tools: Add mode to find python automatically (#159820)
Add support for automatically finding Python interpreters in manylinux
environments to our wheel building script. Scaffolding for sequential builds

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159820
Approved by: https://github.com/malfet
2025-08-05 19:19:22 +00:00
e06b110f73 [Testing] Add MPS to NATIVE_DEVICES (#153835)
This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely:
- Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64`
- Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())`
- Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed)
- Add lots of `expectedFailureIfMPS`
- Delete all `@onlyNativeDeviceTypesAnd("mps")`

&lt;sarcasm&gt; I love how well documented this variable are &lt;/sarcasm&gt;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835
Approved by: https://github.com/Skylion007
2025-08-05 18:57:35 +00:00
0ba09a6d34 fix link for tutorial of inductor on windows (#159853)
fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853
Approved by: https://github.com/sekyondaMeta

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>
2025-08-05 18:37:47 +00:00
aeb5321b63 Allow controlling PG backend and options via init_device_mesh (#159371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371
Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol
2025-08-05 12:44:14 +00:00
625108ede2 [inductor] consolidate common GEMM triton param retrieval (#159383)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
2025-08-05 11:42:25 +00:00
09e5a93fcb Improve graph output alias with subclass error message (#159619)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159619
Approved by: https://github.com/albanD
2025-08-05 06:47:31 +00:00
908c5cc4c0 Generalize torch._C._set_allocator_settings to be generic (#156175)
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312, #156165
2025-08-05 04:08:42 +00:00
c1145852a5 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312
2025-08-05 04:08:42 +00:00
ae1a706444 Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #159629
2025-08-05 04:08:04 +00:00
56d19a5ced Fix AllocatorConfig potential SIO issue (#159629)
# Motivation
As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process.

Co-authored-by: @ScottTodd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629
Approved by: https://github.com/ScottTodd, https://github.com/albanD
2025-08-05 04:07:51 +00:00
b6c53383fe [Dynamo][Better Engineering] Type annotation for torch/_dynamo/output_graph.py (#159602)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/output_graph.py`

Running
```
mypy torch/_dynamo/output_graph.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2163 | 4792 | 45.14% | 121 | 268 | 45.15% |
| This PR | 4818 | 4818 | 100.00% | 268 | 268 | 100.00% |
| Delta    | +2655 | +26 | +54.84% | +147 | 0 | +54.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159602
Approved by: https://github.com/Skylion007
2025-08-05 03:50:54 +00:00
4fd5fabee9 skip XPU for dataloader CPU only unit test (#159811)
Fixes [#159802](https://github.com/pytorch/pytorch/issues/159802)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159811
Approved by: https://github.com/izaitsevfb
2025-08-05 03:44:01 +00:00
bbc0df1094 [Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Relying on CI. Should be a NFC.

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D79378792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777
Approved by: https://github.com/davidberard98
2025-08-05 03:29:13 +00:00
33ec6e3e9a Remove pin on libuv from instructions (#159504)
This package doesn't exist at conda-forge and causes some confusion for users.
see https://anaconda.org/conda-forge/libuv/files?version=1.39.0

libuv is quite stable, so the newer versions should be fine. we build with them anyway at conda-forge.

see: https://github.com/conda-forge/libuv-feedstock/issues/80

Hopefully this can help future users.

Fixes https://github.com/conda-forge/libuv-feedstock/issues/80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159504
Approved by: https://github.com/seemethere
2025-08-05 03:18:42 +00:00
efc4b460b3 Add cascade sum support for Inductor CPP backend (#156296)
Fixes #154703

Add cascade summation support for Inductor CPP backend to improve precision for large size summation.

Currently, Inductor CPP directly do reduction for sum. As shown in #154703, when the size of the sum is large and the number of parallel is small, direct reduction will cause an intolerable precision loss:
```
extern "C"  void kernel(float* in_out_ptr0,
                       const float* in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                        tmp_acc0_vec = tmp_acc0_vec + tmp0;
                    }
                }
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0);
        }
    }
    {
        {
            {
                auto tmp0 = out_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(3000000000.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
```

After adding cascade sum support:

```
extern "C"  void kernel(float* in_out_ptr0,
                       const float* in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            at::vec::Vectorized<float> masked_tmp_acc0_vec = at::vec::Vectorized<float>(0);
            CascadeSumHelper<float, 65536> scalar_cascade_helper0(static_cast<int64_t>(3000000000L));
            CascadeSumHelper<at::vec::Vectorized<float>, 65536> cascade_helper0(static_cast<int64_t>(187500000L));
            CascadeSumHelper<at::vec::Vectorized<float>, 65536> masked_cascade_helper0(static_cast<int64_t>(0L));
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                        tmp_acc0_vec = cascade_sum_combine(tmp0, &cascade_helper0);
                    }
                }
            }
            tmp_acc0 = cascade_sum_final(&scalar_cascade_helper0);
            tmp_acc0_vec = cascade_sum_final(&cascade_helper0);
            masked_tmp_acc0_vec = cascade_sum_final(&masked_cascade_helper0);
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec + masked_tmp_acc0_vec);
            out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0);
        }
    }
    {
        {
            {
                auto tmp0 = out_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(3000000000.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
```
This will inevitably reduce performance when cascade sum is turned on.
For the case shown in #154703: performance reduced by ~3%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-05 02:54:32 +00:00
1ca8388442 [BE][MPS] Remove unused size12 variable (#159832)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable]
  auto size12 = input_sizes[1] * input_sizes[2];
       ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832
Approved by: https://github.com/dcci
2025-08-05 02:32:06 +00:00
b69497351d [nativert] force resize to zero. (#159683)
Summary:
this was quite a miserable bug. there are a few kernels that don't explicitly resize outputs to zero, which led to some weird UB.

Rollback Plan:

Differential Revision: D79476454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159683
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-08-05 02:25:31 +00:00
482f069c41 [C10D] fix slow init due to repeated dns resolution failure (#159596)
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596
Approved by: https://github.com/d4l3k
2025-08-05 02:15:26 +00:00
85d931f29e Use uppercase OR when checking for system XNNPACK (#159527)
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.

---

For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:

```
CMake Error at cmake/Dependencies.cmake:566 (if):
  if given arguments:

    "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"

  Unknown arguments specified
Call Stack (most recent call first):
  CMakeLists.txt:868 (include)
```

Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
2025-08-05 02:10:53 +00:00
8a2f53c523 Recursively sync fbgemm submodules before build (#159477)
ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622
```
2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’
2025-07-27T08:00:32.3444080Z   389 |         x86::Xmm partial_sum_xmm(partial_sum_vreg.id());
```

It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477
Approved by: https://github.com/pruthvistony, https://github.com/eqy
2025-08-05 02:00:54 +00:00
b59b61a099 Add avg_pool3d backward pass for MPS (#159089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089
Approved by: https://github.com/malfet
2025-08-05 01:55:38 +00:00
57ab39f7e4 Update torch-xpu-ops commit pin (#159621)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](1f7a57f507) includes:

- Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization
- Add optional NaN checks to XCCL
- Fix NllLossForwardReduce2DKernelFunctor accuracy
- Extend the existing communication logging to include the reduction operation for collective calls
- [Reland] Install xpu codegen header to torch/include
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621
Approved by: https://github.com/EikanWang
2025-08-05 01:46:15 +00:00
182975e01a [Dynamo] Enable torch function dispatch on HOPs (#159708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159708
Approved by: https://github.com/zou3519, https://github.com/XilunWu
ghstack dependencies: #159707
2025-08-05 01:43:22 +00:00
9f8cfe7476 [Dynamo] Fix arg ordering in tf modes (#159707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159707
Approved by: https://github.com/zou3519
2025-08-05 01:43:21 +00:00
e273ff028a Fix failing test (#159800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159800
Approved by: https://github.com/aorenste
2025-08-05 00:28:51 +00:00
5e0fc2c9a9 [AOTI] don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433)
**Motivation / Context**: (what I _think_ is happening here)

In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation.

But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA.

**Solution space**

Export allows users to specify which dimension are dynamic, and to provide **ranges of valid sizes**.

One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32.

However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above.

To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`.

If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges.

Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-08-05 00:17:09 +00:00
bc4b04e058 DeviceCopy should have the same layout as input (#159615)
Summary: Fix https://github.com/pytorch/pytorch/issues/159612

- Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input
- The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy
```

Rollback Plan:

Differential Revision: D79411407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615
Approved by: https://github.com/eellison
2025-08-04 23:56:58 +00:00
6b414f56a4 Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)" (#159798)
This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1.

Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model)

```
import torch
def ri(inp, repeats, output_size):
    return torch.repeat_interleave(inp, repeats, output_size=output_size)
inp = torch.arange(0, 4, device="cuda").reshape(-1, 1)
x = torch.tensor([1, 2, 3, 4], device="cuda")
ri_c = torch.compile(ri)
print(ri(inp, x, 10))
print(ri_c(inp, x, 10))
```

which leads to errors like

```
/tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed.
```

Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798
Approved by: https://github.com/danzimm
2025-08-04 23:39:20 +00:00
fb8f32ef52 Revert "[mps] Turn on inductor dynamic shapes tests (#159456)"
This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf.

Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))
2025-08-04 23:11:05 +00:00
7ba996bbaa [Cutlass] Fix wrapper code generation breakage (#159760)
Fixes issues introduced by https://github.com/pytorch/pytorch/pull/159355

The issue got past OSS CI because the H100 tag wasn't added, not sure how to prevent these kinds of issues in the future, perhaps we should run H100 on Inductor PRs?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159760
Approved by: https://github.com/angelayi
2025-08-04 23:03:03 +00:00
ddbdcdc710 [cutlass backend][test] Expand FP8 tests to FP16 (#159538)
Differential Revision: [D79317343](https://our.internmc.facebook.com/intern/diff/D79317343/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159538
Approved by: https://github.com/mlazos
2025-08-04 23:01:55 +00:00
19f1f9960d [mps] Turn on inductor dynamic shapes tests (#159456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-08-04 22:44:31 +00:00
fd6655a0f5 Feature: Implement support for cudnn_batch_norm_out kernel to replace the autogen approach. (#123020)
Fixes #115611

Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency.

Test Case:

```c++
#include <torch/torch.h>
#include <iostream>
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>

int main() {
    auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA));
    auto weight = torch::randn({3}, torch::device(torch::kCUDA));
    auto bias = torch::randn({3}, torch::device(torch::kCUDA));
    auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA));
    auto running_var = torch::ones({3}, torch::device(torch::kCUDA));

    bool training = true;
    double exponential_average_factor = 0.1;
    double epsilon = 1e-5;

    auto output = torch::empty_like(input);
    auto save_mean = torch::empty({3}, torch::device(torch::kCUDA));
    auto save_var = torch::empty({3}, torch::device(torch::kCUDA));
    auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder

    at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve);
    auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon);

    bool is_close_output = torch::allclose(output, std::get<0>(outputs));
    bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs));
    bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs));
    bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs));

    std::cout << "Is output close: " << is_close_output << std::endl;
    std::cout << "Is save_mean close: " << is_close_save_mean << std::endl;
    std::cout << "Is save_var close: " << is_close_save_var << std::endl;
    std::cout << "Is reserve close: " << is_close_reserve << std::endl;

    return 0;
}
```

Please CC @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020
Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD
2025-08-04 22:40:33 +00:00
a7f3bdf550 [Dynamo][Better Engineering] Type coverage for torch/_dynamo/utils.py (#159580)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/utils.py`

Running
```
mypy torch/_dynamo/utils.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2163 | 4792 | 45.14% | 121 | 268 | 45.15% |
| This PR | 4818 | 4818 | 100.00% | 268 | 268 | 100.00% |
| Delta    | +2655 | +26 | +54.84% | +147 | 0 | +54.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159580
Approved by: https://github.com/williamwen42
2025-08-04 21:51:53 +00:00
510e8b4ae0 [inductor] use writable temp file on windows (#159738)
Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-08-04 21:51:02 +00:00
83ba3f1101 Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)"
This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4.

Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))
2025-08-04 21:47:11 +00:00
1fad16aacb Revert "[inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)"
This reverts commit 444e2381d07a14cb501c00d11f9e63a3f1d2c86e.

Reverted https://github.com/pytorch/pytorch/pull/158983 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))
2025-08-04 21:47:11 +00:00
444e2381d0 [inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983
Approved by: https://github.com/eellison
ghstack dependencies: #158758
2025-08-04 21:42:05 +00:00
6085bf7565 [inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)
Fixes #155121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang, https://github.com/eellison
2025-08-04 21:22:11 +00:00
8201dbf4bc check driver to be >=12.4 to use fabric handles (#159697)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159697
Approved by: https://github.com/malfet
2025-08-04 21:05:39 +00:00
26d045bb60 Linux py 3.14 wheel builds (#157559)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157559
Approved by: https://github.com/malfet, https://github.com/albanD
2025-08-04 20:55:19 +00:00
356ac3103a Revert "Stop parsing command line arguments every time common_utils is imported. (#156703)"
This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3.

Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))
2025-08-04 20:37:39 +00:00
d4109a0f99 [MPS] Add max_unpool1d/2d/3d (#159789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789
Approved by: https://github.com/malfet
2025-08-04 20:00:59 +00:00
7ea789ccfb Revert #156868: Bring back symint check for sharding propagation cache (#159671)
Fixes #159601

Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671
Approved by: https://github.com/xmfan
2025-08-04 19:58:48 +00:00
7e8197e34d Revert "Migrate ScalarType to headeronly (#159416)"
This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449.

Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))
2025-08-04 19:55:09 +00:00
50eac811a6 [typing] Constrain OrderedSet generic to be Hashable (#159684)
Ran across this typing bug while creating an OrderedSet from a type I didn't realize wasn't hashable, which failed at runtime. With this constraint, typing would've failed pre-runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159684
Approved by: https://github.com/Skylion007
2025-08-04 18:08:01 +00:00
4e0f179d0b Update the signature and test of torch.hamming_window() (#152682)
Fixes #146590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152682
Approved by: https://github.com/albanD
2025-08-04 17:50:42 +00:00
36e59d9b12 [c10d][nvshmem] fix missing override compilation error for nvshmem symmetric code (#159557)
Summary:
Fix error when compiling nvshmem code section `NVSHMEMSymmetricMemory.cu` with BUCK

```
fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu:154:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  154 | virtual at::Tensor get_buffer(int
      |                    ^
fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
   56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
```

Test Plan:
Build test + CI

Rollback Plan:

Differential Revision: D78813586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159557
Approved by: https://github.com/kwen2501
2025-08-04 17:46:30 +00:00
fc340d0ca3 [export] Allow comparing device w/o index with device w/ index (#159665)
In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665
Approved by: https://github.com/yushangdi
2025-08-04 17:00:07 +00:00
53e47af0f7 [dynamo][guards] Read the attr name from GetAttrGuardAccessor (#159754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159754
Approved by: https://github.com/jansel
ghstack dependencies: #159752
2025-08-04 16:51:27 +00:00
66ad881fc7 [dynamo][guards][refactor] Simplify type extraction from GuardManager (#159752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159752
Approved by: https://github.com/jansel
2025-08-04 16:51:27 +00:00
1d3eef27ac [ROCm CI] Migrate to MI325 Capacity (#159649)
Migrate mi300s to gfx942.

Related to https://github.com/pytorch/pytorch/pull/159059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159649
Approved by: https://github.com/huydhn
2025-08-04 16:48:12 +00:00
dd95900cec [AOTI] normalize_path_separator file path for Windows. (#159726)
`normalize_path_separator` file path for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159726
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-08-04 15:57:19 +00:00
1cdd665526 fix test_verbose_logs_dynamic_shapes with MSVC (#159573)
Operator `typeid` have different outputs in different compiler. There is a good example in [cppreference](https://www.en.cppreference.com/w/cpp/language/typeid.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159573
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-08-04 15:56:53 +00:00
7cb2dcd2dd [c10d][nvshmem] modify is_nvshmem_available runtime check to work with static-linked library (#159558) (#159561)
Summary:

Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically

Test Plan:
CI + sample run

Rollback Plan:

Differential Revision: D79177525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561
Approved by: https://github.com/kwen2501
2025-08-04 15:40:29 +00:00
e5a81aa7ba Fix conversion of values in libtorch agnostic tests (#155115)
Due to different byteorder,
when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back.

This change fixes test
TestLibtorchAgnosticCPU::test_my_ones_like_cpu
from
cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115
Approved by: https://github.com/huydhn
2025-08-04 13:40:22 +00:00
3e2aa4b0e3 Update pin to include Python 3.14 support (#159725)
Update Triton Pin to top of rel/3.4 branch : https://github.com/triton-lang/triton/tree/rel/3.4 . This is the same as release/3.4.x branch but also includes Python 3.14 support

This should unblock enablement of Python 3.14 support in this PR: https://github.com/pytorch/pytorch/pull/157559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159725
Approved by: https://github.com/davidberard98
2025-08-04 13:30:12 +00:00
6646461764 S390X: fix detection of magic number placeholder in inductor (#157784)
This change fixes multiple tests in
test/inductor/test_aot_inductor_arrayref.py
such as
test_cond_with_parameters_cpu_with_stack_allocation,
test_issue_140766_cpu_with_stack_allocation,
test_model_modified_weights_cpu_with_stack_allocation,
test_nested_tensor_from_jagged_cpu_with_stack_allocation.

Enable tests in test/inductor/test_aot_inductor_arrayref.py

This change is split off from https://github.com/pytorch/pytorch/pull/150116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784
Approved by: https://github.com/huydhn
2025-08-04 12:42:31 +00:00
f74da2a136 [xla hash update] update the pinned xla hash (#159758)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159758
Approved by: https://github.com/pytorchbot
2025-08-04 11:21:45 +00:00
eqy
d35b27dde5 [CUDA] Add some more missing @serialTest decorators (#159672)
Seems to fix #159663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672
Approved by: https://github.com/Skylion007
2025-08-04 07:44:35 +00:00
a9dc1566d4 [MTIA Aten Backend] Migrate arange.start_out (#159540)
Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540
Approved by: https://github.com/malfet, https://github.com/nautsimon
2025-08-04 07:38:05 +00:00
33a1996714 Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024)
This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024
Approved by: https://github.com/mingfeima
2025-08-04 05:49:46 +00:00
ee62177c19 [dynamo] Be consistent with storing func source for UserMethodVariable (#159696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696
Approved by: https://github.com/jansel
ghstack dependencies: #159534
2025-08-04 05:12:44 +00:00
64cbaa876c [dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534
Approved by: https://github.com/jansel
2025-08-04 05:12:44 +00:00
4516c59f5f [dynamo][source] Add special source for __code__ and __closure__ (#159722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159722
Approved by: https://github.com/jansel
2025-08-04 05:02:05 +00:00
8bc843a9ec [vllm hash update] update the pinned vllm hash (#159610)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159610
Approved by: https://github.com/pytorchbot
2025-08-04 04:06:09 +00:00
e39a62c70d Fix warnings in triton_helpers.py (#159719)
```
  /home/jansel/pytorch/torch/_inductor/runtime/triton_helpers.py:152: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '|' instead
    equal |= a_isnan and b_isnan
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159719
Approved by: https://github.com/Skylion007
2025-08-04 03:21:09 +00:00
978e3a9142 refresh expected results (#159727)
Just regular update due to recent <10% changes CI is stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727
Approved by: https://github.com/anijain2305
2025-08-03 22:47:50 +00:00
e2a5c42e7e [BE][MPS] Build metal kernels of MacOS-14+ (#159733)
Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary.
Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic.

Part of https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733
Approved by: https://github.com/dcci
ghstack dependencies: #159731, #159732
2025-08-03 20:53:58 +00:00
5116c49b52 [BE] Remove macos-13 guard from bench_mps_ops (#159732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732
Approved by: https://github.com/dcci
ghstack dependencies: #159731
2025-08-03 20:53:58 +00:00
fecdebe385 [CI][MPS] Fix compile benchmark correctness (#159731)
By passing `fullgraph=True` attribute and increasing cache size limit to 2**16

Otherwise, compiler might decide not to fall back to eager to avoid recompilations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731
Approved by: https://github.com/dcci
2025-08-03 20:53:50 +00:00
e136a9175b [BE] Fix dev warning in Dependencies.cmake (#159702)
Namely
```
CMake Warning (dev) in cmake/Dependencies.cmake:
  A logical block opening on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if)

  closes on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif)

  with mis-matching arguments.
```

Introduced by https://github.com/pytorch/pytorch/pull/143846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-03 18:45:07 +00:00
9a680e14b7 [bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723)
The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` .

This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723
Approved by: https://github.com/IvanKobzarev
2025-08-03 09:16:55 +00:00
805a102beb Revert "[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)"
This reverts commit 1616777cd2a3170ff76afa3e7860b0969420c445.

Reverted https://github.com/pytorch/pytorch/pull/159534 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see 9c18901bfd/1 ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))
2025-08-03 04:58:32 +00:00
6e8d705a22 Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696)"
This reverts commit be71000ff5292293d1976f313218e2df4d5046d3.

Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see 9c18901bfd/1 ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))
2025-08-03 04:58:32 +00:00
9c18901bfd [MTIA Aten Backend] Migrate all.out (#159539)
Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539
Approved by: https://github.com/malfet
ghstack dependencies: #159098
2025-08-03 02:08:35 +00:00
a29ed5e1ac Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-08-02 23:23:17 +00:00
d2792f51b2 [bucketing] Use max of input/output size for bucketing (#159717)
The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717
Approved by: https://github.com/wconstab
2025-08-02 22:42:22 +00:00
be71000ff5 [dynamo] Be consistent with storing func source for UserMethodVariable (#159696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696
Approved by: https://github.com/jansel
ghstack dependencies: #159186, #159534
2025-08-02 21:40:38 +00:00
3f86076775 gc before warming up benchmarking (#159670)
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen
2025-08-02 19:37:24 +00:00
1616777cd2 [dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534
Approved by: https://github.com/jansel
ghstack dependencies: #159186
2025-08-02 18:04:35 +00:00
38895c0ac2 Update RuntimeError message in is_nonzero(input) method from bool to Boolean (#159712)
RuntimeError message updated in is_nonzero(input) method from bool to Boolean.

**Case 1:**
t = torch.tensor([])
torch.is_nonzero(t)

**Case 2:**
t = torch.tensor([1,2])
torch.is_nonzero(t)

**Existing Error message in documentation:**

for case 1: RuntimeError: bool value of Tensor with no values is ambiguous
for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous

**Proposed Error message in documentation:**

for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous
for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Fixes #159710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712
Approved by: https://github.com/malfet
2025-08-02 17:23:45 +00:00
310f901a71 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-08-02 16:38:54 +00:00
e11b1cd97e [ROCm] fix nightly wheel due to rocBLAS environment variable (#159570)
Fixes #159070

The TunableOp failure is due to missing rocBLAS files in our manywheels packaging. This bug has been present since June 7-8 time frame. It was caused by a typo in the rocBLAS environment variable that stores the list of files. It was introduced in this PR: https://github.com/pytorch/pytorch/pull/155388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159570
Approved by: https://github.com/malfet
2025-08-02 06:54:43 +00:00
b599d91738 Log autotune choices and benchmark result to scuba/chrome trace (#159496)
Summary:
Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels.

**Example**

Event: mm_template_autotuning
Column: autotune_choices

```json
{
  "num_choices": 52,
  "num_triton_choices": 19,
  "best_kernel": "cutlass_f6c25cf2",
  "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8",
  "best_time": 0.6283040046691895,
  "best_triton_pos": 26,
  "best_triton_time": 0.6832960247993469,
  "best_triton_kernel": "triton_mm_17",
  "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"
}
```

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
```

Rollback Plan:

Differential Revision: D79235037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496
Approved by: https://github.com/masnesral
2025-08-02 05:34:17 +00:00
fd6a6658c3 Enable _int_mm on Intel GPU (#157769)
# Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-08-02 05:16:01 +00:00
04973496a8 [audio hash update] update the pinned audio hash (#159611)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159611
Approved by: https://github.com/pytorchbot
2025-08-02 05:15:47 +00:00
1548b011ea Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Differential Revision: [D79472604](https://our.internmc.facebook.com/intern/diff/D79472604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-08-02 03:54:41 +00:00
e57a92734d [export] Fix nn_module_stack of assert_tensor_metadata nodes (#159625)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159625
Approved by: https://github.com/yushangdi
2025-08-02 02:52:42 +00:00
79ff3b320b Back out "[ez] get rid of unused var" (#159677)
Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops.

Test Plan:
backout

Rollback Plan:

Differential Revision: D79474114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677
Approved by: https://github.com/georgiaphillips
2025-08-02 01:50:16 +00:00
426f249f20 Fix launch grid calculation (#159497)
Summary:

The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages.

In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79275534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497
Approved by: https://github.com/blaine-rister
2025-08-02 01:12:58 +00:00
d33a484763 Use boxed_nop_preserve_node_meta for aot_export_joint_with_descriptors (#159545)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159545
Approved by: https://github.com/xmfan, https://github.com/wconstab
ghstack dependencies: #159336, #159337
2025-08-02 00:33:41 +00:00
a81ffbc5f5 improve shape checks for grouped_mm (#159666)
Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666
Approved by: https://github.com/danielvegamyhre, https://github.com/eqy
2025-08-02 00:12:25 +00:00
465fe4d9f7 Enable sample nightly PT2 benchmark on B200 (#158011)
Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200

### Testing

https://github.com/pytorch/pytorch/actions/runs/16615101382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-08-01 23:47:44 +00:00
9477af1063 fix compilation on cuda < 12.3 (#159657)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159657
Approved by: https://github.com/kwen2501
2025-08-01 23:40:55 +00:00
dcc36e38bb [Graph Breaks] Remove unsupported Additional Info field (#159658)
Race condition when landing PR#158800 caused us to add this field when it is deprecated, so remove it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159658
Approved by: https://github.com/williamwen42
2025-08-01 23:25:50 +00:00
efd78584a8 [EZ] Add linux-aarch64.yml workflow to the viable/strict blocking set (#159668)
Since it's required to be run on every PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159668
Approved by: https://github.com/malfet
2025-08-01 23:19:08 +00:00
135762ea20 Unpin helion (#159579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159579
Approved by: https://github.com/jansel
2025-08-01 23:08:06 +00:00
e2ee9cfaa2 [NativeRT] Turn on enableStaticCPUKernels by default (#159422)
Summary: As title.

Test Plan:
Need to manual test on production models.

Rollback Plan:

Differential Revision: D78747742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159422
Approved by: https://github.com/dolpm
2025-08-01 22:27:07 +00:00
06d28de17a Update CK Kernel generation and update ck submodule (#157964)
changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964
Approved by: https://github.com/842974287
2025-08-01 22:24:27 +00:00
df9720b8b5 [MTIA Aten Backend] Migrate all foreach ops (#159098)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate all foreach operators to in-tree, including:
  - _foreach_abs
  - _foreach_abs_
  - _foreach_add.List
  - _foreach_add_.List
  - _foreach_add_.Scalar
  - _foreach_add_.Tensor
  - _foreach_addcmul.Scalar
  - _foreach_addcmul_.Scalar
  - _foreach_copy
  - _foreach_copy_
  - _foreach_mul.List
  - _foreach_mul_.List
  - _foreach_mul_.Scalar
  - _foreach_mul.Tensor
  - _foreach_mul_.Tensor
  - _foreach_norm.Scalar
  - _foreach_sqrt_

Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098
Approved by: https://github.com/malfet
2025-08-01 22:10:12 +00:00
85e74d5ace [inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.

- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
2025-08-01 21:51:42 +00:00
0450f05658 Output tensor meta data for FX graph node (#159311)
FX graph segment in CompiledFxGraph does not include tensor meta data, for example, tensor shape, tensor stride, tensor data type, tensor device. AI system co-design team requested to include these information in FX graph segment so they can use FX graph segment to project the performance on different hardware.
This DIFF is to modify the Graph::Node::format_node to include tensor meta data.
Before this DIFF, the triton kernel FX graph segment looks like the following:
```
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {})
# %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# %cos : cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {})
# return %cos
After this DIFF:
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {})
# %add : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# %cos : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {})
# return %cos
```
If format_node can not be changed, I can copy the code to caffe2/torch/_inductor/utils.py.

Differential Revision: D77973076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159311
Approved by: https://github.com/angelayi
2025-08-01 21:40:29 +00:00
595a65f5c2 [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/script_object.py (#159343)
Fixes part of #147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159343
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-08-01 21:30:41 +00:00
8c6c2e40eb Edit a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend (#159542)
As suggested in the pull request #158903 by @H-huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159542
Approved by: https://github.com/d4l3k, https://github.com/H-Huang
2025-08-01 21:20:25 +00:00
32840d19f9 [cutlass backend] skip stream k if shape is dynamic (#159442)
Differential Revision: [D79229210](https://our.internmc.facebook.com/intern/diff/D79229210/)

Motivation is workspace size is hard to determine, and varies for different shape. What I observed is sometimes the shape got smaller, but the workspace can increase. So it is hard to upper bound it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159442
Approved by: https://github.com/ColinPeppler
2025-08-01 20:42:24 +00:00
2040f00112 [BE][Easy] respect os.environ in subprocess calls in tools/nightly.py (#159572)
Respect parent shell's envvars, such as `UV_INDEX_STRATEGY`, `http{,s}_proxy`, etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159572
Approved by: https://github.com/Skylion007
2025-08-01 20:40:31 +00:00
c137f9da0b [Dynamo][Better Engineering] Add type coverage to dynamo/compiled_autograd.py (#159518)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/compiled_autograd.py`

Running
```
mypy torch/_dynamo/compiled_autograd.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  425 | 1553 | 27.37% | 17 | 62 | 27.42% |
| This PR | 1623 | 1623 | 100.00% | 62 | 62 | 100.00% |
| Delta    | +1198| +0 | +72.63% | +45 | 0 | +72.58% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159518
Approved by: https://github.com/xmfan
2025-08-01 20:24:58 +00:00
5e8b95605f [PP] Support OVERLAP_F_B computation type (#158978)
Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591)

The IR looks like:

```
[0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9]
[1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9]
[2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9]
[3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9]
```

In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does.

814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978
Approved by: https://github.com/wconstab
2025-08-01 20:22:30 +00:00
8ea86a6e31 Actually test STD_TORCH_CHECK, add testfile to CMake (#159603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159603
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-08-01 19:53:41 +00:00
acad808545 Revert "[inductor] consolidate common GEMM triton param retrieval (#159383)"
This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443.

Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))
2025-08-01 19:49:21 +00:00
c687446374 Revert "Fix rand_like decomposition to preserve strides (#159294)"
This reverts commit 2c46922ce4b33c39b1c48c302604805510a3f889.

Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to breaking internal test ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3145541845))
2025-08-01 19:19:51 +00:00
dd22ba09b4 [C10D] Document barrier interaction with device_id (#159389)
Addresses #159262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389
Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj
2025-08-01 18:12:21 +00:00
c0e0126399 Remove unused input parameter in ExpandableSegment (#159356)
# Motivation
While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.

# Additional Context
I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356
Approved by: https://github.com/ngimel, https://github.com/albanD
ghstack dependencies: #159159
2025-08-01 17:47:51 +00:00
e4b123b5e4 Revert direct updates (#159654)
reverts:
```

commit 5711a8f06948eeee56ed5f53f171fa519f78491c (tag: trunk/5711a8f06948eeee56ed5f53f171fa519f78491c, origin/main, main)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:32:52 2025 -0700

    Update test_utils.py

commit b4b71d011ed07a41c2086ff0dec2988a63662877 (tag: trunk/b4b71d011ed07a41c2086ff0dec2988a63662877)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:27:54 2025 -0700

    Update utils.py

commit 52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d (tag: trunk/52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:26:05 2025 -0700
```

(commits pushed directly to main by mistake)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159654
Approved by: https://github.com/atalman
2025-08-01 16:54:51 +00:00
5711a8f069 Update test_utils.py 2025-08-01 09:32:52 -07:00
b4b71d011e Update utils.py 2025-08-01 09:27:54 -07:00
52376b9b6f Update convert_frame.py 2025-08-01 09:26:05 -07:00
1371a98b0e Migrate ScalarType to headeronly (#159416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416
Approved by: https://github.com/albanD
ghstack dependencies: #159415, #159411
2025-08-01 16:07:01 +00:00
2a286cbdf4 Allow register_buffer with Tensor-like object (#159455)
As torch allows extending the tensor with `__torch_function__`, it would be desirable to allow registering it as a buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159455
Approved by: https://github.com/mikaylagawarecki
2025-08-01 15:31:38 +00:00
7c37b8e1e0 [ROCm][Windows] Switch __builtin_clz ifdef from WIN32 to MSC_VER. (#159273)
PyTorch with ROCm on Windows is built with clang-cl and not MSVC. This code path is specific to the MSVC compiler so it should be checking for MSC_VER, not just WIN32. The change here is similar to https://github.com/pytorch/pytorch/pull/146606.

This fixes downstream build errors using clang-cl like https://github.com/ROCm/TheRock/actions/runs/16569646709/job/46858176812 (patched and tested downstream at https://github.com/ROCm/TheRock/pull/1140):
```
[7099/7147] Building CXX object functorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj
FAILED: functorch/CMakeFiles/functorch.dir/csrc/dim/dim.cpp.obj
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\clang-cl.exe  /nologo -TP -DEXPORT_AOTI_FUNCTIONS -DFUNCTORCH_BUILD_MAIN_LIB -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_ON_WINDOWS -DROCM_USE_FLOAT16 -DROCM_VERSION=70000 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -DTORCH_HIP_VERSION=700 -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DUSE_PROF_API=1 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dfunctorch_EXPORTS -IB:\src\torch\build\aten\src -IB:\src\torch\aten\src -IB:\src\torch\build -IB:\src\torch -IB:\src\torch\nlohmann -IB:\src\torch\moodycamel -IB:\src\torch\third_party\mimalloc\include -IB:\src\torch\functorch -IB:\src\torch\torch\csrc\api -IB:\src\torch\torch\csrc\api\include -IB:\src\torch\c10\.. -IB:\src\torch\c10\hip\..\.. -IB:\src\torch\torch\.. -IB:\src\torch\torch\..\aten\src -IB:\src\torch\torch\..\aten\src\TH -IB:\src\torch\build\caffe2\aten\src -IB:\src\torch\build\third_party -IB:\src\torch\build\third_party\onnx -IB:\src\torch\torch\..\third_party\valgrind-headers -IB:\src\torch\torch\..\third_party\gloo -IB:\src\torch\torch\..\third_party\onnx -IB:\src\torch\torch\..\third_party\flatbuffers\include -IB:\src\torch\torch\..\third_party\kineto\libkineto\include -IB:\src\torch\torch\..\third_party\cpp-httplib -IB:\src\torch\torch\..\third_party\nlohmann\include -IB:\src\torch\torch\csrc -IB:\src\torch\torch\lib -IB:\src\torch\torch\standalone -IB:\src\torch\torch\lib\libshm_windows -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include -imsvcB:\src\torch\third_party\protobuf\src -imsvcB:\src\torch\third_party\XNNPACK\include -imsvcB:\src\torch\third_party\ittapi\include -imsvcB:\src\torch\cmake\..\third_party\eigen -imsvcB:\src\torch\third_party\ideep\mkl-dnn\include\oneapi\dnnl -imsvcB:\src\torch\third_party\ideep\include -imsvcB:\src\torch\INTERFACE -imsvcB:\src\torch\third_party\nlohmann\include -imsvcB:\src\torch\third_party\concurrentqueue -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\hiprand -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\rocrand -imsvcB:\src\torch\cmake\..\third_party\pybind11\include -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\include /DWIN32 /D_WINDOWS /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -Z7 -Wmissing-prototypes -Werror=missing-prototypes /permissive- /d2implyavx512upperregs- /EHsc /bigobj -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -fms-extensions -Wno-ignored-attributes /showIncludes /Fofunctorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj /Fdfunctorch\CMakeFiles\functorch.dir\ -c -- B:\src\torch\functorch\csrc\dim\dim.cpp
clang-cl: warning: unknown argument ignored in clang-cl: '-std=c++17' [-Wunknown-argument]
clang-cl: warning: argument unused during compilation: '/d2implyavx512upperregs-' [-Wunused-command-line-argument]
In file included from B:\src\torch\functorch\csrc\dim\dim.cpp:36:
B:\src\torch\functorch\csrc\dim\arena.h(14,21): error: functions that differ only in their return type cannot be overloaded
   14 | inline unsigned int __builtin_clz(unsigned int x) {
      |        ~~~~~~~~~~~~ ^
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\lib\clang\20\include\ia32intrin.h(60,15): note: '__builtin_clz' is a builtin with type 'int (unsigned int) noexcept'
   60 |   return 31 - __builtin_clz((unsigned int)__A);
      |               ^
1 error generated.
[7100/7147] Building CXX object caffe2\torch\CMakeFiles\torch_python.dir\csrc\utils\tensor_list.cpp.obj
```

> [!NOTE]
> I haven't been able to reproduce those errors locally, but we have CI jobs that consistently fail when building for Python 3.11 but not 3.12 or 3.13. I'm not sure what is different between those builds, but the code fix seems correct.

There are a few other variations on fixes to this floating around, such as:
* a97a957af0/lz4.c (L34-L43) (checking with `__has_builtin`)
* c98c55ec7e/lj92.c (L31-L46) (the same code as here, but with `_MSC_VER`)
* 2760e5a2bb/def.h (L23-L25) (using `__lzcnt` instead of a custom implementation)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159273
Approved by: https://github.com/Skylion007, https://github.com/m-gallus
2025-08-01 15:21:26 +00:00
ee2649219c Fix max_width computation in _tensor_str._Formatter (#126859)
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.

Now, the code first checks if it should be in sci_mode, then compute `max_width`

Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])

print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```

In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([   10.0000,     0.1000,     0.0100]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```

On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)

Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)

After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000,  0.1000,  0.0100]) Formatter max_width: 7
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```

This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
2025-08-01 15:05:41 +00:00
b0b3e6e48b [PP] Refactor test_schedule_multiproc (#158780)
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients

So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes https://github.com/pytorch/pytorch/issues/154408).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158780
Approved by: https://github.com/wconstab
2025-08-01 15:02:18 +00:00
3967dbedf4 [ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.

This PR makes the following changes:

- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)

What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`

**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
2025-08-01 06:49:01 +00:00
4396b15aa7 remove co_lnotab in favor of co_linetable (#159227)
Fixes #158833
DeprecationWarning: remove co_lnotab in favor of co_linetable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159227
Approved by: https://github.com/ezyang
2025-08-01 06:34:38 +00:00
bb6766053b fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-08-01 05:42:40 +00:00
a4fc051c9a Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend. (#159549)
Fixes #159548

* Throw an error message when the input tensors for the distributed `gather` are noncontiguous. This behaviour is consistent with the distributed `all_gather`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159549
Approved by: https://github.com/d4l3k
2025-08-01 03:26:06 +00:00
5cc6a0abc1 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)"
This reverts commit dfacf11f66d6512396382bdf5088f0ba9de00406.

Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
90f13f3b2a Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
cb9b74872b Revert "Generalize torch._C._set_allocator_settings to be generic (#156175)"
This reverts commit d3ce45012ed42cd1e13d5048b046b781f0feabe0.

Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
c964204829 [CI] Disable executorch jobs (#159595)
The current executorch pin needs to be updated

The next time the docker image gets rebuilt, the executorch docker build is going to fail like https://github.com/pytorch/pytorch/actions/runs/16626853655/job/47137807966

The failure is that the pin uses a version of the nightly that has been removed from the nightly index
```
#62 72.30 ERROR: Could not find a version that satisfies the requirement torch==2.8.0.dev20250601 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0.dev20250602+cpu, 2.8.0.dev20250603+cpu, 2.8.0.dev20250604+cpu, 2.8.0.dev20250605+cpu, 2.8.0.dev20250606+cpu, 2.8.0.dev20250607+cpu, 2.8.0.dev20250608+cpu, 2.8.0.dev20250609+cpu, 2.8.0.dev20250610+cpu, 2.8.0.dev20250611+cpu, 2.8.0.dev20250612+cpu, 2.8.0.dev20250613+cpu, 2.8.0.dev20250614+cpu, 2.8.0.dev20250615+cpu, 2.8.0.dev20250616+cpu, 2.8.0.dev20250617+cpu, 2.8.0.dev20250618+cpu, 2.8.0.dev20250619+cpu, 2.8.0.dev20250620+cpu, 2.8.0.dev20250621+cpu, 2.8.0.dev20250622+cpu, 2.8.0.dev20250623+cpu, 2.8.0.dev20250624+cpu, 2.8.0.dev20250625+cpu, 2.8.0.dev20250626+cpu, 2.8.0.dev20250627+cpu, 2.9.0.dev20250628+cpu, 2.9.0.dev20250629+cpu, 2.9.0.dev20250630+cpu, 2.9.0.dev20250701+cpu, 2.9.0.dev20250702+cpu, 2.9.0.dev20250703+cpu, 2.9.0.dev20250704+cpu, 2.9.0.dev20250705+cpu, 2.9.0.dev20250706+cpu, 2.9.0.dev20250707+cpu, 2.9.0.dev20250708+cpu, 2.9.0.dev20250709+cpu, 2.9.0.dev20250710+cpu, 2.9.0.dev20250711+cpu, 2.9.0.dev20250712+cpu, 2.9.0.dev20250713+cpu, 2.9.0.dev20250714+cpu, 2.9.0.dev20250715+cpu, 2.9.0.dev20250716+cpu, 2.9.0.dev20250717+cpu, 2.9.0.dev20250718+cpu, 2.9.0.dev20250719+cpu, 2.9.0.dev20250720+cpu, 2.9.0.dev20250722+cpu, 2.9.0.dev20250723+cpu, 2.9.0.dev20250724+cpu, 2.9.0.dev20250725+cpu, 2.9.0.dev20250726+cpu, 2.9.0.dev20250727+cpu, 2.9.0.dev20250728+cpu, 2.9.0.dev20250729+cpu, 2.9.0.dev20250730+cpu, 2.9.0.dev20250731+cpu)
#62 72.30 ERROR: No matching distribution found for torch==2.8.0.dev20250601
```

The executorch hash update currently fails due to https://github.com/pytorch/pytorch/actions/runs/16636773244/job/47079169392
```
2025-07-31T01:56:57.0249165Z + echo 'expecting triton to not be installed, but it is'
2025-07-31T01:56:57.0249614Z expecting triton to not be installed, but it is
2025-07-31T01:56:57.0249969Z + exit 1
2025-07-31T01:58:27.6764352Z ##[error]Final attempt failed. Child_process exited with error code 1
```
I believe the cause is https://github.com/pytorch/executorch/pull/11653 where the nightly pytorch is installed from our index, but then requirements-examples installs timm from pypi, which reinstalls pytorch, except its the release build for cuda from pypi?  Which then causes triton to be installed.

I don't know what the intended behavior is so I'm disabling the executorch docker build, executorch build, and the nightly hash update, and apparently the test was already disabled because it was failing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159595
Approved by: https://github.com/malfet
2025-08-01 02:18:03 +00:00
2ac45c2752 Fix autocast context manager when there is exception (#159565)
Summary: When exception occurs inside context manager, we need to either return False OR properly propagage exceptions via __exit__(exc_type, exc_val). But previously while tracing, we don't actually run the exit node so we end up swallowing the exception in a very weird way as outlined in https://github.com/pytorch/pytorch/issues/153202. This PR fixes it

Test Plan:
new test case

Rollback Plan:

Differential Revision: D79348382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159565
Approved by: https://github.com/zou3519, https://github.com/yushangdi
2025-08-01 02:12:24 +00:00
83e2ea8135 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-08-01 01:55:48 +00:00
d994027a41 [Doc fix] fix spelling of enough (#159587)
fixes typo in word `enought` to correct `enough` at 3 places in these files
```
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/cuda/CuFFTPlanCache.h
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587
Approved by: https://github.com/ezyang
2025-08-01 01:50:57 +00:00
cb4f41e125 Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566)"
This reverts commit 8e07c9870d07c5a318ab21bb16b3fa27576851e6.

Reverted https://github.com/pytorch/pytorch/pull/157566 on behalf of https://github.com/yangw-dev due to failed an odd internal test, please reach out to metamate to fix it, D79112610 ([comment](https://github.com/pytorch/pytorch/pull/157566#issuecomment-3141840110))
2025-08-01 01:27:45 +00:00
690fc9cf88 [merge_rules] add some expected failure and skips (#159581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159581
Approved by: https://github.com/anijain2305
2025-08-01 01:18:40 +00:00
eb853e222b [cutlass upgrade] Ignore unused-but-set-variable for AsyncMM.cu (#159578)
Fixes inductor-perf-nightly-h100. This was caused by cutlass upgrade https://github.com/pytorch/pytorch/pull/158854. I missed it in https://github.com/pytorch/pytorch/pull/159276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159578
Approved by: https://github.com/Skylion007
2025-08-01 00:10:59 +00:00
06395276e4 Remove dynamo_timed from the CachingAutotuner.coordinate_descent_tuning() hot path. (#159588)
Summary: When coordinate_descent_tuning==True, CachingAutotuner.coordinate_descent_tuning() is called for every call of CachingAutotuner.run() (at least for Triton templates), but immediately returns the launcher. Move the dynamo_timed call after the check for triton template so we don't incur the context manager overhead on every call.

Fixes https://github.com/pytorch/pytorch/issues/159525

Test Plan: Used the repro in https://github.com/pytorch/pytorch/issues/159525 to make sure the overhead goes away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159588
Approved by: https://github.com/eellison
2025-07-31 23:33:10 +00:00
8becf646ef [dynamo] Make filter handle None as filter function (#159500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159500
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
ghstack dependencies: #158774, #159102
2025-07-31 23:28:57 +00:00
fa68216ca1 [itertools] Implement itertools.cycle with a polyfill (#159102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159102
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
ghstack dependencies: #158774
2025-07-31 23:28:57 +00:00
25ef3d315d [aoti][mps] Dynamic reductions (#159355)
Dynamic kernel:
```cpp
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    constant long& r0_numel,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int x0 = xindex;
    threadgroup float tmp_acc_0[32];
    float tmp_acc_1 = 0;
    for(auto r0_1_cnt = 0; r0_1_cnt < static_cast<int>(metal::floor(static_cast<float>(0.99902343750000000 + 0.00097656250000000000*r0_numel))); ++r0_1_cnt) {
        int r0_1 = 1024 * r0_1_cnt + r0_index;
        if (r0_1 >= r0_numel) break;
        auto tmp0 = in_ptr0[x0 + 5*r0_1];
        tmp_acc_1 += tmp0;
    }
    auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, metal::min(static_cast<decltype(1024+r0_numel)>(1024), static_cast<decltype(1024+r0_numel)>(r0_numel)));
    if (r0_index == 0) out_ptr0[x0] = static_cast<float>(tmp1);
}

void AOTInductorModel::run_impl(...) {
    ...
    auto arg0_1_size = arg0_1.sizes();
    int64_t s77 = arg0_1_size[0];
    inputs.clear();
    [[maybe_unused]] auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    static constexpr int64_t int_array_0[] = {5LL, };
    static constexpr int64_t int_array_1[] = {1LL, };
    AtenTensorHandle buf0_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle));
    RAIIAtenTensorHandle buf0(buf0_handle);
    auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel");
    auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get());
    mps_lib_0_func->runCommandBlock([&] {
        mps_lib_0_func->startEncoding();
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0);
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1);
        aoti_torch_mps_set_arg_int(mps_lib_0_func_handle, 2, s77);
        mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))}, {static_cast<uint64_t>(1), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))});

    });
    arg0_1.reset();
    output_handles[0] = buf0.release();
} // AOTInductorModel::run_impl
```

Static kernel:
```cpp
kernel void generated_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
    int x0 = xindex;
    auto tmp0 = in_ptr0[x0];
    auto tmp1 = in_ptr0[5 + x0];
    auto tmp3 = in_ptr0[10 + x0];
    auto tmp5 = in_ptr0[15 + x0];
    auto tmp2 = tmp0 + tmp1;
    auto tmp4 = tmp2 + tmp3;
    auto tmp6 = tmp4 + tmp5;
    out_ptr0[x0] = static_cast<float>(tmp6);
}

void AOTInductorModel::run_impl(...) {
    ...
    static constexpr int64_t int_array_0[] = {5LL, };
    static constexpr int64_t int_array_1[] = {1LL, };
    AtenTensorHandle buf0_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle));
    RAIIAtenTensorHandle buf0(buf0_handle);
    auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel");
    auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get());
    mps_lib_0_func->runCommandBlock([&] {
        mps_lib_0_func->startEncoding();
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0);
        aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1);
        mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL)});

    });
    arg0_1.reset();
    output_handles[0] = buf0.release();
} // AOTInductorModel::run_impl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159355
Approved by: https://github.com/malfet
2025-07-31 23:15:02 +00:00
7e00f2ec9d [AOTI] add zero size consts asm handler (#159225)
Add `get_zero_consts_asm_code` to handle zero size consts to object.
This function is used to handle zero consts situation. Because cpp standard does not allow zero size array:
https://stackoverflow.com/questions/9722632/what-happens-if-i-define-a-0-size-array-in-c-c
1. On Windows, MSVC will report error C2466:
https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2466?view=msvc-170
So, we can use assmbely compiler to handle this situation.
2. On Windows, why not use Win32 asm to handle all path? Because ml64 only supports up to align `16`, it is
not aligned to pytorch's `64`. Reference: https://learn.microsoft.com/en-us/cpp/assembler/masm/ml-and-ml64-command-line-reference?view=msvc-170
```
Packs structures on the specified byte boundary. The alignment can be 1, 2, 4, 8, or 16.
```
3. It function can handle zero size case on both Windows and Linux, as that:
    A. On Linux, we added `-pedantic` to disable zero size array on C++ compiler. 8e07c9870d/torch/_inductor/cpp_builder.py (L580)
    B. On Windows, msvc is not support zero size array by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159225
Approved by: https://github.com/desertfire
2025-07-31 22:46:33 +00:00
490cb3f1a4 Revert "[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)"
This reverts commit bb62e1f769ef51e2ec149d7256c135d09425aaa0.

Reverted https://github.com/pytorch/pytorch/pull/159190 on behalf of https://github.com/clee2000 due to broke [GH job link](https://github.com/pytorch/pytorch/actions/runs/16658705097/job/47150840171) [HUD commit link](bb62e1f769) on mac ([comment](https://github.com/pytorch/pytorch/pull/159190#issuecomment-3141513921))
2025-07-31 22:22:13 +00:00
b95cf5c91d Move complex to headeronly (#159411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411
Approved by: https://github.com/albanD
ghstack dependencies: #159415
2025-07-31 22:05:43 +00:00
5e2ef2a465 Move Float8 variations to headeronly (#159415)
This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!).

Why is it not a clean copy paste?
- For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs
- Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that.

Outside of the copy paste, I:
- changed the tests to call torch::headeronly instead of c10
- updated header_only_apis.txt
- added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415
Approved by: https://github.com/albanD
2025-07-31 22:05:43 +00:00
9f753f8c0d [DTensor] Improve sort strategy (#159189)
- Sort strategy now supports sharding on non sorted dim.
~~- Fix histc xfail.~~
  - ~~Previously `python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32` will fail with `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18`. However, if we run `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32`, the test will pass. This kind of error is due to DTensor reuses the strategy schema hashing. It turns out that not only the strategy,  the result correctness also depends on `static_argnum` or the op will reuse the previous args from hashed schema and output wrong results. I updated the document also.~~ (fixed in https://github.com/pytorch/pytorch/pull/159289)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159189
Approved by: https://github.com/XilunWu
2025-07-31 21:52:42 +00:00
db437690d1 Add myself as a reviewer for when someone touches headeronly or stable (#159583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159583
Approved by: https://github.com/mikaylagawarecki
2025-07-31 21:30:05 +00:00
669009bcd1 [inductor] respect layout tags for ops with registered lowerings (#159134)
scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue.

~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work.

FIXES https://github.com/pytorch/pytorch/issues/159097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134
Approved by: https://github.com/eellison
2025-07-31 21:29:40 +00:00
e4e2701429 Add the RunLLM widget to the website (#152055)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055
Approved by: https://github.com/albanD
2025-07-31 20:53:53 +00:00
64cc649275 [itertools] Fix accumulate (#158774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158774
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
2025-07-31 20:32:02 +00:00
b1fb552974 Revert "Fix ep deepcopy when there is python builitin name (#159478)"
This reverts commit de7376537f2a11783169fee2b3bc276d266898bf.

Reverted https://github.com/pytorch/pytorch/pull/159478 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159478#issuecomment-3141228423))
2025-07-31 20:20:53 +00:00
bb62e1f769 [inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.

- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
2025-07-31 19:58:07 +00:00
327e2ca580 [ez] get rid of unused var (#159571)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D79320299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159571
Approved by: https://github.com/houseroad, https://github.com/georgiaphillips
2025-07-31 19:11:57 +00:00
1ebcba4e1b Fix typo in link to torch memory_viz tool (#159214)
Fixes a small typo in the torch_cuda_memory docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159214
Approved by: https://github.com/yewentao256, https://github.com/HDCharles, https://github.com/Skylion007
2025-07-31 18:50:54 +00:00
5f7eae697d Deprecate DataLoader pin_memory_device param (#158323)
Build on top of https://github.com/pytorch/pytorch/pull/146821

- Moves enabling pin_memory back inside `_BaseDataLoaderIter`
  - This is required for `StatefulDataloader` which leveraged  `_BaseDataLoaderIter` directly and not the `Dataloader` class init
- Add a simple test for CPU only env where setting `pin_memory=True` is a no-op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158323
Approved by: https://github.com/ramanishsingh

Co-authored-by: zeshengzong <zesheng.zong@outlook.com>
2025-07-31 18:42:07 +00:00
c1722db0f7 [NativeRT] Make VariadicOpConverter and FuseListUnpackConverter for cpu nodes only (#159519)
Summary:
VariadicOpConverter and FuseListUnpackConverter would introduce ops that only have CPU kernels.

Currently, the graph passes are ran if static_dispatch is enabled.

As we plan to enable static_dispatch by default, this diff add the additional check for the graph pass to only work on the node that has all the inputs/outputs on CPU.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79295640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159519
Approved by: https://github.com/dolpm, https://github.com/henryoier
2025-07-31 18:17:21 +00:00
8a233d6000 Revert "[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)"
This reverts commit 07fad04181321d18963b71e9566d44f86a25c9f7.

Reverted https://github.com/pytorch/pytorch/pull/158692 on behalf of https://github.com/yangw-dev due to failed some internal testapf.metrics.tests.generate_graph_def_test.GenerateGraphDefTest: test_aps_generate_inference_graph_def_with_justknobs1) AssertionError: Expected 'check' to be called once. Called 3 times., please fix the internal test and reland it ([comment](https://github.com/pytorch/pytorch/pull/158692#issuecomment-3140873894))
2025-07-31 18:00:30 +00:00
bf3ebd7ad4 Fix grouped MM load along K when TMA loads are not used (#159485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159485
Approved by: https://github.com/ngimel
2025-07-31 17:58:02 +00:00
c07bb277a0 Revert "fix strategy hashing arg mismatch (#159506)"
This reverts commit 3a556762002ec0027b2120a7e6675182c0e50dbd.

Reverted https://github.com/pytorch/pytorch/pull/159506 on behalf of https://github.com/yangw-dev due to failed the internal tests test_get_bwd_hook (torch.equal(output * 2, input_tensor.grad)) ([comment](https://github.com/pytorch/pytorch/pull/159506#issuecomment-3140858905))
2025-07-31 17:54:29 +00:00
f89c28cc6b [inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462
Approved by: https://github.com/eellison
2025-07-31 17:00:32 +00:00
8fedcfa59a [export] _ccode for PythonMod (#158851)
Summary: Adds ccode impl to PythonMod

Test Plan:
test_export

Rollback Plan:

Differential Revision: D76463347

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158851
Approved by: https://github.com/kalpit-meta-1
2025-07-31 16:46:51 +00:00
6662a76f59 [cutlass backend] Fix EVT tests post buf name change (#159541)
Differential Revision: [D79317791](https://our.internmc.facebook.com/intern/diff/D79317791/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159541
Approved by: https://github.com/mlazos
2025-07-31 16:39:49 +00:00
eqy
05aade1b6d [CUDA] Add serialTest decorator to largeTensorTest in test_cuda.py (#159271)
Hopefully helps with disabled tests due to OOM such as https://github.com/pytorch/pytorch/issues/159069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159271
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-07-31 16:27:16 +00:00
f946b25865 [MPS] Speedup argmax/argmin (#159524)
By using efficient `threadgroup_arg[max|min]` primitives.
- Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test
- Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy

Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton:
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float16)  |      285.8      |       272.2       |       422.3       |        354.5        |       721.6       |        683.5        |       2224.0      |        1979.1
      max (torch.float32)  |      300.2      |       267.0       |       389.6       |        342.5        |       769.4       |        682.6        |       2995.7      |        2609.8
      max (torch.int32)    |      299.6      |       275.4       |       390.0       |        361.7        |       758.7       |        686.1        |       3103.4      |        2646.5
      max (torch.int64)    |      297.5      |       275.5       |       417.0       |        382.1        |       856.1       |        722.6        |       5467.7      |        3156.8

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #158990
2025-07-31 16:18:32 +00:00
d2e02585b8 [AOTI] Explicitly delete wait_tensor returned tensor (#159502)
Summary: In the Python wrapper codegen, the returned tensor from wait_tensor is not assigned or used anywhere, because wait_tensor always returns its input, see more discussion in https://github.com/pytorch/pytorch/issues/126773. Similarly, we should just immediately delete the returned tensor handle from aoti_torch_cpu__c10d_functional_wait_tensor in the cpp wrapper codegen, otherwise it may cause tensor's lifetime expansion and even cause OOM in some cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159502
Approved by: https://github.com/yushangdi, https://github.com/jingsh
ghstack dependencies: #159476, #159487
2025-07-31 15:33:36 +00:00
3dd7ebf418 [BE] Fix buf name mismatch in test_c10d_functional_native.py (#159487)
Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487
Approved by: https://github.com/yushangdi
ghstack dependencies: #159476
2025-07-31 15:33:36 +00:00
8273ee0646 [BE] Fix global config leak in test_c10d_functional_native.py (#159476)
Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476
Approved by: https://github.com/yushangdi
2025-07-31 15:33:36 +00:00
c57382a493 Move BFloat16.h to headeronly (#159412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412
Approved by: https://github.com/desertfire
2025-07-31 15:29:17 +00:00
e7cc42df58 [inductor] consolidate common GEMM triton param retrieval (#159383)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
2025-07-31 13:05:04 +00:00
cyy
72c69e731f set MSVC debug information only on debug builds (#159533)
Fixes: https://github.com/pytorch/pytorch/issues/159515
To reduce the binary size increment in release builds by removing debug information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159533
Approved by: https://github.com/atalman
2025-07-31 12:57:33 +00:00
78b9dea754 [inductor] Fix set_linter's handling of f-strings for Python 3.12 and up (fix #159056) (#159252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159252
Approved by: https://github.com/Skylion007
2025-07-31 12:56:09 +00:00
838924436e update the baseline for nightly max_autotune tests (#154973)
Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data:

In the latest nightly test, two models require baseline updates:

- vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly.
- detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model.

```
vision_maskrcnn                     IMPROVED:           graph_breaks=29, expected=30
Improvement: 1 models have fixed dynamo graph breaks:
    vision_maskrcnn
```

```
detectron2_fcos_r_50_fpn            XFAIL
detectron2_fcos_r_50_fpn            FAIL:               graph_breaks=24, expected=22
Error: 1 models have new dynamo graph breaks:
    detectron2_fcos_r_50_fpn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973
Approved by: https://github.com/desertfire
2025-07-31 11:38:55 +00:00
2ffb510942 [Break XPU][Indutor UT] Fix failures introduced by community. (#159463)
Fixes #159000, Fixes #159335, Fixes #159334, Fixes #159332, Fixes #159331, Fixes #159330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159463
Approved by: https://github.com/jansel
2025-07-31 08:37:41 +00:00
20b5f694f8 [Dynamo] Make frozen dataclasses hashable (#159529)
Fixes https://github.com/pytorch/pytorch/issues/159424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159529
Approved by: https://github.com/oulgen
ghstack dependencies: #159513
2025-07-31 07:03:01 +00:00
447e300d55 [Dynamo] Frozen dataclass attr access test (#159513)
Verifies https://github.com/pytorch/pytorch/issues/159424, but perhaps the issue is not fixed yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159513
Approved by: https://github.com/oulgen
2025-07-31 07:03:01 +00:00
5b2ad9279c [draft export] logging (#159004)
Summary: adds logging for draft export

Test Plan:
loggercli stage actualize-stage TorchDraftExportUsageLoggerConfig

Rollback Plan:

Differential Revision: D78308105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159004
Approved by: https://github.com/angelayi
2025-07-31 05:52:13 +00:00
78d7f0cdec disable execution frame cleanup (#159531)
Summary: Want to disable execution frame cleanup until fix in D78621408 is merged

Test Plan:
CI

Rollback Plan:

Differential Revision: D79306602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159531
Approved by: https://github.com/SherlockNoMad
2025-07-31 05:02:36 +00:00
d5c719ec3c [inductor] fix open temp file failed on Windows. (#159342)
Fix open temp file failed on Windows. Error message:
<img width="1181" height="239" alt="image" src="https://github.com/user-attachments/assets/e4a6f438-cb06-44c6-959b-0a6a49d2f44f" />

Here two option to fix this issue: https://stackoverflow.com/questions/66744497/python-tempfile-namedtemporaryfile-cant-use-generated-tempfile
1. `tempfile.NamedTemporaryFile` must setup `delete=False` on Windows
2. Use `WritableTempFile` to handle this case on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159342
Approved by: https://github.com/jansel
2025-07-31 04:58:02 +00:00
c44efc3755 [Refactor] Fix Compile Warning: possibly dangling reference to a temporary (#159517)
```bash
DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:25: warning: possibly dangling reference to a temporary [-Wdangling-reference]
DEBUG  1388 |     for (const at::IValue& elt : lst) {
DEBUG       |                         ^~~
DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:1: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const c10::IValue&, c10::IValue>()’
DEBUG  1388 |     for (const at::IValue& elt : lst) {
DEBUG       | ^
```

This PR fixes this warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159517
Approved by: https://github.com/xmfan
2025-07-31 04:49:43 +00:00
6b9473469f [Graph Partition] add log for graph partition reasons and #partitions (#159425)
Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped.

In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425
Approved by: https://github.com/eellison
2025-07-31 04:21:06 +00:00
7a4167a164 support fabric handles with symmetric memory (#159319)
enable fabric handles for symmetric memory

Enables handle exchange via CU_MEM_HANDLE_TYPE_FABRIC on the systems that support it. This is needed to enable symmetric memory on NVLS72 systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159319
Approved by: https://github.com/malfet, https://github.com/kwen2501
2025-07-31 04:16:20 +00:00
8e67a6ae89 [vllm hash update] update the pinned vllm hash (#159320)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159320
Approved by: https://github.com/pytorchbot
2025-07-31 04:08:14 +00:00
c68ad1bd6a [dynamo][guards] Always record user.stack for informative tlparse guards (#159526)
Before
<img width="1146" height="280" alt="image" src="https://github.com/user-attachments/assets/4ddb11b2-dec8-4010-a28d-63b3cd4a7929" />

After
<img width="1248" height="248" alt="image" src="https://github.com/user-attachments/assets/8aafc5be-92cd-4468-bb8f-ad966de8c717" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159526
Approved by: https://github.com/Lucaskabela
2025-07-31 03:18:33 +00:00
3e5e094615 Revert "Fix large_tensor_test skipping cpu (#158617)"
This reverts commit debc0591b888f211bfe846bdc7cfa0626a5f6f6a.

Reverted https://github.com/pytorch/pytorch/pull/158617 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16631113381/job/47062415099) [HUD commit link](debc0591b8) ([comment](https://github.com/pytorch/pytorch/pull/158617#issuecomment-3138387762))
2025-07-31 02:57:22 +00:00
clr
c65efc8ea1 torch.compile: Record a pt2_compile_event for combo kernels (#159306)
This is off by default, but some jobs have it on. Having this show up in
perfetto and be globally queryable would be useful to see how expensive this
is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159306
Approved by: https://github.com/masnesral
2025-07-31 02:51:38 +00:00
a9049413e2 [dynamo] Turn on recursive dict tag optimization (#159186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159186
Approved by: https://github.com/jansel
2025-07-31 02:36:37 +00:00
d7a5ec9355 Fix the Doc of padding in avg_poolnd (#159142)
Fixes #159141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159142
Approved by: https://github.com/mikaylagawarecki
2025-07-31 02:02:48 +00:00
2c46922ce4 Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-07-31 01:36:50 +00:00
668d414ae7 [CPU] Fix bias dtype issue for FP8 qlinear (#159125)
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`

With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.

Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
2025-07-31 01:26:45 +00:00
4541509237 [Triton] [Inductor] Fix an incorrect descriptor (#159407)
Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`.

Test Plan:
Found by inspection.

Rollback Plan:

Reviewed By: NoamPaz

Differential Revision: D79178538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407
Approved by: https://github.com/NikhilAPatel
2025-07-31 00:34:19 +00:00
6c7f88c2c9 Check addmm dtypes (#159509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159509
Approved by: https://github.com/eqy
2025-07-31 00:15:46 +00:00
c400c8e2e0 [ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075)
Summary:

In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950.

The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4.

The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds.

Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm.

Test Plan:

**Hipify & build**
```
python tools/amd_build/build_amd.py
USE_FBGEMM_GENAI=1 python setup.py develop
```

**Unit tests**
```
python test/test_matmul_cuda.py -- TestFP8MatmulCUDA
Ran 488 tests in 32.969s
OK (skipped=454)
```

**Performance Sample**
| G  | M | N | K | Runtime Ms | GB/S | TFLOPS |
| --  | -- | -- | -- | -- | -- | -- |
| 128 | 1 | 2048 | 5120 | 0.37| 3590 | 7.17 |
| 128 | 64 | 2048 | 5120 | 0.51| 2792 | 338.34 |
| 128 | 128 | 2048 | 5120 | 0.66| 2272 | 522.72 |
| 128 | 1 | 5120 | 1024 | 0.21| 3224 | 6.43 |
| 128 | 64 | 5120 | 1024 | 0.29| 2590 | 291.40 |
| 128 | 128 | 5120 | 1024 | 0.40| 2165 | 434.76 |
| 128 | 1 | 4096 | 4096 | 0.69| 3126 | 6.25 |
| 128 | 64 | 4096 | 4096 | 0.85| 2655 | 324.66 |
| 128 | 128 | 4096 | 4096 | 1.10| 2142 | 501.40 |
| 128 | 1 | 8192 | 8192 | 2.45| 3508 | 7.01 |
| 128 | 64 | 8192 | 8192 | 3.27| 2692 | 336.74 |
| 128 | 128 | 8192 | 8192 | 4.04| 2224 | 543.76 |
| 16 | 1 | 2048 | 5120 | 0.04| 3928 | 7.85 |
| 16 | 64 | 2048 | 5120 | 0.05| 3295 | 399.29 |
| 16 | 128 | 2048 | 5120 | 0.07| 2558 | 588.69 |
| 16 | 1 | 5120 | 1024 | 0.03| 3119 | 6.23 |
| 16 | 64 | 5120 | 1024 | 0.03| 2849 | 320.62 |
| 16 | 128 | 5120 | 1024 | 0.05| 2013 | 404.11 |
| 16 | 1 | 4096 | 4096 | 0.06| 4512 | 9.02 |
| 16 | 64 | 4096 | 4096 | 0.09| 3124 | 381.95 |
| 16 | 128 | 4096 | 4096 | 0.13| 2340 | 547.67 |
| 16 | 1 | 8192 | 8192 | 0.32| 3374 | 6.75 |
| 16 | 64 | 8192 | 8192 | 0.42| 2593 | 324.28 |
| 16 | 128 | 8192 | 8192 | 0.53| 2120 | 518.36 |

- Using ROCm 6.4.1
- Collected through `triton.testing.do_bench_cudagraph`

**Binary size with gfx942 arch**
Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so
After:  118860960 Jul 23 14:29 build/lib/libtorch_hip.so
The difference is 2757104 bytes (~2.6 MiB).

Reviewers: @drisspg @ngimel @jwfromm @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075
Approved by: https://github.com/drisspg
2025-07-30 23:53:58 +00:00
25c3a7e317 [CUDA][CUDA Graphs] Move cuda graphs test to subprocess to avoid polluting mempool tests (#159305)
Otherwise mempool test will fail as the previous graph capture failed but doesn't have its state in the caching allocator fully cleaned up. See also #159301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159305
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng, https://github.com/naromero77amd
2025-07-30 23:31:38 +00:00
de7376537f Fix ep deepcopy when there is python builitin name (#159478)
Summary: title

Test Plan:
CI

Rollback Plan:

Differential Revision: D79261007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159478
Approved by: https://github.com/pianpwk
2025-07-30 23:14:31 +00:00
fd2c64e286 Fix duplicated sources in inductor provenance tracking (#159484)
Summary:

The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources.

This also means that if there are two nested pass like:

```
with GraphTransformObserver(gm, "outer"):
      with GraphTransformObserver(gm, "inner"):
              .....
```

We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`.

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace
```

Rollback Plan:

Differential Revision: D79203058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484
Approved by: https://github.com/angelayi
2025-07-30 23:03:11 +00:00
2b1ae29960 [Dynamo][Better Engineering] Add typing annotations to guard and source (#158397) (#159491)
Summary:
X-link: https://github.com/pytorch/executorch/pull/12986

As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`

Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta    | +990 | +9 | +44.43% | +155 | 0 | +42.82% |

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Rollback Plan:

Reviewed By: JacobSzwejbka, yangw-dev

Differential Revision: D79199389

Pulled By: Lucaskabela

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159491
Approved by: https://github.com/anijain2305, https://github.com/yangw-dev
2025-07-30 22:57:50 +00:00
1293405c8d [MPS] Add simd_[arg][max|min] (#158990)
And add eager tests for those.
Re-implement `threadgroup_[max|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script
```python
import itertools
import timeit
import torch
from torch.utils.benchmark import Compare, Measurement, Timer

def bench_unary_op(func, x, label) -> Measurement:
    sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else ""
    t = Timer(
        stmt=f"f(x);{sync_cmd}",
        globals={"f": func, "x": x},
        language="python",
        timer=timeit.default_timer,
        sub_label=f"{func.__name__} ({str(x.dtype)})",
        description=label,
        env=torch.__version__,
    )
    return t.blocked_autorange()

def bench_reduction(
    reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32
) -> list[Measurement]:
    rc = []

    # Bench 2D with reduction over dim=0
    def f(t):
        return reduction_func(t, dim=0)[0]

    f.__name__ = reduction_func.__name__
    f_c = torch.compile(f, dynamic=False, fullgraph=True)

    for size in (512, 1024, 2048, 4096):
        x = torch.testing.make_tensor(size, size, device=device, dtype=dtype)
        rc_c, rc_e = f(x), f_c(x)
        rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e)
        rc.append(bench_unary_op(f, x, f"eager-{size}x{size}"))
        rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}"))
    return rc

def main() -> None:
    #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64]
    dtypes = [torch.float32, torch.int32, torch.int64]

    # Profile reduction ops
    rc = []
    for op, dtype in itertools.product([torch.max], dtypes):
        rc.extend(bench_reduction(op, dtype=dtype))
    Compare(rc).print()

if __name__ == "__main__":
    torch._dynamo.config.cache_size_limit = 2**16
    main()
```

Produces the following table before
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      297.3      |       531.6       |       394.1       |        2550.5       |       773.0       |        4904.7       |       3647.2      |        9682.0
      max (torch.int32)    |      297.8      |       359.2       |       387.7       |        1179.4       |       768.2       |        2175.0       |       3677.1      |        4495.9
      max (torch.int64)    |      278.7      |       541.4       |       410.2       |        2873.3       |       858.9       |        5620.4       |       6107.2      |       11176.1

Times are in microseconds (us).
```
And after
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      307.9      |       265.3       |       401.0       |        340.8        |       766.5       |        661.9        |       3463.5      |        2829.5
      max (torch.int32)    |      293.5      |       263.1       |       405.0       |        338.8        |       761.4       |        672.5        |       3050.0      |        2688.6
      max (torch.int64)    |      308.2      |       255.7       |       417.4       |        341.4        |       877.0       |        695.0        |       5812.2      |        5762.2

```

`argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there.

Also fixes `torch.max/min` compilation for half-precision types, added regression types for it.

This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-07-30 21:57:25 +00:00
3a65ff84b6 [dynamo, easy] add comment on skipping sys.monitoring frames (#159493)
Add a comment so we know why we're doing this code (followup to https://github.com/pytorch/pytorch/pull/159369)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159493
Approved by: https://github.com/azahed98, https://github.com/Lucaskabela, https://github.com/zou3519, https://github.com/jingsh
ghstack dependencies: #159369
2025-07-30 21:54:38 +00:00
acf13a9b75 Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend (#158903)
Fixes #158902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158903
Approved by: https://github.com/H-Huang
2025-07-30 21:44:29 +00:00
3a55676200 fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-07-30 21:37:13 +00:00
af39144a93 Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480)
Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead.

Fixes https://github.com/pytorch/pytorch/issues/159440

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480
Approved by: https://github.com/xmfan, https://github.com/oulgen
2025-07-30 21:29:38 +00:00
25343b343e [ATen][CUDA][cuFFT] Guard against deprecated error codes (#159466)
This PR adds a guard based on CUDA version, per latest cuFFT [documentation](https://docs.nvidia.com/cuda/cufft/index.html#return-value-cufftresult):
>The following error codes are deprecated and will be removed in a future release: `CUFFT_INCOMPLETE_PARAMETER_LIST`, `CUFFT_PARSE_ERROR`, `CUFFT_LICENSE_ERROR`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159466
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/Skylion007
2025-07-30 21:10:32 +00:00
07fad04181 [ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.

This PR makes the following changes:

- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)

What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`

**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
2025-07-30 21:01:53 +00:00
7ac70ac4cd Revert "Fix rand_like decomposition to preserve strides (#159294)"
This reverts commit a3a51282dbabe0220c2c3947a89f7d2ecc514d33.

Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to failed internal build Failed to load config ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3137796767))
2025-07-30 20:59:19 +00:00
e221a1c853 [Code Motion]Restructure flex attention kernel into flex subdirectory (#159437)
Mostly code motion, updating relative paths, moving some imports that had to be lazy before to top level scope now that we are free from the curse.

This will make it easier to add newer templates and provide some organization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159437
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/Skylion007
2025-07-30 20:12:35 +00:00
4defea1e2c [c10d] Fix setGroupName and setGroupDesc in group_split and merge_remote_group (#159429)
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

Rollback Plan:

Differential Revision: D79201132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
2025-07-30 19:55:55 +00:00
53d68b95de [ROCm CI] Migrate to MI325 Capacity. (#159059)
This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059
Approved by: https://github.com/jithunnair-amd, https://github.com/atalman

Co-authored-by: deedongala <deekshitha.dongala@amd.com>
2025-07-30 19:47:59 +00:00
f74842d57f [PP] Fix zero bubble schedules for eval() (#159475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159475
Approved by: https://github.com/tianyu-l, https://github.com/Skylion007
2025-07-30 19:46:10 +00:00
644fee2610 Fix TestAutogradFallback flaky tests under Dynamo: migrate to lib._destroy() (#159443)
under dynamo, the libraries couldn't properly be cleared unless we manually did `gc.collect()`, but that's slow. it also worked if we just used the _destroy() method to tear down

FIXES
#159398
#159349
#159254
#159237
#159153
#159114
#159040
#158910
#158841
#158763
#158735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159443
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-07-30 19:30:55 +00:00
7821fbc560 [BE] Clarify comment to not revert when command has been edited (#159495)
This is mostly a nit. I was a bit confused when I saw
<img width="1032" height="183" alt="image" src="https://github.com/user-attachments/assets/7a18f167-78c1-4c33-ba6f-3588914c642e" />
in https://github.com/pytorch/pytorch/pull/159172

So I decided I should clean up this message a bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159495
Approved by: https://github.com/yangw-dev, https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet
2025-07-30 19:23:33 +00:00
73ee323380 [ONNX] RMS Norm (#159377)
- Implement rms norm using onnx RMSNormalization-23
- Use the correct eps for float32
  eaadd1282c/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L1844-L1866)
  <img width="743" height="107" alt="image" src="https://github.com/user-attachments/assets/a6fd45aa-01d9-4667-924d-3012232cfcde" />

- Created facility to run tests with the reference runtime by extending ONNXProgram and assert_onnx_program.

Fix https://github.com/pytorch/pytorch/issues/159257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159377
Approved by: https://github.com/titaiwangms
2025-07-30 18:55:47 +00:00
176c6446f8 Update CODEOWNERS for ONNX (#159390)
Update CODEOWNERS for ONNX to reflect current maintainers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159390
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-07-30 18:54:25 +00:00
debc0591b8 Fix large_tensor_test skipping cpu (#158617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158617
Approved by: https://github.com/BoyuanFeng
2025-07-30 18:48:07 +00:00
0df78f0c11 Remove /d2implyavx512upperregs- flag (#159431)
And reopen https://github.com/pytorch/pytorch/issues/145702

As this flag is not documented anywhere, slows down sccache accelerated build and  per https://developercommunity.visualstudio.com/t/Invalid-code-gen-when-using-AVX2-and-SSE/10527298#T-N10562579 it does not workaround a compiler bug, but rather disables some optimizations of AVX512 instructions which are being invoked in AVX2 codepath

Fixes https://github.com/pytorch/pytorch/issues/159082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159431
Approved by: https://github.com/clee2000
2025-07-30 18:47:03 +00:00
d0e8a0ec4c Add CPython test for heapq (#159370)
Not used directly but used internally by `collections.Counter`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159370
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-07-30 18:43:06 +00:00
22492848b6 [BE]: Update CUTLASS submodule to 4.1.0 (#158854)
Update the CUTLASS submodule to the latest version with new supported architectures and new features we can use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158854
Approved by: https://github.com/henrylhtsang
2025-07-30 17:44:38 +00:00
5c14315b05 fixed typo error (#159451)
Fixes #159375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159451
Approved by: https://github.com/albanD
2025-07-30 17:41:30 +00:00
1b99c1859c [BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.

For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from  the `check_pyobj` call.

```
 mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
-    // fast case: tensor is live in python
-    std::optional<PyObject*> mb_obj =
-        t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
-    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
-        return *mb_obj;
-    }
-    return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+  // fast case: tensor is live in python
+  std::optional<PyObject*> mb_obj =
+      t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+          /*ignore_hermetic_tls=*/false);
+  if (mb_obj.has_value() &&
+      !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+    return *mb_obj;
+  }
+  return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
2025-07-30 17:29:43 +00:00
435edbcb5d [Graph Partition] add graph partition doc (#159450)
This pr adds doc for graph partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450
Approved by: https://github.com/eellison
2025-07-30 17:01:10 +00:00
6c6e11c206 Revert "Fix max_width computation in _tensor_str._Formatter (#126859)"
This reverts commit 1465757959dd7e63715b7621650896eca977aefa.

Reverted https://github.com/pytorch/pytorch/pull/126859 on behalf of https://github.com/yangw-dev due to broke trunk with test  distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single - RuntimeError: Expected to find buf7 = empty but did not find it ([comment](https://github.com/pytorch/pytorch/pull/126859#issuecomment-3137137030))
2025-07-30 16:56:32 +00:00
a775c8e73e [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
2025-07-30 16:35:51 +00:00
24d07b3a67 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-30 16:34:15 +00:00
90fd06be71 Various bugfixes for running NanoGPT training (#159166)
Fix various small bugs with running nanogpt on torchbenchmark in OSS under python 3.10. After these changes, the following now succeeds:

```
tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --training --backend inductor  --caching-precompile --warm-start-latency
```

Cold start: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp12LuZ5/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm start (we are invesigating the recompile):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpT5YTB2/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159166
Approved by: https://github.com/zhxchen17
2025-07-30 16:30:22 +00:00
002f18807e [DCP] Improve error handling for process based async checkpointing (#159374)
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79177410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
2025-07-30 16:25:28 +00:00
259e79e3ff Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 16:11:58 +00:00
ee343ce60c [RPC][TensorPipe] Fix import torch if compiled without TensorPipe (#159461)
This is a follow up on the PR #154382, as the issue still persists:
```
  File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module>
    from . import api, backend_registry, functions
  File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module>
    from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT
  File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module>
    from torch._C._distributed_rpc import (
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461
Approved by: https://github.com/lw
2025-07-30 16:04:02 +00:00
ea5369113a unflatten closure (#159418)
Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call.

Test Plan:
added test, updated another

Rollback Plan:

Differential Revision: D79204669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418
Approved by: https://github.com/angelayi
2025-07-30 15:42:18 +00:00
b268f22ab2 Move Float4 to headeronly (#159414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414
Approved by: https://github.com/desertfire
2025-07-30 15:34:01 +00:00
52a52d1b78 [dynamo][guards] Skip no tensor aliasing guard on inbuilt nn module buffers (#159453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159453
Approved by: https://github.com/jansel
2025-07-30 15:31:07 +00:00
eaadd1282c Revert "Move Half to headeronly (#159172)"
This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa.

Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](6d0f4566e2).  Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))
2025-07-30 15:10:26 +00:00
1465757959 Fix max_width computation in _tensor_str._Formatter (#126859)
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.

Now, the code first checks if it should be in sci_mode, then compute `max_width`

Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])

print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```

In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([   10.0000,     0.1000,     0.0100]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```

On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)

Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)

After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000,  0.1000,  0.0100]) Formatter max_width: 7
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```

This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
2025-07-30 14:01:00 +00:00
17b9c618dd [a2av] not returning out tensor from ops (#159435)
torch.compile of `all_to_all_vdev_2d` hits the following error:
```
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub.
```

This PR selects option (1).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435
Approved by: https://github.com/ngimel, https://github.com/xmfan
2025-07-30 08:30:25 +00:00
d3ce45012e Generalize torch._C._set_allocator_settings to be generic (#156175)
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
2025-07-30 06:37:15 +00:00
1fc010a9d8 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
2025-07-30 06:37:15 +00:00
dfacf11f66 Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
2025-07-30 06:37:06 +00:00
c8cf811995 Enable AcceleratorAllocatorConfig key check (#157908)
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
2025-07-30 06:36:56 +00:00
914b1a3873 Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-30 06:36:46 +00:00
7eb5fdb358 [dynamo][guards] Recursive dict tag optimization (#159183)
Design doc here - https://docs.google.com/document/d/1W29DrWID5miGWlZXspsQVN5U0zydE3kjZpziOXrhuaY/edit?tab=t.0#bookmark=id.sba04iw9sp68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159183
Approved by: https://github.com/jansel
2025-07-30 06:01:32 +00:00
f1fb57d854 Add user annotation for FX graph cache key (#159318)
Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Rollback Plan:

Differential Revision: D79019069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318
Approved by: https://github.com/sraikund16, https://github.com/jansel
2025-07-30 05:52:50 +00:00
6d0f4566e2 Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 05:02:13 +00:00
e785c087c5 [audio hash update] update the pinned audio hash (#159321)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159321
Approved by: https://github.com/pytorchbot
2025-07-30 04:35:01 +00:00
d214901133 Add a title to distributed._dist2.md (#159385)
Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build:
```
WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385
Approved by: https://github.com/d4l3k
2025-07-30 04:09:41 +00:00
96ac64d00c Migrate easy q(u)int/bits stuff to torch/headeronly (#159302)
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.

It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-30 03:41:27 +00:00
46d34d6766 (should_fold) gso to guard_or_false when checking folding whether to 3d bmm into 2d mm (#159184)
Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm.

806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)

## DDE
```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
    elif should_fold(tensor1, tensor2, is_out):
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold
    if guard_size_oblivious(t1.numel() == 0):
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12*((u0//2)), 0) (unhinted: Eq(12*((u0//2)), 0)).  (Size-like symbols: none)

Caused by: (_decomp/decompositions.py:4472 in should_fold)
```

```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
    elif should_fold(tensor1, tensor2, is_out):
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold
    return all(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*((u0//2)), 3) (unhinted: Eq(3*((u0//2)), 3)).  (Size-like symbols: none)

Caused by: (_decomp/decompositions.py:4483 in should_fold)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184
Approved by: https://github.com/ezyang
ghstack dependencies: #158894
2025-07-30 03:12:14 +00:00
clr
880249adbc dynamo: handle AttributeErrors from nn_module when infer_paramaters throws. (#158501)
This only handles AttributeError, but in general, any exception coming from
here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions.

```
 File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function
    initialize_lazy_module(tx, mod, args, kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module
    mod._infer_parameters(mod, fake_args, fake_kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters
    module.initialize_parameters(*args, **kwargs)
  ...,
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...'
```

Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user.
```
======================================================================
ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params
    exp_res = opt_m(x, y)
  File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__
    return super().__call__(*args, **kwargs)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check
    self._orig_mod._infer_parameters(self._orig_mod, args, kwargs)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters
    module.initialize_parameters(*args, **kwargs)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters
    self.foo += 1
    ^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
        f"'{type(self).__name__}' object has no attribute '{name}'"
    )
AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-07-30 02:41:41 +00:00
846ada4973 [AOTI] disable crashed AOTI UTs on Windows. (#159427)
disable crashed AOTI UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159427
Approved by: https://github.com/angelayi
2025-07-30 02:23:27 +00:00
badd0618e4 Remove unused paramter on CUDA AllocParams (#159159)
# Motivation
While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.

# Additional Context
I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-30 02:05:25 +00:00
a753a72b14 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158290, #158291
2025-07-30 01:36:03 +00:00
b57d1ef110 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158290
2025-07-30 01:36:03 +00:00
dd7c996d5c [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
2025-07-30 01:36:03 +00:00
70d2e9ba45 [MPS] Avoid outputing zeros from exponential_ for MPS (#159386)
Fixes #159103
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159386
Approved by: https://github.com/malfet
2025-07-30 00:20:31 +00:00
eqy
62f98dbb44 [CUDA][Convolution] Add tf32_on_and_off decorator to test_deconv_freezing_cuda (#159280)
Blackwell seems to select TF32 kernels for this case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159280
Approved by: https://github.com/zou3519, https://github.com/jingsh, https://github.com/Skylion007
2025-07-29 23:44:10 +00:00
e288c258f7 Revert "Remove tensorexpr tests (#158928)"
This reverts commit d742a2896c571a535003d5928fe80397325575a5.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/yangw-dev due to this breaks bunch of internal dependency since some tests are still using the deleted test files from this pr, the internal reviewer please help fix this using codev ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3134378616))
2025-07-29 23:32:07 +00:00
df58db8831 [dynamo, docs] add recompilation, observability, reporting issues docs (#159062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062
Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305
2025-07-29 23:23:51 +00:00
15bb81ea4f [2/N][CI] Remove MacOS-13 workarounds from tests (#159304)
Part of https://github.com/pytorch/pytorch/issues/159275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #159277, #159278
2025-07-29 23:12:13 +00:00
8d37073bac [ROCm] Update jit_utils.cpp trait modification based on HIP version. (#159292)
The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5.

Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292

Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now.

Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292
Approved by: https://github.com/malfet
2025-07-29 22:45:27 +00:00
dc286aef61 Fused RMSNorm Housekeeping (#159317)
Small PR to address comments that were made from the original fused rmsnorm PR that were not landed

Changes:
- Warning message when input.dtype doesn't match weight.dtype
- Ensure default epsilon value is correct

Comments:
https://github.com/pytorch/pytorch/pull/153666#discussion_r2114735005
https://github.com/pytorch/pytorch/pull/153666#discussion_r2223518064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159317
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy
2025-07-29 22:39:18 +00:00
b4619f0272 Pin Helion to 0.0.10 in PyTorch CI (#159420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159420
Approved by: https://github.com/aorenste, https://github.com/malfet
2025-07-29 22:06:50 +00:00
477c2273e1 [dynamo] better way to skip tracing sys.monitoring callables (#159369)
Better approach to https://github.com/pytorch/pytorch/pull/158171, according to https://github.com/python/cpython/issues/137178#issuecomment-3131617493.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159369
Approved by: https://github.com/Skylion007
2025-07-29 21:54:58 +00:00
2176d481c1 [DTensor] dispatch to sharding prop over decomps (#159324)
Fixes #159110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159324
Approved by: https://github.com/ezyang
2025-07-29 21:28:36 +00:00
b97274e8ac [iter] Raise TypeError if iter arg cannot be iterable (#158410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158410
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #156371, #156416, #156460
2025-07-29 21:24:21 +00:00
f9be65cea4 [iter] Wrap iter(..) call in a ObjectIteratorVariable (#156460)
This object keeps track when the iterator is exhausted (raise Stopiteration).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156460
Approved by: https://github.com/zou3519
ghstack dependencies: #156371, #156416
2025-07-29 21:24:20 +00:00
4e3e3dc0a7 [iter] support iter(callable, sentinel) (#156416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156416
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #156371
2025-07-29 21:24:20 +00:00
fcf59df2b6 [iter] Add support for sequence protocol in iter(..) (#156371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156371
Approved by: https://github.com/zou3519
2025-07-29 21:24:20 +00:00
1bcb2f41e0 [BE] Eliminate workspace info in templates with new API (#159055)
Summary: Moves the workspace info calculations to the old TMA API.

Test Plan:
NFC

Rollback Plan:

Differential Revision: D78904434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159055
Approved by: https://github.com/NikhilAPatel
2025-07-29 21:22:36 +00:00
8460131087 [nativert] Add OSS version of ModelRunner (#159268)
Summary: Implement a ModelRunner from scratch with the minimum features for OSS only

Test Plan:
test_export -r NativeRT

Rollback Plan:

Differential Revision: D78979812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159268
Approved by: https://github.com/dolpm
2025-07-29 21:08:14 +00:00
c0c24b61ff Revert "Partitioner: Fix to align partition node order with original graph (#157892)"
This reverts commit 2d1e92307d3e67622f4fe8058d62e44fe4fa2f4e.

Reverted https://github.com/pytorch/pytorch/pull/157892 on behalf of https://github.com/yangw-dev due to fails internal tests : [executorch/backends/xnnpack/partition/xnnpack_partitioner.py:101:24] Incompatible parameter type [6]: In call `Partition.__init__`, for argument `nodes`, expected `Optional[Iterable[Tuple[Node, Optional[int]]]]` but got `dict_keys[Node, str]`. ([comment](https://github.com/pytorch/pytorch/pull/157892#issuecomment-3134004881))
2025-07-29 20:41:45 +00:00
4fac43b21f [BE] Move _freeze.py to torch/fb/utils (#159307)
Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary

Test Plan:
It's a targets change so CI should suffice

Rollback Plan:

Differential Revision: D78910653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307
Approved by: https://github.com/albanD
2025-07-29 20:07:17 +00:00
b794e77b7b Disable cudagraph GCs by default (#158649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158649
Approved by: https://github.com/eellison
ghstack dependencies: #158193
2025-07-29 19:56:11 +00:00
d987a6f7f0 Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)"
This reverts commit abcb24f4de11f8fedf2c2c9ff53b6092ef42306d.

Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))
2025-07-29 19:49:40 +00:00
5d93127c87 Revert "[HOP, map] Rework of map autograd to the new interface (#153343)"
This reverts commit 24b1f10ca13d682430725c511812e43a35fcd6a6.

Reverted https://github.com/pytorch/pytorch/pull/153343 on behalf of https://github.com/yangw-dev due to a older pr this pr dependes on needed to revert, rebase it after it's in ([comment](https://github.com/pytorch/pytorch/pull/153343#issuecomment-3133816812))
2025-07-29 19:46:42 +00:00
a3a51282db Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-07-29 19:26:20 +00:00
e557b3d5e5 Revert "[inductor] Fix mm decomposition evaluating symints (#158998)"
This reverts commit 52e180c3799a7638ee668b1291a711865ab8cfec.

Reverted https://github.com/pytorch/pytorch/pull/158998 on behalf of https://github.com/yangw-dev due to it broke trunk with pr_time_benchmark test  ([comment](https://github.com/pytorch/pytorch/pull/158998#issuecomment-3133696775))
2025-07-29 19:04:11 +00:00
f3a9e99036 Fix inductor cuda sort nan behavior (#159308)
Fix for https://github.com/pytorch/pytorch/issues/152423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159308
Approved by: https://github.com/isuruf
2025-07-29 19:02:45 +00:00
f7d6e9f500 [dynamo][guards] More small guard optimizations (#159345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159345
Approved by: https://github.com/williamwen42
ghstack dependencies: #159288
2025-07-29 18:36:49 +00:00
e43e09e6c1 [dynamo][guards] Use lambda guards for object aliasing to improve object aliasing guards (#159288)
# Note - On Lambda guarding of object aliasing
        # We previously installed object‑aliasing guards as relational guards,
        # but that undermined the recursive‑dict guard optimization: placing the
        # aliasing guard at a leaf prevented the parent dict node from
        # qualifying as a recursive‑dict guard root. Because aliasing guards are
        # rare, we now emit them as epilogue guards via a small Python lambda.
        # This repeats the access in Python—adding a bit of work—but the
        # overhead is outweighed by the gains from enabling recursive‑dict guard
        # optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288
Approved by: https://github.com/StrongerXi
2025-07-29 18:36:49 +00:00
2004f8aa10 FXConverter handling of generic output in inductor fallback kernel (#159002) (#159297)
Summary:

A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case.

Test Plan:
Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`.

Before this diff the modified test would fail with
```
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate
    line.codegen_fx(self)(line)
  File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output
    inds = line.indices[0][1:]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
IndexError: list index out of range
```
 (Full error paste in P1878839403)

With this diff the error is no longer present.

Rollback Plan:

Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297
Approved by: https://github.com/blaine-rister
2025-07-29 18:29:01 +00:00
31b3b38e3a Ensure export joint with descriptors + compile works (#159337)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159337
Approved by: https://github.com/wconstab
ghstack dependencies: #159336
2025-07-29 17:43:52 +00:00
2f0db0444e Track previous MetricsContext edits for ease of debugging. (#159336)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159336
Approved by: https://github.com/wconstab
2025-07-29 17:43:52 +00:00
6162e650b0 [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-29 17:40:49 +00:00
5d89634ca8 Graph break with error message (#158800)
Fixes #157452

Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```

### Release Notes

Change to nn.Parameter Constructor Behavior in Dynamo

Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging.  Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
2025-07-29 17:34:49 +00:00
52e180c379 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-29 17:29:38 +00:00
c55e72bea1 [Re-land][Inductor] Support native Inductor as backend for MTIA (#159211)
The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error:
<img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" />
I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function.

So this diff/PR is an exactly copy of the previous one, except for adding the docstring.

-------------
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel
2025-07-29 17:03:24 +00:00
750348b579 [NativeRT] Clean up use of TargetDevice in KernelFactory (#159298)
Summary:
Remove use of targetDevice in KernelFactory.

AOTI would infer device when creating AOTIDelegateExecutor.

Test Plan:
CI

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D79007317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298
Approved by: https://github.com/dolpm
2025-07-29 16:24:33 +00:00
52b9af163c Add avg_pool3d for MPS (#158877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158877
Approved by: https://github.com/malfet
2025-07-29 15:22:22 +00:00
f4bfac11c7 [Precompile] [easy] API For Editable PrecompileCacheArtifacts (#158586)
This adds an option for backend precompile artifacts to be *editable*, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them.

This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is:
- User runs AOTAutograd -> Inductor -> Triton
- User saves to AOTAutogradCache the normal results
- User runs autotuning
- User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels.

This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586
Approved by: https://github.com/zhxchen17
2025-07-29 14:53:21 +00:00
8d00833fdb [PP] Fix eval step under no_grad() (#159293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159293
Approved by: https://github.com/tianyu-l, https://github.com/wconstab
2025-07-29 14:42:33 +00:00
de529ef002 [ONNX] onnx.md to simplify deprecated entities (#159312)
Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312
Approved by: https://github.com/titaiwangms
2025-07-29 14:24:17 +00:00
61aa2ae20f Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341)"
This reverts commit e469414b59ceeaae2860e36708de8852b9892776.

Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))
2025-07-29 13:56:20 +00:00
9d32aa9789 Help fix numpy detection in cross compiled layouts (#137084)
We had trouble at conda-forge getting numpy to get detected on aarch64 due to our splayed layout and cross compilation needs.

see:
* https://github.com/conda-forge/pytorch-cpu-feedstock/pull/256
* https://github.com/conda-forge/pytorch-cpu-feedstock/issues/266
* https://github.com/conda-forge/pytorch-cpu-feedstock/pull/267

This is my attempt at making an "upstreamable patch" that tries to follow your structure.

It could introduce a new environment variable `Python_NumPy_INCLUDE_DIR` if you want, but CMake doesn't use it as an environment variable, so I feel like that would be weird.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137084
Approved by: https://github.com/atalman
2025-07-29 12:08:56 +00:00
5cf77a0ea2 Fix redistribution costs for slice_scatter (#159223)
We were previously assuming that the `input_strategy == src_strategy`, which is not true in all cases.

This should fix this.

On the side, I also realized that for `slice_scatter` some DTensorSpecs don't have TensorMeta, e.g., https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L524

It would be good to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159223
Approved by: https://github.com/ezyang, https://github.com/wconstab
2025-07-29 12:00:39 +00:00
efcf87654e [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-29 08:05:56 +00:00
2523e58781 unbacked handling for view_copy (#159244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159244
Approved by: https://github.com/bobrenjc93
2025-07-29 07:10:46 +00:00
222fa451a2 Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-29 05:43:53 +00:00
6de24135e5 Fix flaky test_inductor_multiple_specializations (#159264)
Summary: This test was using do_bench, so it was flaky performance is non-deterministic.

Test Plan:
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled

Rollback Plan:

Differential Revision: D79098692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264
Approved by: https://github.com/jingsh
2025-07-29 05:16:55 +00:00
27ae72036d [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable (#159276)
Differential Revision: [D79106238](https://our.internmc.facebook.com/intern/diff/D79106238/)

This is in prep for cutlass upgrade.

More context: https://github.com/NVIDIA/cutlass/issues/2487

Tested in https://github.com/pytorch/pytorch/pull/159115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159276
Approved by: https://github.com/adamomainz, https://github.com/njriasan, https://github.com/Skylion007
2025-07-29 04:40:24 +00:00
e924df23a6 [NativeRT] Strengthen matcher check for StaticDispatch kernel (#159187)
Summary:
Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU.

Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy.

Prepare for turning on static dispatch kernel by default.

Test Plan:
I should add some test before land.

Rollback Plan:

Differential Revision: D78747600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187
Approved by: https://github.com/dolpm
2025-07-29 04:03:49 +00:00
67e68e0785 [c10d] Cleanup split_group logic using the newly built splitGroup (#158488)
with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175, we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it.

Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488
Approved by: https://github.com/d4l3k
2025-07-29 03:27:11 +00:00
775788f93b [BE][PYFMT] migrate PYFMT for test/[i-z]*/ to ruff format (#144556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144556
Approved by: https://github.com/ezyang
2025-07-29 03:26:09 +00:00
19ce1beb05 [AOTInductor] Add test for enabling CUDACachingAllocator for AOTInductor's Weight (#159279)
Summary:
Add test for enabling CUDACachingAllocator for AOTInductor's Weight.
Implementation TBD

Test Plan:
N/A, commit is adding a test.

Rollback Plan:

Differential Revision: D79107507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159279
Approved by: https://github.com/desertfire, https://github.com/jingsh
2025-07-29 02:52:10 +00:00
a91ddea61f Add CPython tests for collections module (#158950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158950
Approved by: https://github.com/zou3519
2025-07-29 02:24:27 +00:00
ffccb90ff4 [dynamo, docs] add fullgraph=False docs (#159050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050
Approved by: https://github.com/svekars, https://github.com/anijain2305
ghstack dependencies: #157985, #158055, #158531
2025-07-29 01:53:47 +00:00
f916f34739 [dynamo, docs] non-strict programming model docs (#158531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531
Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #157985, #158055

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:47 +00:00
c32994ce4b [docs, dynamo] add fullgraph=True, common graph breaks docs (#158055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055
Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305
ghstack dependencies: #157985

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:41 +00:00
433e43cbec [dynamo, docs] programming model dynamo core concepts (#157985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985
Approved by: https://github.com/svekars, https://github.com/anijain2305
2025-07-29 01:53:34 +00:00
e469414b59 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-07-29 01:14:50 +00:00
657e5e9aa6 All custom operators go through Inductor's graph.call_function (#159174)
Fixes #158892

All custom operators should go through the graph.call_function path. The
other fallback path is for aten/prim operations that don't have support
for things (like torch.float8_e8m0fn).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174
Approved by: https://github.com/eellison
2025-07-29 00:31:57 +00:00
f02b783aae [1/N] Remove MacOS-13 MPS testing (#159278)
Starts addressing https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278
Approved by: https://github.com/dcci
ghstack dependencies: #159277
2025-07-28 23:52:47 +00:00
8ad96a563c [inductor] normalize path of the code. (#159255)
Error stack:
<img width="1361" height="345" alt="image" src="https://github.com/user-attachments/assets/50fb2baa-34fd-4a48-a3e7-76e3185391d4" />

After fix:
<img width="1103" height="398" alt="image" src="https://github.com/user-attachments/assets/ece5a9ba-a085-46fe-b061-0c2ebda3a2df" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159255
Approved by: https://github.com/desertfire
2025-07-28 23:42:11 +00:00
59e261bbd8 Revert "[CI] update flake8 and mypy lint dependencies (#158720)"
This reverts commit f5130bf339f12ccf5c6296130c47685bdc4858e4.

Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/yangw-dev due to this pr failed internally when build torchgen due to rror: fail: Unknown PyPI project: pyyaml, it seems like this is caused by change PyYAML into  pyyaml, please fix it ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3129995414))
2025-07-28 22:02:10 +00:00
08ea8fccaf [ez][docker] Remove some unused vars and scripts (#158680)
`CUDNN_VERSION` isn't used in any Dockerfiles, it's picked automatically based on the cuda version in `install_cuda.sh`

`install_cudnn.sh` isn't used anywhere, cudnn installation happens in `install_cuda.sh`

I didn't find any mentions of `GRADLE_VERSION` or `TENSORRT_VERSION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158680
Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet
2025-07-28 21:44:47 +00:00
41754539be Add 3.14 triton wheel build (#159261)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159261
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-28 20:34:16 +00:00
716d52779f [BE] Delete non-existing labels (#159277)
As no such runners has been online for last 2+ month
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277
Approved by: https://github.com/clee2000
2025-07-28 20:28:57 +00:00
3bf41f26c8 [cutlass] rename EVT args within kernels for code caching (#159243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159243
Approved by: https://github.com/henrylhtsang
2025-07-28 19:01:40 +00:00
19aa8eb4f5 [TF32][Flex Attention] Turn off TF32 for reference computation in test_flex_decoding (#158979)
Seems to avoid threshold (fudge factor) twiddling games as this causes the checks to go down the "very small ref error" path instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158979
Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng, https://github.com/nWEIdia
2025-07-28 18:38:23 +00:00
8c0c5c58c7 [benchmarks] Set model name early to keep warmup and main model same (#159231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231
Approved by: https://github.com/williamwen42
ghstack dependencies: #159209
2025-07-28 18:18:16 +00:00
2d1e92307d Partitioner: Fix to align partition node order with original graph (#157892)
Fixes #157891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892
Approved by: https://github.com/ezyang
2025-07-28 17:36:29 +00:00
399c89e15c fix torch/distributed contributing doc (#158934)
both pointers are pointing to a page of empty github issues. I'm moving this to point to all issues tagged with `pt_distributed_rampup`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158934
Approved by: https://github.com/d4l3k
2025-07-28 17:01:05 +00:00
14d67eec05 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 9b4d938f04c95cebe0fbd96974f64c935567e039.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/ZainRizvi due to This was reverted internally. Somehow this PR didn't get reverted alongside it. See D78772867. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3128148475))
2025-07-28 16:58:27 +00:00
9ad7dd54f9 [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string (#158896)
Summary: - Upgrade KernelLauncher kernelLaunchCheck to print help string, following D78440016

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
```

Rollback Plan:

Differential break Revision: D78572009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158896
Approved by: https://github.com/atalman
2025-07-28 16:11:13 +00:00
387db86ef1 Name Inductor's Subproc pool threads. (#158815)
Differential hack Revision: D78710371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158815
Approved by: https://github.com/d4l3k
2025-07-28 16:08:08 +00:00
e5a1d839c5 [nativert] ensure planner once flag is class-local, not static. (#159116)
Summary: att - otherwise only one global planner will be made even though we need it to be per-model if models are colocated.

Differential hack Revision: D78939141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159116
Approved by: https://github.com/SherlockNoMad
2025-07-28 16:06:21 +00:00
c06164a9c5 [nativert][ez] Remove unused dist collectives ops. (#159220)
Removing dependency to c10d/ in ExecutionFrame.h. We don't need c10d::Work in the frame.

Differential Revision: [D79041618](https://our.internmc.facebook.com/intern/diff/D79041618/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159220
Approved by: https://github.com/SherlockNoMad, https://github.com/dolpm
2025-07-28 16:03:14 +00:00
c7586d4ed3 typo (#156560)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156560
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-07-28 15:40:06 +00:00
8e07c9870d [dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566)
inside torch.compile.disable function always triggers recompilation. because a user inside function decorated with torch._dynamo.disable would be used as an argument in the resume_in_xx function. In the current implementation,  it will always be a new object, resulting in the ID_MATCH guard always failing and triggering recompilation.

Fixes https://github.com/pytorch/pytorch/issues/157399
@xmfan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157566
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2025-07-28 12:44:22 +00:00
a76147c9e0 [xla hash update] update the pinned xla hash (#158223)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158223
Approved by: https://github.com/pytorchbot
2025-07-28 11:19:05 +00:00
f3913ea641 [CUDA] fix nansum in non-JIT build (#158633)
This change fix crash of
```
import torch
a = torch.tensor([[1, 2]], dtype=torch.complex32).to('cuda')
b = torch.nansum(a, dim=0)
print(b)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158633
Approved by: https://github.com/ngimel
2025-07-28 08:11:32 +00:00
1abff80fae Reland D78841818 (#159216)
Summary: Relanding D78841818 with fixes

Test Plan:
Tested all failing tests

buck build --config fbcode.use_link_groups=true --flagfile fbcode//mode/dev-nosan fbcode//sigmoid/core/executor/memory/test:layout_planner_tests

buck test 'fbcode//mode/opt' fbcode//sigmoid/inference/test:test_passes

Rollback Plan:

Reviewed By: hl475

Differential Revision: D79038615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159216
Approved by: https://github.com/dolpm
2025-07-28 07:39:35 +00:00
799303f655 Fix atleast_{1,2,3}d() with no arguments description (#156042)
Fixes #130667

## Test Result

### Before
![image](https://github.com/user-attachments/assets/7e3a6764-872a-4573-8bec-e7219f920a15)
![image](https://github.com/user-attachments/assets/194be00c-9a29-44cf-b6bc-4d261a12d04e)
![image](https://github.com/user-attachments/assets/21cd6a4f-0793-44e3-9073-7b8b801f997c)

### After

![image](https://github.com/user-attachments/assets/fdbaa2ff-f13c-4fa9-bf52-0810faa698bd)
![image](https://github.com/user-attachments/assets/0374b474-4c6b-4b7d-abea-70e3df0c0a06)
![image](https://github.com/user-attachments/assets/9f9dc188-60e2-4c0f-9e23-36a39310008c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156042
Approved by: https://github.com/zou3519
2025-07-28 06:25:23 +00:00
d26ab281d2 Revert "Setup TorchBench in Docker (#158613)"
This reverts commit d72ebefe3fa7d3ee0e9c9b399f5c07611e790664.

Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/XuehaiPan due to checkout_install_torchbench function is removed but still referenced in trunk ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3125695250))
2025-07-28 06:19:00 +00:00
1cffb217ef Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)"
This reverts commit e88f804a2eecf967dbbf95c5643248352626dafd.

Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/XuehaiPan due to Breaks Windows wheels ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3125566269))
2025-07-28 05:29:37 +00:00
c8342b7231 [vllm hash update] update the pinned vllm hash (#159235)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159235
Approved by: https://github.com/pytorchbot
2025-07-28 04:16:31 +00:00
f63673626d [dynamo][guards] Skip guards on constant func.__defaults__ elements (#159209)
Func.__defaults__ is a tuple. Therefore, we can skip guards on immutable elements. Mutable elements are still guarded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159209
Approved by: https://github.com/jansel
2025-07-27 22:46:17 +00:00
37638c303e Addressing some linter errors (#158670)
Summary: Addressing the linter errors reported in the changed files.

Test Plan:
```
buck test mode/opt deeplearning/fbgemm:QuantUtilsTest
```
https://www.internalfb.com/intern/testinfra/testrun/11821949118528688

```
buck test mode/opt caffe2/torch/fb/model_transform/splitting/tests:split_dispatcher_test
```
https://www.internalfb.com/intern/testinfra/testrun/7881299627525465

Rollback Plan:

Differential Revision: D78352311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158670
Approved by: https://github.com/excelle08, https://github.com/cyyever, https://github.com/digantdesai
2025-07-27 21:55:50 +00:00
ee2edf3d37 [ROCm][CK][Inductor] enable gfx950 for max autotune with CK (#159195)
+ update inductor config for new gfx arch
+ fixes in codegen for conv2d and ck-tile matmul
+ use appropriate fp8 dtypes
+ test cleanup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159195
Approved by: https://github.com/chenyang78
2025-07-27 20:47:13 +00:00
51eb41a57e Enable dynamic shapes for foreach operations by default (#158985)
## Summary

This PR changes the default value of `combo_kernel_foreach_dynamic_shapes` from `False` to `True` in `torch/_inductor/config.py`.

## Context

The `combo_kernel_foreach_dynamic_shapes` configuration was introduced in PR #134477 (August 2024) to support dynamic shapes for foreach and combo kernels. It was initially disabled by default as a conservative approach to avoid disrupting production workflows.

## Why This Change?

After several months of the feature being available and stable, it's time to enable it by default. This improves the user experience for developers using `torch.compile(dynamic=True)` with foreach operations.

### Current behavior:
- Users must manually discover and enable `combo_kernel_foreach_dynamic_shapes`
- Without this flag, foreach operations may fail with dynamic shapes
- This creates friction and confusion

### With this change:
- Foreach operations work seamlessly with dynamic compilation
- No manual configuration needed
- Better "it just works" experience

## Testing

Extensive testing was performed with PyTorch 2.5.0+ and 2.7.1:
-  Various tensor sizes (8, 16, 32, 64, 128)
-  Multiple tensors in operations (tested up to 20)
-  Nested foreach operations
-  Mixed operations (foreach + standard operations)
-  Both CPU and CUDA devices
-  Symbolic shapes with dynamic compilation

## Impact Assessment

- **Performance**: No impact - this only affects compilation behavior
- **Backward Compatibility**: Fully maintained - users can still set to `False`
- **Risk**: Minimal - feature has been stable since August 2024

## References

- Original implementation: PR #134477 by @qchip
- This completes the feature rollout by making it available by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158985
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-27 19:56:07 +00:00
ede6186c86 [PP] Allow intermediate nodes in ZB to have multiple grads (#159084)
Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792)

Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests.

This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2):

```python
class Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        return x, 2
    @staticmethod
    def backward(ctx, gI1, gI2):
        assert gI2 is None
        return gI1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084
Approved by: https://github.com/tianyu-l
2025-07-27 19:16:51 +00:00
6d071bd65d Remove numpy dependency from onnx (#159177)
One should not expect numpy to be there during onnx import
Forward fix for : https://github.com/pytorch/pytorch/pull/157734
Added regression test to `test_without_numpy` function

Test plan: Run `python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"` with/without this fix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159177
Approved by: https://github.com/atalman, https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/andrewboldi
2025-07-27 13:23:03 +00:00
cyy
d742a2896c Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-07-27 07:13:27 +00:00
11d6559a58 [inductor] disable failed UTs of test_misc.py (#159210)
Disable failed UTs.

<img width="1195" height="118" alt="image" src="https://github.com/user-attachments/assets/da0933fb-3c4c-44c9-ba85-45971f03405f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159210
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-07-27 05:41:44 +00:00
e7667e5702 [vllm hash update] update the pinned vllm hash (#159217)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159217
Approved by: https://github.com/pytorchbot
2025-07-27 04:16:35 +00:00
cyy
f6c89c1ef3 Detach tensor before clone in SGD optimiser and other code (#159204)
Reverse the pattern of tensor clone followed by detach in SGD and other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159204
Approved by: https://github.com/Skylion007
2025-07-27 03:31:12 +00:00
d72ebefe3f Setup TorchBench in Docker (#158613)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-07-26 12:56:03 -07:00
46b925681c [inductor] Update to(tl.int8).to(tl.uint8) workaround from #94717 to handle entire range of torch.uint8 (#158567)
https://github.com/pytorch/pytorch/pull/94717/files#r2210265070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-07-26 19:11:37 +00:00
fe0ff12dab Revert "[Inductor] Support native Inductor as backend for MTIA (#158526)"
This reverts commit cd68559d0451185f8521912c23e77b83d76b87cf.

Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))
2025-07-26 17:58:00 +00:00
7dafab6a93 Fix SDPA sharding when return_debug_mask is False (#159205)
If `return_debug_mask` is False (which is the default value for SDPA), the attention tensor returned is an empty tensor (which has 0 dimensions). This means that the shardings for the batch and CP case are that are passed can yield invalid dimensions.

This PR fixes it for `scaled_dot_product_flash_attention_strategy`.  Note that `scaled_dot_product_cudnn_attention_strategy` doen't have this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159205
Approved by: https://github.com/wconstab
2025-07-26 17:41:42 +00:00
f5130bf339 [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-26 17:12:29 +00:00
f62772f365 Revert "Remove tensorexpr tests (#158928)"
This reverts commit 517eebc1dd4ae6430a95818b16c5f8b4b10fd1bc.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks trunk test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_frac_cpu_bfloat16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16534544469/job/46768022799) [HUD commit link](517eebc1dd) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3122158944))
2025-07-26 17:01:54 +00:00
e2b2685f84 [inductor] enable compiled autograd on CPU windows - v2 (#159185)
The first version: https://github.com/pytorch/pytorch/pull/158432
compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code.
However these code can be compiled on CPU. This PR enable these code on CPU windows.

But the first version changed ifdef block logical, and caused torch audio build fail: https://github.com/pytorch/audio/issues/3992

Here is the version two, which keep the original logical.

# Local test torch audio build pass:
<img width="874" height="1043" alt="image" src="https://github.com/user-attachments/assets/9657be86-04f7-4c66-b8c6-802ec2a7c5c8" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159185
Approved by: https://github.com/xmfan
2025-07-26 16:21:28 +00:00
3db8623dcb Revert "[NativeRT] Apply Device placement once when loading the graph (#158996)"
This reverts commit 28ee8be5bfeebb2e44daace6551462b52557e451.

Reverted https://github.com/pytorch/pytorch/pull/158996 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158996#issuecomment-3121540050))
2025-07-26 09:05:26 +00:00
cd68559d04 [Inductor] Support native Inductor as backend for MTIA (#158526)
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526
Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison
2025-07-26 08:16:34 +00:00
62a49d929b [vllm hash update] update the pinned vllm hash (#159198)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159198
Approved by: https://github.com/pytorchbot
2025-07-26 04:44:38 +00:00
c6b479bc09 remove guard_or_x from allowlist_for_publicAPI (#159181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159181
Approved by: https://github.com/albanD
2025-07-26 01:22:17 +00:00
cyy
517eebc1dd Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-07-26 01:21:01 +00:00
7f266020de add softmax_backward_strategy missing field (#159167)
Add input_specs in softmax_backward_strategy, as is needed by AutoParallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159167
Approved by: https://github.com/XilunWu
2025-07-26 00:53:53 +00:00
e06798191b Split out C++ code from fused adagrad PR (#159008)
The original fused Adagrad pull request was: PR#153038

This PR contains only the c++ code of that original PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159008
Approved by: https://github.com/janeyx99
2025-07-26 00:36:59 +00:00
eqy
c89fa88acb [conv][cuDNN][64-bit indexing] reduce memory usage of depthwise conv 64-bit indexing test (#158981)
Use half instead for reduced memory usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158981
Approved by: https://github.com/soulitzer, https://github.com/Skylion007
2025-07-25 23:58:45 +00:00
f5cf05c983 Throw invalid_argument instead of RuntimeError when parameters exceed… (#158267)
Throw invalid_argument instead of RuntimeError when parameters exceed limits (for torch.int32 dtype)

Fixes #157707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158267
Approved by: https://github.com/albanD
2025-07-25 23:49:46 +00:00
21a95bdf7c [Inductor] [Triton] Enabling TMA for flex-attention for supported device types (#157822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157822
Approved by: https://github.com/drisspg
ghstack dependencies: #159123
2025-07-25 23:45:26 +00:00
fb029accb7 (is_non_overlapping_and_dense) gso to guard_or_false in when checking length 1 (#158894)
Switch from `guard_size_oblivious` to `guard_or_false` if you encounter a DDE, this would then fallback to computing elementwise strides.

2dccff7dcf/torch/_prims/__init__.py (L1919-L1923)

We think it's safe because Laith tested whether this fallback would fail any tests. It did not.
https://github.com/pytorch/pytorch/pull/158157

## Data-dependent exceptions (DDE)
```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 2139, in _to_copy
    x_tensor = torch._prims.convert_element_type(x_tensor, dtype)
  ...
  File "/data/users/colinpeppler/pytorch/torch/_prims/__init__.py", line 1920, in _convert_element_type_meta
    if torch._prims_common.is_non_overlapping_and_dense(a):
  File "/data/users/colinpeppler/pytorch/torch/_prims_common/__init__.py", line 494, in is_non_overlapping_and_dense
    if guard_size_oblivious(length == 1):
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 - 4, 1) (unhinted: Eq(u0 - 4, 1)).  (Size-like symbols: u0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158894
Approved by: https://github.com/pianpwk, https://github.com/laithsakka
2025-07-25 23:43:38 +00:00
26f4dd5160 Scaled MM Fix NVfp4 (#159170)
Fixes mm on B200:
Before:
```Shell
    def _addmm_nvfp4_dispatch(
        a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear.
        The only difference is whether bias is None or not.
        """
        assert a._data.is_contiguous()
        assert b._data.t().is_contiguous()
        assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}"
        assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}"

        M, K = a.shape[0], a.shape[1]
        N = b.shape[1]

        # Swizzle Dizzle
        if a._is_swizzled_scales:
            a_scale_blocked = a._scale_e4m3  # Already swizzled
        else:
            a_scale = a._scale_e4m3.view(M, K // a._block_size)
            a_scale_blocked = to_blocked(a_scale)

        if b._is_swizzled_scales:
            b_scale_blocked = b._scale_e4m3  # Already swizzled
        else:
            b_scale = b._scale_e4m3.view(N, K // b._block_size)
            b_scale_blocked = to_blocked(b_scale)

        # Merge double quant scales into 1 scale for Scale_In^D
        if a._per_tensor_scale is not None:
            assert b._per_tensor_scale is not None
            scale_result = a._per_tensor_scale * b._per_tensor_scale
        else:
            assert b._per_tensor_scale is None and a._per_tensor_scale is None
            scale_result = None

        # THIS IS A WORKAROUND:
        # RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
        # When we have per-tensor scaling, we need to apply it before bias
        # since bias is not quantized
        should_add_bias_separately = (scale_result is not None) and (bias is not None)
        # should_add_bias_separately = bias is not None

>       result = torch._scaled_mm(
            a._data.view(torch.float4_e2m1fn_x2),
            b._data.view(torch.float4_e2m1fn_x2),
            a_scale_blocked.view(torch.float8_e4m3fn),
            b_scale_blocked.view(torch.float8_e4m3fn),
            bias=None if should_add_bias_separately else bias,
            out_dtype=a._orig_dtype,
            # scale_result=scale_result,  # Not supported yet
        )
E       RuntimeError: Invalid scaling configuration.
E       - For TensorWise scaling, a and b should be float8, scales should be float and singletons.
E       - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous.
E       - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major.
E       - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides).
E       - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous.
E       - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous.
E       Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170
Approved by: https://github.com/ngimel
2025-07-25 23:34:03 +00:00
b9e3eb64a7 [Optimus] Support decompose mm with dynamic shapes (#158821)
Summary: The current implementation will not do the decompose for GEMM with dynamic shapes, thus we add one more option for users to enable this feature

Test Plan:
### how to enable

Step 1: Set decompose_mem_bound_mm = false
Step 2:
Add the decompose_mm_pass pattern to the post_grad_fusion_options
json config example:

"post_grad_fusion_options": {
            "decompose_mm_pass": {
              "min_first_dimension_decomposition": 10240, -> default value
              "max_other_dimention_decomposition": 32,  -> default value
             "skip_dynamic_shape_dim_check": true, -> default is false
            }
      },

yaml config example

```
 post_grad_fusion_options:
        decompose_mm_pass:
          skip_dynamic_shape_dim_check: true
```
Note that all these hyper-parameters can be set by the users, if nothing gives, a default value will be used

### unit test

```
buck2 test @mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_dynamic_shape_decompose_addmm
```

Buck UI: https://www.internalfb.com/buck2/a98eb4b3-da1d-4450-9e49-472ba98b2267
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924745731095
Network: Up: 86KiB  Down: 1.3MiB  (reSessionID-96cf35cc-5189-4372-8f25-1fc6a52a3963)
Executing actions. Remaining     0/3                                                       1.4s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:00.6s
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

before: aps-DPA_new_v0_amd_20250716-e7927755df
after: aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb

tlparse:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

### qps and NE

{F1980635506}
 {F1980635505}
- 12.5% qps improvement with NE neutral

### trace analysis
baseline:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716-e7927755df%2F0%2Frank-1.Jul_22_22_28_01.4592.pt.trace.json.gz&bucket=aps_traces

{F1980633952}
proposal:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb%2F0%2Frank-1.Jul_24_14_37_59.4576.pt.trace.json.gz&bucket=aps_traces

{F1980633966}

```
        unsqueeze_default: "bf16[32*s54, 8, 1][8, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_2, 2)
        unsqueeze_default_1: "bf16[1, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_3, 0);  constant_pad_nd_default_3 = None
        mul_tensor: "bf16[32*s54, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.mul.Tensor(unsqueeze_default, unsqueeze_default_1);  unsqueeze_default = unsqueeze_default_1 = None
```

### what have been decomposed
P1880443593

Rollback Plan:

Differential Revision: D78716034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158821
Approved by: https://github.com/Yuzhen11
2025-07-25 23:19:53 +00:00
69cc99525c [nn]: updated type alias for padddingmode in module/conv.py (#158843)
Fixes #152280

Changed type of `padding_mode` from `str` to `Literal["zeros", "reflect", "replicate", "circular"]`

**cc** @Skylion007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158843
Approved by: https://github.com/mikaylagawarecki
2025-07-25 23:05:02 +00:00
72af19dadf Add aot_autograd.fx_utils (#159005)
See docblock for details.  The API here has been validated by use
in autoparallel but I'm always open to suggestions for tweaks.  One
particular choice I made is to make most of the functions return dicts
by default; this isn't strictly necessary for inputs but it is very
convenient for outputs as the output desc lives on the output node,
not the argument that feeds into the node.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159005
Approved by: https://github.com/wconstab
2025-07-25 22:52:33 +00:00
8aebf01287 [bucketing] Rewrite all_gather, reduce_scatter passes via tracing merge_fn (#158663)
Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function.
`all_gather_merge_fn_to_trace`
`reduce_scatter_merge_fn_to_trace`

(Instead of creating nodes and doing FakeTensor prop manually)
This allows to experiment with merge function.

Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_`

Adding topological sort after bucketing passes (comment in post_grad.py):
```
        # Fx collectives bucketing passes require topological sort for the cases:
        # when bucketed collectives have users before the last collective in the bucket
        # AND when inputs of bucketed collective have ancestors after the first collective in the bucket.
        #
        # In this case we can not manually pick the place for bucketed collective insertion.
        # But we are guaranteed by the bucketing (independent collectives in the bucket),
        # that it is possible to reorder nodes to satisfy all ordering requirements.
        #
        # --- before bucketing ---
        # in0 = ...
        # wait_ag0 = ag(in0)
        # user0(wait_ag0)
        # ...
        # pre_in1 = ...
        # in1 = transform(pre_in1)
        # wait_ag1 = ag(in1)
        # user1(wait_ag1)
        #
        # --- after bucketing ---
        #
        # in0 = ...
        # user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective.
        #
        # pre_in1 = ...
        # in1 = transform(pre_in1)
        # ag_bucket(in0+in1)
        # wait_bucket
        # wait_ag0 = wait_bucket[0]
        # wait_ag1 = wait_bucket[1]
        # user1(wait_ag1)
````

Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel:

<img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" />
<img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663
Approved by: https://github.com/wconstab
2025-07-25 22:49:51 +00:00
bc5dbbbb78 support scalar tensor for functional all_gather (#149913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149913
Approved by: https://github.com/H-Huang
ghstack dependencies: #149912
2025-07-25 22:38:08 +00:00
36cf8f1ed8 [BE] Use .md instead of .rst for nn.aliases doc (#158666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158666
Approved by: https://github.com/janeyx99
ghstack dependencies: #158491, #158654
2025-07-25 22:03:55 +00:00
1e79872f2e [BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654
Approved by: https://github.com/janeyx99
ghstack dependencies: #158491
2025-07-25 22:03:55 +00:00
9e8f27cc79 [BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491)
Options to address the "undocumented python objects":

1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change)
3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs.

#### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html)
This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except
- NLLLoss2d (deprecated)
- Container (deprecated)
- CrossMapLRN2d (what is this?)
- NonDynamicallyQuantizableLinear

This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491
Approved by: https://github.com/janeyx99
2025-07-25 22:03:55 +00:00
e65ab9a868 Enable generating generic c_shim that doesn't bypass dispatcher (#158974)
Adds `c_shim_aten.{h/cpp}` and use this for `fill_`

This is the generated `c_shim_aten.cpp` for reference

```cpp

// WARNING: THIS FILE IS AUTOGENERATED BY torchgen. DO NOT MODIFY BY HAND.
// See 7e86a7c015/torchgen/gen.py (L2424-L2436) for details

// This file corresponds to the aten_shimified_ops list in torchgen/aoti/fallback_ops.py

#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h>
#include <torch/csrc/inductor/aoti_torch/utils.h>

#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/CompositeExplicitAutogradFunctions.h>
#include <ATen/CompositeExplicitAutogradNonFunctionalFunctions.h>
#include <ATen/CompositeImplicitAutogradFunctions.h>
#else
#include <ATen/ops/fill.h>

#endif // AT_PER_OPERATOR_HEADERS

using namespace torch::aot_inductor;

AOTITorchError aoti_torch_aten_fill__Scalar(AtenTensorHandle self, double value) {
    AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({
        at::fill_(
            *tensor_handle_to_tensor_pointer(self), value
        );
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158974
Approved by: https://github.com/albanD, https://github.com/janeyx99
2025-07-25 21:59:14 +00:00
bfe6765d6b [export] assert fix in serdes (#159060)
Summary: catch asserts on True

Test Plan:
T232064560

Rollback Plan:

Differential Revision: D78907485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159060
Approved by: https://github.com/yiming0416
2025-07-25 21:46:20 +00:00
e88f804a2e [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
2025-07-25 21:44:57 +00:00
7ef3c3357d NUMA binding integration with elastic agent and torchrun (#149334)
Implements #148689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334
Approved by: https://github.com/d4l3k

Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>
2025-07-25 21:19:49 +00:00
24b1f10ca1 [HOP, map] Rework of map autograd to the new interface (#153343)
This PR reworks the current autograd implementation of map to the new interface.

@pytorchbot label "topic: not user facing"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343
Approved by: https://github.com/ydwu4
2025-07-25 21:17:06 +00:00
0006dd5c43 [test][torchbind] don't allow set torchbind attr at runtime (#158608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158608
Approved by: https://github.com/zou3519
ghstack dependencies: #158583, #158606, #158607
2025-07-25 20:55:41 +00:00
0f31e9a656 [torchbind] fix fakifying a staitc tensor returns dynamic accidentally (#158607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158607
Approved by: https://github.com/zou3519
ghstack dependencies: #158583, #158606
2025-07-25 20:55:41 +00:00
0427e439aa [test][torchbind] turn on inductor backend for compile torchbind tests (#158606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158606
Approved by: https://github.com/zou3519
ghstack dependencies: #158583
2025-07-25 20:55:41 +00:00
4aa69ae336 [torchbind] support register_autocast for torchbind custom op (#158583)
Fix https://github.com/pytorch/pytorch/issues/158414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158583
Approved by: https://github.com/zou3519
2025-07-25 20:55:41 +00:00
14c314b30d [nativert] make per-node benchmark work with memory planning (#159117)
Summary: this will use-after-free otherwise

Rollback Plan:

Differential Revision: D78934104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159117
Approved by: https://github.com/SherlockNoMad
2025-07-25 20:46:17 +00:00
0b01e11416 [ez][export] add sym_sum to verified ops (#159111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159111
Approved by: https://github.com/angelayi
2025-07-25 20:42:42 +00:00
806d9e3fe7 [Inductor][TMA] Split config-gated and pure compatibility logic for TMA template eligibility checks (#159123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159123
Approved by: https://github.com/drisspg
2025-07-25 20:35:49 +00:00
d90ce83027 add a util function _make_all_gather_out_tensor to reduce code duplication (#149912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149912
Approved by: https://github.com/H-Huang
2025-07-25 20:29:01 +00:00
dfcb07bdfa [Inductor] disable windows failed UTs temporary. (#159163)
Disable windows failed UTs temporary.
<img width="1238" height="107" alt="image" src="https://github.com/user-attachments/assets/c8a40408-a793-4016-99bb-19c1bb09860a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159163
Approved by: https://github.com/desertfire
2025-07-25 20:25:36 +00:00
fa0355c18d Fix full_like decomposition to preserve strides (#158898)
Summary:
See original PR at: https://github.com/pytorch/pytorch/pull/144765, which landed internally but was reverted due to test failures. Addressing reviewer comments and trying again.

Rollback Plan:

Differential hack Revision: D78783627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158898
Approved by: https://github.com/eellison
2025-07-25 20:21:36 +00:00
28ee8be5bf [NativeRT] Apply Device placement once when loading the graph (#158996)
Summary:
Placement is leaked to too many classes!

In this diff, we consolidate all placement lookup into one place: Graph::ApplyDevicePlacement.

After applying placement, the in-memory graph, tensorMeta, weightMeta would already have the re-mapped device.
The subsequence weight loading, sample input loading, target device inference would look up the re-mapped device from graph's tensorMeta.

graph's tensorMeta becomes the only ground truth!

Test Plan:
Need to add some tests before landing.
This is a big change.

Rollback Plan:

Differential Revision: D78841818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158996
Approved by: https://github.com/henryoier
2025-07-25 20:11:35 +00:00
ed472257d1 [associative_scan] stop manually set example inputs in dynamo (#159065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159065
Approved by: https://github.com/zou3519
ghstack dependencies: #159063, #159064
2025-07-25 20:08:08 +00:00
57eea56a9a [scan] stop manually set example inputs in dynamo (#159064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159064
Approved by: https://github.com/zou3519
ghstack dependencies: #159063
2025-07-25 20:08:08 +00:00
dd681f7f59 [while_loop] stop manually setting example inputs in dynamo (#159063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159063
Approved by: https://github.com/zou3519
2025-07-25 20:08:08 +00:00
0d4d3e8a89 [TCPStore] Allow ping to be retried (#159165)
On client setup we retry connections with server:

f8fafdc7a6/torch/csrc/distributed/c10d/TCPStore.cpp (L313-L350)

I noticed `ping()` raises `TORCH_INTERNAL_ASSERT` AKA a runtime error rather than a `DistNetworkError`. So updating that so it can be retried as well.

We have seen this pop up internally:
- https://fb.workplace.com/groups/319878845696681/permalink/1478849733132914/
- https://fb.workplace.com/groups/319878845696681/permalink/1479368959747658/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159165
Approved by: https://github.com/d4l3k
2025-07-25 20:03:00 +00:00
ee4c5c7cd2 Add torchcheck for replication_pad3d_backward (#151986)
Fixes #142833

Add check on channel dimension, logic same to the CUDA implementation 78bbb468c6/aten/src/ATen/native/cuda/ReplicationPadding.cu (L347)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151986
Approved by: https://github.com/mikaylagawarecki
2025-07-25 19:48:51 +00:00
51cd6697cd Fix: Use memory_order_relaxed instead of memory_order_relaxed (#159105)
Addresses #159074 by using `memory_order_release` instead of `memory_order_relaxed` here:

9c10760662/c10/core/DeviceType.cpp (L161)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159105
Approved by: https://github.com/colesbury
2025-07-25 19:39:04 +00:00
ba949c54a7 [inductor] fix test_save_graph_repro on Windows. (#159148)
The issue is caused by Windows path separator work as escape character. Fixed by `normalize_path_separator` in torch front end codegen.

Error message:
<img width="855" height="542" alt="image" src="https://github.com/user-attachments/assets/ad08b521-05e6-4c93-9507-ad19c68ac7b5" />

Fixed:
<img width="855" height="312" alt="image" src="https://github.com/user-attachments/assets/4a0a142a-2dbe-4226-a4cb-8eacfab2c3fc" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159148
Approved by: https://github.com/desertfire
2025-07-25 19:11:08 +00:00
2a528e80ce Add more type hints for _inductor/ir.py (#159049)
Fixes #146167

Incremental step to add type hints for _inductor/ir.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159049
Approved by: https://github.com/Skylion007
2025-07-25 18:56:30 +00:00
56c45f863b Add aot_export_joint_with_descriptors and aot_compile_joint_with_descriptors (#158715)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158715
Approved by: https://github.com/fmassa, https://github.com/wconstab, https://github.com/xmfan
ghstack dependencies: #158624, #158708, #158734
2025-07-25 18:49:00 +00:00
d30f89b9b8 Add host protoc script back (#159157)
Following comment from https://github.com/pytorch/pytorch/pull/158475#issuecomment-3116518904

Also this is a fake issue as protoc is dead anyways: https://github.com/pytorch/pytorch/issues/159156

Also also, macos cross compilation is not something that is tested :/ But I guess we're ok with that given how niche it it...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159157
Approved by: https://github.com/janeyx99
2025-07-25 18:44:20 +00:00
3fb78501f0 Revert "enable compiled autograd on CPU windows (#158432)"
This reverts commit a369350065493109d1abfbb994695777ab11bcf4.

Reverted https://github.com/pytorch/pytorch/pull/158432 on behalf of https://github.com/atalman due to Broke audio cuda windows builds see: https://github.com/pytorch/audio/issues/3992 ([comment](https://github.com/pytorch/pytorch/pull/158432#issuecomment-3119912177))
2025-07-25 18:29:16 +00:00
8a0508335f [export] Fix public bindings (#159109)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159109
Approved by: https://github.com/jbschlosser
2025-07-25 18:18:52 +00:00
4c0d5ad4be Fix docstring for clip_grads_with_norm_ to reflect clamping behavior (#158200)
Fix docstring for clip_grads_with_norm_ to reflect clamping behavior
This PR updates the docstring for torch.nn.utils.clip_grads_with_norm_ to accurately reflect the implementation behavior. The current documentation suggests that gradients are always scaled by:

grad = grad * (max_norm / (total_norm + eps))

However, the actual implementation clamps the scale coefficient to a maximum of 1.0, ensuring gradients are only scaled down, not up. This PR corrects the formula and adds a clarifying note to avoid confusion for users.

Updated the formula in the docstring to:

grad = grad * min(max_norm / (total_norm + eps), 1.0)

Added a note explaining the rationale for clamping (to prevent gradient amplification).
Ensured consistency with the behavior of clip_grad_norm_.

Fixes #151554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158200
Approved by: https://github.com/mikaylagawarecki
2025-07-25 18:07:41 +00:00
316c188a5e Remove torch.functional entries from the doc ignore list (#158581)
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking.
3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs.

This PR takes option (3) above and:
* Removes all 20 `torch.functional` entries from the doc ignore list
* Removes `torch.functional.align_tensors()` entirely, since we don't want to document it.
    * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety.
* Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581
Approved by: https://github.com/janeyx99
2025-07-25 17:19:01 +00:00
191eca0bf0 Use simple_wraps instead of functools.wraps in AOTAutograd (#158734)
Wrapping is load bearing for things that introspect argument signatures,
but use of functools.wraps to do this is undesirable as this overrides
the name/module of the wrapping function, which is bad for tracking down
exactly what code is actually being run at runtime.  simple_wraps is
like wraps but it doesn't override the name information, so you still
get an appropriate printout.  To see the stack of all functions wrapping
each other, there is now a helper fn_stack.

I also make some assertions tighter in the descriptor PR.  These didn't
catch any bugs but I figure might as well.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158734
Approved by: https://github.com/wconstab
ghstack dependencies: #158624, #158708
2025-07-25 17:08:54 +00:00
74f64d3c84 Add inputs and outputs in Triton Kernel FX Graph segment (#158174)
Summary: Add inputs and outputs in Triton Kernel FX Graph segment

The FX graph segment in Triton kernel does not include the input tensors and return tensors, for example
Python code:
```
  @torchdynamo.optimize("inductor")
  def fn(a, b, c):
      x = torch.nn.functional.linear(a, b)
      x = x.sin()
      x = x.t() + c * 2
      return x
```

```
# %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {})
# %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})

```
The fix is to add the input and output tensors into FX graph segment

```
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {})
# %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# return %add
```

Differential Revision: D78131358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158174
Approved by: https://github.com/jansel
2025-07-25 17:01:17 +00:00
f8fafdc7a6 Revert "[BE] remove torch deploy - conditionals (#158288)"
This reverts commit ab26d4fbeb5bc4b4e6ef1c37fbec9fab6e5a9edd.

Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
c8316d0e79 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit 6ed2cb6ccd00e64f67fd414d42dff54393140c8f.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
a9f6770edd Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 9c68c4d08f4c4da49f0086b80e382f0cdd518f60.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
5620e617c9 Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)"
This reverts commit 255c0545e7eac2ec6d00a41a3fc9d6d8201f8f39.

Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
ee84ba42ea [Experiment] Run PT2 benchmark twice a day (#159162)
Running every 4 hours seems too many, lower it to twice a day.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159162
Approved by: https://github.com/desertfire, https://github.com/eellison
2025-07-25 15:58:29 +00:00
561193e5f2 [CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691)
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2

sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory

They use larger runners, which have more GPU memory, so its usually ok.  I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)

I've applied skips to the ones that OOMed

Time decreases from ~2.7hr per test job -> ~2hr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
2025-07-25 15:26:29 +00:00
9535995bbc Revert "Remove tensorexpr tests (#158928)"
This reverts commit a0bc865123dba047aa1507e281bf2462780cf271.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to broke cpp static runtime test? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16517697273/job/46715871457) [HUD commit link](a0bc865123) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3118554478))
2025-07-25 15:22:51 +00:00
6fcb2b4413 [dynamo] unimplemented -> unimplemented_v2 for user_defined.py (#156652)
For https://github.com/pytorch/pytorch/issues/147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156652
Approved by: https://github.com/zou3519

Co-authored-by: Sidharth <ssubbarao8@meta.com>
2025-07-25 15:04:17 +00:00
204eb4da5e Add expanded_def option for FX printing, render descriptor, update tests (#158708)
----

- First, we add a new expanded_def to FX, which will expand the
  definitions of variables into multiple lines, one per variable
  definition.  This makes extremely long args/return lists much
  more readable.

- Next, we extend this mechanism to also print out descriptors on
  placeholders and return values, as comments, if available.  This
  is how we will test descriptors.

- We update tlparse for AOTAutograd to use this format.

- We update expect tests to use this format and update their formats,
  so you can inspect what it can look at.  There may be other tests
  I should update, open to suggestions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158708
Approved by: https://github.com/wconstab
ghstack dependencies: #158624
2025-07-25 13:22:32 +00:00
bf311141d6 Track descriptors for all inputs/outputs of AOTAutograd traced graph (#158624)
One of the recurring challenges of working with FX graphs produced by
AOTAutograd is that there is a very intricate input/output calling
convention that is essentially impossible to understand without actually
reverse engineering the AOTAutograd code.  It is so bad that there
is a bit of logic for stashing indices of relevant arguments/outputs
in TracingContext so Inductor can figure out what the correct arguments
are.

This PR introduces the necessary scaffolding to keep track of
"descriptors" of every input/output to a (joint) FX graph produced
by AOTAutograd.  First read through descriptors.py to get a sense for
what is available: for inputs, you can figure out if you have
a plain input, tangent, parameter, or something more exotic like
one of the fields of a subclass or view base.  For outputs, you can
determine if you have a plain output or grad, or something more exotic
like the contents of a mutated input or an intermediate base of several
views that were returned.

There are two distinct parts of this patch: AOTInput tracking, and
AOTOutput tracking.

**AOTInput tracking.**  The way this works is that AOTAutograd starts of
with some Tensor `flat_args` that are the inputs to the graph being
traced, and then updates these arguments as it modifies the input
calling convention.  Anywhere these `args` are passed around, we now add a
news argument `args_descs` which is updated in synchrony with args.  Add
a new arg?  Add a new AOTInput to `args_descs`.

**AOTOutput tracking.**  Originally, I wanted to also add an `outs_descs`
analogous to `args_descs` tracking output metadata.  However, it is
often difficult to compute what the output will be until you're actually
tracing the function for real (and are able to peek at the real
outputs).  So we only compute `outs_desc` when we actually trace.  To do
this, we change the calling convention of the function we trace to
return not just outputs, but a tuple of `outs` and `outs_descs`.  Before
we bottom out at the `make_fx` invocation, we save `outs_descs` to a
nonlocal and bottom out.

To actually make use of this information in a useful way, see the next PR. Potentially the two PRs could be combined together but I think it's actually clearer for them to be separate.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158624
Approved by: https://github.com/xmfan
2025-07-25 13:22:32 +00:00
92e93bb580 [inductor][cpu] Stop lowering div to reciprocal multiplication to preserve precision when the divisor is a scalar and device is on cpu (#158231)
## Fixes https://github.com/pytorch/pytorch/issues/157959
## mini repro from issue
```c++
import torch
from torch import nn

class Foo(nn.Module):

    def __init__(
        self,
        use_parameter: bool
    ) -> None:
        super().__init__()
        self.b = 101
        if use_parameter:
            self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # return x + self.b
        # return x - self.b
        return x / self.b
        # return x * self.b

torch.manual_seed(42)
x = torch.rand((5, 5))
expected = Foo(False)(x)

models = [
    Foo(False),
    Foo(True),
    torch.compile(Foo(False), fullgraph=True),
    torch.compile(Foo(True), fullgraph=True),
]

for m in models:
    print((m(x) - expected).sum())
```

all outputs equal zero except the result of  torch.compile(Foo(False), fullgraph=True)

## summary:
when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal.
this could lead precision lost in c++ kernel. but not in triton kernel
## why:
Generated C++ kernel; thanks to @xmfan for supplying the code.
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L)))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = static_cast<float>(0.009900990099009901);
                    auto tmp2 = at::vec::Vectorized<float>(tmp1);
                    auto tmp3 = tmp0 * tmp2;
                    tmp3.store(out_ptr0 + static_cast<int64_t>(x0));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L)))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                    auto tmp1 = static_cast<float>(0.009900990099009901);
                    auto tmp2 = at::vec::Vectorized<float>(tmp1);
                    auto tmp3 = tmp0 * tmp2;
                    tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                }
            }
        }
    }
}
```
The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits.
```c++
#include <iostream>
#include <iomanip>

int main() {
 auto tmp1 = static_cast<float>(0.009900990099009901);
 auto tmp2 = static_cast<double>(0.009900990099009901);
 std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl;
 std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl;
    return 0;
}
```
the ouput is

```bash
tmp1 = 0.0099009899422526359558
tmp2 = 0.0099009900990099011103
```
 `auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value.
I also found that the bug occurred at that position
86d8af6a6c/torch/_inductor/lowering.py (L6238)

The commit states that the precision lost is expected in cuda implementation.
original commit
03439d4c1c
cuda implementation
0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)

What is interesting is that the Triton kernel works correctly due to the precision of float type in python.
```python
def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 25
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = 0.009900990099009901
    tmp2 = tmp0 * tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231
Approved by: https://github.com/eellison
2025-07-25 08:57:17 +00:00
cyy
a0bc865123 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD
2025-07-25 08:37:51 +00:00
aaa384b2d4 move view_meta to fake impl (#158406)
Python dispatcher is not always enabled in fake tensors and have to be called explicitly.
While it should be, it requires some work to get all tests working.

 I have been running in several issues where I add to add enable_python_dispatcher ex
  XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406
Approved by: https://github.com/bobrenjc93
2025-07-25 08:21:27 +00:00
0fd5f1c294 [ROCm][CI] upgrade wheels to 6.4.2 patch release (#158886)
Upgrade wheels to ROCm 6.4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158886
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 08:11:41 +00:00
e38a2b3d0f [inductor] add missing ignore_errors parameter for Windows. (#159025)
The origin code comemnts:
```python
# Let's not fail if we can't clean up the temp dir. Also note that for
# Windows, we can't delete the loaded modules because the module binaries
# are open.
```
But we are missing the `ignore_errors` parameter for Windows. I help to add it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159025
Approved by: https://github.com/jansel
2025-07-25 07:58:22 +00:00
ae183d6092 Aten vector default constructors set to 0, add fnmadd and fnmsub (#158508)
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158508
Approved by: https://github.com/swolchok
2025-07-25 06:55:37 +00:00
659f8fb115 [dynamo][guards] Add some relational guard helpers (#159077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159077
Approved by: https://github.com/jansel
ghstack dependencies: #158995
2025-07-25 06:28:10 +00:00
05a748d287 [dynamo][guards] Expand is_immutable_object to have None (#158995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158995
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-07-25 06:12:05 +00:00
02ca965560 Device agnostic for DCP (#158337)
Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well.
use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices.

Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158337
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/Saiteja64

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-25 05:24:09 +00:00
511d987378 only call re-plan if historic max's were updated. (#159016)
Summary: wasteful. only update the plan if a new maximum has been found.

Test Plan:
ci

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D78859344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159016
Approved by: https://github.com/SherlockNoMad
2025-07-25 05:07:30 +00:00
9685fc36d4 Add missing optional for tensor ops (#159028)
## Test Result

<img width="872" height="340" alt="image" src="https://github.com/user-attachments/assets/20c3f1a2-0160-4ea3-b9f3-14630b4ec06d" />
<img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/68f8d8da-0570-4ae8-8e45-573b2c64cae5" />
<img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/42d133f6-94eb-4a38-8b4b-5586f52bff88" />
<img width="878" height="285" alt="image" src="https://github.com/user-attachments/assets/d3ad8950-81fa-4c4c-a5b5-621b0d9df99b" />

<img width="889" height="430" alt="image" src="https://github.com/user-attachments/assets/9aabeaff-bb8f-4990-b253-1bb053e72aca" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159028
Approved by: https://github.com/Skylion007
2025-07-25 04:36:55 +00:00
9e5cfd3ee5 [audio hash update] update the pinned audio hash (#159108)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159108
Approved by: https://github.com/pytorchbot
2025-07-25 04:35:21 +00:00
cdf8e9ec1a [MPS] Add support for unsigned types (#159094)
As both Metal and MPS support uint16/uint32 and uint64

Test plan: `python3 -c "import torch;print(torch.randint(55, 66, (16,), device='mps', dtype=torch.uint16)[10:])"`

Fixes https://github.com/pytorch/pytorch/issues/159076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159094
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-07-25 04:31:42 +00:00
bcf34d24eb [vllm hash update] update the pinned vllm hash (#159107)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159107
Approved by: https://github.com/pytorchbot
2025-07-25 04:03:39 +00:00
9b29166f57 [ROCm] add flag torch.backends.miopen.immediate (#158951)
The MIOpen integration has changed over the years.  In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode.  But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True.  This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode.  Thus, immediate mode needs its own flag.  Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags.  This PR adds the first miopen-only flag to control its immediate mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 04:01:51 +00:00
1fced0c7d5 [ROCm] enable hipblaslt on gfx908 for ROCm >= 6.3 (#159092)
Fixes #159030.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159092
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 03:54:30 +00:00
16c0ccd669 [ROCm][CI] upgrade to 6.4.2 patch release (#158887)
Upgrade to ROCm 6.4.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158887
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 03:45:44 +00:00
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
f903bc475c [BE] add noqa for flake8 rule B036: found except BaseException without re-raising (#159043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043
Approved by: https://github.com/Skylion007
2025-07-25 02:56:34 +00:00
4261e26a8b [OpenReg] move fallback tests into test_openreg.py (#158441)
----

- move fallback tests into test_operneg
- remove the test_cpp_extensions_open_device_registration.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158441
Approved by: https://github.com/albanD
ghstack dependencies: #158415, #158440
2025-07-25 02:39:41 +00:00
b635359e4c [OpenReg] add pyproject.toml for openreg (#158440)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158440
Approved by: https://github.com/albanD
ghstack dependencies: #158415
2025-07-25 02:39:41 +00:00
f1a1aa9490 [OpenReg] Improve README.md and optimize some codes for OpenReg (#158415)
----

- add description for DSO dependencies
- remove unnecessary code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158415
Approved by: https://github.com/albanD
2025-07-25 02:39:41 +00:00
6fc0ad22f0 Using the latest torch.library.register_fake API instead of torch.library.impl_abstract (#158839)
As the title stated.

`torch.library.impl_abstract` have beed deprecated in PyTorch2.4, so change to use the new API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158839
Approved by: https://github.com/jingsh, https://github.com/zou3519
ghstack dependencies: #158838
2025-07-25 02:37:30 +00:00
c60d382870 Add tests for torch.ops.load_library (#158838)
According to this [comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3097899129), adding a related test to keep BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158838
Approved by: https://github.com/zou3519
2025-07-25 02:37:30 +00:00
64cb349b81 Extract a method that filters frames in the captured stack trace (#158266)
Summary: The subclass can override the filtering logic to customize which frames to keep or drop.

Test Plan:
```
buck run caffe2/test:test_export -- -r  test_stack_trace
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:others -- -r test_constant_random
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r test_custom_obj_list_out
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx  -- -r class_member_back_compat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158266
Approved by: https://github.com/ezyang, https://github.com/yushangdi
2025-07-25 02:22:03 +00:00
a53db90e21 Revert "[inductor] consolidate common GEMM triton param retrieval (#158015)"
This reverts commit 9faef3d17c2e422d5d62f62b266155e2deb52c40.

Reverted https://github.com/pytorch/pytorch/pull/158015 on behalf of https://github.com/henrylhtsang due to breaking tests ([comment](https://github.com/pytorch/pytorch/pull/158015#issuecomment-3115384824))
2025-07-25 00:16:50 +00:00
9c10760662 [SymmMem] Use host/nvshmem_api.h for backward compat (#159061)
Resolves #159045

`nvshmem_host.h` was introduced in 3.3.9.
Use `host/nvshmem_api.h` and `host/nvshmemx_api.h` for prior versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159061
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/fegin
2025-07-24 22:56:26 +00:00
8d2a1d6e18 Revert "Graph break with error message (#158800)"
This reverts commit cae4746952afbb6d26ecf7599cb7c6c449c69ef4.

Reverted https://github.com/pytorch/pytorch/pull/158800 on behalf of https://github.com/clee2000 due to broke some tests on main inductor/test_distributed_patterns.py::DistributedPatternTests::test_nn_param_return4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16507837934/job/46685704688) [HUD commit link](cae4746952), note to self: bad TD, but also dynamo/test_repros failed but didn't get skipped by TD so maybe a landrace, or I just blaming the wrong commit entirely.. ([comment](https://github.com/pytorch/pytorch/pull/158800#issuecomment-3115224608))
2025-07-24 22:45:58 +00:00
751285cb22 Revert "Move some of vec into headeronly in preparation for Half.h (#158976)"
This reverts commit 5564f2ca2e0836d75c4ee45899b1b981582c3e2d.

Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))
2025-07-24 22:31:49 +00:00
efc810c7d0 [Bugfix] Fix circular import between export and dynamo from tensor fn map (#158931)
Fixes #158120

The issue was caused by populating a builtin tensor fn map at import time; if torch.export.export was called before any dynamo imports with the `meta` device, this map would not be populated, and so would populate on import time which would try to call `torch.disable`, which would not yet be initialized

Fix is to populate this map lazily

```
python test/dynamo/imports_non_circular_repro.py TestImports.test_circular_import_with_export_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158931
Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/anijain2305
2025-07-24 22:24:57 +00:00
abb0bf45df [AOTI] skip crashed case on Windows temporary. (#158929)
skip crashed case on Windows temporary.

This case will crashed application:
<img width="1053" height="275" alt="image" src="https://github.com/user-attachments/assets/3225e9c8-cbe7-4998-86da-f20fbb12ead2" />

Quick analysis:
<img width="1400" height="261" alt="image" src="https://github.com/user-attachments/assets/9c21fefc-9ed8-40f2-84c5-edde2004777c" />

1. It is crashed on OpenMP.
2. stack is dameged, need consider how to debug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158929
Approved by: https://github.com/desertfire
2025-07-24 22:08:19 +00:00
b533f12120 Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)"
This reverts commit da94023b0205bf98c3da366f2f86e0a443f4db17.

Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @sraikund16 can you please help validate the fix? (See D78845227 for details). You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3115072504))
2025-07-24 21:46:00 +00:00
e20736bf1d Dont't GC as often when collecting cudagraphs (#158193)
TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s

Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`.

We were previously garbage collecting at the beginning of each cudagraph
capture. vLLM collects 5427 graphs and most of those garbage collections weren't
actually collecting any memory (CPU or GPU). This changes it to not collect more
than every 10s so if we're capturing in a loop we don't burn all our cycles
looking for garbage.

(These number have a lot of variance from run to run but give the correct
general scale)
```
       | calls | total | synchronize |  gcs | collect | empty cache | sys freed | cuda freed |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
before |  5427 |   78s |       1.48s | 5427 |  53.22s |       1.21s |    145855 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
after  |  5427 |   24s |          0s |    3 |   1.53s |       0.84s |       592 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
```
total - this is the total time reported by vLLM's "Graph capturing finished" log.
The rest of these are measured in torch.cuda.graphs.graph.__enter__():
  calls - number of times torch.cuda.graphs.graph.__enter__ was called
  synchronize - this is the duration taken by the cuda.synchronize call
  gcs - number of times gc.collect was called
  collect - this is the duration taken by the gc.collect call
  empty cache - this is the duration taken by the torch.cuda.empty_cache call
  sys freed - the number of bytes reported freed by gc.collect
  cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved

So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is
fairly quick.

Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this):
<img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193
Approved by: https://github.com/ngimel
2025-07-24 21:37:11 +00:00
cae4746952 Graph break with error message (#158800)
Fixes #157452

Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```

### Release Notes

Change to nn.Parameter Constructor Behavior in Dynamo

Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging.  Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
2025-07-24 21:05:17 +00:00
4a13d4d7d0 [ROCm] Update jit_utils.cpp for compatibility with ROCm7.0 (#158868)
Resolves error when running tests such as `test_nn.py::TestNN::test_L1Loss_no_reduce_complex_cuda` etc. on ROCm7.0:

```
/tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1016:7: error: no template named 'is_floating_point'; did you mean '__hip_internal::is_floating_point'?
 1016 |       is_floating_point<_Tp>::value,
      |       ^~~~~~~~~~~~~~~~~
      |       __hip_internal::is_floating_point
/tmp/comgr-4cd8ad/include/hiprtc_runtime.h:1481:31: note: '__hip_internal::is_floating_point' declared here
 1481 | template<typename _Tp> struct is_floating_point : public false_type {};
      |                               ^
/tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1017:16: error: too few template arguments for class template '__libcpp_complex_overload_traits'
 1017 |       typename __libcpp_complex_overload_traits<_Tp>::_ComplexType
      |                ^
/tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:850:10: note: template is declared here
  847 |   template <class _Tp, bool = is_integral<_Tp>::value,
      |   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  848 |                        bool = is_floating_point<_Tp>::value
      |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  849 |                        >
      |                        ~
  850 |   struct __libcpp_complex_overload_traits {};
      |          ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx90a.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158868
Approved by: https://github.com/jeffdaily
2025-07-24 21:00:37 +00:00
da35562bba [ONNX] Filter out torchscript sentences (#158850)
Fixes #157300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158850
Approved by: https://github.com/justinchuby, https://github.com/svekars
2025-07-24 20:59:06 +00:00
de85ee73ae Update context in unimplemented_v2 when exception bubbles up to the interpreter (#158924)
Before:
```
.Observed exception
  Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
  Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
  Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.

  Developer debug context:
```

After:
```
Observed exception
  Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
  Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
  Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.

  Developer debug context: raised exception TypeError([ConstantVariable(str: "unhashable type: <class 'torch._dynamo.variables.dicts.SetVariable'>")])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158924
Approved by: https://github.com/williamwen42, https://github.com/zou3519
2025-07-24 20:50:22 +00:00
eqy
8573a2beda [CUDA] Fix missing __syncthreads in MultiMarginLoss backward (#158994)
Turns out issue in #158921 is detectable with a simple unit test and adding the missing sync fixes it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158994
Approved by: https://github.com/malfet, https://github.com/Skylion007

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-07-24 20:47:29 +00:00
13398dab79 Revert "Remove tensorexpr tests (#158928)"
This reverts commit a3f9f79f591102afa93145bb67dc7e34df44f9a4.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to Theres still some references to the things removed in this PR in test.sh, the jobs on this PR are failing because of that but log classifier is probably pointing to a wrong line, should be an easy fix tho ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3114873706))
2025-07-24 20:45:30 +00:00
5564f2ca2e Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-24 20:32:33 +00:00
f3edcac23a [dynamo] Added back weblink generation (#159011)
Added back weblink generation for v2.9 development

Note: It is fine to bring the weblink generation back since v2.9 isn't released for a while

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159011
Approved by: https://github.com/williamwen42
2025-07-24 20:27:11 +00:00
90c241dedd [precompile] Support user defined function calls from bytecode. (#158947)
Previously precompile was implemented under the assumption that dynamo always inlines the user code and generate resume functions when a graph break is hit. In cases like nanogpt training, there exists nontrivial amount of code causing dynamo to fail the speculation and stop inlining certain type of user function. This results in more code objects to be tracked by CompilePackage.

Since these new code objects are user defined, we need to also serialize the location of these code so that we can load the precompile entries to the these code objects in another process.

With this fix, we are able to run nanogpt inference+training with precompile under torchbench.

Differential Revision: [D78691422](https://our.internmc.facebook.com/intern/diff/D78691422/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158947
Approved by: https://github.com/jamesjwu
2025-07-24 20:10:57 +00:00
5ab0eb28f7 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-24 20:10:51 +00:00
0b2ef76e85 DDE-Free select with unbacked index. (#157605)
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
2025-07-24 20:08:05 +00:00
9faef3d17c [inductor] consolidate common GEMM triton param retrieval (#158015)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D78081314](https://our.internmc.facebook.com/intern/diff/D78081314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158015
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
2025-07-24 19:17:48 +00:00
aeaa20083f [profiler] update CUDA runtime kernel identification logic (#157890)
Update CUDA kernel detection to exclude memory API calls

References:
- https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html
- https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157890
Approved by: https://github.com/sraikund16
2025-07-24 19:14:08 +00:00
5be7e187ba Support sort and scatter_add strategy (#159022)
Add `sort`, `scatter_add` strategy.  I am reusing the strategy for `scatter` related ops for a quick support. The strategy can be potential improved after we fix index related strategies.

Minor fix: fix `replicate_op_strategy` to support output multiple tensors, which is required by aten.sort.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159022
Approved by: https://github.com/XilunWu, https://github.com/wconstab
2025-07-24 18:33:18 +00:00
347a97da66 [MPS] Enable dlpack integration (#158888)
Though testing is a lie and dependent on https://github.com/pytorch/pytorch/pull/153835

Fixes https://github.com/pytorch/pytorch/issues/153789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158888
Approved by: https://github.com/albanD
ghstack dependencies: #158874
2025-07-24 18:05:41 +00:00
78aa3bd6b6 Added Emscripten __assert_fail declaration to Macros.h (#158580)
Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D78500790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580
Approved by: https://github.com/JacobSzwejbka
2025-07-24 17:10:29 +00:00
ee97dbf2e7 [ROCm][CI] update HIP patch for 6.4.1, again (#159001)
Another fix for hipGraph capture of MIOpen OCL kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159001
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-24 16:36:19 +00:00
b7d41729e0 Add zerotensor design description in code (#158837)
Fix `TODO: add a note explaining the design decisions` by adding design description in https://github.com/pytorch/pytorch/issues/69687 to codebase. Make it easier to get and read by other developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158837
Approved by: https://github.com/soulitzer
2025-07-24 16:35:42 +00:00
abcb24f4de [Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`

Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta    | +990 | +9 | +44.43% | +155 | 0 | +42.82% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
2025-07-24 15:55:18 +00:00
fd48681b6a [DeviceMesh][ez] Make the logic within flatten simpler (#158999)
While looking at the code of device mesh I find that this logic can be simplified. Also the naming needs to be correct. Because this mesh is not "flattened" yet, so we can just call it flatten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158999
Approved by: https://github.com/wz337, https://github.com/wconstab
ghstack dependencies: #158900
2025-07-24 15:40:13 +00:00
cyy
a3f9f79f59 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD
2025-07-24 15:38:36 +00:00
2fc0b1605e [a2av] Make test input more random (#157029)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Use torch.randn to fill input buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157029
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743, #156881, #157026
2025-07-24 15:35:12 +00:00
11ea3736dd Revert "[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691)"
This reverts commit 0c0fcb53ff5ee1eb5f0d1f535ed3726d01f8abb5.

Reverted https://github.com/pytorch/pytorch/pull/158691 on behalf of https://github.com/ZainRizvi due to Sorry but these are causing jobs to fail with out of memory errors on trunk ([comment](https://github.com/pytorch/pytorch/pull/158691#issuecomment-3113922186))
2025-07-24 15:31:53 +00:00
43d4ff6851 [a2av] Test dispatch-then-combine (#157026)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Putting both the dispatch API and combine API in battlefield, one following the other, i.e.
```
all_to_all_vdev_2d(inp, out, inp_splits, out_splits_offsets, ...)

all_to_all_vdev_2d_offset(
    input=out,
    out=combine_out,
    in_splits_offsets=out_splits_offsets,
    out_splits_offsets=combine_out_splits_offsets
)
```
Here the `out_splits_offsets` from dispatch perfectly serves as the `in_splits_offsets` argument for combine.

Then we assert that the output of combine is exactly the same as the original input to shuffle, and combine's output splits are exactly the same as the original input splits.

It works!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157026
Approved by: https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743, #156881
2025-07-24 15:21:02 +00:00
83957d1c03 [a2av] Add token combine operator (#156881)
Added `all_to_all_vdev_2d_offset`, which:

Perform a 2D AllToAllv operation, with input split and offset
information provided on device. The input offsets need not to be
exact prefix sum of the input splits, i.e. paddings are allowed between the
splitted chunks. The paddings, however, will not be transferred to peer
ranks.

In Mixure of Experts models, this operation can be used to combine tokens
processed by experts on remote ranks. This operation can be viewed as an
"reverse" operation to the `all_to_all_vdev_2d` operation (which shuffles
tokens to experts).

The change may seem a bit dense, sorry.  But it is mainly two changes:
1. templating existing device functions (to use provided input offset or calculate it)
2. generalizing variable names, e.g. npes, ne --> minor_size, major_size,
so that I can use the same alltoall function for matrix of (nranks, ne) as well as matrix of (ne, nranks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156881
Approved by: https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743
2025-07-24 15:08:04 +00:00
48fe4ff247 [export] set enable_gqa in export flash->math decomp (#158604)
Differential Revision: D78524147

For `scaled_dot_product_attention(..., enable_gqa=True)`:
- the Math backend passes the flag through, performing the extra [KV broadcast](6e07d6a0ff/aten/src/ATen/native/transformers/attention.cpp (L902)) if set to True
- the Flash backend has no flag, and relies on correct indexing in the C++ kernel
- Export used to default to Math for `enable_gqa=True`, but https://github.com/pytorch/pytorch/pull/157893 landed and enabled Flash. At the same time, there's an export-only [decomp](6e07d6a0ff/torch/_decomp/decompositions.py (L4968)) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs.

This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158604
Approved by: https://github.com/angelayi, https://github.com/drisspg
2025-07-24 14:46:13 +00:00
f55c5d085e [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-24 14:09:54 +00:00
a3025e17b2 Fix inductor non-stable argsort/sort test (#146622)
- Prevent the inductor test for argsort/sort from wrongly failing when the argsort/sort output with stable=False differs from pytorch but is still a valid argsort output.
- Add functionality to allow alternative assert_equal functions in inductor tests for future cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146622
Approved by: https://github.com/eellison

Co-authored-by: George Wigley <georgewi@graphcore.ai>
2025-07-24 14:02:12 +00:00
afd6eb0d49 [docker release] Remove build layer as not used (#158988)
[docker release] Remove build layer as not used in any of the : https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158988
Approved by: https://github.com/oulgen, https://github.com/malfet
2025-07-24 12:22:55 +00:00
3ced1079a4 [inductor] Fix collectives_reordering overwrite real_dep with fake_dep with the same name (#158960)
Differential Revision: [D78839734](https://our.internmc.facebook.com/intern/diff/D78839734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158960
Approved by: https://github.com/wconstab
2025-07-24 11:08:58 +00:00
3e954d3943 better testing for subclasses + compile (#158742)
Fixes #114398

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158742
Approved by: https://github.com/ezyang
2025-07-24 10:28:44 +00:00
fb067de550 [NativeRT] Remove device_ member from OpKernel base class (#158944)
Summary:
In general, device_ is not very useful in OpKernel.  Remove it to avoid misuse.

Also, the meaning of `device_` is also ambiguous in the OpKernel.
For StaticDispatch kernels, we always call cpu kernel.
For C10Kernel, we rely on input tensor's device and dispatcher to determine which device to run on.
For ops involves multiple device, e.g. aten._to_copy(device), the meaning of device is ill-defined.

Test Plan:
CI

Rollback Plan:

Reviewed By: henryoier, dolpm, kqfu, zhxchen17

Differential Revision: D78704840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158944
Approved by: https://github.com/dolpm
2025-07-24 09:21:37 +00:00
693197eed6 [doc] remove FSDP1 developer note (#158991)
this resolve pytorch doc audit - we remove fsdp1 doc and promote fsdp2

https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158991
Approved by: https://github.com/svekars, https://github.com/mori360
ghstack dependencies: #158989
2025-07-24 08:21:54 +00:00
cyy
65c1109ca2 Remove CUDA 11 CMake code (#156795)
CUDA 11 is no longer supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156795
Approved by: https://github.com/atalman, https://github.com/malfet
2025-07-24 08:00:41 +00:00
70fb5bb6fb [CI] Add smoke test for NVSHMEM availability (#158938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158938
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-07-24 06:34:21 +00:00
30bb7636da removed zero dim cpu logic from fake_tensor.py (#147501)
Fixes #144748
In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug.
The root cause is fake_tenosr.py's find-common-device method, 0b0da81021/torch/_subclasses/fake_tensor.py (L833), takes zero dim cpu tensor into account but  the device check in adaption.h doesn't.

This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501
Approved by: https://github.com/ezyang
2025-07-24 06:19:46 +00:00
68349118b5 [doc] add weifengpy to torch distributed pocs (#158989)
<img width="415" height="355" alt="Screenshot 2025-07-23 at 16 02 12" src="https://github.com/user-attachments/assets/35b6bb45-d5ed-4d74-8369-e8e66aaa2618" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158989
Approved by: https://github.com/mori360
2025-07-24 04:42:33 +00:00
e09d80c545 [vllm hash update] update the pinned vllm hash (#158997)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158997
Approved by: https://github.com/pytorchbot
2025-07-24 04:04:17 +00:00
07df6ba7f5 [BE] Remove unused test_python_gloo_with_tls (#158964)
This was last modified in 2021 and has not been invokved at least since 2.0 release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158964
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #158961, #158962, #158963
2025-07-24 02:34:27 +00:00
d61153a300 Delete mobile merge rule (#158963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158963
Approved by: https://github.com/atalman
ghstack dependencies: #158961, #158962
2025-07-24 02:34:27 +00:00
da9e120e3f [BE] Remove unused build-android action (#158962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158962
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #158961
2025-07-24 02:34:27 +00:00
611b61e758 [BE] Remove android build rules (#158961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158961
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-07-24 02:34:27 +00:00
cyy
d352c28dd1 [2/N] Remove FindPackageHandleStandardArgs.cmake (#156559)
Following #157188, this PR removes FindPackageHandleStandardArgs.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156559
Approved by: https://github.com/albanD
2025-07-24 02:34:10 +00:00
0c0fcb53ff [CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691)
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2

sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory

They use larger runners, which have more GPU memory, so its usually ok.  I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)

I've applied skips to the ones that OOMed

Time decreases from ~2.7hr per test job -> ~2hr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
2025-07-24 01:51:28 +00:00
febf3c475e fix forced loglevel in pytorch oss code (#158820)
Differential Revision: [D78715806](https://our.internmc.facebook.com/intern/diff/D78715806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158820
Approved by: https://github.com/Skylion007, https://github.com/pradeepfn
2025-07-24 00:40:28 +00:00
7001d6fbc9 Skip slow tests for aarch64-inductor-benchmarks (#158842)
This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench.
The suggested models takes a long time which leads to the benchmark runs being `timeout`.  [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml)

•	The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms  , which significantly slow down execution leading to the timeout.
**Action:**
•	An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025.

To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in:
      * timm benchmarks
      * torchbench benchmarks

 As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but:
•	An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated.

Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842
Approved by: https://github.com/malfet
2025-07-24 00:21:38 +00:00
0118931e27 [Inductor] Fix a user-defined Triton kernel bool param codegen issue (#158845)
Summary: Fixes https://github.com/pytorch/pytorch/issues/158778. When handling a boolean type parameter to a user-defined Triton kernel, we need to treat it differently from integer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158845
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-07-24 00:19:27 +00:00
ebb032a202 [docker release] Fix push nightly tag (#158984)
This is a typo.
I see that this step is not executing in nightly builds:
https://github.com/pytorch/pytorch/actions/runs/16464544564/job/46538759844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158984
Approved by: https://github.com/oulgen
2025-07-23 23:39:49 +00:00
60ac3414eb [a2av] Split in_out_splits into in_splits and out_splits_offsets (#156743)
So that it would be easier if user would like to feed `out_splits_offsets` as input to a combining a2av (coming next).
An example is in #157029.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156743
Approved by: https://github.com/ngimel
ghstack dependencies: #158234, #158235
2025-07-23 23:34:48 +00:00
d34cee4cf3 Revert "[Torch Native] Add test for packaging weight (#158750)"
This reverts commit 85ee2fb8c5c57b513526b0cc968ba13012167572.

Reverted https://github.com/pytorch/pytorch/pull/158750 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing on trunk: inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_with_exporter_weights [GH job link](https://github.com/pytorch/pytorch/actions/runs/16478978095/job/46590552109) [HUD commit link](85ee2fb8c5) ([comment](https://github.com/pytorch/pytorch/pull/158750#issuecomment-3111188266))
2025-07-23 23:24:55 +00:00
5cdb3d896e [FSDP][Replicate] added replicate function that uses FSDP instead of DDP (#158207)
**Summary**
Users would like to use Replicate with TP. Currently, the replicate function uses DDP, which has not been maintained resulting in a lack of integration options. Since users can use FSDP with TP, we will make the replicate function use FSDP so that users can use replicate with FSDP. To that end I have created a replicate function that uses FSDP instead of DDP. One blocker that I ran into is that the replicate function has a contract which assigns a module "replicate" attribute in registry. This would mean that fully_shards is_composable requirement would not be satisfied making it impossible to apply fully_shard to a replicate module. The solution to this was to copy the fully_shard function and state and modify it for replicate. In the future, it should be explored making the replicate_state inherit from FSDP_state to get rid of code duplicity. I have attached below the profile tracing of a replicated Net Module.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_270fcc36-194a-42f5-9841-cace984c2132_devgpu263.prn2.facebook.com_1792146.1753232748025155780.pt.trace.json

**Test Case**
1.  pytest test/distributed/_composable/test_replicate_with_fsdp.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158207
Approved by: https://github.com/weifengpy

Co-authored-by: Anshul Sinha <50644008+sinhaanshul@users.noreply.github.com>
2025-07-23 22:53:06 +00:00
0204099762 Raise exception in Dynamo if op fails in the interpreter (#158661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158661
Approved by: https://github.com/williamwen42
ghstack dependencies: #158660
2025-07-23 22:31:51 +00:00
b67f97c166 Correctly handle OP_CONTAINS (#158660)
CPython can fallback to `__iter__` if object doesn't implement
`__contains__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660
Approved by: https://github.com/zou3519
2025-07-23 22:31:51 +00:00
7f649ed4f8 Add basic torch.hash_tensor op (#154149)
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.

- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
86df3ff1f1 fix xnnpack build on mac (#158881)
Summary: Fix a bug for not getting the correct sources

Test Plan:
CI

on my mac:
```
buck2 build @//fbobjc/mode/profile --show-full-output //xplat/executorch/examples/portable/executor_runner:executor_runner_opt
File changed: fbsource//xplat/caffe2/third_party/xnnpack.buck.bzl
Buck UI: https://www.internalfb.com/buck2/67b59179-4de8-462a-9202-0b9c34a35aef
Network: Up: 2.4MiB  Down: 1.3KiB  (reSessionID-f687a7cd-5961-4851-bc67-b07043baa52a)
Loading targets.   Remaining     0/1                                                                                                          504 targets declared
Analyzing targets. Remaining     0/42                                                                                                         1960 actions, 2424 artifacts declared
Executing actions. Remaining     0/975                                                                                                        37.2s exec time total
Command: build.    Finished 40 local
Time elapsed: 7.7s
BUILD SUCCEEDED
fbsource//xplat/executorch/examples/portable/executor_runner:executor_runner_opt /Users/maxren/fbsource/buck-out/v2/gen/fbsource/267ffdee31edf15e/xplat/executorch/examples/portable/executor_runner/__executor_runner_opt__/executor_runner_opt
```

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78771697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158881
Approved by: https://github.com/digantdesai
2025-07-23 22:06:27 +00:00
82f8e04f27 Update distributed maintainers (#158900)
I maintain couple components of distributed like devicemesh, c10d and PGNCCL, gloo, etc. Can I be marked not as emeritus? Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158900
Approved by: https://github.com/albanD
2025-07-23 21:53:27 +00:00
5619bf9971 Enable MI355X PyTorch CI testing. (#158889)
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.

- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
2025-07-23 21:50:31 +00:00
d8425e9c75 [1/N] support of replication fallback strategy (#158046)
#### 1. Provide a default fallback strategy that can apply to arbitrary operator with output in type of single tensor.

We can call register_op_strategy to register using the `fallback_op_strategy`:
- For op without List[Tensor] as input, call:
```
register_op_strategy(op_overload)(replicate_op_strategy)
```
- For op contains List[Tensor] as input, call:
```
register_op_strategy(op_overload, schema_info=RuntimeSchemaInfo(needs_pytree=True))(replicate_op_strategy)
```
The strategy will force all input and output to be replicated with the corresponding redistribute_cost.

#### 2. Add a test function as a necessary condition for strategy function.
```
detect_exists_identical_opspec(*args, op, mesh, strategy_function)
```
This function detects if identical strategies will be produced given the sample `args`. It will iterate all combinations of placements for each arg and produce the output strategy from the registered `strategy_function`.

#### 3. Provide a context manger `op_strategy_context` to easily register/unregister strategies for testing.
E.g.,
```
with op_strategy_context(test_op.default, replicate_op_strategy):
    ...
```
#### 4. Fix a bug that TupleStrategy never get flatten as expected:
9df0176408/torch/distributed/tensor/_op_schema.py (L286)
Basically we need to 1) register_pytree_node for TupleStrategy, 2) propagate the schema_info to `strategy_schema` after  `strategy_schema = _wrap_with_op_strategy(op_schema)`.

This is the first implementation. Plan to add support to enable sharding on the batch dim as the output strategy next.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2025-07-23 21:14:20 +00:00
633d5faf3f [DeviceMesh] Enable slicing a submesh with warnings (#158899)
We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899
Approved by: https://github.com/wz337
2025-07-23 21:13:41 +00:00
4d5d56a30e [dynamo] lintrunner for gb_registry adds/updates (#158460)
This PR adds automation to adding/updating the JSON registry through the lintrunner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158460
Approved by: https://github.com/williamwen42
2025-07-23 21:02:54 +00:00
64e8d7d66b [BE] bump test dependency z3-solver to drop using deprecated pkg_resources (#158905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158905
Approved by: https://github.com/albanD, https://github.com/ezyang
ghstack dependencies: #158904
2025-07-23 21:01:02 +00:00
b935ad17d5 [BE][Easy] add missing Python 3.14 PyPI classifier (#158904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158904
Approved by: https://github.com/albanD
2025-07-23 21:01:02 +00:00
f7f550649f [cutlass backend] Change default inst level mm config number (#158901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158901
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh, https://github.com/Skylion007
2025-07-23 20:53:22 +00:00
255c0545e7 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290, #158291
2025-07-23 20:27:28 +00:00
9c68c4d08f [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-23 20:27:28 +00:00
6ed2cb6ccd [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-23 20:27:28 +00:00
ab26d4fbeb [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-23 20:27:28 +00:00
da94023b02 [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-23 20:03:52 +00:00
c996aff6ed [ROCm] UT verifies a runtime error is raised if tensor.item() is captured in a cudagraph (#158878)
Unit test for this PR: https://github.com/pytorch/pytorch/pull/158165

This unit test verifies that a runtime error is raised when tensor.item() operation is captured in a cudagraph. Equally valid for ROCm and CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158878
Approved by: https://github.com/jeffdaily, https://github.com/ngimel
2025-07-23 20:01:50 +00:00
691736ae07 Add kernel options to flex docs (#158875)
Fixes https://github.com/pytorch/pytorch/issues/158741
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158875
Approved by: https://github.com/BoyuanFeng, https://github.com/albanD
2025-07-23 19:05:19 +00:00
fe8f556006 Fix Triton GEMM templates with k=1 (#158650)
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.

One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.

For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
2025-07-23 18:45:51 +00:00
85ee2fb8c5 [Torch Native] Add test for packaging weight (#158750)
Add test that require weights to be packaged for torch native

For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model.

After we added weight deduping, we should be able to let this config be False.

```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter_weights
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750
Approved by: https://github.com/desertfire
2025-07-23 18:36:10 +00:00
fef236da69 Add zero_() and empty_like(t) to torch/csrc/stable/ops.h (#158866)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158866
Approved by: https://github.com/janeyx99
2025-07-23 18:31:05 +00:00
76be282e3a Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)"
This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8.

Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))
2025-07-23 18:25:46 +00:00
9905ed616a [Inductor] Expose decomposeK knobs as envvars (#158745)
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
2025-07-23 18:23:44 +00:00
30b0ad5c68 Revert "Fix decorators skipping NCCL tests (#158846)"
This reverts commit 57024913c409764f129d6a7792625f5b05462e31.

Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking trunk. See distributed/_composable/fsdp/test_fully_shard_logging.py::LoggingTests::test_fsdp_logging [GH job link](https://github.com/pytorch/pytorch/actions/runs/16472103496/job/46564570609) [HUD commit link](57024913c4) ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3109553414))
2025-07-23 17:47:35 +00:00
41b6cdaf76 Revert "Fix Triton GEMM templates with k=1 (#158650)"
This reverts commit 9df0f565972a8a034fd77d65aff2c53e6e9856d1.

Reverted https://github.com/pytorch/pytorch/pull/158650 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158650#issuecomment-3109538827))
2025-07-23 17:42:10 +00:00
1b456c580d [dynamo][guards] Add type info of the guarded value in guard managers (#158765)
tlparse looks like this

<img width="1165" height="226" alt="image" src="https://github.com/user-attachments/assets/04c4e6b1-34a3-4d9d-8304-6eb6d9a94980" />

This will aid in reading guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158765
Approved by: https://github.com/Lucaskabela, https://github.com/StrongerXi
2025-07-23 16:59:15 +00:00
5e386eec94 [AOTI] enable aot inductor on Windows (#158915)
With many PRs landed, we can run the first aot inductor example on Windows.

<img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" />

Let's remove the Windows check on `AotCodeCompiler`.

CC: @angelayi , @desertfire , @jansel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158915
Approved by: https://github.com/desertfire
2025-07-23 16:29:15 +00:00
00da8e63eb CI for Windows Arm64 (#148753)
This pull request adds a new CI workflow for Windows Arm64, named win-arm64-build-test.yml.
It can be triggered on any pull request by including the ciflow/win-arm64 tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148753
Approved by: https://github.com/malfet
2025-07-23 16:12:20 +00:00
576253c476 [math] Trace float.fromhex (#156976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156976
Approved by: https://github.com/zou3519
ghstack dependencies: #156975, #156977
2025-07-23 16:12:08 +00:00
f5314f89c8 [struct] Add struct.pack and struct.unpack polyfills (#156977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156977
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
ghstack dependencies: #156975
2025-07-23 16:12:08 +00:00
671e22a951 [math] Raise exception in Dynamo if constant fold call fail (#156975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156975
Approved by: https://github.com/zou3519
2025-07-23 16:12:08 +00:00
d3d9bc1c31 [inductor] Allow backends to register their own custom config object (#158254)
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
2025-07-23 15:56:06 +00:00
7d296d5c19 [aoti][mps] Enable more tests (#158703)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158703
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #158349, #158350, #158351
2025-07-23 15:38:56 +00:00
2a60b8fc97 [export][ez] Fix packaging (#158855)
Summary: as title, seems ytpo

Test Plan:
CI

Rollback Plan:

Differential Revision: D78758466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158855
Approved by: https://github.com/henryoier
2025-07-23 15:36:14 +00:00
d898d0d437 [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-23 15:06:54 +00:00
5998cd4eaa [MPS] Speedup torch.full for 1-byte types (#158874)
By using [`fillBuffer:range:value:`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/fillbuffer:range:value:?language=objc) rather than MPSGraph op, which should be faster and also does not have INT_MAX limit

Which in turn fixes `test_index_put_accumulate_large_tensor_mps` test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158874
Approved by: https://github.com/dcci
2025-07-23 14:00:40 +00:00
57024913c4 Fix decorators skipping NCCL tests (#158846)
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using `unittest.skip` decorators avoids the starting of the test in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
2025-07-23 13:31:21 +00:00
ee72338f0c [Inductor] MSVC use pointer when generating temporary array pointer (#158913)
MSVC cannot implicitly convert a const iterator to a const pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158913
Approved by: https://github.com/desertfire

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-23 13:19:11 +00:00
c665594c1e [AOTI] fix extract file failed on Windows. (#158702)
Changes:
1. rename zip index filename, and keep it out of normalize path.
2. normalize output path for extract file.

Extract files successful:
<img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702
Approved by: https://github.com/angelayi
2025-07-23 08:00:14 +00:00
255a04baf1 [pt2 event logging] send autotuning data for strides and hinted shapes (#158852)
Summary:
# Why

capture relevant data for offline lookup table generation

# What

report the hinted sizes not just the symbolic sizes

Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```

This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized)

Rollback Plan:

Reviewed By: stashuk-olek

Differential Revision: D77837548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158852
Approved by: https://github.com/jingsh
2025-07-23 06:44:27 +00:00
1d302eaee8 [vllm] add vllm test base docker image (#158755)
# description
Add base docker image for vllm.

It seems like we use the base docker image for both pytorch build, and tests. Configure a base image for vllm against pytorch CI.

# Others
Added readme regarding how the base docker images are used, and how to add one, this also explain what is the right file to modify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158755
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-23 05:42:44 +00:00
a6b7bea244 [inductor] support linear & layer_norm unbacked (#155267)
### What
- Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path.
- For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`).

### Example DDE
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)).  (Size-like symbols: u0)

Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes)
```
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)).  (Size-like symbols: u0)

Caused by: (_inductor/ir.py:2797 in create)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267
Approved by: https://github.com/eellison
2025-07-23 05:42:01 +00:00
be72bcf828 [vllm hash update] update the pinned vllm hash (#158806)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158806
Approved by: https://github.com/pytorchbot
2025-07-23 04:41:53 +00:00
f80f97d192 [audio hash update] update the pinned audio hash (#158807)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158807
Approved by: https://github.com/pytorchbot
2025-07-23 04:39:50 +00:00
42a69f7c2b [MTIA Aten Backend] Migrate addmm.out / baddbmm.out / bmm.out (#158749)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate addmm.out / baddbmm.out / bmm.out to in-tree.

Differential Revision: [D78578483](https://our.internmc.facebook.com/intern/diff/D78578483/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158749
Approved by: https://github.com/albanD, https://github.com/nautsimon
ghstack dependencies: #158748
2025-07-23 03:45:28 +00:00
b87471e66f [MTIA Aten Backend] Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out (#158748)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out to in-tree.

Differential Revision: [D78568103](https://our.internmc.facebook.com/intern/diff/D78568103/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158748
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-07-23 03:45:20 +00:00
f10e4430e2 [AOTI] normalize path and process model files. (#158705)
Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705
Approved by: https://github.com/desertfire
2025-07-23 02:58:21 +00:00
2dccff7dcf [inductor] pass_fds not supported on Windows, skip them on Windows. (#158830)
<img width="1366" height="806" alt="image" src="https://github.com/user-attachments/assets/ddf3d27a-36da-47ce-9ba9-00c43805bb06" />

Almost UTs are failed on `AssertionError: pass_fds not supported on Windows.`, let's skip them on Windows.
TODO: I will also debug and confirm `pass_fds` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158830
Approved by: https://github.com/jansel
2025-07-23 02:24:35 +00:00
dec0d3101c [export] fix unbacked range deserialization (#158681)
Fixes https://github.com/pytorch/pytorch/issues/151809, by reading shape assertion nodes into ShapeEnv, and deferring instantiation of node example values, to be done node-by-node.

Differential Revision: D78588406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158681
Approved by: https://github.com/ydwu4, https://github.com/avikchaudhuri
2025-07-23 02:13:11 +00:00
9df0f56597 Fix Triton GEMM templates with k=1 (#158650)
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.

One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.

For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
2025-07-23 02:05:57 +00:00
91602a9254 Cleanup old caffe2 scripts (#158475)
Testing on this one is grep based: if there were no reference to that script I can find, I deleted.
We can easily add any of these back if needed!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever
2025-07-23 01:21:31 +00:00
cc372ad557 [aoti][mps] Improve tabbing in cpp generation (#158351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158351
Approved by: https://github.com/desertfire, https://github.com/malfet
ghstack dependencies: #158349, #158350
2025-07-23 00:54:53 +00:00
84058d1179 [aoti][mps] Fix cpu kernel generation (#158350)
In the case where we have both mps and cpu code which can be inductor compiled, we need to case on the device -- this requires the device field to be correctly passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158350
Approved by: https://github.com/malfet
ghstack dependencies: #158349
2025-07-23 00:54:53 +00:00
096dc35d77 [aoti][mps] Fix update constants buffer (#158349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158349
Approved by: https://github.com/malfet
2025-07-23 00:54:52 +00:00
56d07d0bde Add merge_rules category for Dynamo; add guilhermeleobas (#158620)
Adds guilhermeleobas to merge_rules for Dynamo and functorch.
Guilherme has done good work on both of these subsystems and I am tired
of him approving my PRs and me not being able to merge them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158620
Approved by: https://github.com/anijain2305
2025-07-23 00:44:27 +00:00
39b54b78d7 [export] runtime asserts for while HOP subgraphs (#158467)
Differential Revision: D78431075

For #158366
- Calls runtime asserts pass for HOP subgraphs (in reenter_make_fx)
- For while_loop only (can be expanded), clones input tensors for subgraph tracing, so unbacked memos (item, nonzero, etc.) aren't reused

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158467
Approved by: https://github.com/ydwu4
2025-07-23 00:34:18 +00:00
3703dabe42 [ROCm] delete un-needed workaround for tensor.item() (#158486)
Deleting unused workaround per discussion here:
https://github.com/pytorch/pytorch/pull/158165#discussion_r2207968880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158486
Approved by: https://github.com/jeffdaily, https://github.com/houseroad
2025-07-23 00:31:57 +00:00
d3f9107d68 Remove top limit for cpython version and fix lint appropriately. (#158853)
As per title.
Sorry for the churn in the main commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158853
Approved by: https://github.com/seemethere, https://github.com/Skylion007, https://github.com/jingsh, https://github.com/malfet, https://github.com/ZainRizvi
2025-07-22 23:59:00 +00:00
cab96b5879 [tests] Reduce sizes of unnecessarily large tensors to reduce OOM flakes (#158456)
Downsizes several tensors that were massively oversized to test the problem at hand, to reduce test flaking.

Fixes #126867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158456
Approved by: https://github.com/desertfire
2025-07-22 23:41:48 +00:00
6100ed457c [ROCm] Improve Type Safety of C10_WARP_SIZE (#158271)
# Background

The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs.
Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`.

This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime.

This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro.

Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example)

# Behaviors

* If compiling with HIPCC (i.e `defined(__HIPCC__)`):
  + Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3)
  + Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass
    - `__GFX8__` is also 64 but we do not support any GFX8 GPU.
* If not compiling with HIPCC:
  + Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()`

# `constexpr` variant for host code

For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-07-22 23:19:38 +00:00
badfebf29e Revert "[Inductor] Expose decomposeK knobs as envvars (#158745)"
This reverts commit eac777c4f46b381106f2f2b78fe05b506f8c558c.

Reverted https://github.com/pytorch/pytorch/pull/158745 on behalf of https://github.com/jeffdaily due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/158745#issuecomment-3105071170))
2025-07-22 23:04:16 +00:00
fc5a404eb1 [gtest][listing] fixing caffe2:verify_api_visibility - main (#158229)
Summary: Remove the custom main from this test file

Test Plan:
https://www.internalfb.com/intern/testinfra/testrun/9570149303161031

Rollback Plan:

Reviewed By: patskovn

Differential Revision: D78015676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158229
Approved by: https://github.com/Skylion007
2025-07-22 22:45:28 +00:00
04a393507b Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-07-22 22:25:44 +00:00
a626dc8f16 [AOTI] windows package load dev (#158671)
changes:
1. add extract file fail handler for Windows develop.
2. normalize more file paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-07-22 21:35:57 +00:00
fd47401536 [doc] Updates to distributed.md for XCCL backend (#155834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 21:01:43 +00:00
e44e05f7ae [dynamo] Move skipIf decorator to class level in test_fx_graph_runnable (#157594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157594
Approved by: https://github.com/xmfan
ghstack dependencies: #157162
2025-07-22 20:41:49 +00:00
ddd74d10fc More fixes to MakeTensor::computeStorageSize() (#158813)
Followup after https://github.com/pytorch/pytorch/pull/158690 that fixessimilar logic if `strides` are not explicitly specified
Expanded testing to cover both cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158813
Approved by: https://github.com/ZainRizvi, https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #158690
2025-07-22 20:36:12 +00:00
823e223893 [ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)
Fixes #156012

This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.

After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-22 20:32:34 +00:00
6499420e45 [DeviceMesh] Make the repr shorter when debug ENV not set (#158822)
Users want a shorter repr so this PR is trying to address that when TORCH_DISTRIBUTED_DEBUG is not set to DETAIL. Feedback and discussion is welcomed. Somehow I found that torch.set_printoptions is global, so I am hesitated to use it.

Now the print is like

<img width="435" height="79" alt="image" src="https://github.com/user-attachments/assets/8f173287-7138-4fbe-a4a3-8483523b21e4" />

or

<img width="485" height="104" alt="image" src="https://github.com/user-attachments/assets/21e34db9-56b5-47e2-9767-750d6105a273" />

or

<img width="675" height="97" alt="image" src="https://github.com/user-attachments/assets/53aa763e-7edd-4622-9cdb-37e2af8ec11f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158822
Approved by: https://github.com/wz337, https://github.com/wconstab, https://github.com/xmfan
2025-07-22 20:31:44 +00:00
e17538022a Making input dynamically adjust. (#157324)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157324
Approved by: https://github.com/Skylion007, https://github.com/d4l3k
2025-07-22 20:14:05 +00:00
37ded2ac90 Using torch.accelerator in comm_mode_features_example.py and visualize_sharding_example.py (#157317)
Continuation of https://github.com/pytorch/pytorch/pull/153213  .

 @guangyey
 @kwen2501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157317
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 19:58:48 +00:00
767791943d [ONNX] Set default opset to 20 (#158802)
Bump default opset to 20, which is a newer opset and the max torchscript exporter supports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158802
Approved by: https://github.com/titaiwangms
2025-07-22 19:55:05 +00:00
c917c63282 [ROCm][tunableop] UT tolerance increase for matmul_small_brute_force_tunableop at FP16 (#158788)
TunableOp will sometimes find a less precise solution due to the small input vectors used in this UT. Bumping op tolerance to eliminate flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158788
Approved by: https://github.com/jeffdaily
2025-07-22 19:45:35 +00:00
659bfbf443 Revert "We do support 3.14" (#158856)
Reverting to fix lint
This reverts commit 2a249f1967d29626fe6ac6a07f28440348d1cc93.

An emergency fix since the change needed to fix this is a little more complex than expected (see https://github.com/pytorch/pytorch/pull/158853 for reference)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158856
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-07-22 19:40:53 +00:00
832ab990c9 Use init_device_mesh API for select tests where possible (#158675)
This addresses reviews made for:
#158538
#108749

It interchanged all the specific DevideMesh constructor calls with the API provided by the test cases, to improve abstraction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158675
Approved by: https://github.com/wconstab
2025-07-22 19:28:42 +00:00
56df025d51 Add caching for _rename_without_collisions (#158594)
Fixes #158357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158594
Approved by: https://github.com/pianpwk
2025-07-22 19:19:13 +00:00
55ff4f85e9 [FP8][CUTLASS] xFail honor_sm_carveout on sm100 (#152378)
CUTLASS only supports SM carveout via green contexts on `sm100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152378
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/nWEIdia
2025-07-22 18:39:50 +00:00
7d2ceaff21 [dynamo] skip tracing functions registered in sys.monitoring (#158171)
Fixes https://github.com/pytorch/pytorch/issues/158164

This was fixed by applying `skip_code_recursive` to any function registered to `sys.monitoring` (via `PyThreadState_GET()->interp->monitoring_callables`). This check is done whenever we attempt to set the eval frame callback from Python.

Microbenchmark: `benchmarks/dynamo/microbenchmarks/overheads.py`:

BEFORE:
```
requires_grad=False
eager    7.1us (warmup=0.0s)
compiled 24.6us (warmup=10.0s)

requires_grad=True
eager    8.9us (warmup=0.0s)
compiled 57.8us (warmup=0.1s)

inference_mode()
eager    6.5us (warmup=0.0s)
compiled 23.4us (warmup=0.1s)
```

AFTER:
```
requires_grad=False
eager    7.0us (warmup=0.0s)
compiled 23.2us (warmup=15.2s)

requires_grad=True
eager    9.0us (warmup=0.0s)
compiled 55.1us (warmup=0.1s)

inference_mode()
eager    6.4us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```

Followup thought: how do we let users know that a frame is skipped because the code object is a callable registered to sys.monitoring? (or any other reason?)

Differential Revision: [D78530528](https://our.internmc.facebook.com/intern/diff/D78530528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158171
Approved by: https://github.com/jansel
2025-07-22 18:02:30 +00:00
2a249f1967 We do support 3.14
This has been added a bit back.
2025-07-22 10:40:18 -07:00
52c294008e [hop] allow non fake inputs when check input alias and mutation (#158798)
https://github.com/pytorch/pytorch/pull/154193 gets reverted due to a test failure. The root cause being that: an executorch pass turns int inputs into a scalar tensor in cond's subgraph. The pass have been around on the critical path of executorch since two years ago. Changing it would be difficult. So we just allow non-fake inputs for check input mutation and aliasing, which shoudn't affect the correctness of the analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158798
Approved by: https://github.com/pianpwk
2025-07-22 17:22:37 +00:00
0971637c11 Fix torch.tensor warning in ONNX symbolic_opset10 export (#158835)
Fix PyTorch tensor copying warning in ONNX export

## Problem

PyTorch ONNX exporter was generating a warning about incorrect tensor copying method:

```
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158835
Approved by: https://github.com/justinchuby
2025-07-22 16:32:49 +00:00
7d6f340238 Revert "[AOTI] Add more default options to compile_standalone (#158560)"
This reverts commit a991e285ae35159680b0ad4be24669906a6fa256.

Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))
2025-07-22 16:20:17 +00:00
4060f30042 [AOTI] Convert C-struct zip handling to RAII container (#158687)
Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate.

Fixes #158614 (hopefully)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687
Approved by: https://github.com/desertfire
2025-07-22 16:01:51 +00:00
9a28e23d97 Revert "removed zero dim cpu logic from fake_tensor.py (#147501)"
This reverts commit 9e0473b56621162bd85e94943a516be4727e5651.

Reverted https://github.com/pytorch/pytorch/pull/147501 on behalf of https://github.com/ZainRizvi due to Seems to have broken ROCm. See inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_standalone_cos [GH job link](https://github.com/pytorch/pytorch/actions/runs/16428359564/job/46426243808) [HUD commit link](a991e285ae) ([comment](https://github.com/pytorch/pytorch/pull/147501#issuecomment-3103494041))
2025-07-22 15:45:34 +00:00
d0c00d9a69 [MPS] Do not crash if tensor dim > INT_MAX (#158824)
Looks like all MPS operations will crash if one of tensor dimentions are
greater than `2**31-1`

Change it into a structured exception, by checking tensor size before
attempting to create MPS Tensor

Add regression test for it. Before this change running following will abort with exception
```
% python3 -c "import torch; torch.randint(0, 10, (2**31,), dtype=torch.uint8, device='mps')"
/AppleInternal/Library/BuildRoots/1c8f7852-1ca9-11f0-b28b-226177e5bb69/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
zsh: abort      python3 -c·
```

Skip the test on MacOS-13, as it crashes somewhere deep in MPSGraph framework with
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158824
Approved by: https://github.com/dcci
ghstack dependencies: #158690, #158823
2025-07-22 15:12:26 +00:00
371ffaf415 [bucketing] Support case of several pgs in graph (#158632)
Main changes:
- bucketing collectives only from the same process_group by group_name
- Support of groups like [0,2,4,6], [0,1,3,5] using `rank_idx_dict` for in pass operations for slice idxs etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158632
Approved by: https://github.com/wconstab
2025-07-22 14:50:39 +00:00
1b772de397 Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.

It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
2025-07-22 14:12:21 +00:00
8e99714204 [EZ][BE][MPS] Remove unused ndArrayFromTensor (#158823)
And `printTensorNDArray`, both of which according to https://github.com/search?type=code&q=ndArrayFromTensor+org%3Apytorch are not used anywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158823
Approved by: https://github.com/dcci
ghstack dependencies: #158690
2025-07-22 14:06:42 +00:00
9b4d938f04 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-07-22 11:26:54 +00:00
0142d5f4e2 Revert "Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)"
This reverts commit f09a484b8164aaadd57a79354f0ccf47733f365e.

Reverted https://github.com/pytorch/pytorch/pull/158682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158682#issuecomment-3101648365))
2025-07-22 08:33:08 +00:00
91b69deeb0 [ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)
fbgemm_gpu build started failing with asmjit errors.  Moving to latest tip of fbgemm for inductor tests resolves the build failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-22 08:04:59 +00:00
392fa75411 Change from import trace to import config (#158796)
Summary:
for this particular instance, we're doing

 from torch._inductor.config import trace

...trace.provenance_tracking...

but for all other call sites, we're doing

from torch._inductor import config
... config.trace.provenance_tracking....

Test Plan:
CI

Rollback Plan:

Differential Revision: D78699876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158796
Approved by: https://github.com/c00w
2025-07-22 06:10:38 +00:00
3a67bf9c62 [PGNCCLx] Bring split and merge for PGNCCLx (#158790)
Summary: We added group split in D78300794 and remote_group_merge in D78450094. We first want to upstream this change to PGNCCLx as well so that NCCLx can use this new API and we can continue our c10d clean up in https://github.com/pytorch/pytorch/pull/158488.

Test Plan:
CI

```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge
```

Rollback Plan:

Differential Revision: D78521060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158790
Approved by: https://github.com/d4l3k
2025-07-22 06:05:00 +00:00
d984143a74 [ci][cutlass backend] Add ci for cutlass backend tests (#156626)
redo of https://github.com/pytorch/pytorch/pull/156136

Differential Revision: [D77327309](https://our.internmc.facebook.com/intern/diff/D77327309)

I want to try land the full version first. If the ci is taking too long, we can revert back to only testing for a few names.
```
 -k 'test_max_autotune_cutlass_backend_regular_mm and not test_max_autotune_cutlass_backend_regular_mm_streamk'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156626
Approved by: https://github.com/huydhn, https://github.com/mlazos
2025-07-22 05:18:13 +00:00
21c97bd565 [reland] Transfer "stack_trace" in post_grad passes (#158752)
Summary:
We transfer stack trace in post_grad passes.

We shouldn't add "stack_trace" to _COPY_META_FIELDS because _COPY_META_FIELDS is used in proxy.py where stack_trace is explicitly set.

Since the stack_trace is being used by more and more debugging tools, we should also start testing it more rigorously. This PR start by adding a first test for testing that stack trace is preserved through post_grad_passes.

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_pattern_matcher_transfer_meta

buck run mode/dev-nosan
 fbcode//caffe2/test/inductor:auto_functionalize -- --rcaffe2/test/inductor:auto_functionalize_old
```

Rollback Plan:

Differential Revision: D78669729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158752
Approved by: https://github.com/jingsh
2025-07-22 03:49:13 +00:00
a155f742ad [benchmark] allow default mode for compile (#158792)
Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792
Approved by: https://github.com/zou3519
2025-07-22 03:07:22 +00:00
cyy
3639d29ea1 Fix warnings of unused-variable (#158627)
Fixes
```
/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_kernel.cpp:42:22: error: unused variable 'verification_pattern' [-Werror,-Wunused-variable]
```
and also extra semicolons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158627
Approved by: https://github.com/albanD
2025-07-22 02:49:06 +00:00
aee8a2e985 Remove duplicated installation for python dependencies. (#158339)
As the title stated.

The `Common` Section have installed the python dependencies
1b389025ba/README.md (L247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158339
Approved by: https://github.com/ezyang
2025-07-22 02:39:28 +00:00
eac777c4f4 [Inductor] Expose decomposeK knobs as envvars (#158745)
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
2025-07-22 01:59:51 +00:00
1a6b21c59f [AOTI] fix load_pt2 split wrong model name on Windows (#158711)
fix load_pt2 split wrong model name on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158711
Approved by: https://github.com/jansel
2025-07-22 01:54:44 +00:00
abe0c9538a [BE] Fix extra-semi warnings (#158730)
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )

300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
 - TensorPipe
 - LLVM
 - FBGEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
2025-07-22 01:05:03 +00:00
95b658427d Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit 1179e333237b02ed8fe2ba10cb9a23adf98d7d7a.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
6341311333 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 2ad5c25cfc603c3656e6699d6137419dbb009495.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
350d6af52c [AOTI] add windows support for get_cpp_compile_command (#158732)
add windows support for `get_cpp_compile_command`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158732
Approved by: https://github.com/desertfire
2025-07-22 00:23:10 +00:00
9281625a9b Revert "Setup TorchBench in Docker (#158613)"
This reverts commit cab28330f8c49cdb66d6a299755dc09c87c14a9d.

Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/ZainRizvi due to Seems to have broken trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16429779764/job/46430634676) [HUD commit link](b3c868d603) ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3100023071))
2025-07-22 00:12:49 +00:00
2c37acfd89 [AOTI][CPU] Consider bias=None case for fbgemm_linear_fp16_weight (#158535)
Test Plan:

Rollback Plan:

Differential Revision: D78458214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158535
Approved by: https://github.com/houseroad, https://github.com/henryoier, https://github.com/jingsh
2025-07-21 23:42:44 +00:00
08540b13c6 Use cuda error code instead of error text in get_cuda_error_help (#158688)
Use cudaError_t and switch through the enum to prevent impact by upstream changes in wording
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158688
Approved by: https://github.com/q10, https://github.com/aorenste
2025-07-21 23:34:50 +00:00
187c2deb40 Fix clamp(min/max) strategy (#158619)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158619
Approved by: https://github.com/wanchaol
2025-07-21 23:26:08 +00:00
67be2f27e1 [CI][lintrunner] Only run on non deleted changed files (#158794)
My PR was failing lint because I removed a file, and then lintrunner would try to run on the deleted file and error, so this changes how the changed files are retrieved to only retrieve changed files that have not been removed.

I don't think this is possible through `gh pr view`, so instead it uses `gh api`

Testing: https://github.com/pytorch/pytorch/pull/158795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158794
Approved by: https://github.com/seemethere
2025-07-21 23:22:37 +00:00
d293022c47 [cutass backend] memorize parts of cache key to reduce general overhead (#158311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158311
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #156781
2025-07-21 23:21:12 +00:00
ee5a434f8c Revert "[BE] remove torch deploy - conditionals (#158288)"
This reverts commit 1a4268b8113d5160d71225bab980f03c2318a0a4.

Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
4c18e85300 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit a6de309ca15cda6b2792fc74e82814dc8d2f9dd9.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
920f26c761 Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 0b9fb91f17edfbc51ae36584dcb8350b2d8bb23b.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
99cc3633f6 Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)"
This reverts commit d9426a81d2ab54f809a3b32a6ab2e606073fe66f.

Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
15a50dcf1c Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)"
This reverts commit eb7365072315be2bc4259114e25e269801441748.

Reverted https://github.com/pytorch/pytorch/pull/158427 on behalf of https://github.com/ZainRizvi due to Reverting this as part of reverting the stack for https://github.com/pytorch/pytorch/pull/158288 ([comment](https://github.com/pytorch/pytorch/pull/158427#issuecomment-3099815367))
2025-07-21 23:14:57 +00:00
1227ed6674 [dynamic shapes] fix _maybe_evaluate_static axioms bug (#158672)
Summary: couldn't get a minimal repro, but xref for size change during dict iteration error: https://fb.workplace.com/groups/1075192433118967/posts/1709439696360901

Test Plan:
-

Rollback Plan:

Differential Revision: D78047846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158672
Approved by: https://github.com/bobrenjc93
2025-07-21 23:14:19 +00:00
2bb684304d Fix the typos in the right nav by pulling the latest theme (#158746)
This will fix broken links in the right nav.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158746
Approved by: https://github.com/malfet
2025-07-21 22:51:07 +00:00
f09a484b81 Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)
Summary:
**Changes**
*   Deleted function import from build definition utilities
    *   Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
*   Replaced is_arvr_mode() function calls with direct references to configuration flags
    *  Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
*   Changed conditional expressions to Buck `select()` statements

Test Plan:
Check if CI passes

Rollback Plan:

Differential Revision: D78520947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet
2025-07-21 22:49:26 +00:00
feaa02f9ad Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a78fb63dbdf98a1db219095293de1a11005e0390.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to It still breaks inductor-perf-nightly, see https://github.com/pytorch/pytorch/actions/runs/16425364208/job/46417088208, I'm going to dismiss all previous reviews ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3099706457))
2025-07-21 22:46:53 +00:00
b3c868d603 [vllm]Add vllm.txt for pinned commit (#158754)
It seems the nightly.yml won't auto-generate txt file when it does not existed, so added the file with latest merged commit from vllm:

[vllm commit](https://github.com/vllm-project/vllm/commits/main)

Error:
https://github.com/pytorch/pytorch/actions/runs/16405915719/job/46351847504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158754
Approved by: https://github.com/huydhn
2025-07-21 22:41:07 +00:00
cab28330f8 Setup TorchBench in Docker (#158613)
This reduces the time spending to setup TorchBench in A100/H100 by another half an hour

### Testing

* H100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396172453.  Once this done, I will review the results on [HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2011%20Jul%202025%2023%3A01%3A24%20GMT&stopTime=Fri%2C%2018%20Jul%202025%2023%3A01%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/huydhn/6/head&lCommit=14a38c719b29a19f518239b5edb084838ac5d2fb&rBranch=main&rCommit=0a99b026d6bd0f67dc2c0a20fe3228ddc4144854) to confirm that all models are there
* A100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396173932

Signed-off-by: Huy Do <huydhn@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158613
Approved by: https://github.com/janeyx99
2025-07-21 22:34:08 +00:00
4366610f5a [c10d] block_current_stream: correctness fixes (#158757)
This fixes a number of issues that were present in https://github.com/pytorch/pytorch/pull/156883 as pointed out by @ngimel

Test plan:

Expanded tests to cover use after free behavior + non-default stream

```
pytest test/distributed/test_c10d_pypg.py -v -k block_current_stream
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158757
Approved by: https://github.com/ngimel
2025-07-21 22:23:44 +00:00
dd0adc9386 [SymmMem] Add NVSHMEM broadcast support into Triton (#158514)
Adds broadcast collective operation for distributing data from root PE to all other PEs in NVSHMEM Triton kernels.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_broadcast`
<details>
<summary> Quick debug print for sanity check </summary>

```markdown
============================================================
[Rank 0] Starting broadcast test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
============================================================
[Rank 1] Starting broadcast test with world_size=2
============================================================
[Rank 1] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
[Rank 1] Non-root source data: [-1, -1, -1, -1]
[Rank 0] Root source data: [100, 101, 102, 103]
[Rank 1] Initial destination: [-999, -999, -999, -999]
[Rank 0] Initial destination: [-999, -999, -999, -999]
[Rank 0] Executing broadcast operation...
[Rank 1] Executing broadcast operation...
[Rank 0] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Results after broadcast:
[Rank 0] Results after broadcast:
[Rank 1] Destination buffer: [100, 101, 102, 103]
[Rank 1] Expected: [100, 101, 102, 103]
[Rank 0] Destination buffer: [100, 101, 102, 103]
[Rank 0] Expected: [100, 101, 102, 103]
[Rank 1] Match: ✓
[Rank 0] Match: ✓
[Rank 1] ============================================================
[Rank 1] Broadcast test PASSED ✓
[Rank 1] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 1] ============================================================
[Rank 0] ============================================================
[Rank 0] Broadcast test PASSED ✓
[Rank 0] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 0] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158514
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512, #158513
2025-07-21 22:23:26 +00:00
734826d88e Revert "[AOTI] windows package load dev (#158671)"
This reverts commit d42c40976727fed4c9908d4194f26917d0a3da66.

Reverted https://github.com/pytorch/pytorch/pull/158671 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @angelayi can you please help them validate the fixes internally? You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158671#issuecomment-3099570374))
2025-07-21 22:20:46 +00:00
5a56e6a72b Revert "[AOTI] fix extract file failed on Windows. (#158702)"
This reverts commit 7cc1a9546c135f8e7635e0d38aa2bba797f8907d.

Reverted https://github.com/pytorch/pytorch/pull/158702 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158702#issuecomment-3099556215))
2025-07-21 22:18:19 +00:00
e8af168ee0 Revert "[AOTI] normalize path and process model files. (#158705)"
This reverts commit ff0da08f4bc5ee135b495926cd58a36a1c0e1a5b.

Reverted https://github.com/pytorch/pytorch/pull/158705 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158705#issuecomment-3099532516))
2025-07-21 22:16:03 +00:00
97d7dc197f Revert "[AOTI] Convert C-struct zip handling to RAII container (#158687)"
This reverts commit 8ed5e1844c77d952bcea89ca7d0225d876fec4e8.

Reverted https://github.com/pytorch/pytorch/pull/158687 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158687#issuecomment-3099515618))
2025-07-21 22:13:26 +00:00
9498d95b9c [Dynamo][BetterEngineering] Type trace_rules.py (#158679)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `trace_rules.py`
Running
```
mypy torch/_dynamo/trace_rules.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2564 | 3997 | 64.15% | 34 | 53 | 64.15% |
| This PR | 4022 | 4022 | 100.00% | 53 | 53 | 100.00% |
| Delta    | +1458 | +25 | +35.85% | +19 | 0 | +35.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158679
Approved by: https://github.com/williamwen42
2025-07-21 22:12:59 +00:00
0e46f54286 [ROCm][CI] update HIP patch for 6.4.1 (#158651)
patch is intended to fix hipGraph capture for some miopen kernels

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-21 22:09:36 +00:00
216ba6e5f2 Fix MaskedTensor to device ignored mask (#151205)
Fixes #147140

## Changes

- Add `to` implementation in `MaskedTensor` to support move `mask` to target device

## Test Result

```python
In [1]: import torch
   ...: from torch.masked import as_masked_tensor
   ...: data = torch.tensor([1,2,3])
   ...: mask = torch.tensor([True,False,True])
   ...: mt = as_masked_tensor(data, mask).to('cuda')
   ...: mt.get_data().device, mt.get_mask().device
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
/home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(new_data, _maybe_get_mask(args[0]))
Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0))

In [2]: mt.sum(dim=0)
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
Out[2]: MaskedTensor(4, True)

```

```bash
pytest test/test_maskedtensor.py -vv
```

![image](https://github.com/user-attachments/assets/640b809c-b4f0-4aca-a09e-04049017a745)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205
Approved by: https://github.com/ezyang
2025-07-21 21:44:49 +00:00
c774180e59 Bump requests from 2.32.2 to 2.32.4 in /tools/build/bazel (#158006)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.32.4</h2>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
<li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
</ul>
<h2>v2.32.3</h2>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS.</li>
<li>Dropped support for pypy 3.9 following its end of support.</li>
</ul>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="021dc729f0"><code>021dc72</code></a> Polish up release tooling for last manual release</li>
<li><a href="821770e822"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li>
<li><a href="59f8aa2adf"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li>
<li><a href="5b4b64c346"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li>
<li><a href="7bc45877a8"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li>
<li><a href="96ba401c12"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li>
<li><a href="7341690e84"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li>
<li><a href="6716d7c9f2"><code>6716d7c</code></a> remove links</li>
<li><a href="a7e1c745dc"><code>a7e1c74</code></a> Update docs/conf.py</li>
<li><a href="c799b8167a"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.2&new-version=2.32.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-21 21:35:38 +00:00
a991e285ae [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-07-21 21:16:48 +00:00
9e0473b566 removed zero dim cpu logic from fake_tensor.py (#147501)
Fixes #144748
In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug.
The root cause is fake_tenosr.py's find-common-device method, 0b0da81021/torch/_subclasses/fake_tensor.py (L833), takes zero dim cpu tensor into account but  the device check in adaption.h doesn't.

This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501
Approved by: https://github.com/ezyang
2025-07-21 21:11:10 +00:00
5e17932c22 [DCP] Add support for ShardedTensor to PgTransport (#158573)
Add support for ShardedTensors in when PGTransport is used for send/recv checkpoints

Test is pulled from https://github.com/pytorch/pytorch/pull/157963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158573
Approved by: https://github.com/meetv18
2025-07-21 21:04:23 +00:00
6b0526a2c4 ban fusion of large amount of reads (#158667)
This is an reland attempt of https://github.com/pytorch/pytorch/pull/157563, but insteading of introducing the `realize_acc_reads_size_threshold` config and setting to a default value, we set it to `None` for now to unblock an internal use case. Will deep dive into the issue and harden the logic in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158667
Approved by: https://github.com/yf225
2025-07-21 21:00:40 +00:00
bc379aebe2 Revert "Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)"
This reverts commit 8e57cdb746b4ab28865fdf01532f87b0d21700e9.

Reverted https://github.com/pytorch/pytorch/pull/158048 on behalf of https://github.com/jeffdaily due to rocm failures due to unit test introduced in this PR, but no pre-merge signal available ([comment](https://github.com/pytorch/pytorch/pull/158048#issuecomment-3098746624))
2025-07-21 20:45:21 +00:00
b1a0c34dd3 [pt2 event logging] add configurable prefix (#157678)
Summary:
# Why

make experiments easier to find

# What

- dynamo config to provide a prefix
- use the prefix when sending data to scuba through the self.id_ field

Test Plan:
```
# code edited to set the prefix as `coconutruben-02`
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```

on scuba

```
| autotune_dtypes | autotune_offset | autotune_shape | autotune_strides | event | run_id |
| -----| -----| -----| -----| -----| ----- |
| "torch.float16, torch.float16" | "0, 0" | "4096x3008, 3008x2048" | "[3008, 1], [2048, 1]" | "mm_template_autotuning" | "coconutruben-02-e6bdccc5-6dcf-4d68-9a04-b34f2c6d94fd" |
| "torch.float16, torch.float16" | "0, 0" | "4096x3008, 3008x2048" | "[3008, 1], [2048, 1]" | "mm_template_autotuning" | "coconutruben-02-14165153-5842-4eaa-9e6c-3b0cbc016375" |

```

Rollback Plan:

Differential Revision: D77837550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157678
Approved by: https://github.com/stashuk-olek
2025-07-21 20:41:03 +00:00
851e953f68 ci: Only run lint jobs on relevant files (#158773)
Conditionally run lint jobs on relevant files, this
is mainly targetd at clangtidy since it takes a long time
but also includes mypy since that's an additional 4 minutes
of runtime that we can save.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158773
Approved by: https://github.com/malfet
2025-07-21 20:21:34 +00:00
b66f429827 Fix torch.randint, torch.mul param missing description (#158731)
Wrong separator cause param description truncated.

- Change separator of param and its description
- Remove quote make `torch.dtype` display as reference to the class

## Test Result

### Before

<img width="1092" height="784" alt="image" src="https://github.com/user-attachments/assets/e8d96b26-07e9-40ff-9392-fa6665d4bbe4" />
<img width="1111" height="457" alt="image" src="https://github.com/user-attachments/assets/a3c2e333-f861-4aeb-b4fb-05c8d880ae81" />

### After

<img width="897" height="820" alt="image" src="https://github.com/user-attachments/assets/d1b5cefa-717a-4223-84b0-4346b7eecf44" />
<img width="872" height="409" alt="image" src="https://github.com/user-attachments/assets/96223c37-cd9d-4656-9e55-032d09cbe5c1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158731
Approved by: https://github.com/ngimel
2025-07-21 20:17:27 +00:00
ea5b06ed5b [Dynamo][BetterEngineering] Type side_effects.py (#158605)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `side_effects.py`
Running
```
mypy torch/_dynamo/side_effects.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  365 | 1166 | 31.30% | 16 | 51 | 31.37% |
| This PR | 1185 | 1185 | 100.00% | 51 | 51 | 100.00% |
| Delta    | +820 | +19 | +68.70% | +35 | 0 | +68.63% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158605
Approved by: https://github.com/StrongerXi
2025-07-21 19:34:14 +00:00
25fbf09d5f Use more fine-grained locks in sym mem kernels (#158523)
Summary: Use only acq in the beginning of the kernel, and only release in the end

Test Plan:
Existing tests

Rollback Plan:

Differential Revision: D78458020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158523
Approved by: https://github.com/drisspg, https://github.com/kwen2501
2025-07-21 19:23:47 +00:00
22920c9138 Grab bag of (mostly) typing improvements (#158075)
Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are:

1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`.
2. Removing two unused parameters from `torch.export._unlift._unlift`.
3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency.
4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075
Approved by: https://github.com/Skylion007
2025-07-21 19:17:01 +00:00
ad2dec1997 [SymmMem] Add NVSHMEM alltoall support into Triton (#158513)
Implements collective alltoall operation for NVSHMEM Triton kernels. Enables data exchange where each PE sends unique data to every other PE in the team.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_alltoall`

<details>
<summary>Quick debug print for sanity check</summary>

```markdown
============================================================
[Rank 0] Starting alltoall test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems_per_pe: 2
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 16
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
[Rank 0] Preparing source data:
[Rank 1] Preparing source data:
  - Data for PE 0: [0, 0] (indices 0-1)
  - Data for PE 1: [1, 1] (indices 2-3)
[Rank 0] Complete source buffer: [0, 0, 1, 1]
  - Data for PE 0: [100, 100] (indices 0-1)
  - Data for PE 1: [101, 101] (indices 2-3)
[Rank 1] Complete source buffer: [100, 100, 101, 101]
[Rank 1] Initial destination buffer: [-1, -1, -1, -1]
[Rank 0] Initial destination buffer: [-1, -1, -1, -1]
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[rank0]:[W716 15:30:06.215666766 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank1]:[W716 15:30:06.215752786 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
NCCL version 2.27.5+cuda12.4
[Rank 1] Executing alltoall operation...
[Rank 0] Executing alltoall operation...
[Rank 1] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Results after alltoall:
[Rank 1] Results after alltoall:[Rank 0] Destination buffer: [0, 0, 100, 100]
[Rank 0] Verifying results:
  - From PE 0 (indices 0-1):
    Expected: [0, 0]
    Actual:   [0, 0]
[Rank 1] Destination buffer: [1, 1, 101, 101]
[Rank 1] Verifying results:
  - From PE 0 (indices 0-1):
    Expected: [1, 1]
    Actual:   [1, 1]
    Match:    ✓
    Match:    ✓
  - From PE 1 (indices 2-3):
    Expected: [100, 100]
  - From PE 1 (indices 2-3):
    Expected: [101, 101]
    Actual:   [100, 100]
    Actual:   [101, 101]
    Match:    ✓
    Match:    ✓
[Rank 0] ============================================================
[Rank 0] Summary: ALL TESTS PASSED ✓
[Rank 0] Data flow explanation:
  - Each rank sends 2 elements to every other rank
[Rank 1] ============================================================
[Rank 1] Summary: ALL TESTS PASSED ✓
  - Rank 0 sent: [0, 0, 1, 1]
[Rank 1] Data flow explanation:
  - Each rank sends 2 elements to every other rank
  - Rank 0 received: [0, 0, 100, 100]
  - My data for PE 0 (0) went to PE 0's buffer
  - I received PE 0's data for me (0)
  - My data for PE 1 (1) went to PE 1's buffer
  - Rank 1 sent: [100, 100, 101, 101]
  - I received PE 1's data for me (100)
[Rank 0] ============================================================
  - Rank 1 received: [1, 1, 101, 101]
  - My data for PE 0 (100) went to PE 0's buffer
  - I received PE 0's data for me (1)
  - My data for PE 1 (101) went to PE 1's buffer
  - I received PE 1's data for me (101)
[Rank 1] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158513
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512
2025-07-21 19:14:47 +00:00
662dd7db5b [cutlass backend] cache maybe_append_choices (#156781)
This PR attempts to cache:
* codegen for cutlass backend for the same kernel. Even if runtime params are different.

From some profiling, most of the time spent is on render. So we only target to cache that part for now.

The output of render is `code`, and we are able to cache that easily. Also, I have to cache size_args, since it depends on `kernel.get_dynamic_shape_args()`, which depends on the state of self when we call render.

make_key is doing most of the work here: We are hashing on input node layouts, output node layout and op.configuration_name() (this is what hash(op) would do anyway).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156781
Approved by: https://github.com/ColinPeppler
2025-07-21 19:02:39 +00:00
72db0a98a3 Revert "[DTensor] Assert DTensorSpec has valid placements (#158133)"
This reverts commit 1839e8d04b81ee6eda0cff6fbfc218a7a600f6f7.

Reverted https://github.com/pytorch/pytorch/pull/158133 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78496151 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158133#issuecomment-3097994857))
2025-07-21 18:54:07 +00:00
8ed5e1844c [AOTI] Convert C-struct zip handling to RAII container (#158687)
Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate.

Fixes #158614 (hopefully)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687
Approved by: https://github.com/desertfire
2025-07-21 18:53:14 +00:00
393fecb2cc [Optimus][Unit test] clean up the unit test (#158696)
Summary: We should only patch the specific pattern(s) for each unit test.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```

Buck UI: https://www.internalfb.com/buck2/f8d37674-91c4-4244-90fa-f24fc3f91e4b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275088644915
Network: Up: 100KiB  Down: 233KiB  (reSessionID-92039f44-bc6f-4e78-87b1-93bca1bd1c66)
Analyzing targets. Remaining     0/296
Executing actions. Remaining     0/20196                                                                    5.8s exec time total
Command: test.     Finished 2 local, 2 cache (50% hit)                                                      4.6s exec time cached (79%)
Time elapsed: 3:55.1s
Tests finished: Pass 13. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D78598127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158696
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-07-21 18:05:09 +00:00
9285b8245c [BE][testing] fix test_cat_max_autotune_triton (#158589)
Summary: This test often fails internally -- looks like it's because autotuning sometimes chooses not to do the epilog tuning. Turning off `benchmark_epilogue_fusion` seems to fix.

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_cat_max_autotune_triton (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158589
Approved by: https://github.com/eellison
2025-07-21 18:02:18 +00:00
637e75433c [BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2025-07-21 17:56:05 +00:00
a78fb63dbd [build] pin setuptools>=77 to enable PEP 639 (#158104)
For reference here is the link PEP 639: [peps.python.org/pep-0639](https://peps.python.org/pep-0639/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2025-07-21 17:46:40 +00:00
7205458b85 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-21 17:32:31 +00:00
35f1b4ad9e Revert "Fused RMSNorm implementation (#153666)"
This reverts commit 15ef4f28df0a14e9f0d55a57a4e2db415a303be7.

Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts.  See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935))
2025-07-21 17:31:42 +00:00
cbe1cb7018 [CMake] Move xpu flag to xpu.cmake (#158542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542
Approved by: https://github.com/gujinghui, https://github.com/ezyang
2025-07-21 17:19:59 +00:00
9894d43b6c [AOTI] explicit aoti wrapper functions for Windows. (#158713)
On Windows, we need to explicit declaration for export APIs. Because the package loader call these API via GetProcAddress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158713
Approved by: https://github.com/desertfire
2025-07-21 15:59:44 +00:00
f168cf49a8 [BE] Always use python 3.9 for pre-push hook's lintrunner (#158693)
A follow up to https://github.com/pytorch/pytorch/pull/158389

Sets up the pre-push lintrunner to always use python 3.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158693
Approved by: https://github.com/atalman
2025-07-21 15:19:27 +00:00
393377d215 Revert "[CI] update flake8 and mypy lint dependencies (#158720)"
This reverts commit a527e816935957a164d74dd7c5069310b2857695.

Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/malfet due to This broke lint, see 8e57cdb746/1 ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3096893256))
2025-07-21 13:58:50 +00:00
8e57cdb746 Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.

It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
2025-07-21 13:35:46 +00:00
d5a29fc58a De-abstract premature generalization with InductorWrapper (#158528)
See docblock on InductorWrapper for the distinction.  This will matter
on a later refactor PR where I will change the signature for one of
these but not the other.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158528
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158449
2025-07-21 13:27:07 +00:00
979fae761c Rename modules in AOTAutograd (#158449)
Fixes https://github.com/pytorch/pytorch/issues/158382

```
renamed:    torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py -> torch/_functorch/_aot_autograd/graph_capture.py
renamed:    torch/_functorch/_aot_autograd/traced_function_transforms.py -> torch/_functorch/_aot_autograd/graph_capture_wrappers.py
renamed:    torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py -> torch/_functorch/_aot_autograd/graph_compile.py
```

Everything else is ONLY import changes. I did not rename any functions
even if we probably should have.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158449
Approved by: https://github.com/jamesjwu
2025-07-21 13:27:07 +00:00
1eb6b2089f [Inductor] Set the default value of min_chunk_size to 512 (#150762)
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-21 12:46:05 +00:00
bbc32d680f [SymmMem] Add NVSHMEM sync_all support into Triton (#158512)
Adds `sync_all()` function for local store visibility synchronization in NVSHMEM Triton kernels. Provides memory ordering for local operations without remote completion guarantees.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_sync`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158512
Approved by: https://github.com/fduwjj
ghstack dependencies: #158511
2025-07-21 10:27:59 +00:00
a527e81693 [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-21 09:24:29 +00:00
1c6328a588 [EZ][BE] Fix compilation warning in Pooling.metal (#158729)
This one
```
Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal to Pooling_30.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:172:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.
```
Although functionally one is not supposed to hit this codepath ever, it's not not to throw warning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158729
Approved by: https://github.com/Skylion007
2025-07-21 04:34:14 +00:00
70b4a8880b [SymmMem] Add NVSHMEM barrier_all, my_pe, n_pes support into Triton (#158511)
Adds device-side barrier synchronization and PE identification functions for NVSHMEM Triton integration. Includes `barrier_all()` for collective synchronization and `my_pe()`/`n_pes()` for PE identification within kernels.

We are launching with cooperative grid launch (for all the PRs in this stack) because the `nvshmemx_collective_launch` function must be used to launch kernels on the GPU when the kernels use NVSHMEM synchronization or collective APIs, and `nvshmemx_collective_launch` essentially boils down to a CUDA cooperative group launch.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_barrier`

Also tested that if you remove the barrier, you get an assertion error/race conditions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158511
Approved by: https://github.com/fduwjj
2025-07-21 02:37:33 +00:00
5e1232871b Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a4ec381302f8acd279033707b182bed30ffd2091.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to This break inductor-perf-nighly-macos by failing to build torchvision, see https://github.com/pytorch/pytorch/issues/158728 ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3095048940))
2025-07-21 02:24:11 +00:00
ff0da08f4b [AOTI] normalize path and process model files. (#158705)
Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705
Approved by: https://github.com/desertfire
2025-07-21 01:08:59 +00:00
2cdafab0bd [BE] Raise ValueError from torch.cat meta func (#158249)
Followup after https://github.com/pytorch/pytorch/pull/155460

From [Python documentation](https://docs.python.org/3/library/exceptions.html#ValueError):
> Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.

Raise [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when input-output types are incompatible with each other
> Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch.

> This exception may be raised by user code to indicate that an attempted operation on an object is not supported, and is not meant to be. If an object is meant to support a given operation but has not yet provided an implementation, [NotImplementedError](https://docs.python.org/3/library/exceptions.html#NotImplementedError) is the proper exception to raise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158249
Approved by: https://github.com/jbschlosser, https://github.com/Skylion007, https://github.com/albanD
2025-07-20 23:49:18 +00:00
4b02bd76d3 DCP safetensors test fix (#158685)
https://github.com/pytorch/pytorch/pull/158069 removed the consolidated output path argument without updating the test. Reported by a user here https://github.com/pytorch/pytorch/pull/156705#issuecomment-3090748034.
Adding back the logic from the original PR https://github.com/pytorch/pytorch/pull/158069 and fixing the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158685
Approved by: https://github.com/teja-rao
2025-07-20 22:52:54 +00:00
2e038793ef [inductor][templates] Finalize all registered hooks (#157270)
This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds:
- `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually.
- A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270
Approved by: https://github.com/eellison
2025-07-20 22:07:32 +00:00
5e149a6482 Add deprecation warning (#158203)
Summary: export_for_training exist because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we should deprecate and delete this API.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78240836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158203
Approved by: https://github.com/JacobSzwejbka
2025-07-20 17:02:01 +00:00
badf002014 [Reland] Add warning about removed sm50 and sm60 arches (#158700)
Related to https://github.com/pytorch/pytorch/issues/157517

Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.

Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
    Found <GPU Name> which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 7.0.

  warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
                        Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
                        Please see https://github.com/pytorch/pytorch/issues/157517
                        Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```

Please note I reverted original PR https://github.com/pytorch/pytorch/pull/158301 because it broke internal users. This is a reland, added added check for non empty torch.cuda.get_arch_list()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158700
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/eqy
2025-07-20 14:57:46 +00:00
4869f71170 don't set CUDA_MODULE_LOADING (#158712)
If needed, it'll be set in `_C._cuda_init()`. setenv is not threadsafe, so this can cause segfaults due to getenv/setenv races.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158712
Approved by: https://github.com/eqy
2025-07-20 01:36:26 +00:00
b4abf41425 Raise BufferError for DLPack buffer-related errors. (#150691)
This PR addresses the Array API documentation for [`__dlpack__`][1] and
[`from_dlpack`][2] by making some buffer-related errors `BufferError`
instead of `RuntimeError`, e.g. incompatible dtype, strides, or device.

[1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
[2]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.from_dlpack.html#from-dlpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150691
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #150216, #150217, #150218
2025-07-20 00:46:21 +00:00
a10f15718d [DLPack] Add support for missing keyword-arguments. (#150218)
This PR introduces the rest of the keyword-arguments added in DLPack
version 2023.12: `dl_device` and `copy`.

In summary, we handle these arguments in the C++ implementation of
`to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the
`maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also
introduces the following changes:

- Add a new Python API `torchDeviceToDLDevice()`, which is simply a
  refactoring of the `getDLDevice()` function at
  _aten/src/ATen/DLConvertor.cpp_.
- Add both keyword-arguments to the `from_dlpack()` function at
  _torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder
  method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218
Approved by: https://github.com/albanD
ghstack dependencies: #150216, #150217
2025-07-20 00:46:20 +00:00
1d526fe78f Fix DLPack stream logic. (#150217)
This PR fixes the logic for dealing with CUDA and ROCm streams whenever
we are trying to create a DLPack capsule from a tensor.

In summary, this PR:

- Uses the legacy default stream if `tensor.__dlpack__(stream=None)` is
  called for a CUDA tensor.
- Errors if `tensor.__dlpack__(stream=2)` is called for a CUDA tensor:
  PyTorch doesn't support the per-thread default stream.
- Errors if `tensor.__dlpack__(stream=stream)`, where `stream` is 1 or
  2, is called for a CUDA tensor using ROCm.

For more details, see [the documentation][1].

[1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150217
Approved by: https://github.com/msaroufim, https://github.com/albanD
ghstack dependencies: #150216
2025-07-20 00:46:20 +00:00
b64f338da4 [DLPack] add NumPy exchange tests. (#150216)
This PR resolves an old TODO that requested NumPy DLPack exchange tests
once version 1.22 was required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150216
Approved by: https://github.com/msaroufim, https://github.com/albanD
2025-07-20 00:46:20 +00:00
a1cfe7f1df [nativert] benchmark util (#158678)
Differential Revision: D78514241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158678
Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips
2025-07-20 00:28:09 +00:00
d36afac83b Build domain libraries for all workflows with TorchBench config (#158601)
They are expensive GPU runners and should not spend time building packages

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158601
Approved by: https://github.com/ZainRizvi
2025-07-19 21:51:39 +00:00
7cc1a9546c [AOTI] fix extract file failed on Windows. (#158702)
Changes:
1. rename zip index name, and keep it out of normalize path.
2. normalize output path for extract file.

Extract files successful:
<img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702
Approved by: https://github.com/angelayi
2025-07-19 08:58:42 +00:00
7cc5d03dfc Document the rest of the specific optimizer module APIs (#158669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669
Approved by: https://github.com/albanD
ghstack dependencies: #158483
2025-07-19 07:27:15 +00:00
f73594164a [BE] document Adadelta and Adagrad APIs properly (#158483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483
Approved by: https://github.com/albanD
2025-07-19 07:27:15 +00:00
a9f84021fb [CI] Fixes CI for CUDA Version > 12.9 (#157385)
Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385
Approved by: https://github.com/eqy
2025-07-19 06:51:57 +00:00
22d82222c6 GenAI Layer Benchmark (#158536)
This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.

It covers all kernels supported by [quack](https://github.com/Dao-AILab/quack?tab=readme-ov-file#kernels-) (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.

## Motivations

- Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark.
- We recently added GenAI model benchmark (based on [vLLM](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm)). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize.

## Key Settings

- Avoid reusing a kernel specializing on 1 shape for benchmark on another shape.
```python
torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000
```

- For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark.
```
torch._dynamo.mark_dynamic(x, 0)
```

GPU: H100 (devvm006.dkl0)

Results: [P1874246170](https://www.internalfb.com/phabricator/paste/view/P1874246170)

Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for `torch.bfloat16`, use rtol `1.6e-2` and atol `1e-5`). It shows numerical issues for some backends and kernels.

Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/7aa77ad1-83eb-41ea-a27d-50fd5b1dd6be" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/a26ec028-3791-4a41-a12a-05e10f60e9aa" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/cc6673ed-c148-4dd2-a729-5f02e717ab3e" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/f71f9f9d-7b45-4ce7-89d0-e9bce727efae" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/e012821a-b7e6-4e83-a24c-c97fa8cd37b5" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/2d52ee1e-9a8c-4bd1-a180-97b93f07171d" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/02aad056-3ce1-4b40-8cfe-adae81fd017a" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/779f6b0d-a102-4164-8300-86fff0329ddf" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158536
Approved by: https://github.com/yf225, https://github.com/eellison
2025-07-19 05:41:01 +00:00
5cde34473c Fix MakeTensor::computeStorageSize() (#158690)
For tensor with non-zero offset, it must be multiplied by element size

Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with
```
C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15
Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib)
frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib)
frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib)
frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib)
frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib)
frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib)
frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib)
frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib)
frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib)
frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib)
frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib)
frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic)
frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158690
Approved by: https://github.com/angelayi
2025-07-19 05:21:33 +00:00
fac0be7b9c [async-TP] Turn asserts back into silent skips (#158572)
https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise.

This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569.

This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572
Approved by: https://github.com/danielvegamyhre, https://github.com/atalman
2025-07-19 04:54:38 +00:00
64dabb2cf5 only fail regressions>10% on pr_time benchmarks (#158577)
Moving to a new framework, maintaitning the pr_time benchmark test right now is hard and often breaking.
1. only fail PRs >10% regressions.
2. post monitor with pr_time benchmarks dashboard (oncall), and update expected results (frequently or on big changes)
(supposed to already be doing https://www.internalfb.com/unidash/dashboard/pt2_diff_time_metrics)
3. setting up some one detections  detectors warnings that would be triggered at regressions and notify internally post land
https://www.internalfb.com/monitoring/detector/1140915271179237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158577
Approved by: https://github.com/xmfan, https://github.com/janeyx99
2025-07-19 04:35:31 +00:00
ab557421a4 [cca] [c10d] Refactor CUDAEventCache into separate files (#158616)
Summary:
Refactored CUDAEventCache from ProcessGroupNCCL.hpp/.cpp into dedicated header and implementation files for better code organization and maintainability.

Split out CUDAEventCache into:
- New header file: CUDAEventCache.hpp
- New implementation file: CUDAEventCache.cpp
- Updated build_variables.bzl to include the new file

This change improves code maintainability, readability, and follows better code organization practices.
---
> Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/)
[Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace)

Test Plan:
Verified build with:
```
buck build //caffe2/test/distributed:c10d
```
---
> Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/)
[Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158616
Approved by: https://github.com/fduwjj
2025-07-19 02:51:28 +00:00
90b082e207 enable_caching_generated_triton_templates=True by default (#158592)
Got some risk, but good to catch issues if there is any, easy to revert single flag flip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158592
Approved by: https://github.com/eellison
2025-07-19 02:19:34 +00:00
a741094159 Build domain libraries on the build job (#158600)
By setting the name of the domain libraries to build via `BUILD_ADDITIONAL_PACKAGES` environment variable, the build job will build them and make them available as artifacts in the same way as the PyTorch CI wheel. To ensure that this doesn't break CI, the test job will still build them as usual if the wheels are not there.  Building dependencies like FBGEMM on the test job is bad, especially for GPU jobs, because it leave the GPU resource idle

Fixes https://github.com/pytorch/pytorch/issues/152024

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158600
Approved by: https://github.com/yangw-dev
ghstack dependencies: #158598, #158599
2025-07-19 02:03:50 +00:00
2955acaed6 Clean up some unused build env variables (#158599)
* Parameter build-with-debug isn't needed, it isn't even passed into Docker. Debug build is detected via the build environment name
* AWS_DEFAULT_REGION is a leftover from ARC and isn't used anywhere in .ci/pytorch nor .github

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158599
Approved by: https://github.com/cyyever, https://github.com/ZainRizvi
ghstack dependencies: #158598
2025-07-19 01:59:00 +00:00
2c16eb9f3d [dynamo] Support more basic output types for nonstrict_trace (#157969)
Fixes #157397 and improves the user-facing error message for remaining
unsupported cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157969
Approved by: https://github.com/zou3519
2025-07-19 00:59:54 +00:00
c2c88846a9 Revert "[Easy] Show some clear error when torch.ops.load_library fails. (#157524)"
This reverts commit 555f3562541992b66a550eca8e8740884b1247f8.

Reverted https://github.com/pytorch/pytorch/pull/157524 on behalf of https://github.com/wdvr due to reverting for now to reopen the discussion ([comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3091317252))
2025-07-19 00:45:31 +00:00
5b40f6581e Revert "Add warning about removed sm50 and sm60 arches (#158301)"
This reverts commit fb731fe371cb1b5bf95de84b19c213590526acb2.

Reverted https://github.com/pytorch/pytorch/pull/158301 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158301#issuecomment-3091307023))
2025-07-19 00:32:04 +00:00
d42c409767 [AOTI] windows package load dev (#158671)
changes:
1. add extract file fail handler for Windows develop.
2. normalize more file paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671
Approved by: https://github.com/angelayi
2025-07-19 00:06:40 +00:00
a3aacd6cb2 [DTensor] fix copy_ strategy (#158538)
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158490
2025-07-18 23:44:43 +00:00
36bddcd18c [DTensor] Fix default_strategy and rename for clarity (#158490)
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-07-18 23:44:42 +00:00
15ef4f28df Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/albanD
2025-07-18 23:24:21 +00:00
60b9b06a53 [caffe2] Fix Missing override in get_buffer of NCCLSymmetricMemory (#158597)
Summary:
Fix the error that occurs in the devarm environment when compiling with Clang:
```
caffe2/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu:97:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
97 | virtual at::Tensor get_buffer(int
| ^
caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
| ^
1 error generated.
```

Test Plan:
See D78520305

Rollback Plan:

Differential Revision: D78517953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158597
Approved by: https://github.com/janeyx99
2025-07-18 23:12:29 +00:00
a835dbc096 [c10d][ez] Fix error message to reflect the correct API name (#158668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158668
Approved by: https://github.com/VieEeEw
2025-07-18 23:10:47 +00:00
f76f4abf3f Track monitor (#156907)
Tracking gpu mem allocation, we were tracking the gpu bandwidth memory, the mem allocation is the one reflect wether the gpu is oom or not, upcoming ui fix.

UI fix: https://github.com/pytorch/test-infra/pull/6878/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156907
Approved by: https://github.com/huydhn
2025-07-18 22:54:13 +00:00
be483a5481 setup pinned commit for vllm in pytorch ci (#158591)
Set up pinned commit for vllm in nightly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158591
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-18 22:30:20 +00:00
bc7b1f5252 [AOTI] Use libstdc++ only for fbcode cpu case (#158659)
Differential Revision: D78567218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158659
Approved by: https://github.com/kflu, https://github.com/zoranzhao
2025-07-18 22:27:10 +00:00
07c4c2a792 [dynamo][be] hide warnings without invalidating warnings cache (#158520)
I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning.

So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor.

There is a behavior change if someone calls a compiled graph with simplefilter("error"):
```python
# e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing
with warnings.catch_warnings():
    warnings.simplefilter("error")  # turns all warnings into errors
    compiled_fn()  # will throw if any of the muted warnings fire
```

FIXES https://github.com/pytorch/pytorch/issues/128427

A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-07-18 22:02:31 +00:00
89850bbc07 [Dynamo] Use proper sources for constructing dataclass defaults (#157993)
Partially fixes https://github.com/pytorch/pytorch/issues/154009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157993
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2025-07-18 21:51:40 +00:00
3bb729df97 Revert "Fix test consolidate hf safetensors (#157386)"
This reverts commit fa1c20ae9285f7994a73d2d06025065f96b67a57.

Reverted https://github.com/pytorch/pytorch/pull/157386 on behalf of https://github.com/jithunnair-amd due to Need to revert this so we can revert PR 156705, which introduced errors on ROCm CI. These errors were not seen on CUDA CI because CUDA CI docker images do not have safetensors installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/157386#issuecomment-3090706074))
2025-07-18 21:00:12 +00:00
e3351b3ddf Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069)"
This reverts commit 627ba411366bcc15019c49756d3f22fd3914bd50.

Reverted https://github.com/pytorch/pytorch/pull/158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/158069#issuecomment-3090692336))
2025-07-18 20:54:19 +00:00
1ab1ab38a0 Use linux.12xlarge.memory to build for H100/sm_90 (#158598)
Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100 or newer GPUs, so it doesn't benefit much from existing compiler cache from trunk. Also use a memory-intensive runner here because memory is usually the bottleneck

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158598
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-07-18 20:31:56 +00:00
8b2a650572 pt2_remote_cache: Log sample for failures, and log the explicit reason we're faling. (#156874)
Summary: This allows us to start alerting on cache failures, based on scuba data

Test Plan:
Added new tests explicitly for the Remote Cache API.

Note that we have existing tests for memcache, but not for manifold AFAICT.

There are two potential wrinkles. One we're adding a new field (and everything uses ScubaData AFAICT, so this should just work).

The other one is the implicit api contract that if the sample is None, then it will be ignored (and not crash). I believe the second one is implemented correctly (and tested). The first one is a little more nebulous, but I think won't cause any breakages.

Also manually ran a compile and made sure it didn't break - P1851504490 as well as forcing it to break and checking we didn't screw up the exception handling - P1851504243

Rollback Plan:

Differential Revision: D77054339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156874
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-07-18 20:28:27 +00:00
ec0b538961 [inductor] Make times and repeat parameters command line args (#158590)
Summary: Small change to make the `times` and `repeat` variables controllable as command line args.

Test Plan:
Execute:
```
buck2 run <run params> <path>:inductor_benchmark -- --times=1 --repeat=1
```
Only runs once, and without passing the args it runs with default values of 10.

Rollback Plan:

Reviewed By: malfet

Differential Revision: D78458680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158590
Approved by: https://github.com/FindHao, https://github.com/malfet
2025-07-18 20:07:55 +00:00
599f94e7b9 [AOTI] add Windows file ext to package loader. (#158578)
Add `object` and `extension` file type for Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158578
Approved by: https://github.com/angelayi
2025-07-18 19:57:12 +00:00
04ac258cf6 [BE][testing] Fix test_cudacodecache.py (#158259)
Summary: According to internal test failures, looks like we're missing a check for cuda: https://fburl.com/testinfra/eznzkyha

Test Plan:c`buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158259
Approved by: https://github.com/exclamaforte, https://github.com/BoyuanFeng
2025-07-18 19:56:13 +00:00
1b5fdb23b9 [BE] Add pre-push hook for lintrunner to the PyTorch repo (#158389)
Adds a pre-commit hook (technically a pre-push hook) to the PyTorch repo.
**This is currently an opt-in feature**, which one can opt into by running `python scripts/setup_hooks.py` locally.

### Features
- **Run Lintrunner Before Push**: Before every `git push`, automatically runs lintrunner on your changes.
  - Really need to skip the checks? Run `git push --no-verify`
- **Consistent, Isolated, Lintrunner Environment**: During pre-push, Lintrunner runs in it's own virtual en environment that contain all lintrunner dependencies in a consistent, isolated environment.  No more lintrunner failures because you created a new .venv. (Did you know you needed to run `lintrunner init` every time you make a new .venv?)
- **Dependencies Automatically Updated**: If .lintrunner.toml is updated, this will automatically re-run `lintrunner init` to ensure you install the latest dependencies specified

### Installation
- Run `python scripts/setup_hooks.py`. Now every `git push` will first run lintrunner.

### Additional details
- The lintrunner used by the pre-push hook runs in a special per-repo virtual environment managed by the commit-hook tool located under `$USER/.cache/pre-commit`
- Does not affect your regularly used lintrunner
  - Manual invocations of lintrunner will continue to depend on your local environment instead of the special pre-push one. If there's enough interest, we could explore consolidating them.
- Does not run `lintrunner -a` for you.
  - You still need to manually run that (can be changed later though!)
- Have staged/unstaged changes? No worries
  - This runs `git stash` before running the pre-commit hooks and pops back your changes afterwards, so only the changes actaully being pushed will be tested

### Downsides
- No streaming UI updates
  - While you still get the same output from lintrunner that you're used to, the commit-hook framework doesn't show any output while lintrunner is actually running. Instead, it shows the entire output after linter has completed execution, which could be a few minutes (especially if it has to run `lintrunner init` first)
- `uv` installation is required to run the setup script. The setup script will ask users to install uv if it's not available.
  - This is required to be able to install the pre-commit package in a safe way that's available no matter what .venv you are running in.

### Opting out
- Disable hook for a single push: Run `git push --no-verify`
- Disable hook permanently: If something goes wrong and you need to wipe your setup:
  - Delete the `$USER/.cache/pre-commit` folder and the `.git/hooks/pre-push` file in your local repo.
  - You can now rerun `python scripts/setup_hooks.py` to setup your git push hook again if you want.

### Potential Future Changes
Things that could be done to make this even better if folks like these ideas:
- Automatic setup
  - Our `CONTRIBUTING.md` file tells devs to run `make setup-env`.  That could be a good entry point to hook the installation into
- Fix the console output streaming
- Make every lintrunner invocation (including manual ones) use the same repo-specific venv that the commit-hook uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158389
Approved by: https://github.com/seemethere
2025-07-18 19:55:35 +00:00
75e2628782 Add lower bounds for fsspec and networkx dependencies (#158565)
Fixes #156587

This sets lower bounds for fsspec and networkx in both setup.py and requirements,txt.

- fsspec>= 0.8.5 (released December 15, 2020)
- netowrkx>= 2.5.1 (released April 3, 2021)

These are the first stable versions released after Python 3.9 came out on October 5, 2020. Since Python 3.8 is no longer maintained, setting these minimums helps ensure PyTorch won't be installed alongside unexpectedly old versions of these packages.

Tested with these versions locally to make sure they don't break anything. Adding CI for lower-bound testing could be a follow up later if need.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158565
Approved by: https://github.com/janeyx99
2025-07-18 19:42:09 +00:00
79e49efadd Pull latest Sphinx theme (#158595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595
Approved by: https://github.com/albanD
2025-07-18 18:46:47 +00:00
b87e50db5e [BE][testing] Fix internal test failures in test/dynamo/test_unspec (#158485)
Summary: These tests failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.

Test Plan:
```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_object' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_values_with_graph_break' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only' --run-disabled
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158485
Approved by: https://github.com/williamwen42
2025-07-18 18:41:03 +00:00
656885b614 [Dynamo][Better Engineering] Type devices, resume_execution and testing utils (#158593)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a set of utilities in dynamo, `device_interface.py`, `resume_execution.py`, `tensor_version_ops.py`, `test_case.py`, and `test_minifier_common.py`

Running
```
mypy torch/_dynamo/device_interface.py torch/_dynamo/resume_execution.py torch/_dynamo/tensor_version_op.py torch/_dynamo/test_case.py torch/_dynamo/test_minifier_common.py  --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  976 | 1672 | 58.37% | 76 | 112 | 67.86% |
| This PR | 1719 | 1719 | 100.00% | 112 | 112 | 100.00% |
| Delta    | +743 | +47 | +41.63% | +36 | 0 | +32.14% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158593
Approved by: https://github.com/mlazos
2025-07-18 18:22:06 +00:00
6e07d6a0ff [Dynamo][Better Engineering] Add typing support for _dynamo/repro and debug_utils (#158504)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to an important set of utilities in dynamo, `repro/` and the base `debug_utils.py`

Running
```
mypy torch/_dynamo/repro/ torch/_dynamo/debug_utils.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  905 | 3268 | 27.69% | 22 | 81 | 27.16% |
| This PR | 3368 | 3368 | 100.00% | 81 | 81 | 100.00% |
| Delta    | +2463 | +100 | +72.31% | +59 | 0 | +72.84% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158504
Approved by: https://github.com/mlazos
2025-07-18 18:15:55 +00:00
b4358c5e87 [inductor] Explicitly link c10 in inductor. (#158622)
MSVC have error "unresolved external symbol" when compiling inductor. Explicitly link c10 in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158622
Approved by: https://github.com/desertfire

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-18 18:00:50 +00:00
86675af3f0 Revert "[ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)"
This reverts commit 9308261a2afb69d807ea06508bb8582b066d9ccd.

Reverted https://github.com/pytorch/pytorch/pull/158602 on behalf of https://github.com/ZainRizvi due to The lint job failure was hiding a real lint failure. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/16375911199/job/46275682191) [HUD commit link](6f73e06796) ([comment](https://github.com/pytorch/pytorch/pull/158602#issuecomment-3090209891))
2025-07-18 17:46:11 +00:00
725cdb218e Name threads in caffe2/torch/distributed/checkpoint AsyncCheckpointExecutor (#158612)
Differential Revision: D78493333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158612
Approved by: https://github.com/d4l3k
2025-07-18 17:33:12 +00:00
8c3f84908b [aot] fix greater_than_max build fail on Windows. (#158479)
Error snapshot:
<img width="937" height="110" alt="image" src="https://github.com/user-attachments/assets/10195f84-83c4-42db-af3c-76f875a6a983" />

Reason:
`std::numeric_limits::max` is confilct to windef.h:`max(a, b)`

Fix code:
<img width="488" height="269" alt="image" src="https://github.com/user-attachments/assets/3328c37b-7c89-435e-944c-4ca7c9b6c5b6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158479
Approved by: https://github.com/desertfire
2025-07-18 17:18:10 +00:00
6f73e06796 [iter] exhaust ListIterator when unpack_var_sequence is called (#156370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156370
Approved by: https://github.com/zou3519
ghstack dependencies: #156369
2025-07-18 16:48:27 +00:00
acffd1a297 [iter] Update some of the tests to not call pickle (#156369)
Some tests in test_iter only fail because of pickle. I'm skipping the pickle section as Dynamo doesn't support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156369
Approved by: https://github.com/zou3519
2025-07-18 16:48:27 +00:00
bf4aa78279 Revert "[DTensor] Fix default_strategy and rename for clarity (#158490)"
This reverts commit d8b084312b54e97bdbaf6a178fe2fc628a23243b.

Reverted https://github.com/pytorch/pytorch/pull/158490 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](d8b084312b) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))
2025-07-18 16:45:32 +00:00
50f33a6fca Revert "[DTensor] fix copy_ strategy (#158538)"
This reverts commit 7b05bdd925f0f4b49e68662f9761fabaa27f2faf.

Reverted https://github.com/pytorch/pytorch/pull/158538 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](d8b084312b) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))
2025-07-18 16:45:32 +00:00
35df895d05 [AOTI] package loader normalize path separator (#158630)
Add `normalize_path_separator` to handle Windows path simplify.

This solution is working well on `torch/_inductor/cpp_builder.py`: a00cd8cf25/torch/_inductor/cpp_builder.py (L406-L409)

Let's copy it to package loader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158630
Approved by: https://github.com/angelayi
2025-07-18 15:55:24 +00:00
193b29ee0c [BE][EZ] Minor doc fixes (#158574)
[BE] Minor doc fixes
2025-07-18 10:34:55 -05:00
036eb1f65d [precompile] Filter out ID_MATCH family of guards with caching_precompile. (#158368)
Summary: For case like caching_precompile, we almost always want to drop ID_MATCH-type guards since they will block serialization. This diff add this behavior when this global flag is toggled on so that ID_MATCH guards are excluded from compilation and serialization.

Test Plan:
test_dynamo -- -k test_id_match_with_config

Rollback Plan:

Differential Revision: D78363609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158368
Approved by: https://github.com/jamesjwu
2025-07-18 14:47:11 +00:00
e882c761dd Add STD_TORCH_CHECK to headeronly (#158377)
Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377
Approved by: https://github.com/albanD
2025-07-18 14:35:20 +00:00
0eae6b68f4 Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)
Fixes #158376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537
Approved by: https://github.com/atalman
2025-07-18 14:05:52 +00:00
a4ec381302 [build] pin setuptools>=77 to enable PEP 639 (#158104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2025-07-18 11:49:54 +00:00
27af877f84 [ATen][CUDA][SDPA] Flash Attention: Refactor sm version checks (#158558)
The architecture version checks are unnecessary fine-grained in PyTorch. Considering the fact that PyTorch's Flash Attention works on all `sm_80+` machines, it makes more sense to just check for lower bound.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158558
Approved by: https://github.com/eqy
2025-07-18 09:59:41 +00:00
7b05bdd925 [DTensor] fix copy_ strategy (#158538)
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
2025-07-18 09:59:37 +00:00
ead80f3202 Fix s390x CI: ensure that all python dependencies are installed when … (#158552)
…building pytorch for tests on s390x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158552
Approved by: https://github.com/huydhn
2025-07-18 09:13:41 +00:00
32aade9d8d Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit 39ac189808c61588f3594dbc2fc1d69bb6194c47.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/jithunnair-amd due to Ignored ROCm failures while ROCm was unstable, but HUD clearly shows this PR introduced failures on trunk ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3087982975))
2025-07-18 07:47:46 +00:00
be896d6b41 Revert "Forward-fix unused variables warning/error (#158549)"
This reverts commit eeda1a75ace75ce8a6763050fb91d236a6d3287b.

Reverted https://github.com/pytorch/pytorch/pull/158549 on behalf of https://github.com/jithunnair-amd due to Sorry, need to revert this first, so we can revert PR 158037, which broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/158549#issuecomment-3087942475))
2025-07-18 07:44:14 +00:00
a3396a9b85 [hop] set capture_scalar_outputs=True by default for compiled hops (#158480)
We want to do it for two reasons:
1. It's tedious for users to manually turn on capture_scalar_outputs=True when compiling map and scan with inductor, where we decomposing them into while_loop and use the idx tensor.item() to select a slice of output buffer and write into it. This pr turns on the flag by default.
2. a graph break caused by capture_scalar_outputs=False would cause the hop to fail, and we should turn it on by default so that the error message is more meaningful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158480
Approved by: https://github.com/zou3519
2025-07-18 07:16:50 +00:00
fda3f3b2ec [while_loop] fix constant tensor used as carried inputs (#158381)
Address second part of #158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158381
Approved by: https://github.com/zou3519
2025-07-18 07:08:11 +00:00
a00cd8cf25 Add a way to disable compile for debugging flex-attention (#158534)
Finally got around to doing this, this flag lets us do:

```Python

#!/usr/bin/env python3
"""
FlexAttention Debug: Using breakpoints and unwrap
"""

import torch
import torch.nn.attention.flex_attention as fa

unwrap = torch._C._functorch.get_unwrapped

def score_mod(score, batch, head, q_idx, kv_idx):
    # Set breakpoint here to debug
    breakpoint()

    # In debugger, unwrap to see actual tensor values:
    # >>> actual_score = unwrap(unwrap(unwrap(unwrap(score))))
    # >>> actual_batch = unwrap(batch)
    # >>> actual_head = unwrap(head)
    # >>> actual_q_idx = unwrap(q_idx)
    # >>> actual_kv_idx = unwrap(kv_idx)
    # >>> print(actual_score)
    # >>> print(f"q_idx: {actual_q_idx}, kv_idx: {actual_kv_idx}")

    return torch.where(q_idx >= kv_idx, score, torch.tensor(float('-inf')))

def main():
    # Enable debug mode
    fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = True

    # Small example
    B, H, S, D = 1, 2, 4, 8
    q = torch.randn(B, H, S, D)
    k = torch.randn(B, H, S, D)
    v = torch.randn(B, H, S, D)

    # Run - will hit breakpoint
    output = fa.flex_attention(q, k, v, score_mod=score_mod)

    # Disable debug mode
    fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = False

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158534
Approved by: https://github.com/Chillee, https://github.com/zou3519
2025-07-18 05:33:45 +00:00
eb73650723 [BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.

For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from  the `check_pyobj` call.

```
 mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
-    // fast case: tensor is live in python
-    std::optional<PyObject*> mb_obj =
-        t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
-    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
-        return *mb_obj;
-    }
-    return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+  // fast case: tensor is live in python
+  std::optional<PyObject*> mb_obj =
+      t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+          /*ignore_hermetic_tls=*/false);
+  if (mb_obj.has_value() &&
+      !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+    return *mb_obj;
+  }
+  return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
2025-07-18 05:23:00 +00:00
9308261a2a [ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)
fbgemm_gpu build started failing with asmjit errors.  Moving to latest tip of fbgemm for inductor tests resolves the build failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-18 05:02:31 +00:00
9a7c2f1f64 Revert "Add torch compile force disable caches alias (#158072)"
This reverts commit 2ecf083b7247f265a03ec296ba9d7b795f035118.

Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))
2025-07-18 04:58:24 +00:00
d8b084312b [DTensor] Fix default_strategy and rename for clarity (#158490)
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #158495
2025-07-18 04:09:32 +00:00
1e86fa2e5b Add stack trace to Inductor IR nodes if inductor.config.trace.provenance_tracing=True (#158576)
Summary:
- Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and  ` create_node_mapping_kernel_to_post_grad`
- Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace`
- Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node
- When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to.
  - A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](a00442421a/torch/_inductor/fx_passes/group_batch_fusion.py (L299))

Example:

```
scheduling ExternKernelOut(
  python_kernel_name='extern_kernels.mm',
  name=buf0,
  layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]),
  inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView(
    StorageBox(
      ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1]))
    ),
    FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]),
    origins=OrderedSet([mm_default_1]),
    stack_traces = {,
    File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
        x = self.fc1(x),
      File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
        return F.linear(input, self.weight, self.bias),
    }
  )],
  constant_args=(),
  kwargs={},
  output_view=None,
  python_kernel_name=extern_kernels.mm,
  cpp_kernel_name=at::mm_out,
  ordered_kwargs_for_cpp_kernel=(),
  op_overload=None,
  arg_properties=[{}, {}],
  allarg_properties={},
  kwarg_properties=None,
  unbacked_bindings={},
  mutation_outputs=[],
  origin_node=mm_default_1,
  origins=OrderedSet([mm_default_1]),
  stack_traces = {,
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
      x = self.fc1(x),
    File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
      return F.linear(input, self.weight, self.bias),
  }
)
```

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D78365534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158576
Approved by: https://github.com/angelayi
2025-07-18 04:05:17 +00:00
86dbc0ef67 [NativeRT] Remove makeProxyExecutor from ModelRunner interface (#158587)
Summary: makeProxyExecutor shouldn't be exposed to ModelRunner Interface.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78501011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158587
Approved by: https://github.com/yiming0416, https://github.com/henryoier
2025-07-18 03:20:40 +00:00
89d842fec5 Make torch.distributed.breakpoint() set a long timeout (#158481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158481
Approved by: https://github.com/d4l3k
ghstack dependencies: #158469
2025-07-18 02:18:43 +00:00
ce4554352b Shunt fx_interpreter graphmodule print on error into tlparse (#158469)
Include both the error stacktrace and the graphmodule in a new
structured trace artifact.  Log the shortened version to the console,
and also log a hint to look at the tlparse for more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158469
Approved by: https://github.com/ezyang
2025-07-18 02:18:43 +00:00
583138d170 [Dynamo][Better Engineering] Add typing for comptime, cache, and convert_frame (#158379)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, primarily for`comptime.py` but also `cache_size.py` and `convert_frame.py`.

Running
```
mypy torch/_dynamo/comptime.py torch/_dynamo/cache_size.py torch/_dynamo/convert_frame.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1837 | 2215 | 82.93% | 45 | 82 | 54.88% |
| This PR | 2230 | 2230 | 100.00% | 82 | 82 | 100.00% |
| Delta    | +393 | +15 | +17.07% | +37 | 0 | +45.12% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158379
Approved by: https://github.com/mlazos
2025-07-18 02:11:57 +00:00
eqy
6fd6fc418d [B200] Fix flex-attention heuristic for test_tma_with_customer_kernel_options_cuda (#158494)
Otherwise fails with
```
torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused__to_copy_ones_sort_sum_zeros_2 Required: 264224 Hardware limit: 232448 Reducing block sizes or `num_stages` may help.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158494
Approved by: https://github.com/drisspg
2025-07-18 02:03:49 +00:00
ddbecdfb66 [DTensor] Document redistribute_costs (#158495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158495
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-18 01:43:38 +00:00
ef38edb284 Add stride check for attn_mask on non-cpu device (#158424)
Fixes #158374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158424
Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman
2025-07-18 01:10:58 +00:00
6673ac746c Fix test linalg for MKL upgrading (#158312)
Fixes #158054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158312
Approved by: https://github.com/albanD
2025-07-18 01:08:33 +00:00
7b72e5b3ad Fix Pandas version mismatch upon reinstalling numpy (#158584)
If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place.
```
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list
Package            Version
------------------ -----------
attrs              25.3.0
build              1.2.2.post1
certifi            2025.7.14
charset-normalizer 3.4.2
cmake              4.0.3
exceptiongroup     1.3.0
expecttest         0.3.0
filelock           3.18.0
fsspec             2025.5.1
hypothesis         6.135.32
idna               3.10
importlib_metadata 8.7.0
Jinja2             3.1.6
lintrunner         0.12.7
MarkupSafe         2.1.5
mpmath             1.3.0
networkx           3.2.1
ninja              [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4)
opt-einsum         3.3.0
optree             0.16.0
packaging          25.0
pip                25.1
psutil             7.0.0
pyproject_hooks    1.2.0
python-dateutil    2.9.0.post0
pytz               2025.2
PyYAML             6.0.2
requests           2.32.4
setuptools         78.1.1
six                1.17.0
sortedcontainers   2.4.0
sympy              1.14.0
tomli              2.2.1
typing_extensions  4.14.0
tzdata             2025.2
urllib3            2.5.0
uv                 0.7.21
wheel              0.45.1
zipp               3.23.0
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4
Collecting numpy==1.22.4
  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.22.4
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3
Collecting pandas==2.0.3
  Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4)
Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
Installing collected packages: pandas
Successfully installed pandas-2.0.3
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2
Collecting numpy==2.0.2
  Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
Successfully installed numpy-2.0.2
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python
Python 3.9.23 (main, Jun  5 2025, 13:40:20)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158584
Approved by: https://github.com/huydhn
2025-07-18 00:14:16 +00:00
33c9b414aa [CI][MPS] Enable test_indexing on MPS (#158582)
- Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with
```
/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
```
while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')`

- Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet)
- Adjust `test_index` to avoid using DoubleTensors for MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158582
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales
2025-07-17 23:33:52 +00:00
b0e325c2c8 [Dynamo][Better Engineering] Add type coverage to decorators (#158509)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to an important file in dynamo, `decorators.py`

NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time

Running
```
mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  209 | 908 | 23.02% | 9 | 39 | 23.08% |
| This PR | 870 | 943 | 100.00% | 36 | 39 | 100.00% |
| Delta    | +661 | +35 | +76.98% | +27 | 0 | +76.92% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158509
Approved by: https://github.com/williamwen42
2025-07-17 23:31:26 +00:00
f63988ae00 [BE]Clean up old APIs in AOTI c shim (#158400)
Summary:
The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h`

This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h`

The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py`

Test Plan:
CI

Rollback Plan:

Differential Revision: D78378373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158400
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-07-17 23:24:50 +00:00
2df2e3bb51 [ROCm][CI] Last known good HIP patch (#158596)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158596
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-17 22:52:16 +00:00
0ecfb93a0b Avoid globally modifying torch.testing._internal.common_methods_invocations.wrapper_set_seed (#158548)
Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158548
Approved by: https://github.com/janeyx99
2025-07-17 22:31:59 +00:00
74f4cf4bd5 Add missing <vector> in c10/util/WaitCounter.h (#158354)
It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158354
Approved by: https://github.com/janeyx99
2025-07-17 22:23:05 +00:00
cyy
1b91954b9f Suppress volatile type error (#158435)
Fixes
```
/var/lib/jenkins/workspace/torch/csrc/dynamo/guards.cpp:5320:10:
error: compound assignment to object of volatile-qualified type 'volatile char' is deprecated [-Werror,-Wdeprecated-volatile]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158435
Approved by: https://github.com/janeyx99
2025-07-17 22:21:04 +00:00
41b2c4d119 Reduce random reads for offset metadata when calling torch.load under FakeTensorMode (#157931)
We already test the `_get_offset` functionality with that TORCH_SERIALIZATION_DEBUG flag that is set in CI, so I didn't add more testing specifically for FakeTensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157931
Approved by: https://github.com/albanD
2025-07-17 22:17:52 +00:00
af6624023e [dynamo] Skip training flag check id already guarding on nn modules (#158492)
This might help some legacy models that still have
inline_inbuilt_nn_modules False for some reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158492
Approved by: https://github.com/StrongerXi
2025-07-17 21:42:19 +00:00
a00442421a [CI][TD] Enable TD on all test configs (#158163)
I think the main one that was missing is dynamo_wrapped

There's also slow and inductor, but the filter later for workflows stops TD from running on those anyways

dynamo_wrapped is the second longest jobs for pull right now
<img width="1265" height="311" alt="image" src="https://github.com/user-attachments/assets/d4ca034c-a8f0-4b31-a80f-0f4f21fce32a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158163
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2025-07-17 21:05:25 +00:00
ced5cf042d Revert "Cleanup old caffe2 scripts (#158475)"
This reverts commit 94d7f0c1ef9a4cb4db0eb5d6b1ffc55941cbeab1.

Reverted https://github.com/pytorch/pytorch/pull/158475 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158475#issuecomment-3085447409))
2025-07-17 20:58:34 +00:00
1b88da1cac [MPS] Improve performance of max_pool3d (#157875)
To check how the changes from this PR affect performance, I wrote a script here: 55ef32a127/max_pool_mps/perf.py.

Before this PR, I get this:

```
===================
max_pool3d
===================
0: 0.013105 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2}
1: 0.038003 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5}
2: 0.212963 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5}
3: 1.224645 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5}
4: 7.317867 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1}
5: 34.679233 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20}
6: 34.626383 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1}
7: 44.835892 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40}
8: 0.083579 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2}
9: 0.936575 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2}
10: 5.329883 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2}
11: 11.713617 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2}
12: 25.450454 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2}
13: 0.058375 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2}
14: 3.757558 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2}
15: 33.451588 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2}
```

After this PR, I get this:

```
===================
max_pool3d
===================
0: 0.007202 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2}
1: 0.018596 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5}
2: 0.130717 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5}
3: 0.966795 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5}
4: 4.095804 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1}
5: 12.833446 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20}
6: 12.859346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1}
7: 14.080529 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40}
8: 0.029283 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2}
9: 0.175700 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2}
10: 0.742750 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2}
11: 1.939596 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2}
12: 4.074821 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2}
13: 0.028425 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2}
14: 0.384375 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2}
15: 2.623346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2}
```

Every case is improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157875
Approved by: https://github.com/malfet
2025-07-17 20:34:12 +00:00
66c9bc5062 [export] Add runnable code to export docs (#158506)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html

Yay I can add runnable code to export docs now
Also moved export API reference to a different file.

With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506
Approved by: https://github.com/pianpwk, https://github.com/svekars
2025-07-17 20:15:22 +00:00
80ac73c057 [ca] reset between tests (#158418)
CA reset is much faster than dynamo reset, so it's probably okay to run it every time. I'm not sure if this will fix the flaky autograd tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158418
Approved by: https://github.com/jansel
2025-07-17 20:14:29 +00:00
eeb0783fe6 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-17 20:04:42 +00:00
ef256ad17b Make Inductor imports TYPE_CHECKING only (#158524)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158524
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-17 19:55:19 +00:00
fd51bcdd21 check if USE_ROCM is defined (#158571)
Summary:
check if USE_ROCM is defined

D78424375 broke some builds: see T231304402

Test Plan:
rerunning failed builds

Rollback Plan:

Reviewed By: Camyll

Differential Revision: D78493019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158571
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-07-17 19:48:26 +00:00
7ebbf2cae7 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563) (#158550)
This reverts commit 8554c8007ddaa8029e7e01bb1af12f358bf597c2 #157563 due to causing a few breakages on ROCm

Reverted expected_results.csv to 26807dcf277feb2d99ab88d7b6da526488baea93

> @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: f4d8bc46c7/2
and the failures are still showing up on tip-of-tree on HUD

Context
https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857

Needs to be relanded in non bc-breaking way, or sanity checked for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-17 19:47:41 +00:00
8dcebaa7b0 [AOTI] add WIN32 implement for create_temp_dir (#158570)
add Windows implement for `create_temp_dir`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158570
Approved by: https://github.com/angelayi
2025-07-17 19:22:59 +00:00
7e34f9c292 Add torch._C._log_api_usage_once to datapipes (mapper) (#155489)
This is to get a better understanding of how datapipes is used right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155489
Approved by: https://github.com/ramanishsingh
2025-07-17 19:01:49 +00:00
25f4d7e482 Use new type statement to fix public API of types (#158487)
Since type statement breaks older python version, trying to find equivalent behavior without the type mechanics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158487
Approved by: https://github.com/andrewor14
2025-07-17 18:46:44 +00:00
ad223a6c5f Add FP8 Types (#158430)
Summary: Add FP8 Types

Test Plan:
sandcastle

Rollback Plan:

Differential Revision: D78395110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158430
Approved by: https://github.com/henryoier
2025-07-17 18:09:56 +00:00
f92a2035e4 ci: Update lint workflow to only run on changed files for PRs (#158518)
This modifies the lint workflow to use the new get-changed-files
workflow to optimize lint execution by only running on files
that have actually changed in pull requests.

This more closely mirrors the type of behavior that users
expect when running lint locally on their PRs.

This also leaves the default behavior as a fallback for when
you're not running on a pull request.

Since lint runs on the pull_request event I'm not really worried about
any type of ciflow shenanigans in this.

This also splits mypy into its own job since mypy needs to run on all-files all the time.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158518
Approved by: https://github.com/huydhn
ghstack dependencies: #158517
2025-07-17 18:00:44 +00:00
bff69f25c2 [BE][testing] fix test/dynamo/test_repros:test_longtensor_list (#158458)
Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458
Approved by: https://github.com/jansel
2025-07-17 17:27:00 +00:00
6d31d38965 recovering node source from dict (#158373) (#158473)
Summary:

this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78434648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158473
Approved by: https://github.com/angelayi
2025-07-17 17:00:19 +00:00
bfe5674e22 Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info  ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))
2025-07-17 16:55:55 +00:00
94d7f0c1ef Cleanup old caffe2 scripts (#158475)
Testing on this one is grep based: if there were no reference to that script I can find, I deleted.
We can easily add any of these back if needed!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever
2025-07-17 16:50:06 +00:00
23550ab735 Revert "DDE-Free select with unbacked index. (#157605)"
This reverts commit 79d7c754ab8ae0e5c3a614521632d2cfbfa0fdba.

Reverted https://github.com/pytorch/pytorch/pull/157605 on behalf of https://github.com/laithsakka due to fail pr time benchmarks  ([comment](https://github.com/pytorch/pytorch/pull/157605#issuecomment-3084663020))
2025-07-17 16:20:02 +00:00
16b21fa8b2 [AOTI] skip ld and objcopy on Windows. (#158545)
Skip `ld` and `objcopy` on Windows. They are not support on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158545
Approved by: https://github.com/desertfire
2025-07-17 15:43:24 +00:00
2ecf083b72 Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-07-17 15:40:36 +00:00
813c76b98d Revert "Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)"
This reverts commit 58c7cf9ede6311da5533dbcaf238a912176a6a85.

Reverted https://github.com/pytorch/pytorch/pull/158537 on behalf of https://github.com/albanD due to This broke C++ tests ([comment](https://github.com/pytorch/pytorch/pull/158537#issuecomment-3084425920))
2025-07-17 15:06:43 +00:00
288bf54a23 Revert "Move off of deprecated API in 2.9 (#158527)"
This reverts commit 9636e2cfd3e995ef977f670ad47e8e895296d992.

Reverted https://github.com/pytorch/pytorch/pull/158527 on behalf of https://github.com/albanD due to breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/158527#issuecomment-3084385585))
2025-07-17 14:55:28 +00:00
da4c7b4ced [AOTI] align signature to model_base.h (#158554)
Remove `const` keyword, align its signature to `model_base.h` eeda1a75ac/torch/csrc/inductor/aoti_runtime/model_base.h (L51-L53)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158554
Approved by: https://github.com/desertfire
2025-07-17 14:44:32 +00:00
a04bd11895 [AOTI] Use format_consts_to_cpp on Windows. (#158543)
`format_consts_to_asm` is not supported on Windows, force use `format_consts_to_cpp` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158543
Approved by: https://github.com/desertfire
2025-07-17 14:40:34 +00:00
58c7cf9ede Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)
Fixes #158376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537
Approved by: https://github.com/atalman
2025-07-17 13:39:25 +00:00
38c04415a9 [oss][hf][bug fix] Remove buggy consolidation logic (#158380)
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D78366701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
2025-07-17 13:05:06 +00:00
7892f5a007 [inductor][triton] Update HAS_WARP_SPEC to check triton.Config params. Update Triton Hash to top of release/3.4.x stack (#158459)
Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton.

Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459
Approved by: https://github.com/atalman
2025-07-17 12:50:46 +00:00
d5af0eca8d [BE][3/5] fix typos in aten/ (aten/src/ATen/native/) (#157552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157552
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637, #157550, #157551
2025-07-17 12:08:34 +00:00
f57ef62ebc [BE][2/5] fix typos in aten/ (aten/src/ATen/native/) (#157551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157551
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637, #157550
2025-07-17 12:08:33 +00:00
4c8b408d16 [BE][1/5] fix typos in aten/ (#157550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157550
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637
2025-07-17 12:08:33 +00:00
c8d43cbc6e [BE][3/6] fix typos in test/ (#157637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157637
Approved by: https://github.com/yewentao256, https://github.com/albanD
ghstack dependencies: #156605
2025-07-17 12:08:33 +00:00
3f8e2e91ad [BE][15/16] fix typos in torch/ (torch/distributed/tensor/) (#156605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156605
Approved by: https://github.com/wanchaol, https://github.com/albanD
2025-07-17 12:08:33 +00:00
eeda1a75ac Forward-fix unused variables warning/error (#158549)
Introduced in https://github.com/pytorch/pytorch/pull/158037, didn't seem to trigger on PR, but trunk CI is failing in some `linux-jammy-cpu-py3.12-gcc11-inductor-*` jobs where this warning is turned into an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158549
Approved by: https://github.com/danthe3rd
2025-07-17 09:44:19 +00:00
f4d8bc46c7 Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520)
### Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.

```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```

Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-17 08:57:34 +00:00
39ac189808 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-17 08:26:27 +00:00
d76323d417 [NativeRT] Remove normalizeDevice (#158489)
Summary:
In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0).

tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index.

TBEPermute is relying on this behavior to route output tensor to a device specified by current thread.

For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement.

Device-to-device mapping must be done between concrete device!

Test Plan:
CI

Rollback Plan:

Differential Revision: D78443109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489
Approved by: https://github.com/henryoier
2025-07-17 06:48:25 +00:00
04349f9ee5 [PT2]: Skip AOTI Weight Loading during Init (#158416)
Summary: AOTI already has weights embedded in .so file. So for the initial load, no need to load the weights again. This allows lowered modules can have different set of weights on different hardwares.

Test Plan:
```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=895279202
SNAPSHOT_ID=0
MODULE=merge

buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName ${MODULE} --predictor-hardware-type 1 --submodToDevice ""  --benchmarkDontRebatchSamples=true --benchmarkNumIterations 1000
```

Rollback Plan:

Differential Revision: D78383881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158416
Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad
2025-07-17 06:47:47 +00:00
09db3a22e8 [BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453)
BUILD_SPLIT_CUDA logic has been removed for a while

Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453
Approved by: https://github.com/albanD
ghstack dependencies: #158358, #158365
2025-07-17 06:47:10 +00:00
a38f433be2 [Docker builds] Move from Miniconda to Miniforge (#158370)
This is related to: https://www.anaconda.com/legal/terms/terms-of-service

Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799

Rocm and XPU builds since they use Miniforge are not affected

```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
 > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93     • https://repo.anaconda.com/pkgs/main
11.93     • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93     ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.

Using solution 2.

Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``

4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-07-17 06:33:08 +00:00
9f37cce693 Revert "[Docker builds] Move from Miniconda to Miniforge (#158370)"
This reverts commit 0a99b026d6bd0f67dc2c0a20fe3228ddc4144854.

Reverted https://github.com/pytorch/pytorch/pull/158370 on behalf of https://github.com/laithsakka due to this fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/158370#issuecomment-3082744071))
2025-07-17 06:28:49 +00:00
9636e2cfd3 Move off of deprecated API in 2.9 (#158527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158527
Approved by: https://github.com/danielvegamyhre
2025-07-17 06:18:13 +00:00
d9426a81d2 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290, #158291
2025-07-17 05:56:26 +00:00
0b9fb91f17 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-17 05:56:26 +00:00
a6de309ca1 [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-17 05:56:18 +00:00
1a4268b811 [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-17 05:56:07 +00:00
79d7c754ab DDE-Free select with unbacked index. (#157605)
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
2025-07-17 05:08:11 +00:00
415dfabe9b [Easy] Fix the format (#158450)
When I modify the code located in test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg,
some unrelated format error occurred.

```Python
Lint for torch/_inductor/fx_passes/fuse_attention.py:

  Error (CODESPELL) spelling error
    Failed due to ValueError:
    /pytorch/pytorch/torch/_inductor/fx_passes/fuse_attention.py:587: differnt
    ==> different

    Please either fix the error or add the word(s) to the dictionary file.
    HINT: all-lowercase words in the dictionary can cover all case variations.

Lint for torch/fx/traceback.py:

  Error (MYPY) [assignment]
    Incompatible types in assignment (expression has type "str", variable has
    type "None")

        101  |
        102  |    def _get_action_string(self):
        103  |        if self._action_string is None:
        104  |            self._action_string = "+".join([a.name.lower() for a in self.action])
        105  |        return self._action_string
        106  |
        107  |    def print_readable(self, indent=0):

  Error (MYPY) [assignment]
    Incompatible types in assignment (expression has type "dict[str, Any]",
    variable has type "None")

        121  |        if self._dict is None:
        122  |            # Convert the object to a dictionary
        123  |            action_string = self._get_action_string()
        124  |            self._dict = {
        125  |                "name": self.name,
        126  |                "target": self.target,
        127  |                "graph_id": self.graph_id,

  Error (MYPY) [return-value]
    Incompatible return value type (got "None", expected "dict[Any, Any]")

        130  |                "from_node": [node.to_dict() for node in self.from_node],
        131  |            }
        132  |
        133  |        return self._dict
        134  |
        135  |    def __eq__(self, other: object):
        136  |        if not isinstance(other, NodeSource):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158450
Approved by: https://github.com/Skylion007
2025-07-17 04:56:10 +00:00
8eaa9f2701 Fix mask construction when dispatching index_put to masked_fill (#158472)
Fixes #158413
Previously trailing Nones in the index were incorrectly handled as implicit broadcasting dims in the mask, whereas they should just be ignored.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158472
Approved by: https://github.com/ezyang
2025-07-17 04:21:43 +00:00
ebf83b8b77 [audio hash update] update the pinned audio hash (#158402)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158402
Approved by: https://github.com/pytorchbot
2025-07-17 04:19:06 +00:00
24b49b9881 [Fix] Rework CUDA error explanation framework to be less destructive … (#158484)
…in fbsource

Fix-forward for #158395

Added `std::string c10::cuda::get_cuda_error_help(const char* error_string)` to provide a framework for appending clarifying messages to CUDA errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158484
Approved by: https://github.com/aorenste
2025-07-17 03:36:47 +00:00
1839e8d04b [DTensor] Assert DTensorSpec has valid placements (#158133)
This helped identify buggy sharding rules during debugging, why not
check it in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158133
Approved by: https://github.com/XilunWu, https://github.com/zpcore
ghstack dependencies: #158132
2025-07-17 02:32:26 +00:00
2ad5c25cfc Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-07-17 01:56:01 +00:00
1179e33323 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-07-17 01:56:01 +00:00
f6d138807f Always disable ShardingPropagation cache if compiling (#156868)
Fixes #151106

Addresses issue (2) in #152963 for the DTensor sharding propagation cache being brittle under compile. The existing `_are_we_tracing` from `distributed._functional_collectives`, which mostly determines if currently tracing based on Fake Tensor dispatch mode, is reused here.

**Test Plan**:
There are already tests for DTensor + Compile with dynamic shape ([test_dtensor_dynamic](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L260),
[test_dynamo_dtensor_from_local_dynamic_shapes](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L402)) that cover the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156868
Approved by: https://github.com/xmfan
2025-07-17 01:33:53 +00:00
c09eba877f [Device] Add support for PrivateUse1 device type in parse_type function (#157609)
This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency.

This help to migrate existed backend to PrivateUse1 smoothly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609
Approved by: https://github.com/jgong5, https://github.com/albanD
2025-07-17 01:27:44 +00:00
2179afd714 [easy][guards] Add developer comment for posterity (#158471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158471
Approved by: https://github.com/StrongerXi
2025-07-17 01:17:04 +00:00
d7e1b8b11d [dynamo] Constant fold torch.autograd._profiler_enabled (#158482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158482
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-07-17 01:07:42 +00:00
b6454a9058 [AOT_inductor] model_base.h add Windows include files. (#158477)
model_base.h add Windows include files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158477
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-17 00:57:48 +00:00
e9367a7a42 ci: Add reusable workflow to get changed files in PRs (#158517)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158517
Approved by: https://github.com/huydhn
2025-07-17 00:57:43 +00:00
clr
e78f2ac92b inductor: Fix crash in split_cat when tensors is a Node (#157155)
If there is only one node passed to aten::cat, the argument is a single node,
rather than a list of nodes with a valid length.

Example stack
```
  File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/pattern_matcher.py", line 1115, in apply
    self.handler(match, *match.args, **match.kwargs)
  File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/fx_passes/split_cat.py", line 1786, in merge_split_cat_aten
    if len(cat_inputs) < threshold_to_cat:
torch._inductor.exc.InductorError: TypeError: object of type 'Node' has no len()
```

This has failed about 7 internal jobs in the last week, running pytorch trunk code from 06/15

I've attached a test which reproduces this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157155
Approved by: https://github.com/jansel
2025-07-17 00:57:38 +00:00
82a1ee1135 Refactor Provenance Tracking (#158399)
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`

- change the guard flag from `trace.enabled` to `trace.provenance_tracking`.  It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.

In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.

See more motivation in internal Diff

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D78287976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
2025-07-17 00:23:00 +00:00
306dd19216 update expeced results (#158497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158497
Approved by: https://github.com/xmfan
2025-07-17 00:02:52 +00:00
1d58476162 [PP] Add eval() API to schedule (#157795)
These change add an `eval()` API to PP schedules

## Context

Currently, you can run "Forward only" for a schedule in two ways:
1. Use a custom schedule `_ScheduleForwardOnly`
2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed.

However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this.

```python
if self.rank == 0:
    schedule.eval(x)
elif self.rank == self.world_size - 1:
    losses = []
    schedule.eval(target=target, losses=losses)
else:
    schedule.eval()
```

TODO:
- in later PRs, we will deprecate the `_ScheduleForwardOnly`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795
Approved by: https://github.com/wconstab
2025-07-16 23:48:45 +00:00
a4d753295e [Dynamo][Better Engineering] Add enhanced typing support to _dynamo/eval_frame.py (#158276)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to the main entrypoint for dynamo, `eval_frame.py`

Running
```
mypy torch/_dynamo/eval_frame.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  623 | 2232 | 27.91% | 19 | 68 | 27.94% |
| This PR | 2285 | 2285 | 100.00% | 68 | 68 | 100.00% |
| Delta    | +1662 | +63 | +72.09% | +49 | 0 | +72.06% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158276
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <williamwen@meta.com>
2025-07-16 23:31:10 +00:00
a9f902add0 [CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295)
Reopen https://github.com/pytorch/pytorch/pull/156097

Fixes https://github.com/pytorch/pytorch/issues/154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-16 23:14:36 +00:00
e311886e3d Add transpose to torch/csrc/stable (#158160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158160
Approved by: https://github.com/janeyx99
2025-07-16 22:50:57 +00:00
3cb11877aa [aoti][mps] Enable test_aot_inductor.py tests (#155598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155598
Approved by: https://github.com/yushangdi
2025-07-16 22:26:57 +00:00
5951fcd50a [Dynamo][Better Engineering] Support typing in codegen.py (#158386)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, primarily for `codegen.py` but also `config.py`

Running
```
mypy torch/_dynamo/codegen.py torch/_dynamo/config.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  347 | 1330 | 26.09% | 24 | 50 | 48.00% |
| This PR | 1334 | 1334 | 100.00% | 50 | 50 | 100.00% |
| Delta    | +987 | +4 | +73.91.% | +26 | 0 | +52.00% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158386
Approved by: https://github.com/StrongerXi
2025-07-16 22:09:01 +00:00
ada44e5ba7 [Dynamo][Better Engineering] Add typing to bytecode analysis and transform (#158293)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, `bytecode_transformation.py` and by extension, `bytecode_analysis.py`

Running
```
mypy torch/_dynamo/bytecode_transformation.py torch/_dynamo/bytecode_analysis.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1422 | 1920 | 74.06% | 73 | 93 | 78.49% |
| This PR | 1968 | 1968 | 100.00% | 93 | 93 | 100.00% |
| Delta    | +546 | +48 | +25.94% | 20 | 0 | +21.51% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158293
Approved by: https://github.com/StrongerXi, https://github.com/Skylion007
2025-07-16 21:50:55 +00:00
9df0176408 [BE][testing] Disable test_static_cuda_launcher:test_floats internally (#158296)
Summary: it seems the check for 'Offd' vs. 'Offf' doesn't work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158296
Approved by: https://github.com/davidberard98
2025-07-16 21:27:40 +00:00
94c746bb43 [DTensor][BE] add document to ShardingPropagator.register_op_strategy (#158362)
**Summary**
Add document to `ShardingPropagator.register_op_strategy` on how to draft
`strategy_func` and when to use `schema_info`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158362
Approved by: https://github.com/zpcore
2025-07-16 21:08:59 +00:00
473208cb18 [ez][lint] Add pr_time_benchmarks to merge conflictless csv linter (#158353)
Discovered this when looking at a PR I was trying to revert and was surprised that the PR got rid of the spaces but didn't trigger the linter.  Turns out the file was following the rule but wasn't actually being checked
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158353
Approved by: https://github.com/seemethere, https://github.com/Camyll
2025-07-16 20:31:07 +00:00
fb731fe371 Add warning about removed sm50 and sm60 arches (#158301)
Related to https://github.com/pytorch/pytorch/issues/157517

Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.

Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
    Found <GPU Name> which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 7.0.

  warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
                        Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
                        Please see https://github.com/pytorch/pytorch/issues/157517
                        Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301
Approved by: https://github.com/nWEIdia, https://github.com/albanD
2025-07-16 20:11:18 +00:00
a9ee4250d5 [4/n] Remove references to TorchScript in PyTorch docs (#158317)
Summary: jit.rst

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158317
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-16 20:01:34 +00:00
14ecc03361 Revert "recovering node source from dict (#158373)"
This reverts commit 4d055982e38f59fdb2a4c9d8855e58548bc42c12.

Reverted https://github.com/pytorch/pytorch/pull/158373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158373#issuecomment-3080093479))
2025-07-16 19:55:21 +00:00
1cc62c2cb9 [export] Update docs (#157750)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html

Changes:
* Rename draft_export.md -> export.draft_export.md for consistency.
* Removed non-strict section in export, instead pointed to programming model doc.
* Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs.
* Removed Specialization section in favor of programming model doc
* Added pt2 archive doc
* Cleaned up sidebar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750
Approved by: https://github.com/pianpwk
2025-07-16 19:53:12 +00:00
f58a680d09 [c10d]Prototype of remote_group_merge (#158287)
Tentative implementation of merge_remote_group per the proposal here: [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158287
Approved by: https://github.com/d4l3k
ghstack dependencies: #157716
2025-07-16 19:33:57 +00:00
944a140e90 Revert "[cuda][cupy] Improve cupy device placement when device is provided (#158320)"
This reverts commit 59f9b25f3cfc635053843372ea29ff4bf754da3f.

Reverted https://github.com/pytorch/pytorch/pull/158320 on behalf of https://github.com/wdvr due to reverting because most likely causing test/test_numba_integration.py::TestNumbaIntegration::test_from_cuda_array_interface_inferred_strides to fail ([comment](https://github.com/pytorch/pytorch/pull/158320#issuecomment-3079960616))
2025-07-16 19:15:33 +00:00
cyy
79ab84e9b8 Fix invalid formatting (#158436)
It causes errors under C++20
```
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:330:40:
error: call to consteval function 'fmt::fstring<>::fstring<std::string, 0>' is not a constant expression
```
Indeed the printed value is treated as format string and it may contain special chars in some cases. While this is not true in our case, it can't be determined in compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158436
Approved by: https://github.com/Skylion007
2025-07-16 18:47:09 +00:00
2b0f9b1f61 Move c10/macros/Macros.h to headeronly (#158365)
^

Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365
Approved by: https://github.com/swolchok
ghstack dependencies: #158358
2025-07-16 18:46:52 +00:00
b40f48d191 Move the rest of c10/macros/Export.h (#158358)
Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358
Approved by: https://github.com/swolchok
2025-07-16 18:46:52 +00:00
4d055982e3 recovering node source from dict (#158373)
Summary: this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78363882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158373
Approved by: https://github.com/yushangdi
2025-07-16 18:46:09 +00:00
bc9091a524 Fix indexing with multi-dimensional boolean mask (#158369)
Fixes #71673

This fixes a bug in PyTorch indexing, that shows up when mixing multi-dimensional boolean masks with other forms of indexing. Examples:
```python
>>> import torch
>>> x = torch.ones([2, 2, 3])
>>> m = torch.tensor(((True, False), (False, False)))  # (2x2 boolean mask)

>>> x[m].shape  # this works fine (the boolean mask acts on the 2x2 subspace selecting one row)
torch.Size([1, 3])

>>> x[m, 0]  # this should produce a tensor of shape (1,)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 3] at index 1

>>> x[m, ::2]  # this should produce a tensor of shape (1, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 3] at index 1

>>> x[m, None]  # this should produce a tensor of shape (1, 1, 3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 2, 3] at index 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158369
Approved by: https://github.com/ngimel
2025-07-16 18:30:57 +00:00
a26bf38927 Don't need to handle PyTrace_EXCEPTION in pyProfileFn (#154392)
According to the [document](https://python.readthedocs.io/fr/stable/c-api/init.html#c.PyTrace_EXCEPTION) and [comment](https://github.com/python/cpython/blob/3.9/Modules/_lsprof.c#L407), we don't need to handle PyTrace_EXCEPTION in pyProfileFn.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154392
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 18:00:11 +00:00
da05b7fb94 [cond] add _FlopCounterMode support for cond (#158067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158067
Approved by: https://github.com/zou3519
ghstack dependencies: #158077
2025-07-16 17:26:20 +00:00
82b1c48292 [hop] add supports_higher_order_operators flag to TorchDispatchMode (#158077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158077
Approved by: https://github.com/zou3519
2025-07-16 17:26:20 +00:00
a369350065 enable compiled autograd on CPU windows (#158432)
compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code.
However these code can be compiled on CPU. This PR enable these code on CPU windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158432
Approved by: https://github.com/jansel, https://github.com/xmfan

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-16 17:22:37 +00:00
ff611d971f [ROCm] check stream graph capture status in memcpy_and_sync inline function (#158165)
Check for stream graph capture when using hipMemcpyWithStream.

Fixes https://github.com/pytorch/pytorch/issues/155684, https://github.com/pytorch/pytorch/issues/155231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158165
Approved by: https://github.com/jeffdaily
2025-07-16 17:17:34 +00:00
4805a6ead6 [aot][XPU] switch xpu to use consts cpp build. (#158425)
Intel compiler is not support `format_consts_to_asm`, let's use `format_consts_to_cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158425
Approved by: https://github.com/jansel
2025-07-16 16:19:33 +00:00
a8b9736737 [BE][testing] disable test_custom_op_square internally (#158367)
Summary: test is failing with `ld.lld: error: unable to find library -laoti_custom_ops`

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:test_aot_inductor_custom_ops -- --exact 'caffe2/test/inductor:test_aot_inductor_custom_ops - test_custom_op_square_cuda (caffe2.test.inductor.test_aot_inductor_custom_ops.AOTInductorTestABICompatibleCuda)' --run-disabled`

Differential Revision: [D78364617](https://our.internmc.facebook.com/intern/diff/D78364617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158367
Approved by: https://github.com/desertfire
2025-07-16 16:16:14 +00:00
4b11428cb5 [BE][testing] Skip test_repeated_masked_load internally (#158355)
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
2025-07-16 16:15:44 +00:00
a04a13c449 [BE][testing] Skip test_triton_interpret internally (#158260)
Summary: Subprocesses in fbcode are tricky because of .par files. I'm thinking it's not an important enough test to get it running and skipping is fine.

Test Plan: `buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158260
Approved by: https://github.com/eellison
2025-07-16 16:14:44 +00:00
a23f4471b9 [ROCm][Windows] Fix finding ROCm/HIP version (#156486)
This commit fixes Windows build issue related to trying to use rocm-core (rocm-core doesn't exist on HIP SDK)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156486
Approved by: https://github.com/jeffdaily, https://github.com/stellaraccident
2025-07-16 15:31:43 +00:00
06a67a8948 Fix sha256 for aotriton ROCm7.0 tarball (#158420)
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
       file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
    /var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
  does not match expected value
    expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
      actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
  Each download failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
2025-07-16 15:24:20 +00:00
9513b9d03f Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit bc65253369933160a2da3fc786d027a572faf6b7.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/lw due to OSX failures are real ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3079042171))
2025-07-16 15:04:10 +00:00
0b19d463d9 forward fix lint (#158448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158448
Approved by: https://github.com/adamomainz
2025-07-16 14:55:33 +00:00
5763ec5f8d [BE] Replace lib with TORCH_INSTALL_LIB_DIR (#158235)
Their values are actually the same. Just staying in line with other `INSTALL` commands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158235
Approved by: https://github.com/Skylion007
ghstack dependencies: #158234
2025-07-16 14:20:19 +00:00
2043f6911e [BE] Rename libnvshmem_extension to libtorch_nvshmem (#158234)
`libnvshmem_extension.so` creates an illusion that it is a shared library from NVSHMEM. But indeed it is built from torch source code, for symmetric tensor infrastructure and operations, though leveraging NVSHMEM APIs. Thus this PR renames `libnvshmem_extension.so` to `libtorch_nvshmem.so`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158234
Approved by: https://github.com/albanD
2025-07-16 14:20:19 +00:00
bc65253369 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-16 13:54:09 +00:00
51a708ffc6 [nativert] libtorch kernel registry (#157150)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77451703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-07-16 12:36:55 +00:00
55d888a616 Add framework for explanations for common CUDA errors (#158395)
As popularly requested in user groups.

Test plan:
```
import torch

a = torch.randn(10000)
device = torch.device('cuda:1')
a = a.to(device)
```

Before:
```
Traceback (most recent call last):
  File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
    a = a.to(device)
        ^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

After:
```
Traceback (most recent call last):
  File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
    a = a.to(device)
        ^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-07-16 12:31:18 +00:00
0a99b026d6 [Docker builds] Move from Miniconda to Miniforge (#158370)
This is related to: https://www.anaconda.com/legal/terms/terms-of-service

Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799

Rocm and XPU builds since they use Miniforge are not affected

```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
 > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93     • https://repo.anaconda.com/pkgs/main
11.93     • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93     ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.

Using solution 2.

Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``

4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-07-16 10:52:47 +00:00
ac706bfc7f disable multi kernel rocm (#158299)
Fixes https://github.com/pytorch/pytorch/issues/158274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158299
Approved by: https://github.com/huydhn
2025-07-16 10:20:09 +00:00
9d184bda2f add device generalization support for distributed tests (#156796)
MOTIVATION
To generalize Distributed test cases for non-CUDA devices

CHANGES

- test/distributed/checkpoint/test_fsspec.py
- test/distributed/checkpoint/test_state_dict.py
- test/distributed/test_multi_threaded_pg.py

Replaced hard coded device names with torch.accelerator.current_accelerator

- torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py

support for hccl backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-07-16 09:37:03 +00:00
ea74fdd24a [Inductor][Triton] Update TMA Compatibility Requirements (#157881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157881
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-07-16 09:31:44 +00:00
e71bb021b9 Add a periodic test for older NVIDIA driver (#158300)
This is needed because of the botched landing of https://github.com/pytorch/pytorch/pull/156097 which crashed on older NVIDIA drivers `525.*`.  I add a periodic job to install the `525.105.17` on CI, then run:

1. A smoke to make sure that CUDA can be initialized
2. And the whole the test suite on the older driver
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158300
Approved by: https://github.com/ngimel
2025-07-16 08:18:18 +00:00
fb9a5d248f Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297)
Fixes #141563

In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case.

This difference in behavior leads to a bug when using torch.compile:
```python
>>> import numpy as np
>>> f = lambda x: x[:,(0,1),...,(0,1)].shape
>>> a = np.ones((3, 4, 5))
>>> f(a)
(2, 3)
>>> torch.compile(f)(a)
(3, 2)
```

Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified.

Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples:
- #71673
- #107699
- #158125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297
Approved by: https://github.com/soumith
2025-07-16 08:11:53 +00:00
ddf502c988 [AOTI] add -lstdc++ into aoti link cmd for Meta internal (#158325)
Differential Revision: D78123716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158325
Approved by: https://github.com/desertfire
2025-07-16 07:55:08 +00:00
555f356254 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-16 07:33:22 +00:00
59f9b25f3c [cuda][cupy] Improve cupy device placement when device is provided (#158320)
This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen.

I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy.

I will leave it to pytorch developers to see if the improvement is worthwhile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320
Approved by: https://github.com/ezyang
2025-07-16 07:12:36 +00:00
fedbd1a48e Enable ROCm 7.0 Alpha docker builds for PyTorch CI (#158390)
This PR adds ROCm 7.0 alpha docker builds to start testing latest ROCm in PyTorch CI and enable new MI350x hardware.

Highlights:
* Stop building `pytorch-linux-jammy-rocm-n-1-py3` docker images, as they're not currently used in any CI workflows
* Add `pytorch-linux-noble-rocm-alpha-py3` docker images that will use ROCm alpha (newer than latest official release) builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158390
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-16 06:09:37 +00:00
5484890539 Add better typing to avaialbe kernel options for flex attention (#158383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158383
Approved by: https://github.com/joydddd, https://github.com/BoyuanFeng
2025-07-16 06:06:29 +00:00
61a7b09ef3 [BE][Easy] split build system requirements.txt to a separate file (#158111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158111
Approved by: https://github.com/ezyang
2025-07-16 05:03:30 +00:00
e92e3eaf4e [Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586)
Hi team,

Please help review this trivial fix.

Without this change:

``` python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

    capture_overload_names (bool) : whether to include ATen overload names in the profile
```

With this change:

```python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed.
    profiler_metrics : a list of CUPTI profiler metrics used
       to measure GPU performance events.
       If this list contains values Kineto runs in CUPTI profiler mode
    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
       or for the entire measurement duration.
    verbose (bool) : whether the trace file has `Call stack` field or not.
    performance_events : a list of profiler events to be used for measurement.
    enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events
       that expose CUDA device, stream and event synchronization activities. This feature is new
       and currently disabled by default.
    adjust_profiler_step (bool) : whether to adjust the profiler step to
       match the parent python event duration. This feature is new and currently disabled by default.
    disable_external_correlation (bool) : whether to disable external correlation
    profile_all_threads (bool) : whether to profile all threads
    capture_overload_names (bool) : whether to include ATen overload names in the profile

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 04:10:49 +00:00
0a9d450168 [DTensor] implement histc (#158298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158298
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-16 04:10:32 +00:00
e265b719bd Extract out prepare_aot_module_simplified for use in next PR (#158319)
Also a small amount of extra code cleanup.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158319
Approved by: https://github.com/jingsh
ghstack dependencies: #158149, #158150, #158173, #158176, #158213, #158251
2025-07-16 03:59:41 +00:00
7637c9718a Move functions from torch._functorch.aot_autograd that are not frontend functions to frontend_utils (#158251)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158251
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176, #158213
2025-07-16 03:59:41 +00:00
49d0332cef Introduce stages to aot_dispatch (#158213)
The starting point for this refactor is that I need access to the fully
general joint graph representation in an export-like interface, but I
then subsequently need a way to feed this joint graph into the rest of
the compilation pipeline so I can get an actual callable that I can run
once I've finished modifying it.  Previously, people had added export
capabilities to AOTAutograd by having an export flag that toggled what
exactly the functions return and triggering aot_dispatch to go to a
different "export" implementation, but I've found this difficult to
understand and has lead to a bit of duplicate code for the export path.

So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call:
  * create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls:
    * One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which:
      * Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture
      * Do some base/export/autograd specific post-processing on the graph

Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to *resume* the rest of the post-processing to actually get a callable.

Here is the new structure:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow:
  * Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure
  * Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this.
  * Call aot_stage1_graph_capture to do the graph capture
  * Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want

With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176
2025-07-16 03:59:32 +00:00
84dec060b7 Hoist choose_dispatcher to top level, remove unnecessary returns (#158176)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158176
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173
2025-07-16 03:56:25 +00:00
5b0df2565e Pipeline _create_aot_dispatcher_function (#158173)
Two main things of note:

- Review this diff without whitespace changes
- To ensure that context managers correctly propagate to later pipeline
  stages, I am using the ExitStack trick: there is an ExitStack which is
  in scope for the entire pipeline, and inside of the individual
  pipeline stages we push context managers onto this stack when we want
  them to survive into the next pipeline stage.  This is not obviously
  what the best final form of the code is, but
  create_aot_dispatcher_function is called from multiple locations so I
  can't just inline the context managers into the call site.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149, #158150
2025-07-16 03:56:25 +00:00
0cb36e2d62 cache dict and string rep for better perf (#158372)
Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf.

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78298501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372
Approved by: https://github.com/yushangdi
2025-07-16 02:15:32 +00:00
584a0510b3 [inductor] fix windows path for fresh cache. (#158324)
`normalize_path_separator` for windows path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158324
Approved by: https://github.com/jansel
2025-07-16 01:54:35 +00:00
9768d393fa add sfdp pattern (#155792)
add sfdp pattern for MBartForCausalLM/PLBartForCausalLM in transformers==4.44.2.
Improve the inference performance of these model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155792
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-07-16 01:52:05 +00:00
900fba4c07 Update warning of TF32 (#158209)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158209
Approved by: https://github.com/jansel
2025-07-16 01:28:50 +00:00
03852ddc22 Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)"
This reverts commit 1ea9cde598ead20194dbb6c5cb26e74e36e6ad55.

Reverted https://github.com/pytorch/pytorch/pull/156903 on behalf of https://github.com/atalman due to Breaks torchao and torchtitan nightly builds ([comment](https://github.com/pytorch/pytorch/pull/156903#issuecomment-3076423488))
2025-07-16 01:28:46 +00:00
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
651b4a68f2 [hop][dynamo] track run-ahead sym variables in side effects (#158273)
Before the PR, for code like this:
```
        class Example2(torch.nn.Module):
            def forward(self, x, trigger, target):
                return torch.cond(
                    trigger == 1,
                    lambda: x + target,
                    lambda: x * target,
                    (),
                )

        m = Example2()
        x = torch.randn(2)
        trigger = 0
        target = 2
        args = (x, trigger, target)
        ep = torch.export.export(
            m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC)
        )
```
dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph.

The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing:
1. it's not in the _can_lift_attrs_to_input list (the change in builder.py)
2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py)

Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them.

Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-07-15 23:48:20 +00:00
144965ca9a [BE][S538760] get rid of TORCH_CHECK_.* and CHECK macros (#158269)
Summary: check will be crit, causing program to exit, which is quite dangerous

Test Plan:
CI

Rollback Plan:

Differential Revision: D78050595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158269
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-07-15 22:04:12 +00:00
ee0992871c Add test for user-managed weights with load_state_dict (#157496)
Summary:
Adds a unit test to verify that when 'user_managed=True' is passed to 'update_constant_buffer', the compiled AOTI model properly shares parameter storage with the eager model.

The test specifically covers the following:
1. Passes model weights to the AOTI model with 'user_managed=True''.
2. Updates the eager model weights using 'load_state_dict()', which performs in-place
3. Asserts that the compiled AOTI model reflects the updated weights, confirming shared memory behavior.

Fixes: #157474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157496
Approved by: https://github.com/desertfire
2025-07-15 21:17:24 +00:00
05dfd312cf [3/n] Remove references to TorchScript in PyTorch docs (#158315)
Summary:
- cpp_index.rst
- fx.md
- jit_builtin_functions.rst
- jit_python_reference.md
- jit_unsupported.md

cpu_threading
large_scale_deployment

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158315
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 21:14:18 +00:00
abeae997a3 Use brew suggested miniconda install command (#158347)
Use ```brew install --cask miniconda``` as specified by https://formulae.brew.sh/cask/miniconda

Forward fix After: https://github.com/pytorch/pytorch/pull/156898#issuecomment-3074207175

Seeing in CI:
```
Run if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
==> Caveats
Please run the following to setup your shell:
  conda init "$(basename "${SHELL}")"

Alternatively, manually add the following to your shell init:
  eval "$(conda "shell.$(basename "${SHELL}")" hook)"

==> Downloading https://repo.anaconda.com/miniconda/Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
Already downloaded: /Users/ec2-user/Library/Caches/Homebrew/downloads/2e356e8b147647692e4da77ce4c0c14eefee65ec86f29cc7e8c21a26ac9397ca--Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
==> Installing Cask miniconda
==> Running installer script 'Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh'
PREFIX=/opt/homebrew/Caskroom/miniconda/base
Unpacking payload ...
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working...
done
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
installation finished.
==> Linking Binary 'conda' to '/opt/homebrew/bin/conda'
🍺  miniconda was successfully installed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158347
Approved by: https://github.com/seemethere
2025-07-15 21:08:25 +00:00
3f83e3eeca [ONNX] Remove legacy registration and dispatcher (#158283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158283
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262, #158282
2025-07-15 21:00:49 +00:00
0640cfa38c [2/n] Remove references to TorchScript in PyTorch docs (#158306)
Summary: Removed jit_language_reference.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 20:57:23 +00:00
e4c17d5e1c [ONNX] Remove fx_onnx_interpreter.py (#158282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158282
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262
2025-07-15 20:46:06 +00:00
cc0faeb80f [dynamo][guards] Instruction count for guard eval for development work (#158214)
Its turned off  by default. Even the code is hidden before of the define preprocessing flag. It will be used only for development work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158214
Approved by: https://github.com/StrongerXi
ghstack dependencies: #158215
2025-07-15 20:29:23 +00:00
205241a0d5 [ONNX] Remove legacy dynamo graph extractor (#158262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158262
Approved by: https://github.com/justinchuby
ghstack dependencies: #158258
2025-07-15 20:21:49 +00:00
19625daf88 [1/n] Remove references to TorchScript in PyTorch docs (#158305)
Summary: Removed jit_language_reference_v2.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158305
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-07-15 20:16:53 +00:00
dbf7d421da [BE][testing] fix aot_inductor_package internally (#158270)
Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld
```

To:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

This PR changes to strings like:
```
-Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270
Approved by: https://github.com/desertfire
2025-07-15 20:15:18 +00:00
b86d5cef68 [dynamo][tensor] Skip HASATTR attribute on tensor guards (#158215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158215
Approved by: https://github.com/StrongerXi
2025-07-15 20:10:47 +00:00
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
250ae2531c Fix types in graphs.py (#158192)
Added type annotations for torch/cuda/graphs.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192
Approved by: https://github.com/oulgen
2025-07-15 19:49:38 +00:00
011026205a make node source hashable (#158322)
Summary: as title

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78296410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322
Approved by: https://github.com/yushangdi
2025-07-15 19:31:00 +00:00
4657a84bc5 [Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129)
Summary:
We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log.

Rollback Plan:

Differential Revision: D78173788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129
Approved by: https://github.com/Skylion007, https://github.com/avicizhu
2025-07-15 19:22:26 +00:00
5606c516fd [ONNX] Remove legacy Dort (#158258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-07-15 19:14:06 +00:00
7afb834f93 Inline dispatch_and_compile into its call site. (#158150)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149
2025-07-15 19:08:55 +00:00
148789ddd8 Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149)
The general context for the upcoming stack of commits is I am attempting
to "pipeline" AOTAutograd.  Instead of having function f call function g
which is the next "stage" of compilation, instead f should return with
its outputs, which are then piped to g for the next stage.  This will
make it easier to implement early exit / resume pipeline without forcing
callback structure, which is good for export-style use cases.  It also
reduces the size of our stack traces, which makes tools like Perfetto
happy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149
Approved by: https://github.com/jamesjwu
2025-07-15 19:08:55 +00:00
3beb915004 Update CODEOWNERS for dataloading (#158348)
Adding Scott

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158348
Approved by: https://github.com/scotts, https://github.com/janeyx99
2025-07-15 19:06:18 +00:00
cf3247b74a Standalone compile API in _Exporter (#158139)
Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling:
```python
package._compiled_and_package(
                tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True
            )
```

`tmp_dir` will contains:
- `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models)
- `CMakeLists.txt`
- `pt2_pacakge_name/` (this is where the models are)
- `pt2_pacakge_name.pt2`
- `inputs.pt` files if package_example_inputs is True

Remaining TODOs
- support loading contants/weights
- the `package_example_inputs = True` option only supports a list of Tensors for now
- eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead.

Test Plan:
```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter
```

Example generated `main.cpp`:

```cpp
#include <dlfcn.h>
#include <fstream>
#include <iostream>
#include <memory>
#include <torch/torch.h>
#include <vector>
#include <torch/csrc/inductor/aoti_torch/tensor_converter.h>
#include "package/data/aotinductor/Plus__default/Plus__default.h"
#include "package/data/aotinductor/Minus__default/Minus__default.h"

using torch::aot_inductor::AOTInductorModelPlus__default;
using torch::aot_inductor::AOTInductorModelMinus__default;
using torch::aot_inductor::ConstantHandle;
using torch::aot_inductor::ConstantMap;

int main(int argc, char* argv[]) {
    std::string device_str = "cpu";
    try {
        c10::Device device(device_str);
        // Load input tensors for model Plus__default
        std::vector<at::Tensor> input_tensors1;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors1.push_back(ivalue.toTensor().to(device));
        }

        // Load input tensors for model Minus__default
        std::vector<at::Tensor> input_tensors2;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors2.push_back(ivalue.toTensor().to(device));
        }

// Create array of input handles
        auto input_handles1 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1);
        auto input_handles2 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2);

// Create array for output handles
        AtenTensorHandle output_handle1;
        AtenTensorHandle output_handle2;

// Create and load models
        auto constants_map1 = std::make_shared<ConstantMap>();
        auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>();
        auto model1 = AOTInductorModelPlus__default::Create(
            constants_map1, constants_array1, device_str,
            "package/data/aotinductor/Plus__default/");
        model1->load_constants();
        auto constants_map2 = std::make_shared<ConstantMap>();
        auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>();
        auto model2 = AOTInductorModelMinus__default::Create(
            constants_map2, constants_array2, device_str,
            "package/data/aotinductor/Minus__default/");
        model2->load_constants();

// Run the models
        torch::aot_inductor::DeviceStreamType stream1 = nullptr;
        model1->run(&input_handles1[0], &output_handle1, stream1, nullptr);
        torch::aot_inductor::DeviceStreamType stream2 = nullptr;
        model2->run(&input_handles2[0], &output_handle2, stream2, nullptr);

// Convert output handles to tensors
        auto output_tensor1 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1);
        auto output_tensor2 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1);

// Validate outputs
        std::cout << "output_tensor1" << output_tensor1 << std::endl;
        std::cout << "output_tensor2" << output_tensor2 << std::endl;
        return 0;
    } catch (const std::exception &e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
}

```

Rollback Plan:

Differential Revision: D78124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139
Approved by: https://github.com/desertfire
2025-07-15 18:47:56 +00:00
46915b1361 Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601)"
This reverts commit 1e8e9f745e43fa38bbfc7b67b30bc66c0e7ebbd6.

Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/huydhn due to See https://github.com/pytorch/pytorch/pull/149601#discussion_r2208325379 ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3074965720))
2025-07-15 18:40:59 +00:00
8c3f206457 Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158117)
This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above.

Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ).

We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12`

Fixes #157626

fixes these tests failures when pytorch is built in GCC12 and above
```
test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117
Approved by: https://github.com/malfet
2025-07-15 18:26:38 +00:00
41971335c9 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)"
This reverts commit e241a07e6b88aa49d604803bc5a6562f0d9f94d2.

Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))
2025-07-15 18:24:36 +00:00
ea5f88dca6 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))
2025-07-15 18:24:36 +00:00
f2ecf6145f Revert "Enable AcceleratorAllocatorConfig key check (#157908)"
This reverts commit 65fcca4f8c97de82d35d51ad9b790d10433e9b91.

Reverted https://github.com/pytorch/pytorch/pull/157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per https://github.com/pytorch/pytorch/pull/157908#discussion_r2208204782 ([comment](https://github.com/pytorch/pytorch/pull/157908#issuecomment-3074833696))
2025-07-15 18:17:43 +00:00
b26da7741b Revert "[CI] Fixes CI for CUDA Version > 12.9 (#157385)"
This reverts commit 6c5227ba00a2904365af566c24b4681cd01a041c.

Reverted https://github.com/pytorch/pytorch/pull/157385 on behalf of https://github.com/clee2000 due to broke some slow tests test_cpp_extensions_jit.py::TestCppExtensionJIT::test_jit_cuda_archflags [GH job link](https://github.com/pytorch/pytorch/actions/runs/16286465717/job/45986677885) [HUD commit link](6c5227ba00) ([comment](https://github.com/pytorch/pytorch/pull/157385#issuecomment-3074737541))
2025-07-15 18:06:52 +00:00
243b12e565 [Optimus] add einsum_to_pointwise_pass pattern (#155666)
Summary: More context: https://docs.google.com/document/d/1ipiskqG13ZKNX1SGygB3QnHcSyXNQ8pACazPIcS4bnI/edit?tab=t.0

Test Plan:
### how to enable

```
torch._inductor.config.pre_grad_fusion_options={
            "einsum_to_pointwise_pass": {},
        },
```

### unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test 'fbcode//mode/dev-nosan' //caffe2/test/inductor:kernel_optimization
```
Buck UI: https://www.internalfb.com/buck2/267263ff-6f5b-4fff-bfc0-d8f013440ba0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5629499820839168
Network: Up: 61KiB  Down: 675KiB  (reSessionID-fda8edfc-6eef-4bf0-b268-0f8d2e666571)
Loading targets.   Remaining     0/1                                                            1 dirs read, 2310 targets declared
Analyzing targets. Remaining     0/345                                                          284 actions, 329 artifacts declared
Executing actions. Remaining     0/18334                                                        8.0s exec time total
Command: test.     Finished 6 local
Time elapsed: 1:15.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### local reproduce

baseline:

| Metric                | Value       |
|:----------------------|:------------|
| Batch size            | 4096        |
| GPU type              | H100        |
| Latency               | 196.06 ms   |
| Model size            | 1205.21 MB  |
| Flops                 | 7671.30 G   |
| Flops/example         | 1.87 G      |
| TFLOPS/sec            | 39.13       |
| MFU                   | 4.89%       |
| Activation/example    | 1.51 MB     |
| CPU time total        | 602.28 ms   |
| GPU time total        | 798.60 ms   |
| Estimated avg BW      | 234.62 GB/s |
| Estimated avg BW util | 9.78%       |
Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_09_22_12_38_trace.json.gz&bucket=pyper_traces

with the pattern:

| Metric                | Value       |
|:----------------------|:------------|
| Batch size            | 4096        |
| GPU type              | H100        |
| Latency               | 184.94 ms   |
| Model size            | 1205.21 MB  |
| Flops                 | 7671.30 G   |
| Flops/example         | 1.87 G      |
| TFLOPS/sec            | 41.48       |
| MFU                   | 5.18%       |
| Activation/example    | 1.15 MB     |
| CPU time total        | 562.44 ms   |
| GPU time total        | 754.36 ms   |
| Estimated avg BW      | 201.40 GB/s |
| Estimated avg BW util | 8.39%       |
Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_10_22_03_34_trace.json.gz&bucket=pyper_traces

### E2E

baseline: f713998364
with patter:

Rollback Plan:

Differential Revision: D76400889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155666
Approved by: https://github.com/Yuzhen11
2025-07-15 17:50:23 +00:00
b7b1109f49 Expose opt_einsum in torch.backends (#157740)
Fixes the following issue:
```
:/tmp# python -c "import torch; print(torch.__version__)"
2.7.1+cu126
:/tmp# python -c "import torch; print(torch.backends.opt_einsum.is_available())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'torch.backends' has no attribute 'opt_einsum'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157740
Approved by: https://github.com/Skylion007, https://github.com/benjaminglass1
2025-07-15 17:46:43 +00:00
26807dcf27 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit c062550a3598d27c2d6572db7c0f4ff90a84cc84.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main c062550a35, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test).  Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))
2025-07-15 16:35:55 +00:00
4f36743f5e Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)"
This reverts commit 5a54db14e3843cfa87fd8d27487dbf2f2dfb6c47.

Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))
2025-07-15 16:31:13 +00:00
05d7288e31 Fix incorrect bin edge description in histogramdd docs (#158275)
Fixes #124435

This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275
Approved by: https://github.com/albanD
2025-07-15 16:25:01 +00:00
5a54db14e3 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-15 14:27:57 +00:00
90618581e9 Fix grouped MM output strides when compiled but not max-autotuned (#158143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143
Approved by: https://github.com/ngimel
2025-07-15 11:53:13 +00:00
4e13eca713 [BE] Remove CUDA 11.8 artifacts (#158303)
We are including cufile by default in all CUDA 12+ builds. Since CUDA 11.8 is removed we can safely remove this code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158303
Approved by: https://github.com/Camyll, https://github.com/cyyever
2025-07-15 11:52:08 +00:00
156a377f4c [AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949)
Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time.

Differential Revision: D77915987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949
Approved by: https://github.com/desertfire
2025-07-15 10:51:43 +00:00
6200584193 [cutlass backend][BE] remove force disable cache in tests (#158053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158053
Approved by: https://github.com/coconutruben
2025-07-15 10:35:34 +00:00
e40ade5182 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #150312
2025-07-15 10:14:35 +00:00
e241a07e6b Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
2025-07-15 10:14:35 +00:00
7f9fc7e67c [Inductor] Add CPU_MAX_FIRST_DIMENSION_DECOMPOSITION and CPU_MAX_OTHER_DIMENSION_DECOMPOSITION for decompose_mm_pass (#158183)
Differential Revision: D78209993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158183
Approved by: https://github.com/houseroad
2025-07-15 10:07:25 +00:00
1b389025ba Refactor and Improve the OpenReg Module (#158090)
----
# Refactor and Improve the OpenReg Module

## Background

Since PrivateUse1 has become the main path for integrating new devices with PyTorch, there have been some feature requests related to PrivateUse1 regarding interfaces, documentation, reference examples, etc., such as the following:

- https://github.com/pytorch/pytorch/issues/155864
- https://github.com/pytorch/pytorch/issues/144955
- https://github.com/pytorch/pytorch/issues/144845

Taking these requests into consideration and combining them with the position of OpenReg, which is currently used as the test backend for PrivateUse1, I'm planning to make the following optimizations:

- Optimize the implementation of OpenReg to make it align with the standard specifications for real backend (C++) access, serving as a reference for new device integration code.
- Add comprehensive documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html) to guide new accelerator integration, functioning as a reference manual.

## Design Principles:

- Minimization Principle: Keep the code small and clear; only implement the minimum set of code required for verification and as an integration reference.
- Authenticity Principle: Integrate OpenReg in the same way that real accelerators access PyTorch.

## More Infos:

Pleaes refer to [this](6b8020f1ab/test/cpp_extensions/open_registration_extension/torch_openreg/README.md) for more information about `OpenReg`.

## Current Progress:
- Refer to the implementation of [torch_xla](https://github.com/pytorch/xla) to refactor all of OpenReg's code, making it easier to understand.
- Ensure all tests in [test/test_openreg.py](https://github.com/FFFrog/pytorch/blob/openreg/test/test_openreg.py) pass after refactoring.

## Next Steps:
- Add more features to cover all integration points.
- Gradually add user guides and documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158090
Approved by: https://github.com/seemethere, https://github.com/albanD
2025-07-15 08:10:05 +00:00
6c5227ba00 [CI] Fixes CI for CUDA Version > 12.9 (#157385)
Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385
Approved by: https://github.com/huydhn
2025-07-15 07:04:54 +00:00
c8c221c0b3 [Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684)
Fix assert issue.
Add float8_e4m3fn into dtype list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684
Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-15 06:02:01 +00:00
3341c131b7 [SymmMem] Fix NCCL Hang in NVSHMEM Triton Wait Until Test (#158167)
The `test_triton_wait_until` test was hanging due to an NCCL synchronization issue stemming from mismatched NVSHMEM operations. Specifically, the flag variable was updated using `nvshmemx_signal_op` (a signaling operation), but waited on with `nvshmem_wait_until` (intended for put/get updates). Per NVSHMEM documentation (see documentation reference section below), signal-updated variables require `nvshmem_signal_wait_until` for proper completion guarantees, so the mismatch caused a deadlock and NCCL hang.

**Fix:**
- A simple fix was to replace the flag update with a regular `nvshmem_putmem_block` (via `put_kernel`) to match `nvshmem_wait_until`. I also added a fence (`nvshmem_fence`) between data and flag puts on the sender (Rank 1) for ordered delivery.

- In a follow-up PR I will add a kernel/test to demonstrate usage of `nvshmemx_signal_op`

**Testing:**
- I ran `python test/distributed/test_nvshmem_triton.py` and  `python test/distributed/test_nvshmem_triton.py  -k test_triton_wait_until`

- I also verified with debug prints (Sender completes puts/fence before receiver's wait returns, and assertions confirm correct state). Multiple runs show no hangs or failures.

**Documentation Referenced:**
- [NVSHMEM Point-To-Point Synchronization](https://docs.nvidia.com/nvshmem/api/gen/api/sync.html) explicitly states: *"the sig_addr object at the calling PE is expected only to be updated as a signal, through the signaling operations available in Section NVSHMEM_PUT_SIGNAL and Section NVSHMEM_PUT_SIGNAL_NBI"*
- [NVIDIA's Official Ring Broadcast Example](https://docs.nvidia.com/nvshmem/api/examples.html) demonstrates the correct pairing: `nvshmemx_signal_op` with `nvshmem_signal_wait_until` (not `nvshmem_wait_until`)
- [NVSHMEM Signaling Operations](https://docs.nvidia.com/nvshmem/api/gen/api/signal.html) documents that signal operations work on special "signal data objects" with specific atomicity guarantees distinct from regular RMA operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158167
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-07-15 05:57:27 +00:00
9cd521de4d Fix torchrec multiprocess tests (#158159)
Summary: The new version of `get_device_tflops` imported something from testing, which imported common_utils.py, which disabled global flags.

Test Plan:
Fixing existing tests

Rollback Plan:

Differential Revision: D78192700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158159
Approved by: https://github.com/nipung90, https://github.com/huydhn
2025-07-15 05:44:37 +00:00
058fb1790f Fix compilation and "import torch" issues for cpython 3.14 (#158184)
Beginning of process for 3.14 bringup.

State of things from this PR:
- Nothing too scary looking from the Dynamo CPython side, nothing we heavily rely on seems to be missing @williamwen42
- The existing check that makes torch.compile() nicely fail is working as expected. So all these empty functions shouldn't cause any weirdness.
- The `__module__` update changes look suspicious, we should investigate what is the reason and impact of that, in particular for our public API checking @jbschlosser
- Leaving the weakref.py thread safety change as a follow up to keep this a bit simpler. I vendored the whole struct in the meantime FYI @ezyang

EDIT: The `__module__` change is even more cursed than I though due to changes to Union and Optional type where the `__module__` field cannot be changed anymore. See https://github.com/python/cpython/issues/132139 for details.
For now, I'm just skipping the `__module__` setting for 3.14 which will trip the public API checks. Will revisit once I have a final answer on the cpython issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158184
Approved by: https://github.com/msaroufim
2025-07-15 05:06:55 +00:00
add0b450bd [DTensor][BE] improve DTensor ops correctness check utils (#158112)
**Summary**
Implemented the test pattern described in https://github.com/pytorch/pytorch/pull/157991#discussion_r2196363170 as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is:
1. allowing users to specify the `Partial` placement.
2. supporting tree-like output structure.

**Test**
so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158112
Approved by: https://github.com/Skylion007, https://github.com/zpcore
ghstack dependencies: #158051
2025-07-15 04:50:34 +00:00
4c1fabf2c9 [DTensor] have split_strategy return OpStrategy instead of TupleStrategy (#158051)
**Summary**
`split_strategy` used `TupleStrategy` as return type because DTensor sharding
propagation's `OpStrategy` support on multi-returns only applies to `Tuple`.

However, `TupleStrategy`'s not a good fit for `split` op. `TupleStrategy` was
initially introduced to handle the sharding strategy of `foreach_*` ops where
the input args can be split into independent subsets regarding sharding decisions,
so are the outputs.

To address the misuse, this PR adds `OpStrategy` propagation for `List[Tensor]`
(note that this support is INCOMPLETE because it only checks the return type
to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made
similar assumption so I think it's fine to unblock in such a way.

Besides adding `OpStrategy` support to ops having `List[Tensor]` return type,
this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`.

**Test**
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158051
Approved by: https://github.com/wconstab, https://github.com/zpcore
2025-07-15 04:50:34 +00:00
a2ad16be72 [ONNX] Remove legacy Dort tests (#158294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158294
Approved by: https://github.com/justinchuby
ghstack dependencies: #158255, #158256, #158257
2025-07-15 04:44:14 +00:00
5fb07acbc3 [ONNX] Remove legacy modularization (#158257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158257
Approved by: https://github.com/justinchuby
ghstack dependencies: #158255, #158256
2025-07-15 04:36:01 +00:00
336bff6d58 [ONNX] Remove legacy graph passes (#158256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158256
Approved by: https://github.com/justinchuby
ghstack dependencies: #158255
2025-07-15 04:27:30 +00:00
12151c96d9 [ONNX] Remove legacy io_adapter (#158255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158255
Approved by: https://github.com/justinchuby
2025-07-15 03:39:18 +00:00
4486a6dbfd [DTensor] Fix grouped_mm strategy for invalid stride cases (#158245)
local_tensor input to grouped_mm has a stride requirement.

(see `_meta_grouped_mm_common` in meta_registrations.py or
`check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp)

Don't allow sharding a tensor if its shape would result in an
incompatible local_tensor stride.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158245
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-15 03:29:49 +00:00
a5e68814d5 Allow dynamic shapes for DTensor slice (#157953)
This PR allows for symints in `gen_slice_strategy` which is the strategy for `aten.slice.Tensor`. Previously, using dynamic shapes with slicing would result in
```
   File ".../pytorch/torch/distributed/tensor/_ops/_tensor_ops.py", line 348, in gen_slice_strategy
     assert isinstance(end, int)
 torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(DTensor(local_tensor=FakeTensor(..., device='cuda:0', size=(s3, 2)), device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Shard(dim=0),)), slice(None, (s77//2), None)), **{}): got AssertionError()
```

Questions before merge:
1. `dim` is still asserted to be int. Is this fine, or is this potentially dynamic as well?
2. I'm using argtype ignore for `normalize_dim`. Should I instead change types for `normalize_dim` and further dependency to be `IntLike` as well?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157953
Approved by: https://github.com/wconstab
2025-07-15 00:54:01 +00:00
ef4cca2d79 [precompile] Increment frame and add compile ids when loading packages (#158028)
When loading a package and calling package.install(backends), we create a new frame and compile id for each package load, so that tlparse and chromium events still show compile times on warm start.

There is an argument for not doing this in AOT precompile, as no "compile" occurs. So for now, we put it in `package.install`, which hopefully won't be a thing for AOT precompile.

## Recompiles
Recompiles get saved to the same frame and code entry, so on warm start, each recompile will get collapsed into the same entry. Therefore, dynamo compiles that have recompiles on cold start (0/0, 0/1, 0/2, etc) will all get collapsed into a single compile id (0/0), as warm start will load all of the entries properly.

## Graph breaks
Graph breaks get their own compile id, and therefore their own code entry. These are replicated on warm start, so if cold start you had 4 different graphs (and therefore 4 compile ids), you'll have 4 compile ids on warm start as well.

## Test plan
Added a frame counter check to existing unit tests for automatic dynamic, showing that old and new frame counter between old and new load is the same.

This is the chromium event for test_automatic_dynamo_graph_breaks_device_cuda:
```
python test/dynamo/test_package.py -k test_automatic_dynamo_graph_breaks_device_cuda
```

<img width="2216" height="508" alt="image" src="https://github.com/user-attachments/assets/f604ed33-5c31-464b-9320-d67b2e6f57a1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158028
Approved by: https://github.com/oulgen
2025-07-15 00:53:52 +00:00
1c6057fd17 add eq function to NodeSource (#158170)
Summary: add eq function to NodeSouce by comparing their dict representation.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78200762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158170
Approved by: https://github.com/ezyang, https://github.com/yushangdi
2025-07-15 00:50:06 +00:00
7e433d5f42 [cutlass backend] cache a few things for codegen and properties (#158158)
Differential Revision: [D78193404](https://our.internmc.facebook.com/intern/diff/D78193404/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158158
Approved by: https://github.com/ColinPeppler
2025-07-15 00:18:31 +00:00
b7def5ff1c dist2: add support for passing custom configs directly to PG (#158147)
This is intended to make it easier to have backend specific "hints" that can be provided by the user to hint about certain options.

```py
import torch.distributed._dist2 as dist2

pg = dist2.new_group(backend="my_custom_backend", device=..., timeout=..., foo=1234, bar="1234")
pg.allreduce(...)
```

Test plan:

```
pytest test/distributed/test_dist2.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158147
Approved by: https://github.com/fduwjj
2025-07-15 00:02:54 +00:00
7cf31b4a42 [dynamo] fix NamedTupleVariable cloning (#158190)
FIXES https://github.com/pytorch/pytorch/issues/157945

## Explanation
1. Some VTs add additional attrs e.g. NamedTupleVariable has "dynamic_attributes"
a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)

2. VT.clone passes everything by dict, includes "dynamic_attributes"
a0308edb6c/torch/_dynamo/variables/base.py (L255-L259)

3. Non-handled args become kwargs in VT's `__init__`, `super().__init__()` passes kwargs to Base VT
a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)

4. Base VT's `__init__` gets unexpected "dynamic_attributes" kwarg
a0308edb6c/torch/_dynamo/variables/base.py (L609-L613)

You could also let Base VT's `__init__` ignore additional kwargs, but that seemed a bit too permissive, and I don't think many VT's add these derived class only attrs.

## After fix

```python
 ===== __compiled_fn_1_7f9541ed_e166_43fe_8322_c5225ce4207f =====
 /home/xmfan/core/miniconda3/envs/0712/lib/python3.12/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, L_x_: "f32[4, 8, 6][48, 6, 1]cpu"):
        l_x_ = L_x_

         # File: /home/xmfan/core/a/torchtitan/wtf.py:10 in forward, code: U, S = torch.linalg.svd(x)[:2]
        linalg_svd = torch._C._linalg.linalg_svd(l_x_);  l_x_ = None
        U: "f32[4, 8, 8][64, 1, 8]cpu" = linalg_svd[0]
        S: "f32[4, 6][6, 1]cpu" = linalg_svd[1];  linalg_svd = None

         # File: /home/xmfan/core/a/torchtitan/wtf.py:11 in forward, code: reduced = U[:, :, :self.k] @ torch.diag_embed(S[:, :self.k])
        getitem_3: "f32[4, 8, 5][64, 1, 8]cpu" = U[(slice(None, None, None), slice(None, None, None), slice(None, 5, None))];  U = None
        getitem_4: "f32[4, 5][6, 1]cpu" = S[(slice(None, None, None), slice(None, 5, None))];  S = None
        diag_embed: "f32[4, 5, 5][25, 5, 1]cpu" = torch.diag_embed(getitem_4);  getitem_4 = None
        reduced: "f32[4, 8, 5][40, 5, 1]cpu" = getitem_3 @ diag_embed;  getitem_3 = diag_embed = None
        return (reduced,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158190
Approved by: https://github.com/StrongerXi
2025-07-14 23:39:25 +00:00
08799217ae [CI] Move main branch rocm binary builds to its own workflow (#158161)
Petition to move out of ciflow/trunk and into ciflow/rocm because it's a long pole for TTS

<img width="1192" height="312" alt="image" src="https://github.com/user-attachments/assets/b12a097a-3763-4c62-b09f-094ee9ae1c37" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158161
Approved by: https://github.com/seemethere
2025-07-14 23:07:49 +00:00
48315181c7 [CI] Do not run inductor rocm on ciflow/inductor (#158162)
Petition to only run inductor-rocm on ciflow/inductor-rocm and not ciflow/inductor because it's a long pole for TTS
<img width="1266" height="315" alt="image" src="https://github.com/user-attachments/assets/b3587bf7-b1a6-45f3-9b6a-c0e6d473d13b" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158162
Approved by: https://github.com/seemethere
2025-07-14 23:07:45 +00:00
38371f693b ci: Switch lintrunner-noclang to use linter image (#158261)
This changes the image the lintrunner jobs utilizes to be the base linter image
instead of the CUDA image. This is done to reduce the image size and speed up the
build time.

This was switched in https://github.com/pytorch/pytorch/pull/110502 when
clang used to run in the lintrunner jobs but it is now split out so we can
use the default image for non-clang jobs.

Difference in pull time (from running job): ~5min --> ~1min (80% reduction), this should result in an overall runtime decrease of ~25min --> ~20min (20% reduction)

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158261
Approved by: https://github.com/Camyll, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/Skylion007
2025-07-14 22:54:51 +00:00
c062550a35 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-14 22:27:21 +00:00
9345279c6e skip inductor/test_torchinductor_opinfo in windows (#158225)
During enabling inductor CI in Windows, `test_torchinductor_opinfo.py` cost too many time (about 12 hours). This UT was seriously exceeding the time limit of CI. The compiler building was slower 4x in Windows than Linux after analyzing.

Thus, we decide to skip the UT temporary and @xuhancn will keep searching the solution of compiler building in Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158225
Approved by: https://github.com/jansel

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-14 22:14:52 +00:00
194539e9c3 Address NaNs if SDPA is called with all values masked from query (#157727)
Fixes #156707

Detect if all values along the softmax axis are infs and overwrite the outputs for those computations with zeros before the final matmul. The behavior should be aligned with the CPU implementation.

These types of cases where all values along the dimension in the attention mask are false leading to the undefined outputs in softmax occur with left padded batches for generation in HF transformers according to the original issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157727
Approved by: https://github.com/malfet
2025-07-14 22:09:35 +00:00
bcf50636ba [CI] Removing --user flag from all pip install commands (#154900)
Related to https://github.com/pytorch/pytorch/issues/148335

python virtualenv doesn't support using `--user` flag:

```
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.
+ python3 -m pip install --progress-bar off --user ninja==1.10.2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154900
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-07-14 21:09:42 +00:00
6b2bef10af [c10d] Prototype of group_split for dist2 work (#157716)
This is to implement group_split as proposed in [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157716
Approved by: https://github.com/d4l3k
2025-07-14 21:04:12 +00:00
1e4d8b5a4a Fix land race typos from #157290 (#158272)
TSIA, this is a new grammar linter being added recently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158272
Approved by: https://github.com/clee2000
2025-07-14 20:55:13 +00:00
725c327284 [nativert] add memory overlap debug assertion (#157290)
Summary: better safe than sorry. will throw if memory overlap detected when using planned tensors and debug mode is enabled -- this will make our planning unit tests more robust.

Test Plan:
ci

Rollback Plan:

Differential Revision: D77327841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157290
Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17
2025-07-14 19:12:41 +00:00
f87d117939 redo of [Inductor][Cutlass] verify cutlass has cache_file attribute before moving...resolves cutlass cute exception (#158206)
trying to land https://github.com/pytorch/pytorch/pull/156672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158206
Approved by: https://github.com/lessw2020, https://github.com/Skylion007
2025-07-14 18:50:23 +00:00
5633283574 [reland][DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#158204)
This PR is identical to https://github.com/pytorch/pytorch/pull/157216, which got reverted because of removing an outdated import of `torch._dynamo` https://www.internalfb.com/diff/D78021229?transaction_fbid=1713683499308113

The issue has been fixed by @weifengpy by D78199546, so this PR should be good to re-land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158204
Approved by: https://github.com/weifengpy
2025-07-14 18:07:21 +00:00
5b10b0a96f Slightly improve error message from repeat_interleave kernel (#157996)
Summary:
In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`:
```
Assertion `result_size == cumsum_ptr[size - 1]` failed.
```

This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q

```
Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687).
```

In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features.

## What are my concerns with this change?
1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout
2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`.
3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too.

Test Plan:
* Built an ephemeral package with my changes:
https://www.internalfb.com/intern/servicelab/build/736441058/

* Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg

* I will export to GH and run CI/CD tests.

Rollback Plan:
steps:
  - manual.note:
      content: >-
        Just reverting this diff should be sufficient. Since this change is in
        CUDA kernels, I do not believe there is a way to change the error
        message via a JK.

Reviewed By: mradmila

Differential Revision: D77904753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157996
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-07-14 17:55:14 +00:00
fb462cec8d Normalize placeholder names in AOTAutogradCache (#157916)
This PR adds a pass to sanitize_gm_for_cache which normalizes all placeholder names across input dynamo graphs to AOTAutogradCache. This is safe because nothing underneath AOTAutograd uses the node names on the
original dynamo graph: AOTAutograd re-traces with its own nodes, and guards are
in terms of original sources rather than placeholder names.

Note that the dynamo output graphs traced by tlparse will not show this change because it's done before this sanitization step. The aot autograd outputs also will not change because AOTAutograd's own traced graphs don't use the original placeholders of the dynamo graph. Thus, this change is essentially a no-op from everyone's perspective except for cache key checks.

Fixes #157792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157916
Approved by: https://github.com/zou3519
2025-07-14 17:45:11 +00:00
9b0013c6bb [CI] Update mobile build docker image (#158153)
The docker image got removed and then the job started building its own -> takes a long time

I don't know why it uses the asan image

<img width="1906" height="330" alt="image" src="https://github.com/user-attachments/assets/72fbf40c-3cd6-44ea-b61b-6335d2a4b589" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158153
Approved by: https://github.com/Skylion007
2025-07-14 17:35:58 +00:00
6ea91f0672 Revert "[Inductor] Set the default value of min_chunk_size to 512 (#150762)"
This reverts commit 3321acc92e24859dbe2ac6499067d1afde5622c3.

Reverted https://github.com/pytorch/pytorch/pull/150762 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but an inductor compilation error shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150762#issuecomment-3070286787))
2025-07-14 16:58:13 +00:00
6fe7456aa1 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)"
This reverts commit 03b307575a98dc1d953c9d3521a9489e0e61e70c.

Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))
2025-07-14 16:33:48 +00:00
e8cca7bac7 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit 85857181ebca86e9c709e9922a9d9ef41a9c4ef9.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))
2025-07-14 16:33:48 +00:00
59c3cac454 Tag CPython test files with the commit or tag they were copied from. (#158038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158038
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #157799, #157800, #157801, #157802, #156981
2025-07-14 15:42:19 +00:00
826f12b829 [SymmMem] Avoid library mismatch in CMake search (#157836)
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```

The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
      # In pip install case, the lib suffix is `.so.3` instead of `.so`
      NAMES nvshmem_host nvshmem_host.so.3
      HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
      PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```

This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.

After this PR:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
2025-07-14 14:13:02 +00:00
86d8af6a6c Add sm_70 to windows 12.9 build (#158126)
Please see: https://github.com/pytorch/pytorch/issues/157517
Volta architectures will be kept for 12.8/12.9 builds for release 2.8 (12.8 win build does not need change since already including sm70)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158126
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-07-14 13:11:10 +00:00
0bb733ba23 Add cuda 12.4 build in CI (#157958)
Fixes to https://github.com/pytorch/pytorch/issues/156747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157958
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-07-14 13:01:16 +00:00
0f21fa84fb Documentation Fix: torch.empty_like memory preservation (#158050)
updated docs for torch.empty_like to reflect view and dense memory behavior

Fixes #158022

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158050
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-07-14 06:02:54 +00:00
aa11628576 Issue warning with reference to user code rather than torch (#155112)
Re-raising of #129959 as that was closed.

Warning message before:
```
/home/admin/.local/share/hatch/env/virtual/toms-project-1/Qv9k_r_5/dev/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
```

Warning message after:
```
/path/to/my/code:91: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
```

Helps the user find where the issue stems from in their code. What do you think?

(Looks like "skip_file_prefixes" is not available until Python 3.12 minimum...)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155112
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-14 05:24:23 +00:00
9ca080db87 [MPS] Extend atomic operations to all int types (#158179)
That fixes `index_put(..., accumulate=True)` for all dtypes

int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064, #158178
2025-07-14 04:25:05 +00:00
1ea9cde598 [ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)
Fixes #156012

This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.

After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily, https://github.com/XilunWu
2025-07-14 02:50:36 +00:00
edb92e16ba feat(dynamo): raise UnsupportedError for ndarray.astype(object) (#157810)
Fixes #157720

###  What's in this PR?

This PR improves the error handling in `torch.compile` for `ndarray.astype('O')` (or `object`). It now explicitly raises a `torch._dynamo.exc.Unsupported` exception with a clear explanation, instead of failing with a less intuitive error during fake tensor propagation.

This is achieved by adding a check within `NumpyNdarrayVariable.call_method` for this specific `astype` pattern.

A new test, `test_ndarray_astype_object_graph_break`, is also added to `test/test_numpy_interop.py` to verify this new behavior.

### Background

Previously, attempting to `torch.compile` a function containing `ndarray.astype('O')` would result in a `TorchRuntimeError` wrapping a `TypeError: data type 'O' not understood`. This error message, originating deep within the tensor mechanism, was not very user-friendly and didn't clearly state *why* it was unsupported.

This change makes the failure more explicit and provides a better user experience by giving a direct, actionable error message.

**Old Behavior (Error Traceback):**
```
torch.dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: ... got TypeError("data type 'O' not understood")
```

**New Behavior (Error Message):**
```
torch.dynamo.exc.Unsupported: ndarray.astype(object)
Explanation: ndarray.astype('O') or ndarray.astype(object) is not supported by torch.compile, as there is no equivalent to object type in torch.
```

### Testing

A new test has been added to `test_numpy_interop.py` which decorates a function containing `ndarray.astype("O")` with `torch.compile`. The test asserts that a `torch._dynamo.exc.Unsupported` exception is raised, confirming the new error handling works as expected.

The test can be run with:
`pytest test/test_numpy_interop.py -k test_ndarray_astype_object_graph_break`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157810
Approved by: https://github.com/jansel
2025-07-14 01:22:49 +00:00
3321acc92e [Inductor] Set the default value of min_chunk_size to 512 (#150762)
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-14 01:14:30 +00:00
1f57e0e04d [CPU] Support GQA for flash attention (#157893)
As many models require GQA, we support it in flash attention for CPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157893
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-13 09:49:02 +00:00
c68af9af1b Fix XPU CI UT test_circular_dependencies (#158189)
# Motivation
fix https://github.com/pytorch/pytorch/issues/110040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158189
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-13 09:30:57 +00:00
5aee022d8b [BE] Move repeated code into helper functions (#158178)
Namely `index_get_offsets`, giving thread index computes offsets into
input, output and indices tensors
And `index_apply_indices` applies offests to either input or output
tensor index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158178
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064
2025-07-12 18:24:12 +00:00
31326a9ad7 Fix typo in torch.set_float32_matmul_precision docs (#158191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158191
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-12 18:23:11 +00:00
a0308edb6c [build] remove wheel from build requirements (#158027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158027
Approved by: https://github.com/Skylion007
2025-07-12 16:45:51 +00:00
9508d73307 remove allow-untyped-defs from torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py (#157848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157848
Approved by: https://github.com/Skylion007
ghstack dependencies: #157847
2025-07-12 15:42:12 +00:00
066bf29334 remove allow-untyped-defs from torch/_higher_order_ops/run_const_graph.py (#157847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157847
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-07-12 15:42:12 +00:00
5221448574 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-12 15:08:21 +00:00
191693ac85 adding arg values and arg types to Strobelight USDT (#155185)
Summary: This diff makes changes to the USDT added by RihamSelim in D44636587. The "operator_start" USDT passes in the memory addresses of operator arguments and the argument types. This is so we can record argument values and types in the Strobelight GPUEvent Profiler. The previous diff records the ATEN operator, and this diff lays the groundwork to record ATEN op arguments.

Test Plan: I ensured this code builds by running the example in this diff, and testing profiler changes in this diff.

Reviewed By: RihamSelim

Differential Revision: D75606556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155185
Approved by: https://github.com/malfet
2025-07-12 12:00:08 +00:00
aacb944079 [aot inductor] fix clang-asan for consts_cpp. (#158175)
From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes.

But it still raise clang ASAN `stack alloction`, when build large size consts.

This PR:
1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list.
2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175
Approved by: https://github.com/jansel
2025-07-12 07:14:05 +00:00
6b84cb29f9 [dynamo] trace through torch.get_device_module (#157980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157980
Approved by: https://github.com/anijain2305
2025-07-12 06:25:46 +00:00
7f14b42adf [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 05:47:06 +00:00
e90148c91d Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit 4b9a6f7211123511e856ac8c8524bc332a741241.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))
2025-07-12 04:52:11 +00:00
a529a5daf5 [test][distributed][vllm] stabilize the p2p sharing through ipc (#158089)
vLLM's RLHF integration cf75cd2098/examples/offline_inference/rlhf_utils.py (L93) depends on this hidden feature, adding the test so that PyTorch will not break it in a backward-incompatible way.

The goal is to create p2p shared tensors across devices, say sharing process 0's memory on GPU 0, to process 1's memory space on GPU 1, when GPU 0 and GPU 1 can use GPU direct p2p access.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158089
Approved by: https://github.com/houseroad, https://github.com/ngimel
2025-07-12 04:41:13 +00:00
e15f4248ad Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (#156312)"
This reverts commit 7a92b5119654c07d15f5c0818e6ae804b01e836c.

Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250))
2025-07-12 04:40:52 +00:00
9056279f81 don't error out in empty_cache under mempool context (#158152)
Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152
Approved by: https://github.com/zou3519, https://github.com/eqy
2025-07-12 04:37:05 +00:00
f45f6e86b9 Fix torch._numpy advanced indexing to match NumPy when indices are separated (#157676)
Written with Claude Code.

Fixes https://github.com/pytorch/pytorch/issues/157569
Fixes https://github.com/pytorch/pytorch/issues/158134

 NumPy and PyTorch handle advanced indexing differently when advanced indices are separated by slices (e.g., arr[:, [0], :, 0]). PyTorch uses "outer" indexing placing result dimensions in original positions, while NumPy uses "vectorized"
 indexing moving advanced index dimensions to the front.

This adds _numpy_style_advanced_indexing() to detect separated advanced indices and transpose results to match NumPy's dimension ordering, ensuring torch._numpy maintains compatibility with NumPy's indexing behavior.

Fixes cases like:
- arr[:, [0], :, 0] now returns shape (1, 5, 7) instead of (5, 1, 7)
- arr[:, [0, 1], :, 0] now returns shape (2, 5, 7) instead of (5, 2, 7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157676
Approved by: https://github.com/manuelcandales

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-12 04:35:04 +00:00
9c189ed29a Revert "multi-kernel matmuls based on varying hint sizes (#156628)"
This reverts commit 6c795306378c47341d58109da03371bba2bec46e.

Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))
2025-07-12 03:48:39 +00:00
2eff14c445 [ONNX] Delete torch.onnx.dynamo_export (#158130)
It's deprecated since torch==2.7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158130
Approved by: https://github.com/justinchuby
2025-07-12 02:30:47 +00:00
7a92b51196 [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 01:47:22 +00:00
8b97e4dd8c #IS157973/numpy version issue (#158036)
Fixes #157973

`THPUtils_unpackNumberAsBool` now recognises `numpy.bool_ scalars` explicitly (using `torch::utils::is_numpy_bool`).
If the object is a NumPy boolean, we retrieve its truth value via `PyObject_IsTrue` and return it, avoiding the previous failing path that attempted to treat it as an integer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158036
Approved by: https://github.com/jansel
2025-07-12 01:36:28 +00:00
627ba41136 [DCP][HF] [ez]Change where sharded tensors are saved (#158069)
Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D78108144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069
Approved by: https://github.com/teja-rao
2025-07-12 01:02:17 +00:00
f4406689b8 fix MPCT destroy_pg call (#157952)
I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952
Approved by: https://github.com/fduwjj
ghstack dependencies: #157589
2025-07-12 00:46:19 +00:00
7444debaca Revert "Fix logdet returning finite values for singular matrices on CUDA (#157910)"
This reverts commit 7d4228dbfd13d1ac8fac2c78c042dbb8314f042d.

Reverted https://github.com/pytorch/pytorch/pull/157910 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some internal tests accuracy ([comment](https://github.com/pytorch/pytorch/pull/157910#issuecomment-3064368647))
2025-07-12 00:22:51 +00:00
8c928372b3 Make Q Indices optional (#157997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157997
Approved by: https://github.com/BoyuanFeng, https://github.com/Chillee
2025-07-12 00:16:20 +00:00
22f3347fd9 [MTIA Aten Backend] Change relu / relu_ back to use relu kernel (#158101)
# Context
In D75803582, we migrated relu/relu_ from out-of-tree to pytorch in-tree. With that, we also changed it to use the ATen op-layer logic:
https://www.internalfb.com/code/fbsource/[04ec3fcd0b09b601ae26a785e595ab960a6ba684]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520

To summarize:
**The behavior before D75803582:**
The Relu operator calls this code(https://fburl.com/code/pezspv40) and launches Relu kernel.

**The behavior after D75803582:**
The Relu operator uses the ATen logic, which delegates to the clamp_min operator, and no longer launch Relu kernel.

-----------------

But according to my discussion with @vvk, we should keep using the Relu kernel, instead of adopting ATen logic that delegates to clamp_min, because MTIA's Relu kernel has special optimization for MTIA device.

# This diff

Change relu / relu_  to launch relu kernel, which is same as the original behavior before D75803582.

Note: this doesn't mean to revert D75803582, because we still want to move relu/relu_ to in-tree.

Differential Revision: [D78109262](https://our.internmc.facebook.com/intern/diff/D78109262/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158101
Approved by: https://github.com/albanD
2025-07-12 00:12:29 +00:00
0d77364ee3 dist2: cleanup non-option methods on PG (missing, timeouts) (#158123)
This updates the ProcessGroup.* API to include timeouts on all non-option based overloaded methods. This also adds 2 missing ones `alltoall_base` and `barrier`.

Following design in: https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89

Test plan:

```
pytest test/distributed/test_dist2.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158123
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-07-12 00:06:37 +00:00
f44a9eee47 [AOTI] Add missing ops to set of C-shim ops which can have nullptr returns (#158073)
Most added ops are backwards ops, which have not been well-tested previously (thus why they were missed). Necessary ops were identified by manual examination of torch/_meta_registrations.py return values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158073
Approved by: https://github.com/desertfire
2025-07-11 23:35:26 +00:00
ff7dd1776f [cutlass backend] Global filter ops before situation based filter ops (#157866)
The idea of this PR is that, sometimes we are filtering ops based not based on the node specific information. For example, we always filter out simt ops. So I want to group them together into a global filtering function.

This can help shrink the config space as well. 20s -> 6s for instantiation 3332.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157866
Approved by: https://github.com/ColinPeppler
2025-07-11 23:13:20 +00:00
2a8795a981 [c10d] ProcessGroupGloo: support per operation timeouts (#158128)
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.

* This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context.
* This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout.

Test plan:

```
pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2025-07-11 23:09:50 +00:00
a8ec7babcf [dynamo] expand_hints does exc() to expand graph_break_hints (#158078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158078
Approved by: https://github.com/williamwen42
2025-07-11 22:51:28 +00:00
beed033b6e [MPS] Fix index_kernel for large tensors (#158064)
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator

Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   383.5    |    379.8     |    470.9     |     1232.9     |     4410.3
      __getitem__ (torch.float16, torch.int64)  |   379.6    |    354.5     |    533.2     |     1290.3     |     4442.2
      __getitem__ (torch.float32, torch.int64)  |   360.8    |    338.6     |    478.6     |     1348.9     |     4870.4

Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   349.8    |    330.5     |    432.6     |     764.5      |     1961.2
      __getitem__ (torch.float16, torch.int64)  |   342.5    |    330.7     |    434.7     |     741.0      |     1969.4
      __getitem__ (torch.float32, torch.int64)  |   332.2    |    326.1     |    445.4     |     751.3      |     1972.6

Times are in microseconds (us).
```

While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint

Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci
2025-07-11 22:35:44 +00:00
93854e83b7 [DTensor] Rewrite doc of TupleStrategy (#158132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158132
Approved by: https://github.com/XilunWu
2025-07-11 22:08:57 +00:00
4b9a6f7211 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-11 21:07:57 +00:00
4ff9b7fa31 Fix diagnostic message for CUDA version mismatch in cuda.cmake (#157370)
This PR fixes  #157354

It fixes the issue in 'cmake/public/cuda.cmake' where a diagnostic message incorrectly showed an empty CUDA version when 'FindCUDA' and header-reported versions differed.

The problem was caused by this line:

set(${cuda_version_from_findcuda} ${CUDA_VERSION_STRING})

This incorrectly used the value of cuda_version_from_findcuda as a variable name. As a result the version string wasn't assigned and the error message omitted the version. This has been corrected to:

set(cuda_version_from_findcuda ${CUDA_VERSION_STRING})

Now the diagnostic message properly displays the CUDA version reported by FindCUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157370
Approved by: https://github.com/soulitzer
2025-07-11 20:58:35 +00:00
eqy
00ae620b9f [CUDA] Allow cuDNN or flash attn in test_activation_checkpointing pattern match check (#153272)
Seems more robust than maintaining a mirror of dispatch condition based on compute capability etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153272
Approved by: https://github.com/soulitzer
2025-07-11 20:58:12 +00:00
702a304b07 Revert "[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit 9a5278225fc5e7b46d54a65ae1a3f049ee49824f.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/ngimel due to breaks 525 driver installs ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-3063742807))
2025-07-11 20:36:36 +00:00
eqy
9963845a4e [CUDA] Support family-conditional compute capabilies in TORCH_CUDA_ARCH_LIST (#157999)
Similar to arch-conditionals, such as 9.0a  and 10.0a, family conditionals such as 10.0f enable features specific to a family of architectures, such as between sm100 and sm103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157999
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 20:34:59 +00:00
6c79530637 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-11 19:38:10 +00:00
bd364c901d Fix serialization of nans in torch.export (#155359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155359
Approved by: https://github.com/angelayi
2025-07-11 19:33:15 +00:00
b487003182 [PyTorch Core] MTIA supports arbitrary strides (#157883)
Summary:
Currently, on MTIA the following case will return false

```
options.device().supports_as_strided()
```
As a result, whenever moving a tensor from CPU to MTIA, strides will not be preserved ([see here](e5edd013ab/aten/src/ATen/native/TensorConversions.cpp (L351))). This is a primary reason why deserializing tensors from .pt files will be contiguous.

Reviewed By: egienvalue, andyanwang

Differential Revision: D77843224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157883
Approved by: https://github.com/albanD, https://github.com/andyanwang
2025-07-11 18:54:21 +00:00
cyy
b0556110e5 Remove unsafe PyTorchError constructor (#154961)
Use libfmt in call sites of PyTorchError.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154961
Approved by: https://github.com/albanD
2025-07-11 18:22:53 +00:00
1cb0597a89 [PyTorch] Deprecate numpy serialization for MTIA (#157884)
Summary:
NumPy based tensor rebuilding from serialization has been deprecated by other backends (eg. [XLA](https://github.com/pytorch/pytorch/pull/137444)). The new flow has CPU storage being constructed with data from the file and then moved to the target backend device.

Furthermore, relying on numpy for serialization will fail loudly when torch.load flips weights_only.

Reviewed By: andyanwang

Differential Revision: D77843238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157884
Approved by: https://github.com/albanD
2025-07-11 17:57:33 +00:00
157683d862 [Reducer] Remove custom handling of view tensors for MTIA (#157882)
Summary: Following implementation of the updated ATen Backend for mtia, and diffs enabling in tree view ops (D75266206, D75385411), we can remove custom logic from reducer to handle MTIA view operations.

Test Plan:
CI

Rollback Plan:

Reviewed By: egienvalue

Differential Revision: D77843212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157882
Approved by: https://github.com/albanD, https://github.com/andyanwang
2025-07-11 17:56:45 +00:00
92ee5bd9f6 Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)"
This reverts commit d75d30eeb610b164e69d0678a2e2b2dea81eec0f.

Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it turns out that the internal failure was legit ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3063075001))
2025-07-11 17:07:26 +00:00
c4cdcda754 [aot] add format_consts_to_cpp function for further development. (#157608)
Changes:
1. Split `format_consts_to_asm` function, which is current way to convert consts to object.
2. Add `format_consts_to_cpp` function, which would support for more compiler support, such as `msvc` and `icx`.
3. Add `config.aot_inductor.use_consts_asm_build` for `format_consts_to_asm` and `format_consts_to_cpp` control.
4. Add UT for `format_consts_to_cpp`.

For `format_consts_to_cpp`, I have local tested it:
Case: https://docs.pytorch.org/docs/main/torch.compiler_aot_inductor.html
Run it and `cat` cpp code:
<img width="674" alt="image" src="https://github.com/user-attachments/assets/d47ccf84-06d2-47f5-8a0d-9a43a9020aa3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157608
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-11 17:02:41 +00:00
bb3c911c2d [DTensor] support split op on Partial placement (#157991)
**Summary**
To enable use case where the input DTensor to `split` op has `Partial()` placement,
this PR treats `Partial()` in the same way with `Replicate()`. That means, `split` op
only unshards the `Shard(dim=x)` if `x == split_dim` and keep other placement
untouched.

**Test**
Added a new test because `test_dtensor_ops` doesn't test `Partial()` placement.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157991
Approved by: https://github.com/zpcore
2025-07-11 16:19:31 +00:00
1f1f22991d Restore fake device (#157972)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157972
Approved by: https://github.com/ezyang
2025-07-11 16:12:01 +00:00
27c50799c1 Use new cuBLAS row-wise fp8 matmul for scaled-mm (#157905)
Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available.

I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode).

I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups:
![image](https://github.com/user-attachments/assets/43dfb816-9ccf-40c5-8b2a-571ce9cb511d)
![image](https://github.com/user-attachments/assets/be7ac6f2-e16c-479b-ad5c-f8039caba4b1)

We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better.

I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157905
Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg
2025-07-11 16:11:55 +00:00
0797b2b6a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 16:07:54 +00:00
7a08755c5f [BE][Ez]: Update ruff to 0.12.2 (#157937)
Updates to the latest version of ruff and apply some fixes that it flagged and silence a few new lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157937
Approved by: https://github.com/ezyang
2025-07-11 15:16:20 +00:00
0d17029fea [BE][6/6] fix typos in test/ (test/distributed/) (#157640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640
Approved by: https://github.com/yewentao256, https://github.com/malfet
2025-07-11 14:09:37 +00:00
4283d96bcd [build] pin setuptools>=70.1.0 for integrated bdist_wheel command (#157783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157783
Approved by: https://github.com/Skylion007
2025-07-11 12:10:42 +00:00
b4476ca378 Add cudaMallocAsync/cudaFreeAsync to cuda_to_hip_mappings (#158056)
Summary: Adding both functions as they're required for Hipification of https://fburl.com/code/165r7qhr

Test Plan:
Tested in D78090513

Rollback Plan:

Reviewed By: malfet, jiangyurong609

Differential Revision: D78090693
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158056
Approved by: https://github.com/Skylion007
2025-07-11 11:48:19 +00:00
85857181eb Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
2025-07-11 11:41:34 +00:00
03b307575a Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
2025-07-11 11:25:43 +00:00
8088958793 port 4 dynamo test files to Intel GPU (#157779)
For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Six dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056) and [#156575](https://github.com/pytorch/pytorch/pull/156575.) In this PR we will port 4 more dynamo test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- added XPU support in decorators like @requires_gpu
- enabled XPU for some test path
- added xfailIfXPU to skip xpu test when there is a bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157779
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-07-11 10:11:49 +00:00
e1a20988f3 [Quant][CPU] Enable fp8 qconv (#157076)
**Summary**
Enable fp8 qconv on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qconv op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qconv op is not changed either.

So, the FP8 qconv shares the same op as INT8 qconv and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.

Note:
OneDNN does not support quantized fp8 convolution until v3.9 but the version used in PyTorch is v3.7.2. So, the op goes to the reference kernel for now. And we have also update the oneDNN path so that it's compatible with the fp8 dtype. Once oneDNN is upgraded to v3.9 or newer, minimum changes are needed to enable the oneDNN path. And we have ensured that the behavior of the reference kernel is the same as the new oneDNN's implementation.
- oneDNN version < 3.9 (now)
  - Always go to the reference kernel
- oneDNN version >= 3.9 (future)
  - Go to reference kernel on old platforms (without AMX)
  - Use oneDNN on new platforms (with AMX)

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qconv and fp8"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157076
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-07-11 10:00:57 +00:00
ed508cc018 [inductor][triton] Add experimental use_tensor_descriptor config option (#157906)
Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met.

Some implementation details:
- The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store
- The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met.

Testing:
- Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option.

This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906
Approved by: https://github.com/jansel
2025-07-11 09:32:40 +00:00
02724b5f64 [Bugfix][Inductor] Fix dependency list merged incorrectly for a custom op with multiple mutated inputs and None return type. (#157133)
This is an attempt to fix a memory allocation issue when using `torch.compile` with a custom layernorm kernel in vllm:
```C++
  // In-place fused Add and RMS Normalization.
  ops.def(
      "fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, "
      "float epsilon) -> ()");
  ops.impl("fused_add_rms_norm", torch::kCUDA, &fused_add_rms_norm);
```
We observed abnormal extra memory allocations with this op enabled using `torch.compile`:
<img width="738" alt="{374E9FCF-FB46-4750-8B60-D31E3ADCE00A}" src="https://github.com/user-attachments/assets/6c45e1aa-ccde-4c56-99dc-bf4776d699d5" />
and without this op:
<img width="738" alt="{9BB08EFE-FFE3-4D06-82C0-C70BBE6ADD56}" src="https://github.com/user-attachments/assets/56e2ee43-ab87-492d-834c-69e9cafbb0df" />

After investigation, we found that this is because the compiler considers the two buffers for the two mutated inputs `Tensor input` and `Tensor residual` should share a same dependency list, which makes it can not reuse the buffer of `Tensor input`.
```
buf1.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False),
    ]
buf16.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False),
    ]
```
```
op13: ExternKernelSchedulerNode(FallbackKernel)
op13.writes =
    [   StarDep(name='buf17', mode=None),
        StarDep(name='buf18', mode=None),
        StarDep(name='buf19', mode=None)]
op13.unmet_dependencies =
    [   StarDep(name='buf13', mode=None),
        StarDep(name='buf16', mode=None),
        WeakDep(name='buf11', mutating_buf='buf18'),
        WeakDep(name='buf12', mutating_buf='buf18'),
        WeakDep(name='buf13', mutating_buf='buf18'),
        WeakDep(name='buf2', mutating_buf='buf18'),
        WeakDep(name='buf3', mutating_buf='buf18')]
op13.met_dependencies = [StarDep(name='arg11_1', mode=None)]
op13.outputs = [
    buf17: FallbackKernel
    buf17.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0])
    buf17.aliases = ['buf16', 'buf1']
    buf17.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False),
    ]
    buf18: MutationOutput
    buf18.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0])
    buf18.mutations = ['buf16']
    buf18.users = [
        NodeUser(node=ExternKernelSchedulerNode(name='op14'), can_inplace=False, is_weak=False),
        NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=True),
        NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=True),
    ]
    buf19: MutationOutput
    buf19.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0])
    buf19.mutations = ['buf1']
    buf19.users = [NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False)]
]
op13.node.kernel = torch.ops._C.fused_add_rms_norm.default
```
Here we can see `buf16` shares the same dependency list with `buf1` because `buf16` and `buf1` are in the aliases list of `buf17`. This is incorrect since those two are two separate tensors. And this makes the compiler could not reuse `buf16` for subsequent ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157133
Approved by: https://github.com/jansel
2025-07-11 09:06:31 +00:00
44303caabf [APS] Expose max_autotune lookup table config to frontend (#158070)
Summary: As titled. We reuse optimus config to receive the yaml config file from users

Test Plan:
### how to enable max_autotune lookup table hardcode config

```
            inductor.config.post_grad_fusion_options = {
                "inductor_autotune_lookup_table":  <your yaml manifold path>
            }
```
for example, "manifold://ads_training_p9e/tree/max_autotune/mast_omnifm_v3_1kgpu/mast_omnifm_v3_lookup_table.yaml",

see D78052050

Rollback Plan:

Reviewed By: PaulZhang12, jackiexu1992

Differential Revision: D77202285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158070
Approved by: https://github.com/Mingming-Ding
2025-07-11 09:02:52 +00:00
11d6ad8b2e [Docs] Update PT2 Profiler Torch-Compiled Region Image (#158066)
Summary: In Pytorch 2.5 we added source code attribution to PT2 traces. Each Torch-Compiled Region will now have its frame id and frame compile id associated with it. Update the image in the doc and add a description of this in the doc itself

Test Plan:
{F1980179183}

Rollback Plan:

Differential Revision: D78118228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158066
Approved by: https://github.com/aaronenyeshi
2025-07-11 07:56:45 +00:00
cd80f9a4c3 xpu: support custom ops with torch.library on xpu backend (#152879)
Fixes: https://github.com/intel/torch-xpu-ops/issues/1626

This PR started enabling of tests for `torch.library`, but more work is needed. Tests are using `torch._custom_ops` deprecated API planned for removal at pytorch 2.6 (not done). I think cleanup of pytorch would be nice before enabling more tests for xpu.
a2ccda3c60/torch/_custom_op/impl.py (L47)

CC: @EikanWang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152879
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/guangyey, https://github.com/albanD
2025-07-11 07:36:04 +00:00
442aca44d6 Fix XPU broken CI (#158092)
# Motivation
https://github.com/pytorch/pytorch/pull/157739 introduces the new UT `test_sdpfa` that block XPU CI since `_scaled_dot_product_flash_attention is not supported on XPU yet`.

# Additional Context
See https://github.com/pytorch/pytorch/actions/runs/16201010860/job/45741815895?pr=138222#step:15:6399
fix https://github.com/pytorch/pytorch/issues/158095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158092
Approved by: https://github.com/jansel, https://github.com/malfet
2025-07-11 07:23:27 +00:00
d89f30ad45 [MPS] Avoid calling tensor ops in max_pool3d impl (#157874)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157874
Approved by: https://github.com/malfet
2025-07-11 06:47:29 +00:00
b4fc42ca80 Add torch.segment_reduce docs (#154352)
Fixes #153138

## Test Result

![image](https://github.com/user-attachments/assets/62346d62-d048-4259-906b-f8261e10b4cc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154352
Approved by: https://github.com/albanD
2025-07-11 06:16:38 +00:00
cec59b76ca [2/N] cost coverage improvment (#157738)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Details:
1. Fill in missing redistribute_cost in `cat` and `slice_scatter`;
2. Expand the `cat` strategy based on placement of each input tensor. Previously `cat` only outputs one strategy. Now it output at the level of number_of_input_tensor*number_OpSpec_each_tensor_input_strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157738
Approved by: https://github.com/wconstab
2025-07-11 05:54:16 +00:00
ecd73c58ee Revert "[BE] Replace std::runtime_error with TORCH_CHECK [2/N] (#152080)"
This reverts commit b85f10ea5006e8ae8fc769f48659ab7ad5eafb69.

Reverted https://github.com/pytorch/pytorch/pull/152080 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some internal tests ([comment](https://github.com/pytorch/pytorch/pull/152080#issuecomment-3060337857))
2025-07-11 03:58:31 +00:00
94995eba07 [Log] add a hook for recompile user context (#157961)
Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157961
Approved by: https://github.com/masnesral
2025-07-11 03:41:33 +00:00
11a86ad2fa Remove pytorch quant docs since we are moving to torchao (#157766)
Summary:
att

Test Plan:
doc page generated from CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766
Approved by: https://github.com/Skylion007
2025-07-11 03:21:47 +00:00
dd93883231 [exported_program] Remove _postprocess_graph_module_outputs (#158059)
Summary: Appears to be dead as of https://github.com/pytorch/pytorch/pull/120019.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78112302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158059
Approved by: https://github.com/angelayi
2025-07-11 02:40:15 +00:00
326e751d07 [AOTI] Add device guard when launching autotune kernels (#158034)
Summary: Fix https://github.com/pytorch/pytorch/issues/157737. When launching Triton kernels in the autotune block, we need to consider the fact that the model may not always be on device 0. The reason this was not caught on CI is because test_on_gpu_device1 requires multi_gpu and was not run on a multi_gpu instance. Added test_on_gpu_device1 and other similar multi_gpu tests back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158034
Approved by: https://github.com/eqy, https://github.com/yushangdi
2025-07-11 02:34:31 +00:00
7d4228dbfd Fix logdet returning finite values for singular matrices on CUDA (#157910)
Fixes https://github.com/pytorch/pytorch/issues/154312

Fix logdet returning finite values for singular matrices on CUDA (https://github.com/pytorch/pytorch/issues/154312
https://github.com/pytorch/pytorch/issues/154312)

PyTorch's logdet function returns mathematically incorrect finite values for
singular matrices on CUDA devices instead of the expected -inf. This occurs
because cuSOLVER and LAPACK produce tiny non-zero diagonal elements (~1e-16)
instead of exact zeros for singular matrices.

**Problem:**
Issue https://github.com/pytorch/pytorch/issues/154312 matrix returns finite values instead of -inf for singular matrices.

**Solution:**
Implemented NumPy-style two-tier singularity detection with GPU sync point removal:

1. **Primary detection**: Use LAPACK's built-in singularity detection via info parameter
2. **Backup detection**: Apply threshold-based detection for numerical edge cases
3. **Zero GPU sync points**: Eliminated all .item(), std::get<0>(), and scalar extractions
4. **Pure tensor operations**: All computations use tensor operations throughout

**Performance Impact:**
Based on comprehensive benchmarking across matrix sizes and data types:

- **Overall Impact**: 0.85× average speedup (+18.0% overhead)
- **CPU Performance**: 0.84× average speedup (+18.8% overhead)
- **CUDA Performance**: 0.85× average speedup (+17.3% overhead)

**Performance Trade-offs:**
- **Small matrices (16×16, 64×64)**: Higher overhead due to tensor operation setup costs
- **Large matrices (512×512, 2048×2048)**: Near-zero overhead, with some cases showing slight improvements
- **GPU sync elimination**: Removes expensive GPU→CPU synchronization bottlenecks

**Results:**
-  All singular matrices now correctly return -inf on both CPU and CUDA
-  Original issue https://github.com/pytorch/pytorch/issues/154312 matrix now works correctly
-  Results match NumPy's slogdet behavior exactly
-  Zero GPU synchronization points for improved performance
-  Comprehensive edge case testing added

**Verification:**
Before: torch.linalg.slogdet(singular_matrix) → finite values (incorrect)
After:  torch.linalg.slogdet(singular_matrix) → (sign=0, logabsdet=-inf) 

The implementation uses pure tensor operations to eliminate GPU sync points while
maintaining robust singularity detection through a two-tier approach.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157910
Approved by: https://github.com/lezcano, https://github.com/IvanYashchuk, https://github.com/albanD

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-11 02:23:46 +00:00
65fcca4f8c Enable AcceleratorAllocatorConfig key check (#157908)
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
2025-07-11 02:11:08 +00:00
905b084690 Add size_hints to cache key (#158026)
Differential Revision: D78089705

Previously to support overriding autotune configs for post fusion kernels in Inductor with a lookup table, we only keyed on the source code. However, the same source code could have multiple optimal configs, due to the input sizes. With this, we have many collisions in our lookup table, leading to subpar configs. A way around this is to add the size_hints to the lookup key as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158026
Approved by: https://github.com/jansel
2025-07-11 01:47:50 +00:00
37ccc532f7 Update'unit_batch_dynamic_prepacked' tests to use ASSERT_NEAR instead of ASSERT_EQ (#157860) (#157861)
Summary:

Replaced ASSERT_FLOAT_EQ which defaults to fixed kMaxUlps ( = 4-ULP , See gtest-internal.h) with ASSERT_NEAR which lets us set epsilon to 1e-3, (approximately 3 ULPs). This allows for slightly stricter and tunable comparison.

Test Plan:
**Before Fix**

✗ Fail:
qnnpack:pytorch_qnnpack_testApple - FULLY_CONNECTED_SPARSE_OP_8x1/unit_batch_dynamic_prepacked (0.0s)
'Expected equality of these values:
  output_dynamic[i * outputChannels() + c]
    Which is: 9.9160004
  accumulators_float[i * outputChannels() + c]
    Which is: 9.9159956
at 0, 17: reference = 9.9159955978393555, optimized = 9.9160003662109375

------------------------------

**After Fix**

Everything passes

Rollback Plan:

Differential Revision: D77911682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157861
Approved by: https://github.com/kimishpatel, https://github.com/lucylq, https://github.com/malfet
2025-07-11 01:05:50 +00:00
7599bebead Add CPython test test_itertools (#156981)
Test the itertools module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156981
Approved by: https://github.com/zou3519
ghstack dependencies: #157799, #157800, #157801, #157802
2025-07-11 00:12:50 +00:00
397ca98510 Add CPython test test_with (#157802)
Test with statement behavior and dunder methods __enter__ and __exit__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157802
Approved by: https://github.com/zou3519
ghstack dependencies: #157799, #157800, #157801
2025-07-11 00:12:50 +00:00
4809f43867 Add CPython test test_numeric_tower (#157801)
Test abstract numeric types and dunder methods like __int__, __float__, __index__, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157801
Approved by: https://github.com/zou3519
ghstack dependencies: #157799, #157800
2025-07-11 00:12:50 +00:00
0ebf2447da Add CPython test test_operator (#157800)
Test operators via operator module like add, sub, eq, lt, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157800
Approved by: https://github.com/zou3519
ghstack dependencies: #157799
2025-07-11 00:12:50 +00:00
91041f559d Add CPython test test_bool (#157799)
Test dunder methods `__bool__` and `__len__`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157799
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2025-07-11 00:12:50 +00:00
ae86e8f6c8 [1/N] cost coverage improvment (#157504)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Details:
1. Fill missing redistribute_cost for ops like `aten::detach`, `aten::bernoulli `, `aten::_to_copy`, `aten::bucketize.Tensor`, `aten::stack`, `aten::clone`, `aten::copy_`, `aten::zero_ `.
2.  Fix redistribute_cost error in new_factory_strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157504
Approved by: https://github.com/wconstab
2025-07-10 23:55:45 +00:00
8b68e5b1bb [ROCm][Inductor][CK] update API for gemm-multiD change (#156122)
Fixes for the compilation errors in the generated code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156122
Approved by: https://github.com/chenyang78
2025-07-10 23:12:20 +00:00
e517066f41 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 178fe7aa98987111a73534375099f4ad255e8b59.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/huydhn due to This fails some internal tests and needs to be relanded ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3059463896))
2025-07-10 23:11:18 +00:00
1a195bf7d6 Tests for #158030 (#158033)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158033
Approved by: https://github.com/bdhirsh, https://github.com/albanD
ghstack dependencies: #158030
2025-07-10 22:51:28 +00:00
bfcababbcb [OrderedDict] Implement explicit OrderedDict dunder method call (#154943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154943
Approved by: https://github.com/zou3519
ghstack dependencies: #154003, #154793, #154794, #154942
2025-07-10 22:50:39 +00:00
ba71eb496b [dict] Implement dict.__eq__ and dict.__ne__ (#154942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154942
Approved by: https://github.com/zou3519
ghstack dependencies: #154003, #154793, #154794
2025-07-10 22:50:39 +00:00
ba8d19ec02 [dict] Allow Dynamo to trace through explicit dict dunder method call (#154794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154794
Approved by: https://github.com/mlazos
ghstack dependencies: #154003, #154793
2025-07-10 22:50:39 +00:00
57d64298a0 [dict] Add dict.popitem (#154793)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154793
Approved by: https://github.com/mlazos, https://github.com/zou3519
ghstack dependencies: #154003
2025-07-10 22:50:39 +00:00
e84710d1e7 [dict] Raise TypeError in dict methods (#154003)
Raise TypeError in the following scenarios:
* #args mismatch
* arg is unhashable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154003
Approved by: https://github.com/mlazos, https://github.com/zou3519
2025-07-10 22:50:39 +00:00
9bf41633d7 Allow Custom Time Unit When Printing Profiler Table (#157913)
## Overview
This PR adds a kwarg to the `table()` method of the profiler allowing users to specify a time unit to be used for all results in the profiling table. The available options are: `s`, `ms` and `us`. If an invalid unit or no unit is provided, then a time unit is selected based on the size of the value (current default behaviour).

## Testing
A unit test has been added to verify this works correctly.

## Documentation
I couldn't find any documentation specific to the `table()` function beyond doc strings which have been updated.

## Example Output
```
import torch
from torch.profiler import profile

with profile() as prof:
    res = torch.mm(torch.rand(1024, 1024), torch.rand(1024, 1024))

print(prof.key_averages().table(time_unit="s"))
print(prof.key_averages().table(time_unit="ms"))
print(prof.key_averages().table(time_unit="us"))
print(prof.key_averages().table())

```

```
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%        0.000s        10.36%        0.014s        0.007s             2
           aten::empty         0.04%        0.000s         0.04%        0.000s        0.000s             2
        aten::uniform_        10.27%        0.014s        10.27%        0.014s        0.007s             2
              aten::mm        89.64%        0.119s        89.64%        0.119s        0.119s             1
    aten::resolve_conj         0.00%        0.000s         0.00%        0.000s        0.000s             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 0.133s

----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%       0.055ms        10.36%      13.735ms       6.868ms             2
           aten::empty         0.04%       0.054ms         0.04%       0.054ms       0.027ms             2
        aten::uniform_        10.27%      13.626ms        10.27%      13.626ms       6.813ms             2
              aten::mm        89.64%     118.892ms        89.64%     118.896ms     118.896ms             1
    aten::resolve_conj         0.00%       0.004ms         0.00%       0.004ms       0.001ms             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 132.631ms

----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%      55.495us        10.36%   13735.202us    6867.601us             2
           aten::empty         0.04%      54.121us         0.04%      54.121us      27.061us             2
        aten::uniform_        10.27%   13625.586us        10.27%   13625.586us    6812.793us             2
              aten::mm        89.64%  118892.284us        89.64%  118895.981us  118895.981us             1
    aten::resolve_conj         0.00%       3.697us         0.00%       3.697us       1.232us             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 132631.183us

----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
            aten::rand         0.04%      55.495us        10.36%      13.735ms       6.868ms             2
           aten::empty         0.04%      54.121us         0.04%      54.121us      27.061us             2
        aten::uniform_        10.27%      13.626ms        10.27%      13.626ms       6.813ms             2
              aten::mm        89.64%     118.892ms        89.64%     118.896ms     118.896ms             1
    aten::resolve_conj         0.00%       3.697us         0.00%       3.697us       1.232us             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 132.631ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157913
Approved by: https://github.com/sraikund16
2025-07-10 22:44:34 +00:00
83700b4488 dist2: add group context manager (#157988)
This adds new context manager based PG management to dist2. This allows for managing the active process group much in the same way as a stream

```py
with dist2.process_group(pg):
   dist2.current_process_group().allreduce(...).wait()
```

matches

```py
with torch.cuda.stream(stream):
    torch.cuda.current_stream().synchronize()
```

Test plan:

```
pytest test/distributed/test_dist2.py -k context
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157988
Approved by: https://github.com/fduwjj
2025-07-10 22:30:19 +00:00
fca7013f85 Fix DCE eliminating random operations by improving is_impure() (#151524) (#157981)
DCE was incorrectly eliminating unused random operations like torch.rand() that have global RNG side effects, causing inconsistent results between eager and compiled execution modes.

**Root cause**: Python random functions (torch.rand, torch.randn, etc.) don't have the _nondeterministic_seeded attribute, so node.is_impure() returns False, allowing DCE to eliminate them despite advancing global RNG state.

**Solution**: Enhanced is_impure() in torch/fx/node.py to recognize Python random functions and mark them as impure when they use global RNG, regardless of the impure_random parameter setting. This ensures consistency between eager and compiled execution even when config.fallback_random=False.

**Key features**:
- Handles comprehensive list of random functions: rand, randn, randint, randperm, rand_like, randn_like, randint_like, normal, poisson, bernoulli, multinomial
- Generator optimization: Only marks as impure when using global RNG (no generator or generator=None). Operations with explicit generators don't affect global state and can be optimized.
- Works with both impure_random=True and impure_random=False cases
- Cleaner architecture: addresses root cause rather than working around it

**Tests**: Enhanced test_impure_random to verify both FX tracing and AOT compilation codepaths, ensuring random operations are preserved and eager/compiled execution consistency is maintained.

🤖 Generated with [Claude Code](https://claude.ai/code)

Fixes https://github.com/pytorch/pytorch/issues/151524

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157981
Approved by: https://github.com/mlazos

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-10 22:24:29 +00:00
590607c599 [cuDNN][SDPA] Bump cuDNN frontend submodule version to 1.12.1 (#158044)
Really we are just interested in this change which fixes an apparent regression for d=256 support on Hopper bc5f4fd88d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158044
Approved by: https://github.com/Skylion007
2025-07-10 22:01:18 +00:00
5f1225ef48 [EZ][BE] Delete redundant header (#157966)
Not sure why it was there in the first place. And why `Indexing.m`` needed to include QScheme.h is also unclear
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157966
Approved by: https://github.com/Skylion007
2025-07-10 21:59:36 +00:00
96897e721b Return false in statically_known_multiple_of if numerator has more than 20 unique symbols (#157855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157855
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #155590, #157845
2025-07-10 21:00:57 +00:00
d7e0098bf3 Fix is_unaligned usage of statically_known_true (#157845)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157845
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #155590
2025-07-10 21:00:57 +00:00
76ca23c41c [dynamo] Add FakeProcessGroup support for fx_graph_runnable with distributed collectives (#157162)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Summary:
- Modified generate_compiler_repro_string() to automatically detect distributed operations and inject FakeProcessGroup setup code
- Added distributed collective tests in test/dynamo/test_fx_graph_runnable.py using FakeProcessGroup API to test distributed collective operations
- Generated fx_graph_runnable code now runs successfully standalone when containing distributed operations

```import os
os.environ['TORCHINDUCTOR_CACHE_DIR'] = '/var/folders/fd/kcv8m1kn0lqgxz42wvgr46sc0000gn/T/torchinductor_skarjala'

import torch
from torch import tensor, device
import torch.fx as fx
from torch._dynamo.testing import rand_strided
from math import inf
import torch._inductor.inductor_prims
import torch.distributed as dist
from torch.testing._internal.distributed.fake_pg import FakeStore

import torch._dynamo.config
import torch._inductor.config
import torch._functorch.config
import torch.fx.experimental._config

torch._functorch.config.functionalize_rng_ops = False
torch._functorch.config.fake_tensor_allow_unsafe_data_ptr_access = True
torch._functorch.config.unlift_effect_tokens = True

isolate_fails_code_str = None

# torch version: 2.9.0a0+gitf23d314
# torch cuda version: None
# torch git version: f23d31463ca452918e23063409a2bdc55efc0d46

# torch.cuda.is_available()==False, no GPU info collected

from torch.nn import *
class Repro(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, arg0_1):
        all_reduce = torch.ops._c10d_functional.all_reduce.default(arg0_1, 'sum', '0')
        wait_tensor = torch.ops._c10d_functional.wait_tensor.default(all_reduce);  all_reduce = None
        mul = torch.ops.aten.mul.Tensor(wait_tensor, 2)
        copy_ = torch.ops.aten.copy_.default(arg0_1, wait_tensor);  arg0_1 = wait_tensor = copy_ = None
        return (mul,)

def load_args(reader):
    buf0 = reader.storage(None, 64)
    reader.tensor(buf0, (4, 4), is_leaf=True)  # arg0_1
load_args._version = 0
mod = Repro()
if __name__ == '__main__':
    from torch._dynamo.repro.after_aot import run_repro
    # Initialize FakeProcessGroup for distributed operations
    store = FakeStore()
    dist.init_process_group(
        backend="fake",
        rank=0,
        world_size=2,
        store=store
    )
    with torch.no_grad():
        run_repro(mod, load_args, accuracy=False, command='run', save_dir=None, tracing_mode='real', check_str=None)
        # To run it separately, do
        # mod, args = run_repro(mod, load_args, accuracy=False, command='get_args', save_dir=None, tracing_mode='real', check_str=None)
        # mod(*args)
    dist.destroy_process_group()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157162
Approved by: https://github.com/xmfan
2025-07-10 20:30:27 +00:00
a3ec6d64b2 Update test after CUTLASS upgrade (#157903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157903
Approved by: https://github.com/ngimel
2025-07-10 20:10:20 +00:00
8c5b070d1f Documentation Fix: torch.tensor.scatter_ docs (#157929)
updated torch.tensor.scatter_ docs to reflect proper broadcasting behavior

Fixes #157419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157929
Approved by: https://github.com/albanD
2025-07-10 19:22:52 +00:00
da4e7c77a1 [caffe2] Enable auto vectorization (#157984)
Summary:
We are testing enabling back autovectorization in some codepaths.
These resulted in crashes when compiling using clang17, we are now relying on clang19.

Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface

We are going to deploy it on ads workloads

Rollback Plan:

Differential Revision: D77448445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157984
Approved by: https://github.com/Skylion007
2025-07-10 19:19:45 +00:00
5bd7804be2 Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface (#157990)
Summary: Essentially, treat joint_custom_pre_pass/joint_custom_post_pass the same as post_grad_custom_post_pass/post_grad_custom_pre_pass.

Test Plan: More unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157990
Approved by: https://github.com/oulgen
2025-07-10 19:17:11 +00:00
e172309880 Documentation Fix: Torch gather broadcasting (#157920)
updated torch gather docs to reflect proper broadcasting behavior for specific backends

Fixes #157425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157920
Approved by: https://github.com/albanD
2025-07-10 19:08:51 +00:00
e2f64eedaf Fix DTensor handling of conjugate bit. (#158030)
Fixes https://github.com/pytorch/pytorch/issues/130646 specifically for DTensor

Fixes https://github.com/pytorch/torchtitan/issues/267

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158030
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2025-07-10 18:28:12 +00:00
2db1a54465 Add deprecation hint for accelerator APIs (#158013)
[torch.accelerator.set_device_idx](https://docs.pytorch.org/docs/stable/generated/torch.accelerator.set_device_idx.html#torch.accelerator.set_device_idx) and [torch.accelerator.current_device_idx](https://docs.pytorch.org/docs/stable/generated/torch.accelerator.current_device_idx.html#torch.accelerator.current_device_idx) are deprecated, but not reflect in their docs.

## Test Result

### Before
![image](https://github.com/user-attachments/assets/6e0d8c4a-d5e5-420c-8f3a-b2742f0fe263)
![image](https://github.com/user-attachments/assets/4bd99b15-31dc-4043-82e8-3d2c1dfcb57b)
![image](https://github.com/user-attachments/assets/a3d342da-79f2-4950-b17a-d01257603c97)

### After

![image](https://github.com/user-attachments/assets/faf138a8-bd92-4f31-bd7c-4414aee6da5b)
![image](https://github.com/user-attachments/assets/212456bc-1c6b-48c6-9d8c-075d5096b900)
![image](https://github.com/user-attachments/assets/49bb9c8c-203e-424e-bdc0-0f197239146e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158013
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-07-10 18:09:22 +00:00
e3f8141c25 Fix UB in BFloat16 round_to_nearest_even (#157942)
Type punning using unions is undefined behavior in C++ (you may not access a member of a union that is not the active member). bit_cast is the right way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157942
Approved by: https://github.com/Skylion007
2025-07-10 18:03:39 +00:00
a9ac9f2635 [cutlass backend] Change serialization protocol to use more json and cache (#157840)
Differential Revision: [D77949177](https://our.internmc.facebook.com/intern/diff/D77949177/)

What this diff does:
* use lru_cache for serialization and deserialization
* json dumps more. This seems to help perf.

For instantiation level 3332, the loading time decreases from 33s to 20s (roughly 40%) decrease.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157840
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #157839
2025-07-10 17:44:33 +00:00
1d0f45d5d1 [c10d][PGNCCL] Cleanup unused params for nccl comm split (#157978)
Previously we add global ranks as a input params for nccl comm. Now this is not needed, let's clean that up.

Differential Revision: [D78051047](https://our.internmc.facebook.com/intern/diff/D78051047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157978
Approved by: https://github.com/Skylion007
2025-07-10 17:36:23 +00:00
b40c0b61eb Make guard collective logging less chatty (#157995)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157995
Approved by: https://github.com/Microve, https://github.com/albanD, https://github.com/Skylion007
2025-07-10 17:18:37 +00:00
fb45649df7 [cutlass backend] Make config request key depend on serialization.py and cutlass_utils.py (#157839)
Differential Revision: [D77893241](https://our.internmc.facebook.com/intern/diff/D77893241/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157839
Approved by: https://github.com/ColinPeppler
2025-07-10 17:09:32 +00:00
7caf6c801d [ez][CI] Add docker instructions for linux build (#157974)
Copied from linux-test.yml

I'm not sure how necessary this is because the wiki also has this info, and has more details about it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157974
Approved by: https://github.com/huydhn
2025-07-10 16:15:28 +00:00
493bd625e2 Revert "[BE]: Reduce binary size 40% using aggressive fatbin compression. (#157791)"
This reverts commit 9bdf87e8918b9a3f78d7bcb8a770c19f7c82ac15.

Reverted https://github.com/pytorch/pytorch/pull/157791 on behalf of https://github.com/albanD due to Reverting to avoid regressing on the driver supported ([comment](https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058091176))
2025-07-10 16:14:06 +00:00
4781d72faa [AOTI] codegen for static linkage (#157129)
Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal)

- Add codegen for static linkage
- refactor test code for test_compile_after_package tests

For now,  the following options must be used together with `"aot_inductor.compile_standalone": True`.
"aot_inductor.package_cpp_only": True,

Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR.

```
python test/inductor/test_aot_inductor_package.py -k test_compile_after_package
python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129
Approved by: https://github.com/desertfire
2025-07-10 16:03:50 +00:00
9bdf87e891 [BE]: Reduce binary size 40% using aggressive fatbin compression. (#157791)
NVCC apparently has a [compression-mode flag](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#compress-mode-default-size-speed-balance-none-compress-mode) to tell it how you want to compress the fatbinary since 12.4. This mode defaults to speed (pick a low compression mode that loads the file quickly). Since we are running into PyPi size issues, this will allow us to upload smaller wheel files.

From: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compress-mode-default-size-speed-balance-none-compress-mode
```
size
Uses a compression mode more focused on reduced binary size, at the cost of compression and decompression time.
```

Up to 37.2%  reduction in binary size with virtually no drawback (except potentially a little slower loading of the .so at PyTorch startup).

694 MB for CUDA 12.9 builds with 6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX
vs
1.08GB for CUDA 12.9 builds with 7.5;8.0;8.6;9.0;10.0;12.0+PTX

CUDA 12.9 ***694MB*** vs ***1.08GB***

CUDA 12.8 ***604MB*** vs ***845MB***

This ends up saving PyPi.org approximately 19.6 PiB of bandwidth per month for the CUDA 12.9 case.

This will also allow us to add back CUDA 12.8 12.0+PTX which will make the package forward compatible on newer GPUs. Undoing the need for PR https://github.com/pytorch/pytorch/pull/157516 and https://github.com/pytorch/pytorch/pull/157634

<img alt="Screenshot 2025-07-08 at 5 36 44 PM" width="1061" src="https://private-user-images.githubusercontent.com/7563158/463890713-a53ec774-b036-4c0b-a5d5-301756e3644f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTIwNzY3OTIsIm5iZiI6MTc1MjA3NjQ5MiwicGF0aCI6Ii83NTYzMTU4LzQ2Mzg5MDcxMy1hNTNlYzc3NC1iMDM2LTRjMGItYTVkNS0zMDE3NTZlMzY0NGYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDcwOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA3MDlUMTU1NDUyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Yzg1OGExN2VjYmI3ZDFhNjIwZDk0NTBjOWFlZDIzYzY3MmExYTFiOGZhZjc0NTI1ZTk2YzM3YzdhYzkyYzZlMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.2-YmmfXrBFuXCrjDCQ_iTgbtbwv9xNFqM6Goc_liDKE">

More details can be found in Nvidia's technical blog for CUDA 12.4: https://developer.nvidia.com/blog/runtime-fatbin-creation-using-the-nvidia-cuda-toolkit-12-4-compiler/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157791
Approved by: https://github.com/malfet, https://github.com/atalman
2025-07-10 15:51:04 +00:00
f85954e043 Update OpenBLAS commit (#151547)
Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install`

Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2006%20Jun%202025%2009%3A46%3A35%20GMT&stopTime=Fri%2C%2013%20Jun%202025%2009%3A46%3A35%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=0218b65bcf61971c1861cfe8bc586168b73aeb5f&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad)

1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers.
2. Overall torch-bench pass-rate (cpp_wrapper mode) increased `[87%, 65/75 → 96%, 72/75]`

<img width="676" alt="Screenshot 2025-06-20 at 17 05 15" src="https://github.com/user-attachments/assets/2ca9c1bc-80c6-464a-8db6-b758f2476582" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547
Approved by: https://github.com/malfet, https://github.com/snadampal, https://github.com/fadara01

Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com>
Co-authored-by: Ryo Suzuki <ryo.suzuki@arm.com>
Co-authored-by: Ye Tao <ye.tao@arm.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-07-10 14:58:12 +00:00
7702855228 [logging] dynamo_timed the synchronize in CachingAutotuner make_launchers (#157747)
Summary: There's some evidence that some very long compile times are actually attributable to the sync. This should make it easier to say for sure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157747
Approved by: https://github.com/aorenste, https://github.com/mlazos
2025-07-10 14:48:51 +00:00
9a5278225f [CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-10 14:38:18 +00:00
8532033679 RPC tutorial audit (#157938)
Fix [T228333894](https://www.internalfb.com/intern/tasks/?t=228333894)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157938
Approved by: https://github.com/AlannaBurke
2025-07-10 14:15:37 +00:00
8dff457f42 [simple_fsdp] Port fx pass to bucket reduce_scatters (#157780)
Porting fx passes for reduce_scatters bucketing (similar to all_gather bucketing) for simple_fsdp and autoparallel testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157780
Approved by: https://github.com/wconstab
2025-07-10 14:04:43 +00:00
a9537b626c [standalone_compile] Fix single Tensor outputs from split_module (#157803)
We assumed that the output in an FX graph would always just be a
list[Tensor], even in the single tensor return case.
It is possible for the output to be a single Tensor. This can happen
by calling torch.fx.split_module on the module.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157803
Approved by: https://github.com/oulgen
2025-07-10 12:49:03 +00:00
82765dad16 Fix logging of config_suppress_errors and config_inline_inbuilt_nn_modules (#157947)
Currently ~50% of the time we fail or crash before logging metrics, so moving where this is logged will let us have more comprehensive (less-null) data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157947
Approved by: https://github.com/masnesral, https://github.com/jovianjaison
2025-07-10 12:05:43 +00:00
cd995bfb2a [inductor] re-enable TMA templates w/ AOTI (#157819)
Follow-up from #155896: now that AOTI can codegen non-null TMA workspace args, we can re-enable TMA templates w/ AOTI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157819
Approved by: https://github.com/drisspg
2025-07-10 08:35:29 +00:00
1e8e9f745e Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-10 07:05:39 +00:00
af3d069094 [BE][Easy] remove unused build-time dependency astunparse and change astunparse.unparse -> ast.unparse (#157907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157907
Approved by: https://github.com/Skylion007
2025-07-10 07:04:42 +00:00
ba0d0de5e6 Enable set SDPA backend by torch.nn.attention.sdpa_kernel on XPU (#156669)
Introduces support for a new `OVERRIDEABLE` backend in the SDPA module, improves backend selection logic, and adds corresponding tests. In addition, a fallback mechanism was added when a specific backend is unavailable, enhancing user configurability.

### Backend Support and Selection Enhancements:
* Added `at::SDPBackend::overrideable` to the list of available SDPA backends in the `Context` class (`aten/src/ATen/Context.h`).
* Updated the backend selection logic in `select_sdp_backend_xpu` to include the `OVERRIDEABLE` backend and added a fallback mechanism for unsupported `FLASH_ATTENTION` on XPU.
* Adjusted error messaging in `_fused_sdp_choice_xpu` to reflect the inclusion of the `OVERRIDEABLE` backend. (`aten/src/ATen/native/mkldnn/xpu/Attention.cpp`)

### Test Additions for Backend Fallback and Selection:
* Added new unit tests to validate fallback behavior for `FLASH_ATTENTION` to `OVERRIDEABLE` and to verify correct backend selection when `MATH` is enabled. (`test/test_transformers.py`,)

### Codebase Updates for Backend Integration:
* Introduced `OVERRIDEABLE` as a new member of the `_SDPBackend` enum. (`torch/_C/__init__.pyi.in`)
* Extended `_backend_names` and updated related methods to handle the `OVERRIDEABLE` backend. (`torch/nn/attention/__init__.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156669
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-07-10 06:52:22 +00:00
4cc13c4af6 [dynamic shapes] avoid unnecessary slices (#157528)
Fixes #157289, by extending optimization to slices where the end index exceeds the size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157528
Approved by: https://github.com/angelayi
2025-07-10 06:34:46 +00:00
565fd07909 [Easy] Make the error message shown by THPUtils_unpackLong to be clearer (#157886)
As the title stated.

The error message of `THPUtils_unpackLong` is the same as `THPUtils_unpackInt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157886
Approved by: https://github.com/Skylion007
2025-07-10 06:26:13 +00:00
b85f10ea50 [BE] Replace std::runtime_error with TORCH_CHECK [2/N] (#152080)
Part of: #148114

Related commits:

- #151880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152080
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-10 06:02:47 +00:00
fadc936fad Updates to build and test on Noble (Ubuntu24.04) and py3.12 (#152240)
This PR enables Ubuntu24.04 testing on CI:
* Builds a base docker image using Noble (Ubuntu24.04) and py3.12 for ROCm N version
* Builds and tests PyTorch on Ubuntu24.04 as part of the `rocm-mi300` workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152240
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2025-07-10 05:55:42 +00:00
b7860c7863 Implement fast exp for AVX2 and AVX512 for the flash attention (#151441)
**Implement fexp for avx2 and avx512**

Cristiano and all propose a clever exp using the IEEE representation with a fine control of the precision, especially useful
for mix computation of the flash attention.

- Implement Fast Exponential Computation on SIMD Architectures
  A. Cristiano I. Malossi, Yves Ineichen, Costas Bekas, and Alessandro Curioni
- AVX2 and AVX512 float only, up to 20% faster for mix precision flash attention
  than the current implementation.
- For the other types legacy implementation.

**Precision**

1 ULP only valid in hybrid mode fp32 -> f16 due to the cast during the
store operation in the flash attention:

**Benchmark**

Machine Xeon 6972P, results in TOPs, Python forward pass flash attention

numhead 16, Head dimension 64

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 0.8  | 1.3  |
| 1024  | 1.7  | 1.7  |
| 2048  | 6    | 6.1  |
| 4096  | 16   | 16.8 |
| 8192  | 30.6 | 32.3 |
| 16384 | 40   | 40.8 |
| 32768 | 44.9 | 51.4 |
| 65536 | 45.8 | 54.4 |

numhead 16, Head dimension 128

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 2.5  | 4.1  |
| 1024  | 3.3  | 4    |
| 2048  | 11.4 | 10.5 |
| 4096  | 27.4 | 28.4 |
| 8192  | 44.4 | 46   |
| 16384 | 64.2 | 68.1 |
| 32768 | 77.8 | 83   |
| 65536 | 82.1 | 88.1 |

numhead 16, Head dimension 256

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 1.7  | 3.4  |
| 1024  | 4.2  | 6.5  |
| 2048  | 14.6 | 16.1 |
| 4096  | 30.1 | 31.1 |
| 8192  | 60   | 62   |
| 16384 | 83.3 | 87.3 |
| 32768 | 98.7 | 106  |
| 65536 | 102.2| 107.1|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151441
Approved by: https://github.com/mingfeima
2025-07-10 05:51:31 +00:00
9222552572 [non-strict export] uncovered cases of select and slice (#157821)
Summary:
`None` and `Ellipsis` in multi-dimensional indexing was previously not covered.

Moreover, we introduce a small optimization for `slice(None)` and a passthrough when symints do not appear in the indexing.

The remaining case is where indexing is by tensor, which is fairly complicated; we passthrough in that case.

Test Plan:
added tests

Rollback Plan:

Differential Revision: D77943929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157821
Approved by: https://github.com/pianpwk
2025-07-10 05:48:12 +00:00
3584e84c24 Fixed the function to get the origin nodes of fused triton kernel. (#157578)
Summary:
This DIFF is to fix the following issue:
In python source code for CompiledFxGraph,the FX graph segment for the Triton kernel is broken. For example, the following function
  def fn(a, b, c):
      x = torch.nn.functional.linear(a, b)
      x = x.sin()
      x = x.t() + c
      return x
Inductor compiled this FX graph into two nodes: the first one is mm, the second one is a triton kernel for sin + transpose + add. The FX graph segment for the triton kernel is like the following:
Graph fragment:
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %arg2_1), kwargs = {})
Basically only "add" node in the FX graph.
The root cause is function caffe2/torch/_inductor/utils.py:gather_origins does not detect the realized node correctly.
To fix this issue, the IRNode is checked if it is one of the following IRNode:
    ir.ComputedBuffer,
    ir.InputsKernel,
    ir.InputBuffer,
    ir.ReinterpretView,
    ir.TemplateBuffer,

If it is one of them, it is realized, otherwise, it is not.

Test Plan:
buck2 run mode/opt caffe2/test/inductor:provenance_tracing -- caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact.test_triton_kernel_to_post_grad_tracing_cuda

Rollback Plan:

Differential Revision: D77748371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157578
Approved by: https://github.com/mlazos
2025-07-10 05:34:50 +00:00
b146ca74f0 docs: add get_default_backend_for_device to distributed documentation (#156783)
`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-07-10 05:11:30 +00:00
eddddea908 Upgrade MKL in CI (#154198)
This PR is to upgrade MKL in CI as PyTorch release uses MKL 2024.2 while MKL in CI is 2021.4. MKL 2021.4 can't trigger issues like https://github.com/pytorch/pytorch/issues/154477 caused by MKL upgrading in Torch release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154198
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
ghstack dependencies: #154585
2025-07-10 05:09:51 +00:00
80bcaa4195 have dynamic sources only apply to sizes and not strides (#157960)
@animesh pointed out using whitelist for strides can result in confusing graphs as follows

```
s60: "Sym(s60)", L_hidden_states_: "bf16[1, 4096, 3072][s60, 3072, 1]cuda:0"
```

We probably want to capture the relationship between sizes and strides anyways so let's make it so the whitelist only makes the sizes dynamic. That same graph now looks lik ethis

```
L_hidden_states_: "bf16[1, 4096, 64][262144, 64, 1]cuda:0"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157960
Approved by: https://github.com/pianpwk
2025-07-10 05:03:51 +00:00
88cd9f34b0 [audio hash update] update the pinned audio hash (#157873)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157873
Approved by: https://github.com/pytorchbot
2025-07-10 04:59:50 +00:00
2b19d85d70 FractionalMaxPool3d add kernel_size check (#155549)
Fixes #96316

## Test Result

```python
>>> import torch
>>> from torch.func import jacrev, grad, vmap
>>>
>>> torch.manual_seed(420)
<torch._C.Generator object at 0x7fe4767810d0>
>>>
>>> input = torch.randn(1, 1, 5, 5, 5, requires_grad=True)
>>>
>>> def func(input):
...     model = torch.nn.FractionalMaxPool3d(kernel_size=0, output_size=(1, 1, 1))
...     output = model(input)
...     return output
...
>>>
>>> func(input).sum().backward()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in func
  File "/home/zong/code/pytorch/torch/nn/modules/pooling.py", line 1054, in __init__
    raise ValueError(f"kernel_size must greater than 0, but got {kernel_size}")
ValueError: kernel_size must greater than 0, but got 0

```

![image](https://github.com/user-attachments/assets/52780ce7-3951-4d1c-95a4-5ce2bf65c727)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155549
Approved by: https://github.com/albanD
2025-07-10 04:55:06 +00:00
06a40b6850 Fix MKL error: Inconsistent configuration parameters (#154585)
Fixes #154477.

PyTorch release uses 2024.2 MKL, which has some changes to the usage of DFTI: if `DFTI_NUMBER_OF_TRANSFORMS > 1`, `DFTI_INPUT_DISTANCE` and `DFTI_OUTPUT_DISTANCE` also needs to be explicitly set to a positive integer. In addition, the requirement "the datasets to be transformed cannot contain common elements" should also be satisfied. This means that we need to avoid the case where the input strides have 0.

See https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2024-2/configuring-data-layouts.html and https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-2/dfti-number-of-transforms.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154585
Approved by: https://github.com/leslie-fang-intel, https://github.com/soumith, https://github.com/malfet
2025-07-10 03:42:38 +00:00
0a624c2dc5 Fix from_node's graph_id in unlift() (#157943)
Summary: We should use the node before deepcopy in NodeSource

Test Plan:
```
buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export
```

Rollback Plan:

Differential Revision: D78022070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157943
Approved by: https://github.com/angelayi, https://github.com/Gasoonjia
2025-07-10 03:23:55 +00:00
4cfc0a3208 [Inductor] Introduce Lookup Table for Overriding Triton Kernel autotune configs post fusion (#157924)
Summary:
Introduce lookup table for kernels post fusion, hashing on inductor generated source code

Rollback Plan:

Differential Revision: D77866885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157924
Approved by: https://github.com/jansel
2025-07-10 03:23:50 +00:00
3232b57cd8 Updates to safetensors checkpoint consolidation script to be faster (#157936)
Summary:
- adding mmap-ing
- more efficient writing in larger chunks

latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks

Test Plan:
ran consolidation with the following code:

```
from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files
import time

start_time = time.time()
consolidate_safetensors_files(base_path, consolidated_path)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
```

With the old code this was taking a couple minutes and this is now down to ~6s.
Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors

Rollback Plan:

Differential Revision: D77960054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936
Approved by: https://github.com/teja-rao, https://github.com/pradeepfn
2025-07-10 02:50:20 +00:00
3404c1f0cf [HF][DCP] Upload local consolidated files to remote storage if needed (#157371)
If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written.
Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones.

Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371
Approved by: https://github.com/pradeepfn
2025-07-10 02:40:25 +00:00
aab949aa96 Deprecated pkg_resources and use distributions instead (#151915)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151915
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD
2025-07-10 01:51:26 +00:00
6442ae9256 Make the name assert actually do something, and reserve some more names (#157342)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157342
Approved by: https://github.com/albanD
2025-07-10 01:39:40 +00:00
db188503cb [BE] Remove stale pyre-fixme (#157816)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157816
Approved by: https://github.com/Skylion007, https://github.com/jingsh, https://github.com/albanD
2025-07-10 01:33:32 +00:00
693116f765 [doc] DeviceMesh invariant on DTensorSpec (#157806)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157806
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #157805
2025-07-10 01:27:40 +00:00
9a4ac71b58 [doc] Document an invariant in OpSpec (#157805)
I am not sure if this is actually true though, please reject this PR if it is not.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157805
Approved by: https://github.com/wanchaol, https://github.com/zpcore
2025-07-10 01:27:40 +00:00
8387984257 Improve error message for torch.binomial enforcing float inputs (#157658)
Fixes #157195
### Summary:
 Fixed Issue 157195 by adding a new error message for torch.binomial in **aten/src/ATen/native/Distributions.cpp**

### Explanation
 According to the issue,
```
import torch
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]))
```
`RuntimeError: Found dtype Float but expected Long`

 It looks like we are getting a Tensor error rather than a binomial function error. Since the error is coming from **pytorch/aten/src/ATen/TensorIterator.cpp**,  it seems like it is trying to align the tensor data to the same datatype for smooth tensor computations instead of giving a binomial function error.

I tried using both arguments as longs and both as ints and got the right binomial function error
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long())
NotImplementedError: "binomial_cpu" not implemented for 'Long'
```

```
torch.binomial(torch.tensor([10.0]).int(), torch.tensor([0.5]).int())
NotImplementedError: "binomial_cpu" not implemented for 'Int'
```

But when I have both as different datatypes, the TensorIterator.cpp error comes back trying to align the datatypes.
`RuntimeError: Found dtype Float but expected Long`

I then tried finding where the NotImplementation Error was documented and found it in **pytorch/aten/src/ATen/Dispatch.h** in lines 193 - 211

```
#define AT_DISPATCH_SWITCH(TYPE, NAME, ...)                                 \
  [&] {                                                                     \
    const auto& the_type = TYPE;                                            \
    constexpr const char* at_dispatch_name = NAME;                          \
    /* don't use TYPE again in case it is an expensive or side-effect op */ \
    at::ScalarType _st = ::detail::scalar_type(the_type);                   \
    RECORD_KERNEL_FUNCTION_DTYPE(at_dispatch_name, _st);                    \
    switch (_st) {                                                          \
      __VA_ARGS__                                                           \
      default:                                                              \
        TORCH_CHECK_NOT_IMPLEMENTED(                                        \
            false,                                                          \
            '"',                                                            \
            at_dispatch_name,                                               \
            "\" not implemented for '",                                     \
            toString(_st),                                                  \
            "'");                                                           \
    }                                                                       \
  }()
```
 In the **AT_DISPATCH_SWITCH** function, it picks a tensor and its datatype and checks if the Tensor datatype matches the supported datatypes. If not we get the Not Implemented error. Unfortunately, I think the **AT_DISPATCH_SWITCH** function, uses the `common_dtype` from TensorIterator  in order to run. So TensorIterator.cpp needs to happen before the AT_DISPATCH_SWITCH function.

###  Summary: We are getting the wrong error message because **TensorIterator.cpp** gets called and errors out due to Tensor datatype mismatch before we can get the right error message in **Dispatch.h**  for torch.binomial not supporting that datatype.

### Options for the Fix
**Option 1**: Make the error message in TensorIterator.cpp more general so it applies to torch.binomial. An error message along the lines
`RunTime Error : "Tensor Datatypes", op.target_dtype," and ", common_dtype_, "are different "`

**Option 2**: Add an error message for the binomial function datatype mismatch before the the TensorIterator.cpp error message gets called.

Although Option 1 seemed easier I think Option 2 might be better as it is more specific to the binomial function while Option1 would affect all Tensors with datatype mismatch.

 **This PR applies the fix for Option 2**

After Fix :
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]))
RuntimeError: Binomial function arguments count and prob must have same datatype of type Float, got: count = Long, prob = Float
```
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long())
NotImplementedError: "binomial_cpu" not implemented for 'Long'
```
@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157658
Approved by: https://github.com/soulitzer
2025-07-10 00:58:56 +00:00
54a7e5b598 _aot_export_function: allow keeping input mutations in the graph (#157730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157730
Approved by: https://github.com/ezyang
2025-07-10 00:47:51 +00:00
ed03492238 Add check nested_tensor_from_jagged param jagged_dim >= 1 (#157770)
Fixes #157404

## Test Result

```bash
pytest test/test_nestedtensor.py

...............................................s..........ssssss.................................................................................................s.s..sssss..s...ss............................................................. [ 44%]
...........................................................sssss....sss...s.........ss....s....sss.........s.sss...s..s......s............s.sss.ss...............s.....................s....s......................s.s.....s....s..s..ssssssssss [ 59%]
sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss..ssssss.ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssss...............................s........................................... [ 74%]
.......sss...................................................................................................................................................................................................................................... [ 89%]
....sss..........................................................................................................................................................                                                                                [100%]

==================================================================================================== 1317 passed, 258 skipped in 2504.27s (0:41:44) ====================================================================================================
```

![image](https://github.com/user-attachments/assets/dcc8e46d-b88f-4580-b4ad-0999bad33ec9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157770
Approved by: https://github.com/soulitzer

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
2025-07-10 00:34:39 +00:00
752f202ef3 [PGO] include module int attributes in PGO state (#157518)
Dynamo specializes on int module attributes by default. This includes them in PGO state despite specialization, if they're involved in guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157518
Approved by: https://github.com/bobrenjc93
2025-07-09 23:57:54 +00:00
ed051c3084 torch.distributed: add initial _dist2 prototype API (#157841)
This adds the initial dist2 API as proposed in https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89

This is a WIP experimental API and is a sandbox for a number of new features and quality of life improvements/changes to c10d.

Test plan:

```
pytest test/distributed/test_dist2.py
```

Docs

```
cd docs
make html
```

![Screenshot 2025-07-08 at 13-39-23 Object Oriented Distributed API - torch distributed _dist2 — PyTorch main documentation](https://github.com/user-attachments/assets/9c03a7ec-09e5-42b9-8478-1ec28bc2b6bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157841
Approved by: https://github.com/fduwjj
2025-07-09 23:40:43 +00:00
39456edbba [PT2][memory] mutation size correctness (#157562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562
Approved by: https://github.com/yf225
2025-07-09 22:14:20 +00:00
a1dad2f2d2 [BE][Ez]: Autotype torch/profiler with ruff ANN (#157923)
Apply ruff autotyping fixes to add annotations to torch profiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157923
Approved by: https://github.com/albanD, https://github.com/sraikund16
2025-07-09 22:07:50 +00:00
53ab73090e [inductor] support unbacked symint in sdpfa (#157739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157739
Approved by: https://github.com/laithsakka
2025-07-09 22:01:29 +00:00
08e9dd280f [ONNX] Support symbolic arguments in onnx exporter (#157734)
Previous to this PR, torch.onnx.export(..., dynamo=True, veriy=True, report=True) does not support symbolic arguments. Such examples are like follwing:

```python
class M(torch.nn.Module):
    def forward(self, a, x):
        return a + torch.tensor(1) + x

op = torch.onnx.export(M(), (1, torch.ones(2)),
                       dynamic_shapes=(torch.export.Dim.DYNAMIC, {0: torch.export.Dim.DYNAMIC}),
                       dynamo=True, report=True)
```

symbolic arguments are like constant arguments that they don't have tensor_meta wither. Besides, torch.export.export supports model inputs having constants, which is different from the legacy issue: https://github.com/pytorch/pytorch/issues/99534 where we tried to get the FX directly from dynamo export. Thus, `_remove_non_tensor` is deleted from args processing.

NOTE: If the ConstantArugment shows up in exported_program, it was kept to align the length of inputs to nn.Module, but it's irrelevant to the model graph, hwich is why in ONNX model the input is omitted.

The test `test_constant_argument_user_input_is_omitted_in_onnx_graph` needs #157719
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157734
Approved by: https://github.com/justinchuby
2025-07-09 21:15:45 +00:00
163f0d8f2a [BE][Ez]: Auto add return type annotations for methods in torch/nn/module (#157925)
Automatically type a bunch of methods in nn.Module using ruff's type inference rules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157925
Approved by: https://github.com/albanD
2025-07-09 21:12:25 +00:00
f742b32a2f [dynamo] Avoid recompiling over unused objects (#156891)
Dynamo was aggressively specializing on lazy VTs over `set_name_hint` in
`STORE_FAST`, etc., and `isinstance` in `LOAD_FAST_CHECK`. This causes
regional `torch.compile` from optimizing ComfyUI GGUF + LoRA to either
(1). exceed the recompialtion limit of 8, which results in suboptimal
performance, and (2). even if recompilation limit is increased, the
compilation time gets unnecessarily high (180s v.s. 20s for Flux).

This patch fixes the recompilation issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156891
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2025-07-09 20:14:34 +00:00
317520bf6e Add an ovrsource target for torch/headeronly (#157912)
Summary: no idea how this works

Test Plan:
will things just pass?

Rollback Plan:

Differential Revision: D77965219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157912
Approved by: https://github.com/albanD
2025-07-09 19:32:03 +00:00
dfa2649434 Revert "[Inductor] Fix epilogue fusion decision with 1 Triton caller as choice (#156500)"
This reverts commit c48d0f4643b7a69ebe24069e932ce1465a31cdbe.

Reverted https://github.com/pytorch/pytorch/pull/156500 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156500#issuecomment-3053680762))
2025-07-09 18:56:10 +00:00
52772765e0 Change AOTI_RUNTIME_DEVICE_CHECK to be device device specific (#157818)
Summary:
Change AOTI_RUNTIME_DEVICE_CHECK to the following depending on device:

AOTI_RUNTIME_CUDA_CHECK
AOTI_RUNTIME_XPU_CHECK
AOTI_RUNTIME_CPU_CHECK

Currently in the codebase, only `AOTI_RUNTIME_CUDA_CHECK` is used.

This shouldn't change anything as of now, but we do this to prepare for simultaneouly loading multiple backends (e..g CPU and CUDA) in AOTI standalone.

We don't want people writing `AOTI_RUNTIME_DEVICE_CHECK` for both CPU and CUDA checks. This could cause compilation problems when we statically link both CPU and CUDA models.

Test Plan:
CI

Rollback Plan:

Reviewed By: muchulee8

Differential Revision: D77742977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157818
Approved by: https://github.com/jingsh
2025-07-09 18:34:56 +00:00
c54778625e Update is_sparse doc to mention that it is sparse_coo specific (#157378)
## Issue being addressed
`is_sparse` presents itself as determining if a tensor is sparse. HOWEVER, it only does checks against the tensor for `sparse_coo`. This has lead to confusion from developers as when non-coo sparse tensors are provided it return false, despite those tensors being sparse.

## Considered Remedy
Fixing this is do-able however would result in complexity as existing systems may depend on this behavior remaining consistent, and even inside of pytorch is_sparse is used by `bform` which states that it supports only `sparse_csr and sparse_coo` meaning additional work/thought would have to go into solving for `sparse_csc` and `sparse_bsr`

## Remedy provided in this PR
In lieu of these complications the lowest risk highest gain action was to add clear warning messaging to the function for now to avoid confusion to developers utilizing the function. The rest of the function behavior remains identical

## Issue content
Addresses issue number: #101385
Original issue: https://github.com/pytorch/pytorch/issues/101385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157378
Approved by: https://github.com/soulitzer
2025-07-09 18:22:14 +00:00
81c7445eb9 [FSDP2] Use reduceOpSum for world size 1 (#157529)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157529
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/weifengpy
2025-07-09 18:08:48 +00:00
28aae93f24 [Memory Snapshot] Fix Linter for Global Annotations flag in Snapshot (#157858)
Summary: We added the ability to make Annotating Global or Local based on an input flag in PyTorch but didn't add the args to the linter

Reviewed By: mzzchy

Differential Revision: D77959409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157858
Approved by: https://github.com/mzzchy
2025-07-09 17:28:22 +00:00
b354328ecd [AOTI] add flag AOT_INDUCTOR_ENABLE_LTO (#157773)
Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time.

Rollback Plan:

Differential Revision: D77899195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773
Approved by: https://github.com/desertfire
2025-07-09 16:54:19 +00:00
d75d30eeb6 [DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)
This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324.

It does two things:
1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely.
2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
2025-07-09 16:49:34 +00:00
cb711c8fa0 Revert "[BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)"
This reverts commit 754699610b0abec2fe3f5a73269b1dd09a330445.

Reverted https://github.com/pytorch/pytorch/pull/157199 on behalf of https://github.com/malfet due to It breaks lintrunner init` for default environments, see https://github.com/pytorch/pytorch/issues/152999 ([comment](https://github.com/pytorch/pytorch/pull/157199#issuecomment-3053279711))
2025-07-09 16:26:47 +00:00
981c99fdff Uninstall brew miniconda while running MacOS testing (#156898)
That results in torch.compile being unable to produce working artifacts
But reinstall it later, when done

Should fix https://github.com/pytorch/pytorch/issues/156833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-07-09 16:02:55 +00:00
054cd4ca28 [CPU Generator] Remove the unused CPUGeneratorImplStateLegacy in set_state (#153934)
As the title stated.

The old state named CPUGeneratorImplStateLegacy in set_state will not been used,
so just remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153934
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/atalman
2025-07-09 15:45:19 +00:00
f4d60a68dd Adding a change to kick off the theme pull (#157732)
Adding a small change so that Docker container is rebuild and reflects the latest changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157732
Approved by: https://github.com/malfet
2025-07-09 15:43:00 +00:00
6defd5084e Revert "[PT2][memory] mutation size correctness (#157562)"
This reverts commit 86670b39fa3df63a652a9a06b59b73f92d70c392.

Reverted https://github.com/pytorch/pytorch/pull/157562 on behalf of https://github.com/xuanzhang816 due to internal_test_failure ([comment](https://github.com/pytorch/pytorch/pull/157562#issuecomment-3053115025))
2025-07-09 15:38:29 +00:00
b4e3c9ea34 [ez][CI][testing] Set upload artifacts while running to default true if in CI (#157868)
I was confused about why the distributed tests weren't showing up quickly on HUD, its because the call of run_tests.py for distributed didn't include upload artifacts while running flag, so set it to default to IS_CI so I don't need to put the flag everywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157868
Approved by: https://github.com/huydhn
2025-07-09 15:21:25 +00:00
fcc682be4b [BE][Ez]: Fully type nn.utils.clip_grad (#154801)
Full types clip_grad and exposed typing annotations that were hidden by a bad decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801
Approved by: https://github.com/jansel
2025-07-09 14:27:51 +00:00
ed6ae20cf0 [BE][Ez]: Update mimalloc submodule to 2.2.4 (#157794)
Fixes a few minor bugfixes with the previous release and better compiler support. Should be a NOOP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157794
Approved by: https://github.com/atalman
2025-07-09 14:03:07 +00:00
02a9d9095f [BE] remove commented out code in c10/ovrsource_defs.bzl (#157856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157856
Approved by: https://github.com/swolchok, https://github.com/albanD
2025-07-09 13:28:56 +00:00
86eaf452c3 [Easy][Profiler] Fix pattern matcher of profiler (#157711)
Per title, as it fails with the following error if "+PTX" was used in `TORCH_CUDA_ARCH_LIST`:
```
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in skip
    has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in <genexpr>
    has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list())
                   ^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'pute_120'
```
Because slicing `arch[3:]` will not end up on having only digits for `compute_120` element of `torch.cuda.get_arch_list()`:
```python
>>> torch.cuda.get_arch_list()
['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157711
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
2025-07-09 12:09:46 +00:00
297daa1d30 [aarch64] Add sm_80 to CUDA SBSA build (#157843)
related to https://github.com/pytorch/pytorch/issues/152690

This adds sm_80 to CUDA SBSA builds (12.9), so that we will be able to support Ampere family (e.g: sm_86) and Ada family (e.g: sm_89) on CUDA SBSA builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157843
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-07-09 11:46:34 +00:00
a355158fcb [Easy] Fix the compilation warning (#157889)
**Background:**

```Shell
[1376/2332] Building CUDA object caffe2/CMakeFiles/torch_...h/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu.o
/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign
      size_t numelIn_ = -1;
                        ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign
      size_t numelOut_ = -1;
                         ^

/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign
      size_t numelIn_ = -1;
                        ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign
      size_t numelOut_ = -1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157889
Approved by: https://github.com/mlazos
2025-07-09 11:41:02 +00:00
4dce5b71a0 [build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027)
Modernize the development installation:

```bash
# python setup.py develop
python -m pip install --no-build-isolation -e .

# python setup.py install
python -m pip install --no-build-isolation .
```

Now, the `python setup.py develop` is a wrapper around `python -m pip install -e .` since `setuptools>=80.0`:

- pypa/setuptools#4955

`python setup.py install` is deprecated and will emit a warning during run. The warning will become an error on October 31, 2025.

- 9c4d383631/setuptools/command/install.py (L58-L67)

> ```python
> SetuptoolsDeprecationWarning.emit(
>     "setup.py install is deprecated.",
>     """
>     Please avoid running ``setup.py`` directly.
>     Instead, use pypa/build, pypa/installer or other
>     standards-based tools.
>     """,
>     see_url="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html",
>     due_date=(2025, 10, 31),
> )
> ```

- pypa/setuptools#3849

Additional Resource:

- [Why you shouldn't invoke setup.py directly](https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156027
Approved by: https://github.com/ezyang
2025-07-09 11:24:27 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
ffe11b2bf2 [BE] fix typo in torch/distributed/tensor/: childs -> children (#156609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156609
Approved by: https://github.com/wanchaol, https://github.com/cyyever
ghstack dependencies: #156311
2025-07-09 11:02:23 +00:00
4cc8b60d1b [BE][1/16] fix typos in torch/ (#156311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311
Approved by: https://github.com/albanD
2025-07-09 11:02:22 +00:00
f5bbaa2253 Fixes typo in nccl_window_registration test (#157293)
As mentioned here: https://github.com/pytorch/pytorch/pull/155134#discussion_r2175605192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157293
Approved by: https://github.com/Skylion007
2025-07-09 11:01:18 +00:00
924fc52e18 [BE] add a linter to check consistency for cmake minimum version in requirements (#156961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156961
Approved by: https://github.com/ezyang, https://github.com/malfet
2025-07-09 10:44:17 +00:00
b83d8827bc Revert "Deprecate DataLoader pin_memory_device param (#146821)"
This reverts commit ab655816b8f76f511fb2262d45276d8d1b13d59c.

Reverted https://github.com/pytorch/pytorch/pull/146821 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146821#issuecomment-3052093902))
2025-07-09 10:29:31 +00:00
6f23f53599 [inductor] fix tensor.to(uint8) error when tensor src type is float (#157267)
The cpu inductor processes .to(torch.uint8) incorrectly, leading to numerical inconsistencies. The convert_float_to_int8 function may return incorrect results for negative inputs, such as -2.xx, when the data type is uint8_t, producing 0 instead of 255. This issue stems from the clamping logic; we should avoid converting min_val to uint8_t too early
Fixes https://github.com/pytorch/pytorch/issues/156788
@leslie-fang-intel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157267
Approved by: https://github.com/leslie-fang-intel
2025-07-09 07:03:38 +00:00
e3f2597b45 [Optimus] Fix normalization pass in the aten IR (#157857)
Summary: We found there's a special case in recent APS model where the input tensor has smaller size compared to the split size. It will be automatically truncated in split.Tensor thus we add extra condition check for split_with_sizes when do the normalization.

Test Plan:
### unit
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_aten_normalization
```

Buck UI: https://www.internalfb.com/buck2/2ecd1ef8-8efe-4245-b4c8-282c23645b3c
Test UI: https://www.internalfb.com/intern/testinfra/testrun/7599824648585787
Network: Up: 3.9GiB  Down: 9.2GiB  (reSessionID-1396c91e-0dd2-457b-a49b-a6ab1f2a7d8f)
Loading targets.   Remaining      0/5344                                                                                                              99617 dirs read, 1074949 targets declared
Analyzing targets. Remaining      0/123279                                                                                                            4988547 actions, 5966764 artifacts declared
Executing actions. Remaining      0/728058                                                                                                            209:52:59.9s exec time total
Command: test.     Finished 12466 local, 209448 remote, 1226 cache (1% hit)                                                                           42:10.5s exec time cached (0%)
Time elapsed: 26:07.6s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

before fix:
aps-afoc_apop_pt2_v0-db2fe0449a

after fix:
aps-afoc_apop_pt2_v0-755ad0cdc6

Rollback Plan:

Differential Revision: D77961394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157857
Approved by: https://github.com/anijain2305
2025-07-09 05:38:15 +00:00
effe376db0 Adding aoti_standalone config (#157731)
Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if  `package_cpp_only` is explicitly set to False in config.

Test Plan:
```
buck2 run  mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r  TestAOTInductorConfig
```

Rollback Plan:

Differential Revision: D77889754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731
Approved by: https://github.com/desertfire
2025-07-09 04:30:04 +00:00
fcbf7c749a [Windows][Inductor] normalize_path_separator compiler path (#157835)
Fixes #157673

For the call trace:
```
......

  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\common.py", line 2569, in reduction
    return self.kernel.reduction(dtype, src_dtype, reduction_type, value)
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 2155, in reduction
    self._gen_parallel_reduction_buffers(acc, acc_type, reduction_type, init_dtype)
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 1942, in _gen_parallel_reduction_buffers
    reduction_prefix_array(
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 335, in reduction_prefix_array
    if cpp_builder.is_msvc_cl()
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 317, in is_msvc_cl
    return _is_msvc_cl(get_cpp_compiler())
  File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 240, in _is_msvc_cl
    subprocess.check_output([cpp_compiler, "/help"], stderr=subprocess.STDOUT)
torch._inductor.exc.InductorError: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
```
On non-English language pack msvc environment, compiler path has raised `utf-8` issue. I add the `normalize_path_separator` to normalize the compiler path and avoid the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157835
Approved by: https://github.com/jansel
2025-07-09 04:02:20 +00:00
8bda95228f [autograd] Avoid creating and recording event when unnecessary (#157503)
Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.

2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.

We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.

Removing this unnecessary create + record event should save a few us for each instance avoided.

Fixes https://github.com/pytorch/pytorch/issues/157407

----

Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715
2025-07-09 03:36:14 +00:00
8d070187e3 fix type hints for interpolation functions (#157202)
Fixes #129053

Previously interpolate had a bad signature and not correct type hints.
This fixes this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157202
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-07-09 03:11:37 +00:00
c515385b0a Add Intel GPU info collection to the collect env script (#157351)
https://github.com/pytorch/pytorch/pull/137846 was mistakenly closed. Reopen a PR to land the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157351
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-07-09 03:01:41 +00:00
d6237721c0 [Build] Make PyTorch compilable with gcc-14 on ARM (#157867)
Fixes numerous ICEs in vreg allocations for SVE+BF16
```
/pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: error: unrecognizable insn:
   25 | #pragma omp parallel
      |         ^~~
(insn 257 256 258 30 (set (reg:VNx8BF 449 [ bf16_vec1_217 ])
        (unspec:VNx8BF [
                (reg:VNx8BF 455)
                (reg:VNx8BF 456)
            ] UNSPEC_IORF)) "/pytorch/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h":228:31 discrim 1 -1
     (nil))
during RTL pass: vregs
/pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: internal compiler error: in extract_insn, at recog.cc:2812
0xd73c33 internal_error(char const*, ...)
	???:0
0xd73d1f fancy_abort(char const*, int, char const*)
	???:0
0x890053 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	???:0
0x890087 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	???:0
0x1379093 extract_insn(rtx_insn*)
	???:0

```
And one in RTL-expand pass while compiling Activation.cpp
```
during RTL pass: expand
In file included from /pytorch/aten/src/ATen/native/cpu/Activation.cpp:12,
                 from /pytorch/build/aten/src/ATen/native/cpu/Activation.cpp.DEFAULT.cpp:1:
/pytorch/aten/src/ATen/native/cpu/Activation.cpp: In lambda function:
/pytorch/aten/src/ATen/native/cpu/Activation.cpp:94:7: internal compiler error: Segmentation fault
   94 |       });
      |       ^
/pytorch/aten/src/ATen/Dispatch.h:201:7: note: in definition of macro 'AT_DISPATCH_SWITCH'
  201 |       __VA_ARGS__                                                           \
      |       ^~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:72:3: note: in expansion of macro 'AT_PRIVATE_CASE_TYPE_USING_HINT'
   72 |   AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, scalar_t, __VA_ARGS__)
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:214:3: note: in expansion of macro 'AT_DISPATCH_CASE'
  214 |   AT_DISPATCH_CASE(at::ScalarType::Double, __VA_ARGS__) \
      |   ^~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:218:34: note: in expansion of macro 'AT_DISPATCH_CASE_FLOATING_TYPES'
  218 |   AT_DISPATCH_SWITCH(TYPE, NAME, AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))
      |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/native/cpu/Activation.cpp:70:5: note: in expansion of macro 'AT_DISPATCH_FLOATING_TYPES'
   70 |     AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&] {
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~
0xd73c33 internal_error(char const*, ...)
	???:0
0x134f987 rebuild_jump_labels(rtx_insn*)
	???:0
```

Interestingly enough, attempt to compile `Unfold2d.cpp` for `-march=armv8-a+sve` (i.e. without sve+bf16) support also causes ICE
```
/pytorch/aten/src/ATen/native/cpu/Unfold2d.cpp:221:1: error: unrecognizable insn:
  221 | }
      | ^
(insn 2918 2917 2919 296 (set (reg:VNx8BI 5917)
        (unspec:VNx16BI [
                (reg:VNx8BI 5920)
                (reg:VNx8BI 5922)
                (const_vector:VNx4BI [
                        (const_int 0 [0]) repeated x8
                    ])
            ] UNSPEC_TRN1_CONV)) "/usr/include/aarch64-linux-gnu/bits/string_fortified.h":29:33 discrim 1 -1
     (expr_list:REG_EQUAL (const_vector:VNx8BI [
                (const_int 1 [0x1]) repeated x9
                (const_int 0 [0])
                (const_int 1 [0x1]) repeated x2
                (const_int 0 [0]) repeated x4
            ])
        (nil)))
during RTL pass: vregs
```

Which could be worked around by adding
```patch
diff --git a/aten/src/ATen/native/cpu/Unfold2d.cpp b/aten/src/ATen/native/cpu/Unfold2d.cpp
index 8ef0741e77af0a..59c76505dd6246 100644
--- a/aten/src/ATen/native/cpu/Unfold2d.cpp
+++ b/aten/src/ATen/native/cpu/Unfold2d.cpp
@@ -169,6 +169,10 @@ static void unfolded2d_acc_channels_last(

 /* note: due to write issues, this one cannot be parallelized as well as
  * unfolded2d_copy */
+#if defined(__GNUC__) && __GNUC__ == 14 && defined(__ARM_FEATURE_SVE)
+// Workaround for gcc-14.2.0 ICE during RTL pass: vregs when compiling for SVE
+__attribute__((optimize("no-tree-vectorize")))
+#endif
 void unfolded2d_acc_kernel(
     ScalarType dtype,
     void *finput_data,
```

Fixes https://github.com/pytorch/pytorch/issues/157842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157867
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-09 02:59:08 +00:00
ab8874bd26 Suppress warning when using native arch for jit loading cuda extensions. (#156923)
Previeusly, if users want to let pytorch determine the cuda arch when jit loading cuda extensions, they should left environment variable `TORCH_CUDA_ARCH_LIST` empty, but which will raise an warning. This commit add an option to set `TORCH_CUDA_ARCH_LIST=native`, to tell pytorch users want to use native cuda arch intentionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156923
Approved by: https://github.com/ezyang
2025-07-09 02:51:20 +00:00
bc6e0661a6 Fix more H100 CI (#157829)
Follow @d4l3k 's fix in https://github.com/pytorch/pytorch/pull/157826/files. Two more fixes might be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157829
Approved by: https://github.com/davidberard98, https://github.com/d4l3k
2025-07-09 01:28:05 +00:00
e5edd013ab [AOTI] Skip test_simple_multi_arch_embed_kernel_binary_True_cuda (#157301)
Summary: For https://github.com/pytorch/pytorch/issues/156930, still no clue on what went wrong as it is not reproducible locally, but somehow the problem seems only exists when embed_kernel_binary is True. Let's skip it for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157301
Approved by: https://github.com/yushangdi
2025-07-09 01:18:36 +00:00
75f489d37f [Break XPU][Inductor UT] Align tolerance of newly added case with cuda. (#157702)
Align tolerance with cuda for the newly added case `test_comprehensive_logcumsumexp_xpu_float16` in #157512.

Fixes #157697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157702
Approved by: https://github.com/jansel
2025-07-09 00:55:01 +00:00
3eb7084f7a [ci] fix h100-distributed (#157826)
This was broken by https://github.com/pytorch/pytorch/pull/157341

This should resolve the permission issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157826
Approved by: https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/huydhn
2025-07-09 00:27:55 +00:00
86251eff40 Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601)"
This reverts commit 55108074c0795be3b617d3b13b06794f63e1f8ca.

Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3050628047))
2025-07-09 00:07:31 +00:00
1b3d69b59f Work: block_current_stream API (#156883)
This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo.

The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP.

There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780

This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set.

Test plan:

```
pytest test/distributed/test_c10d_gloo.py -v -k wait_stream
pytest test/distributed/test_c10d_nccl.py -v -k wait_stream
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2025-07-08 23:55:46 +00:00
92f41ccc26 [Inductor] Support precomputed size args in the FX backend. (#157758)
# Feature
If a Triton kernel has a complicated indexing expression, Inductor may decide to precompute it on the host and pass it to the kernel as an argument. This happens in situations like broadcasts with dynamic shapes.

This PR adds support for this feature to Inductor's FX IR backend.

We generate FX IR for precomputed size args in 3 steps:
1. In `PythonWrapperCodegen`, this PR refactors the relevant code to use a `SymbolicCallArgLine` instead of raw Python strings. This stores a (symbol, expr) pair. (Prior to this PR, it was (str, expr), but changing this to a symbol makes it easier to do substitutions later on.)
2. In `WrapperFxCodegen`, keep a dict of {symbol: expr} arg defs which gets updated whenever we see a `SymbolicCallArgLine`.
3. When the FX backend sees a `KernelCallLine`, it uses this dict to replace symbolic call args with their definitions.

In the longer run, it might be desirable to emit FX nodes defining these symbolic call args. That way, we could reuse the size computation when the same kernel is called multiple times. However, I wasn't sure if there was an existing way to generate FX nodes from a sympy expression, and implementing that seemed like overkill for the present purposes.

# Test plan
Added a new CI test exercising this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157758
Approved by: https://github.com/jansel
2025-07-08 23:22:17 +00:00
95bc3da9f8 [c10d] support dynamic shapes for all_to_all_single_autograd (#157521)
`all_to_all_single_autograd` is not an op, all the code executed until the `all_to_all_single` dispatch is visible to the compiler. This means the `all_to_all_single_autograd` wrapper code must support symints in order to be traceable with dynamic shapes.

FIXES https://github.com/pytorch/pytorch/issues/157479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157521
Approved by: https://github.com/wconstab
2025-07-08 23:19:59 +00:00
9f18482d41 [dynamo] removing string literals for weblink generation (#157820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157820
Approved by: https://github.com/williamwen42
2025-07-08 23:08:06 +00:00
c5b46b5408 [BE] Standardize CPU capabilities name (#157809)
It's weird to call default x86 CPU capability `NO AVX`, when in reality it's something different. Also it's a bit strange to have it assigned different names on different platforms

Fixes https://github.com/pytorch/pytorch/issues/157538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157809
Approved by: https://github.com/Skylion007
2025-07-08 23:06:09 +00:00
179dcc10e4 Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558)
Please see: https://github.com/pytorch/pytorch/issues/157517
We would like to keep Volta architectures by default for release 2.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558
Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet
2025-07-08 23:02:10 +00:00
7a41f20794 [inductor] Quiesce Triton compile worker pool after each dynamo compile (#156187)
For internal usages, keeping the Triton compile worker pool active for the lifetime of the process has caused some challenges, e.g., it slows down and muddies profiling due to the huge number of threads on a box: N threads = 8 ranks * 32 subprocs * M threads started by torch. Also, each subproc can use more than 1GB each. This PR adds the functionality to shutdown worker subprocs after each dynamo compile when using the SubprocPool implementation. The idea is to leave the main sidecar process running, but signal it to tear down its internal ProcessPoolExecutor when compile is finished. Restarting the ProcessPoolExecutor is relatively fast, e.g., 500ms because the ProcessPoolExecutor forks from the sidecar. Changes:
* Do not start the ProcessPoolExecutor automatically when compile_fx is imported. Instead, start the sidecar process only. The sidecar process imports torch, so is still slow to start.
* Introduce wakeup() and quiesce() calls to the implementation to start and stop the ProcessPoolExecutor.
* Add a context manager to automatically quiesce() at the end of dynamo compilation.
* Signal a wakeup() in compile_fx only when we have cuda devices.
* Add a killswitch so we can turn of quiescing.

Testing:
For correctness, the stacked change at https://github.com/pytorch/pytorch/pull/156534 enables the feature for OSS so it's exercised in CI.

For performance, because of recent compile-time variance (see https://github.com/pytorch/pytorch/issues/152566), it's pretty hard to glean whether there's a regression....

* Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801
* Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801

The wins (mostly for inference) don't make sense, but I'm also skeptical of the losses (mostly for training). I can't repro any of the slowdowns locally. Furthermore, check out the benchmarking results for the stacked diff, which actually enables the quiescing functionality for OSS. That should only slow down compile since there can only be overhead to stop and start the workers. But the results are somehow better:

* Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801
* Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156187
Approved by: https://github.com/aorenste, https://github.com/jansel
2025-07-08 22:53:13 +00:00
178fe7aa98 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-07-08 22:11:33 +00:00
2e14069081 Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)"
This reverts commit 777eca9f16aeecd7c362a235cf25e6b8e6eda57f.

Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3050258896))
2025-07-08 20:48:51 +00:00
391473cca0 [export] Fix lift constants bug (#157719)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157719
Approved by: https://github.com/yushangdi
2025-07-08 20:33:53 +00:00
b9dc2fa4f7 Add legacy note to autograd.profiler doc. (#157459)
Via google search I got to `torch.autograd.profiler` and implemented my code with it. Only to be taken by surprise finding `torch.profile.profiler`, which has a note saying the autograd one is legacy.

This just adds such note to `autograd.profiler` to avoid this confusion and waste of time to future people in my situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157459
Approved by: https://github.com/sraikund16
2025-07-08 20:33:23 +00:00
a73d9e0aec Fix einsum strategy shard dim > ndim (#157593)
Previously we didn't constrain Shard dim to be <= the tensor's ndim. This cause an invalid strategy like `(RR, RS(2)) -> RS(2),` for einsum `bmk,kn->bmn` on the 2d mesh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157593
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2025-07-08 20:27:17 +00:00
06b3265cb1 Increase nightly C++ docs build timeout to 6h (#157759)
This job has been timing out since May 261897734a/1, maybe it's time to figure out if this makes sense.

Issues https://github.com/pytorch/pytorch/issues/157763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157759
Approved by: https://github.com/malfet
2025-07-08 19:28:48 +00:00
dea4864ce0 HF loads dcp - don't do a full deserialize on every file (#157715)
Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D77871099

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715
Approved by: https://github.com/meetv18
2025-07-08 18:13:27 +00:00
4f5be56612 [Pyrefly][Refactor] Replace dict() calls with literal dict syntax for improved readability (#157735)
There are 31 places that I spotted which construct literal dictionaries.

This PR refactors dictionary construction by replacing` dict(...) `calls with `literal {...}` syntax where applicable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157735
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-08 18:10:33 +00:00
0f31445139 Add stack trace of exception to MultiProcContinousTest (#157589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589
Approved by: https://github.com/Skylion007
2025-07-08 17:54:35 +00:00
5b4e0255d7 Check FakeScriptObject in _resolve_name_collision (#157736)
Summary:
Fix https://github.com/pytorch/pytorch/issues/157401

torch.equal cannot handle FakeScriptObject inputs.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r  test_aoti_torchbind_name_collision
```

Rollback Plan:

Differential Revision: D77894081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157736
Approved by: https://github.com/angelayi
2025-07-08 17:51:46 +00:00
44d0800d60 [Intel GPU] Set higher tolerance for squeezenet1_1 with bf16 (#156920)
We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920
Approved by: https://github.com/soulitzer
2025-07-08 17:49:54 +00:00
a5c61eb78d [MPS][BE] Delete as_strided_tensorimpl_mps (#157772)
Because it's just copy-n-paste of `as_strided_tensorimpl` with call to `updateTensorBaseShape`, which is not called/used anywhere else.

Fixes https://github.com/pytorch/pytorch/issues/152701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157772
Approved by: https://github.com/Skylion007
2025-07-08 17:02:36 +00:00
bbe681ed51 [cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)
Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/)

Motivation:
* This is the case we care the most.
* We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656
Approved by: https://github.com/ColinPeppler
2025-07-08 16:57:33 +00:00
ed911747c2 [dtensor] add support for fused optimizer with parameters across multiple meshes (#157682)
We are seeing more and more use cases where parameters in a model (under the same optimizer group) are put on different meshes. E.g.
- when FSDP and TP are both applied, some parameters are sharded only on the FSDP mesh but not TP mesh (see https://github.com/pytorch/pytorch/pull/153268).
- in [dp2ep Expert Parallel](https://github.com/pytorch/torchtitan/pull/1324), the routed experts are sharded on the (global FSDP \ EP) mesh for smaller FSDP and on the EP mesh for EP, whereas other params are sharded on the global FSDP mesh for FSDP.

This PR is, in some sense, a continuation of https://github.com/pytorch/pytorch/pull/147869 to tackle the problem when fused optimizers are used. In such cases, the [`fused_adam`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L15786) / `fused_adamw` has a scalar tensor arg `state_steps` which gets automatically cast to DTensor on the default [`compute_mesh`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_dispatch.py#L350) (one of the multiple meshes), even though the it could correspond to different meshes.

To avoid hitting the cross-mesh propagation exception in `common_pointwise_strategy` and followup redistribute problems, we manually set the target mesh and placements to be the same as input mesh and placements, so that no redistribute will be triggered. This also helps bypass the situation where [`generate_redistribute_costs`](https://github.com/pytorch/pytorch/pull/157682/files#diff-eea32a36dd2d4e58307bc5229402e48048b2ecaef64a7c085495fba1ee10ac89R597) returns infinite cost due to cross mesh redistribute.

Moreover, this PR has minimal scope (restricted to the `fused_ops`) and doesn't need to modify other files such as `_sharding_prop.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157682
Approved by: https://github.com/wanchaol
2025-07-08 15:58:30 +00:00
777eca9f16 [DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216)
This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324.

It does two things:
1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely.
2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
2025-07-08 15:57:37 +00:00
476874b37f [BE]: Update NCCL to 2.27.5 (#157108)
Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157108
Approved by: https://github.com/atalman
2025-07-08 15:40:54 +00:00
5dc75f72d4 Simplify the base classes of _PyFutureMeta (#157757)
Summary:

I'm fairly sure the use of a custom metaclass is a holdover from pre-3.7 where Generic used a custom metaclass so we had to use multiple inheritance to avoid import-time failures.

At this point, `type(Generic)` is just `type` so it isn't needed, and we will get the least metaclass from our base classes, which means the `type(torch._C.Future)` isn't needed either, it will happen automatically just by inheritance.

Test Plan:

I'm fairly confident from local testing that this should be a no-op.

But also, Pytorch CI should give us pretty strong signal that this change doesn't break anything in case there's some edge case I missed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157757
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-08 15:39:56 +00:00
f88d7a7a34 [BE] Do not add . after troubleshooting_url (#157753)
As it gets included into auto-hrefed URLs in say github logs to point to non existing location

For example from https://github.com/pytorch/pytorch/actions/runs/16130448756/job/45517004735?pr=157749#step:18:27
> W0708 00:23:20.150000 67082 torch/_dynamo/convert_frame.py:1047] [0/8] To diagnose recompilation issues, see [https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.](https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157753
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-07-08 15:38:24 +00:00
98bb0c0e78 [CI][MacOS] Add VENV_PATH to search path (#157749)
When building/testing PyTorch on MacOS

Shoudl prevent some flakiness when conda environment overtakes CI/CD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-08 15:37:45 +00:00
76fe88fa56 Revert "Cleanup leftover miniconda brew installation (#156898)"
This reverts commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5.

Reverted https://github.com/pytorch/pytorch/pull/156898 on behalf of https://github.com/malfet due to Breaks TorchVision builds ([comment](https://github.com/pytorch/pytorch/pull/156898#issuecomment-3049281232))
2025-07-08 14:54:42 +00:00
86670b39fa [PT2][memory] mutation size correctness (#157562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562
Approved by: https://github.com/yf225
2025-07-08 14:02:20 +00:00
c78bbdf410 [BE] Update xpu driver repo for CD used almalinux 8.10 (#157356)
XPU CD docker image built on `quay.io/pypa/manylinux_2_28_x86_64`, which based on almalinux 8.10
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157356
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-07-08 13:59:46 +00:00
b9afdd9bcc Add flag to fx.passes.split_module to normalize input names (#157733)
This is useful for vLLM, which runs AOTAutograd directly on graphs after
they have been split.

I created a new flag for this instead of reusing
`keep_original_node_name` (please let me know if you think I should reuse this).
The reasoning is:
- The names of the placeholder nodes is different from the targets of
  the placehoder nodes. The targets are the actual input names.
- Backwards compatibility: this API has been out for ~4 years, it
  looks public, and it has extensive public use. For example, this change
  would actually be BC-breaking to vLLM (they rely on the subgraph input
  names being different at the moment).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157733
Approved by: https://github.com/ezyang
2025-07-08 13:47:24 +00:00
cyy
7381c77724 Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-07-08 12:20:29 +00:00
ab655816b8 Deprecate DataLoader pin_memory_device param (#146821)
Following [ #131858 suggestion](https://github.com/pytorch/pytorch/pull/131858#pullrequestreview-2517760602) to optimize DataLoader code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146821
Approved by: https://github.com/divyanshk

Co-authored-by: Divyansh Khanna <divyanshkhanna09@gmail.com>
2025-07-08 09:24:53 +00:00
41e8b826d0 S390x update test marks (#157541)
Update s390x test marks

test_logs_out from test/dynamo/test_logging.py is updated
and no longer fails on s390x.

test_qengine from test/test_torch.py doesn't work on s390x:
no QEngine is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157541
Approved by: https://github.com/huydhn
2025-07-08 09:08:33 +00:00
5430990bd7 Added philox based RNG context for HPU device in Dtensor scenarios (#156581)
In this PR, we are enabling `HPU` device-specific function calls for random operations. These calls will manage the setting and unsetting of the `context of Random Number Generator`.
While HPU devices typically utilize a `Mersenne-based RNG`, Dtensor-specific random operations employ an `offset-based (Philox) RNG tracker` which is specifically integrated with `CUDA` in scope.
To integrate a similar offset-based RNG tracker within the `HPU backend`, a backend-specific device handle function is necessary to identify the execution context of these random operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156581
Approved by: https://github.com/jeromean, https://github.com/wanchaol
2025-07-08 08:50:24 +00:00
55108074c0 Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-08 08:40:47 +00:00
84b77ec128 [BE] add a minimal linter to check pyproject.toml consistency (#156017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156017
Approved by: https://github.com/ezyang
2025-07-08 08:17:36 +00:00
8134684d44 [inductor collectives] sink waits iterative (#157708)
Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157708
Approved by: https://github.com/wconstab
ghstack dependencies: #157706
2025-07-08 07:17:10 +00:00
2af7c67e48 Mitigate some flaky tests in trunk (#157756)
(not really fix these issues, but we should be able to close them. This also allows CI from the PR to test them)

Fixes https://github.com/pytorch/pytorch/issues/156579
Fixes https://github.com/pytorch/pytorch/issues/156580
Fixes https://github.com/pytorch/pytorch/issues/126867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157756
Approved by: https://github.com/clee2000
2025-07-08 07:07:11 +00:00
38757d94f1 Enable target-determination (TD) for ROCm CI (#156545)
Target determination sorts the tests in a PR CI run based on heuristics about which tests are more relevant to the PR's changes. This can help provide faster CI signal as well as help alleviate capacity concerns as job durations should decrease due to catching failures earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156545
Approved by: https://github.com/jeffdaily, https://github.com/clee2000
2025-07-08 06:27:40 +00:00
1b58e7adab fix storage use_count (#157694)
# Motivation
https://github.com/pytorch/pytorch/pull/155451 decoupled `torch._C._storage_Use_Count` from CUDA and introduced a corresponding unit test:
815545f2dd/test/test_torch.py (L257-L262)
However, this test fails when PyTorch is built with debug assertions enabled. @clee2000 disabled this UT in https://github.com/pytorch/pytorch/pull/156731. The root cause is that `_cdata` is obtained from an `intrusive_ptr`, not a `weak_intrusive_ptr`. As a result, calling `c10::weak_intrusive_ptr::use_count` on it triggers the internal assertion:
815545f2dd/c10/util/intrusive_ptr.h (L912-L917)
For example:
```python
a = torch.randn(10, device=device) # refcount=1, weakcount=1
prev_cf = torch._C._storage_Use_Count(a.untyped_storage()._cdata) # violate the assertation
```
This violates the expected invariant inside `weak_intrusive_ptr::use_count`, which assumes the pointer was originally constructed from a valid `weak_intrusive_ptr`. Actually, `storage_impl` is obtained from an `intrusive_ptr`.
815545f2dd/torch/csrc/Module.cpp (L2105-L2109)

# Solution
Use `c10::intrusive_ptr::use_count` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157694
Approved by: https://github.com/albanD
2025-07-08 05:53:12 +00:00
8186af5a26 [BE][Easy] set end-of-line for .bat file to CRLF in .editorconfig (#156032)
See also:

54976bca10/.gitattributes (L1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156032
Approved by: https://github.com/seemethere, https://github.com/ezyang
2025-07-08 05:40:57 +00:00
bdacf08b86 [BE][Easy] add .editorconfig setting for C/C++/CUDA/ObjC (#157692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157692
Approved by: https://github.com/ezyang
2025-07-08 05:37:15 +00:00
987314aa96 Split batch-num-heads grid dim between y and z (#157745)
for #157018

doesn't totally fix the problem but should help alot

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157745
Approved by: https://github.com/Chillee
2025-07-08 05:17:43 +00:00
39a8f66d59 [BE] Use simdgroup_size constexpr (#157751)
Instead of every shader defining it separately, move it to `c10/metal/common.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157751
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157746
2025-07-08 03:46:20 +00:00
0b73f7c871 [EZ][BE] Move array def to c10/metal/common.h (#157746)
And use proper type aliasing instead of weird _ARRAY_NS

Also use `uint64_t` instead of `ulong`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157746
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-07-08 03:46:20 +00:00
a4c7e7f983 [PowerPC]: Fixed build issue that occur because of datatype f8 enablement for onednn in qlinear and prepack (#157469)
Getting the build issue because of enablement of data type fp8 for onednn in qlinear and qlinear_prepack file after this commit c2185dc4a5626848df37cad214b73d5ae7dd4f17

Currrently cpuinfo is disable for power system because of that  it is giving below error.

**Error:**
 ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope

Made a required changes and now build issue got fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157469
Approved by: https://github.com/malfet
2025-07-08 03:45:06 +00:00
cyy
3ee8828c87 [1/N] Don't use CUDA.cmake module (#157188)
Small changes before removing CUDA.cmake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157188
Approved by: https://github.com/ezyang
2025-07-08 03:05:35 +00:00
f56bfb3030 [CPU] Fix memory access for sbgemm bf16 (#156585)
Fixes #156022.

1. The original dtype conversion overwrites the whole `n_*ldc_` instead of `n_*m_` with stride `ldc_`, causing the potential memory issue.
2. Fix the None value issue in attention backward UT, as the sbgemm bf16 could be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156585
Approved by: https://github.com/mingfeima, https://github.com/aditew01, https://github.com/ezyang
2025-07-08 02:36:28 +00:00
12f9942b10 Fix slice op redistribute_cost compute (#157178)
For slice op backward, my understanding is that the `redistribute_cost` attribute is incorrectly assigned to previous placement strategy: 0decd966af/torch/distributed/tensor/_ops/_tensor_ops.py (L399-L400)

The mistake is hard to be tested since we didn't enforce the `redistribute_cost` for `strategy.strategies` with size one: 2815ade9a8/torch/distributed/tensor/_sharding_prop.py (L491-L499)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157178
Approved by: https://github.com/XilunWu
2025-07-08 02:28:59 +00:00
c5589074e6 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-08 01:21:59 +00:00
30a1cc11a4 Revert "[CI][MacOS] Add VENV_PATH to search path (#157749)"
This reverts commit 85111cd165f108ffabb4a90083d59d7a867ebd9f.

Reverted https://github.com/pytorch/pytorch/pull/157749 on behalf of https://github.com/huydhn due to It looks like lint was not green, so revert and reland I guess ([comment](https://github.com/pytorch/pytorch/pull/157749#issuecomment-3047032909))
2025-07-08 01:18:16 +00:00
19a01382bc Revert "[SymmMem] find_path does not search /usr/local/lib (#157695)"
This reverts commit 3effe0c293219b00a0eae7e139fe2d9aed84bc03.

Reverted https://github.com/pytorch/pytorch/pull/157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](https://github.com/pytorch/pytorch/pull/157695#issuecomment-3047020152))
2025-07-08 01:12:01 +00:00
df72078fe1 [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/torch.py (#157344)
Fixes part of #147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157344
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-07-08 00:46:56 +00:00
85111cd165 [CI][MacOS] Add VENV_PATH to search path (#157749)
When building/testing PyTorch on MacOS

Shoudl prevent some flakiness when conda environment overtakes CI/CD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-08 00:38:37 +00:00
edf7bb4f51 Fix unbound local when an error occurs before pool is initialized (#156750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156750
Approved by: https://github.com/jamesjwu
2025-07-08 00:28:21 +00:00
bbb930aba2 Bump urllib3 from 2.2.2 to 2.5.0 in /tools/build/bazel (#156390)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.2 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-07 17:13:21 -07:00
60b41de0ca remove allow-untyped-defs from torch/ao/nn/quantized/modules/rnn.py (#157234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157234
Approved by: https://github.com/jingsh
ghstack dependencies: #157231, #157232
2025-07-08 00:11:52 +00:00
e38a335d7f remove allow-untyped-defs from torch/backends/cusparselt/__init__.py (#157232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157232
Approved by: https://github.com/jingsh
ghstack dependencies: #157231
2025-07-08 00:11:52 +00:00
9d8cf24b3b remove allow-untyped-defs from torch/_classes.py (#157231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157231
Approved by: https://github.com/jingsh
2025-07-08 00:11:52 +00:00
be56a8d7ac Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 23:57:17 +00:00
3effe0c293 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-07 23:16:45 +00:00
2fde2090d0 [inductor_collectives] Make reorder_collectives_preserve_peak pass grouping nodes (#157706)
Differential Revision: [D77861765](https://our.internmc.facebook.com/intern/diff/D77861765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157706
Approved by: https://github.com/wconstab
2025-07-07 23:13:58 +00:00
5d8d126249 Fix einops x torch.compile interaction (#157600)
Fixes https://github.com/pytorch/pytorch/issues/157451

If/when einops releases a version greater than 0.8.1, it will just break
(without this patch).

The history is:
- Between 2.6 and 2.7, we tried to delete the einops import (#142847)
- That didn't work so well, so we applied a hotfix in 2.7.1. (#153925)
- The hotfix wasn't completely correct (0.8.1 is the latest version of
  einops, so the condition in the hotfix just always evaluates to True!)
- It turns out we didn't need to delete the einops import. We already
  do not eagerly import einops.
- I reverted the code back to the state it was in in 2.6.
  https://github.com/pytorch/pytorch/blob/release/2.6/torch/_dynamo/decorators.py

Test Plan:
- We have testing in CI for einops 0.6.1, 0.7.0, and 0.8.1. Wait for CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157600
Approved by: https://github.com/guilhermeleobas, https://github.com/anijain2305
ghstack dependencies: #157416
2025-07-07 23:04:02 +00:00
378c121d5e Remove unnecessary warnings during the ATen compilation process. (#157703)
Comparing uint32_t(num_threads()) with int(kCUDABlockReduceMaxThreads) always results in a compilation warning. Just change the return type of kCUDABlockReduceMaxThreads to uint32_t to avoid it.
Fixes https://github.com/pytorch/pytorch/issues/157701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157703
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-07-07 22:49:38 +00:00
7e83d50845 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-07 22:13:34 +00:00
6f05d58f2b [AOTI] Split aoti_runtime/model.h to prepare for model static linking (#157592)
Summary:
Prepare for https://github.com/pytorch/pytorch/pull/157129.

We split the file so we can re-use `model.h` part for codegen a separate header for each model in static linkage.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77761249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157592
Approved by: https://github.com/desertfire
2025-07-07 22:13:22 +00:00
a7eb153bba [MemoryViz] Add file selector button (#157647)
In some linux desktop environments like mine, there is no drag and dropping of files. Which made the memoryviz impossible for me to use. So this adds a file selector button as an alternative. Tested that it works locally, and also works with multiple files.

![image](https://github.com/user-attachments/assets/dcb61d68-6c6f-42f6-a075-1783d747d1b0)

And the button remains when something is loaded, to allow loading something else, but it moves out of the way to save vertical space:

![image](https://github.com/user-attachments/assets/4239d13c-3d80-4790-9696-0906c75e14e6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157647
Approved by: https://github.com/sraikund16
2025-07-07 22:03:51 +00:00
ed6df0e324 correctly import torch.version (#157584)
The structure is

```
torch/
  __init__.py
  version.py
```

When we import torch, only `torch/__init__.py` is executed by default.

The submodules like `version.py` are not automatically imported or attached to the torch module.

So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
2025-07-07 21:43:35 +00:00
5c79a55e7e [oss] Add version to metadata (#155343)
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data

Test Plan:
added a test

Rollback Plan:

Differential Revision: D76135887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
2025-07-07 20:57:30 +00:00
3d06ff82a8 [release] Triton pin update to 3.4 (#156664)
Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206
Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664
Approved by: https://github.com/davidberard98
2025-07-07 20:52:25 +00:00
2efa5eaa65 swa avoid stream sync (#157705)
Summary:
When AveragedModel updates_parameters it calls self.n_averaged == 0 for each parameter, where n_averated is a buffer on GPU. Moving check before the cycle to call sync once

It improves update_parameter from 74ms to 57ms ~22% improvement
{F1980011097}
{F1980011111}

Test Plan:
CI

Rollback Plan:

Differential Revision: D77723025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157705
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99
2025-07-07 20:47:35 +00:00
c2510fcd86 Fix index_put propagate strategy arg unpack error (#157671)
Fix `index_put` propagate strategy didn't consider optional arg `accumulate`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157671
Approved by: https://github.com/fmassa, https://github.com/wconstab
2025-07-07 20:18:18 +00:00
510c398a4f Add max_pool3d backward pass for MPS (#157498)
Note on backward precision over fp16:

A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024)  \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))

Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024)  \exp2(-3) \approx 0.002$.

The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024)  \exp2(-2) \approx 0.002$. Same for $e_b=-1$.

So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9  * 0.002$.

That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
2025-07-07 19:46:44 +00:00
63a96eaeb8 [DeviceMesh] Add error when users try to slice non contiguous flattened dim submesh (#157523)
With https://github.com/pytorch/pytorch/issues/157393, we want to first throw a clearer error for users and then fix it in the long-term

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157523
Approved by: https://github.com/fegin
ghstack dependencies: #157501
2025-07-07 19:43:51 +00:00
2b8d3b1b2b [DeviceMesh] Use user set backend and pg option even for the global mesh (#157501)
Short term solution to https://github.com/pytorch/pytorch/issues/156593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157501
Approved by: https://github.com/fegin, https://github.com/lw
2025-07-07 19:43:51 +00:00
bf1ebe0531 Fix typo: 'paramter' → 'parameter' in dynamo variable comment (#157651)
This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'.

These small but meaningful changes help improve code readability and maintain the overall quality of the codebase.

Thanks for your time and review!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651
Approved by: https://github.com/Skylion007
2025-07-07 19:42:44 +00:00
433a247102 [logging] [redo] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156840)
Summary: This is a redo of https://github.com/pytorch/pytorch/pull/156517, but with pt2_compile_events logging disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156840
Approved by: https://github.com/jamesjwu
2025-07-07 19:09:48 +00:00
8a47f9d03b [CI] Fix xpu ci test sccache issue (#157693)
With PR #157341 land, it broken the PXU CI test on sccache which has been disabled by #143851. Re-disable it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157693
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-07 18:29:38 +00:00
9e5f4a844c [FSDP2] Fix issue with set_reduce_scatter_divide_factor errors and MixedPrecisionPolicy (#155964)
fix https://github.com/pytorch/pytorch/issues/155223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155964
Approved by: https://github.com/weifengpy
2025-07-07 17:09:29 +00:00
cyy
7c1f627828 Fix 'dllimport attribute ignored on inline function' (#157670)
There are lots of warnings in builds:
```
 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209030Z  5043 | inline at::Tensor & Tensor::less_(const at::Scalar & other) const {
2025-07-05T16:59:46.9209104Z       |                             ^
2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209860Z  5048 | inline at::Tensor & Tensor::less_(const at::Tensor & other) const
```
This PR has fixed them and turned the warning into an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670
Approved by: https://github.com/albanD
2025-07-07 16:57:48 +00:00
b3b4d28f4c [submodule][cutlass] Update pin to b995f93 v4.0.0 (#157376)
@Skylion007 seems afk. https://github.com/pytorch/pytorch/pull/153541

https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157376
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 16:55:47 +00:00
ae1094b72b Revert "[WIP] Automatically load and save dynamo entries via caching_precompile (#155913)"
This reverts commit e466dab164d9236bfe5817ec8e4d24c7b9d3e392.

Reverted https://github.com/pytorch/pytorch/pull/155913 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/155913#issuecomment-3045914878))
2025-07-07 16:53:35 +00:00
eda0a9cc90 [list] Add list.__delitem__ (#156339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156339
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270, #156271
2025-07-07 14:51:32 +00:00
d74ccf4ffe [list] Add list.__mul__ and list.__imul__ (#156271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156271
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270
2025-07-07 14:51:32 +00:00
689fba032d Implement list.__add__ and list.__iadd__ (#156270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156270
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242
2025-07-07 14:51:25 +00:00
c1d69d5dd5 [list] Implement list.remove (#156242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156242
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148
2025-07-07 14:51:17 +00:00
e49acfc5c5 [list] Raise exception in invalid list method call (#156148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156148
Approved by: https://github.com/zou3519
ghstack dependencies: #153969
2025-07-07 14:51:10 +00:00
034e996d37 [list] Implement list.count (#153969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153969
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2025-07-07 14:51:03 +00:00
16c3b4143b [gtest][listing] Enable gtest json listing for the fbcode/caffe2 project (#156816)
***SUMMARY***

The main function in this tests overrides that of the Gtest framework which contains it's `RUN_ALL_TESTS()` function. The main function in this test is called conditionally when conditions apply, in this case, when the C10_MOBILE directive is provided. This is wrong as we always want to call the `RUN_ALL_TEST()` function.

In this PR, we only make the test suite available for cases that apply, i.e if the C10_MOBILE directive exist which represents the caching allocator and is only exposed on mobile

***TEST PLAN***

This tests should run in modes where it applies which should be covered in the CI run.

Below shows a sample run in the dev-nosan mode which do not have the cache allocator

BEFORE
```
buck test fbcode//caffe2:cpu_caching_allocator_test
Discovered 0. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
⚠ Listing failed: caffe2:cpu_caching_allocator_test
Listing tests failed with error:
Failed to read from /data/users/ysuleiman/fbsource/buck-out/v2/test/buck-out/v2/test_discovery/fbcode/6dcc55a61c1b90b3/default/tpx_execution_dir/gtest_output_file.json. Listing process stdout: , stderr:
```

AFTER
```
buck test '@fbcode//mode/dev-nosan' fbcode//caffe2:cpu_caching_allocator_test
Analyzing targets. Remaining      0/46242                                                                                1871690 actions, 2251668 artifacts declared
Executing actions. Remaining      0/257870                                                                               83:28:24.4s exec time total
Command: test.     Finished 10 remote, 112314 cache (99% hit)                                                            83:22:43.5s exec time cached (99%)
Time elapsed: 2:57.7s
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
```

Rollback Plan:
steps:
  - manual.note:
      content: Revert this diff

Reviewed By: patskovn

Differential Revision: D77229077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156816
Approved by: https://github.com/kimishpatel
2025-07-07 14:16:43 +00:00
54a4d34d10 [fbcode] switch to cutlass-4 (#157579)
Summary: Update cutlass version to 4. For most use cases.

Test Plan:
testing in progress

Rollback Plan:

Differential Revision: D77605011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157579
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 14:12:33 +00:00
78684e27ac [xla hash update] update the pinned xla hash (#156584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584
Approved by: https://github.com/pytorchbot
2025-07-07 12:09:20 +00:00
40e39ae21f Update slow tests (#157696)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157696
Approved by: https://github.com/pytorchbot
2025-07-07 12:09:06 +00:00
e466dab164 [WIP] Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 11:56:30 +00:00
d27d36136c Don't try installing missing cuda dependencies on s390x (#157540)
Don't try installing missing cuda dependencies on s390x

Fixes #157409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157540
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-07 09:16:38 +00:00
815545f2dd [inductor] enable bf32 for mkldnn linear pointwise/binary in inductor (#127294)
When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type.

Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32
python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127294
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-07 06:03:41 +00:00
d26ca5de05 Support transpose and pack for bit8 (#156065)
To be used by CPU INT8 SDPA in torchao. https://github.com/pytorch/ao/pull/2380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156065
Approved by: https://github.com/mingfeima, https://github.com/ezyang
2025-07-07 01:40:47 +00:00
Lei
2022588295 Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369)
Fixes #151223

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369
Approved by: https://github.com/weifengpy
2025-07-06 09:20:31 +00:00
02715d0876 [BE][5/6] fix typos in test/ (test/dynamo/) (#157639)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157639
Approved by: https://github.com/yewentao256, https://github.com/jansel
ghstack dependencies: #157638
2025-07-06 06:34:25 +00:00
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
7cda4017dd Fix torch.utils.cpp_extension parser for clang version 20.1.7+libcxx (#157666)
When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with:

```
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions
    compiler_name, compiler_version = self._check_abi()
                                      ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi
    _, version = get_compiler_abi_compatibility_and_version(compiler)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version
    if tuple(map(int, version)) >= minimum_required_version:
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '7+libcxx'
```

Compiler identification is a valid semantic version:
```
$ clang -dumpfullversion -dumpversion
20.1.7+libcxx
```

After adjusting parser of version, clang is able to compile extensions successfully.

Fixes #157665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666
Approved by: https://github.com/msaroufim
2025-07-06 01:35:00 +00:00
3e56a9cdfb More testing of Python arithmetic operators between tensors and scalars (see 157266) (#157632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157632
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-05 17:48:27 +00:00
ee9ac36c23 Fixing misspelling in documentation (#157565)
Fixes #157564

Fixes misspelling of the word parameter in documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157565
Approved by: https://github.com/awgu, https://github.com/cyyever
2025-07-05 17:04:13 +00:00
9be5860bc3 [dynamo] Fix dynamic shapes handling in after_aot repro generation (#157136)
Summary:
- Extract symbolic variables directly from graph placeholders and arguments
- Add symbolic variable definitions to generated repro code
- Add unit tests with ToyModel for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157136
Approved by: https://github.com/xmfan
ghstack dependencies: #157021
2025-07-05 15:38:41 +00:00
548c9d8281 Fix typo: 'paramter' → 'parameter' in quantization model report test (#157646)
This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`:

- Corrected the word "paramter" to "parameter" for better readability and accuracy.

While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase.

Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157646
Approved by: https://github.com/yewentao256, https://github.com/ezyang
2025-07-05 12:28:36 +00:00
71a650ad56 Fix typo: 'Intializing' → 'Initializing' in test_parametrization.py (#157362)
This pull request fixes a minor typo in the doc comments of `test/nn/test_parametrization.py`.

- Replaced `'Intializing'` with `'Initializing'` in two docstring comments to improve clarity and maintain consistency across the codebase.

This is a non-functional change and does not impact behavior or test outcomes.

Thank you for maintaining such a high-quality codebase. Please let me know if any adjustments are needed. I'd be happy to help!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157362
Approved by: https://github.com/ezyang
2025-07-05 12:21:15 +00:00
2471cc3355 [pc] verify max autotune is in generated source code (#157650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157650
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614, #157619
2025-07-05 07:55:11 +00:00
db00e1699a [pc] introduce ProgressiveCompilationState and clear callback (#157619)
followup from https://github.com/pytorch/pytorch/pull/157305 where
@aorenste correctly suggested clearing callback. this refactor
introduces a new dataclass so we don't need to check nullability for
each field

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157619
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614
2025-07-05 07:55:11 +00:00
5ea832e5f6 [pc] migrate progression futures from list to deque (#157614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157614
Approved by: https://github.com/aorenste
ghstack dependencies: #157305
2025-07-05 07:55:03 +00:00
a952956d05 Add isnan exit condition to special ops (#157464)
They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind
7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)
which were tested by `test_compare_cpu` using following values (as sample index 16)
7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below
```
python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)
```
Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157488
2025-07-05 04:19:50 +00:00
63e87d6d05 [Refactor] Add maybe unused flag to remove warning (#157655)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157655
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-05 03:23:39 +00:00
f7127b9b94 [Refactor] Remove unused variables (#157654)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-05 02:12:15 +00:00
44f5b93122 fix: correct sentence punctuation in cuDNN note (#157623)
Fixes #ISSUE_NUMBER
This PR fixes a small punctuation issue in the PyTorch README.

Specifically:

Added a missing full stop at the end of the sentence:
"Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware."

Added comma for clarity between "CUDA driver" and "NVIDIA hardware".

These edits improve the readability and grammatical correctness of the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157623
Approved by: https://github.com/Skylion007
2025-07-05 01:37:33 +00:00
e0fd48be7d Fix typo: 'occurances' → 'occurrences' in mobile model test (#157629)
This PR addresses a typo in the file `test/mobile/model_test/gen_test_model.py`.

### Changes:
- Corrected "occurances" to the correct spelling "occurrences"
- Renamed associated variables to reflect this change for consistency and clarity

This is a non-functional, cleanup-only PR to improve code readability.

Thanks to the PyTorch team for maintaining such a high-quality codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157629
Approved by: https://github.com/Skylion007
2025-07-05 01:36:42 +00:00
43f7216327 Fix typo: 'paramters' → 'parameters' in ATen tunable README (#157575)
This PR addresses a minor typo in the documentation file aten/src/ATen/cuda/tunable/README.md, where paramters has been corrected to parameters for improved clarity and consistency.

Context
Accurate and clear documentation is crucial for helping developers and contributors understand PyTorch internals. This small fix contributes to the overall quality and readability of the project.

Thank you to the PyTorch team and maintainers for your continued efforts in building such an incredible framework. I'm happy to contribute in any way I can — even if just with a small doc improvement like this one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157575
Approved by: https://github.com/eqy
2025-07-05 01:14:45 +00:00
8a8fac1131 [SymmMem] Move code to where it is used (#157611)
`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157611
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-04 23:37:49 +00:00
bcc98bb2a4 Update _linux-test to support B200 runner (#157341)
This unblocks https://github.com/pytorch/test-infra/issues/6869.  The key changes to call out:

* B200 needs OIDC to access ECR and upload stats to S3, so we need to set `id-token: write` in `_linux-test`.  All workflows calling `_linux-test` also need to be updated accordingly
* Connecting sccache to S3 on B200 doesn't seem to work, so I disable it.  It still works locally though.

### Testing

https://github.com/pytorch/pytorch/actions/runs/16055549292/job/45312298376
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157341
Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2025-07-04 23:19:24 +00:00
524e827095 [build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.

Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:

```bash
python -m pip install --verbose --editable .

python -m pip install --verbose --no-build-isolation --editable .
```

In addition, the SDist is also buildable:

```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz  # build from source using SDist
```

Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #157557
2025-07-04 19:25:14 +00:00
9968edd002 Fix #153942 (#153943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153943
Approved by: https://github.com/malfet
2025-07-04 18:25:18 +00:00
7275f28045 Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157630)
This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch:
1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change.
2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt
2025-07-04 18:08:31 +00:00
7be862ab8f [dynamo] Relax DUPLICATED_INPUT to be serializable. (#157492)
Since we don't actually rely on any real data while building DUPLICATE_INPUT guard, we can safely serialize it with sources and it should be able to reconstruct the guard correctly in the new process. Therefore we don't really need to prevent serializing it.

Differential Revision: [D77683302](https://our.internmc.facebook.com/intern/diff/D77683302/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157492
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-07-04 15:19:34 +00:00
336f1e2d35 [AOTI] Fix AOT inductor CMake build dependency order (#157557)
compile_model.py -> aoti_custom_class -> torch

The custom command requires `torch` to be installed.

8408522976/test/cpp/aoti_inference/compile_model.py (L1-L7)

Fixes CI failure on trunk:

- https://github.com/pytorch/pytorch/actions/runs/16041370426/job/45275085572#step:22:18348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157557
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-04 14:33:36 +00:00
a46ea8a364 Fix typo: 'initalized' → 'initialized' in alias analysis test (#157628)
This PR corrects a small spelling error in `test/jit/test_alias_analysis.py`.

- "initalized" → "initialized"

This is a minor comment correction and does not affect functionality or logic.

Thank you for maintaining this amazing codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157628
Approved by: https://github.com/Skylion007
2025-07-04 13:41:53 +00:00
f41d017aa6 Add device check in mse_loss (#155089)
Fixes #154978

## Test Result

```python
>>> import torch
>>> import numpy as np
>>> import torch.nn as nn
>>> import torch.distributions.normal as norm
>>> device = torch.device(('cuda' if torch.cuda.is_available() else 'cpu'))
>>> print('Using {}'.format(device))
Using cuda
>>> m = nn.Sequential(nn.Linear(1, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh())
>>> m.to(device, dtype=None, non_blocking=False)
Sequential(
  (0): Linear(in_features=1, out_features=128, bias=True)
  (1): Tanh()
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): Tanh()
  (4): Linear(in_features=128, out_features=128, bias=True)
  (5): Tanh()
)
>>> opt = torch.optim.Adam(m.parameters(), lr=0.001)
>>> print('Number of trainable parameters: ', sum((p.numel() for p in m.parameters() if p.requires_grad)))
Number of trainable parameters:  33280
>>> input_tensor = torch.tensor(77.0, device=device)
>>> target = torch.tensor(66.0)
>>> loss_function = nn.MSELoss()
>>> print('Loss Function: ', loss_function)
Loss Function:  MSELoss()
>>> loss = loss_function(input_tensor, target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 610, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3903, in mse_loss
    return torch._C._nn.mse_loss(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155089
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-04 12:37:48 +00:00
52e4e41cbc [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598)
`lru_cache` usage warning was being raised for `torch.get_device_module()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598
Approved by: https://github.com/Sidharth123-cpu
2025-07-04 08:17:50 +00:00
64f2ec77f8 [inductor] Fix fractional_max_pool2d 3D input causing assertion error (#156912)
Fixes #156682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156912
Approved by: https://github.com/angelayi
2025-07-04 06:09:28 +00:00
fdc5b42a8f _broadcast_shapes gso generalizations (#157008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157008
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #155590
2025-07-04 05:56:42 +00:00
d58ed04d89 [async-compile] add progressive compile mode (#157305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157305
Approved by: https://github.com/aorenste
2025-07-04 04:18:50 +00:00
386bc9e2e9 [audio hash update] update the pinned audio hash (#156905)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156905
Approved by: https://github.com/pytorchbot
2025-07-04 04:06:59 +00:00
f2e712ca14 Revert "Fix is_unaligned usage of statically_known_true (#157400)"
This reverts commit b359571c6043b40c4ae4fbb07135fd0f04902e21.

Reverted https://github.com/pytorch/pytorch/pull/157400 on behalf of https://github.com/malfet due to It break tests, see 99c1a6bdd9/1 ([comment](https://github.com/pytorch/pytorch/pull/157400#issuecomment-3034353539))
2025-07-04 03:57:08 +00:00
99c1a6bdd9 [SymmMem] Find NVSHMEM from system installation (#157513)
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157513
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-04 03:34:44 +00:00
4ed1b03f72 Add missing graph and memory related symbols to cuda_to_hip_mappings (#157435) (#157573)
Summary: This PR adds missing CUDA symbols in `cuda_to_hip_mappings`.

Test Plan: Tested in D77642700.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157573
Approved by: https://github.com/Skylion007

Co-authored-by: Geon-Woo Kim <gwkim@meta.com>
2025-07-04 03:03:04 +00:00
8f9a191db6 [SymmMem] Fix CI name mismatch; remove TORCH_SYMMMEM requirement (#157597)
Thanks @huydhn for spotting two name mismatches in the CI configs.
We were matching against "test_h100_symm_mem" instead of "h100-symm-mem".

Also, replaced `TORCH_SYMMMEM` env setting with programmatic method:
`symm_mem.set_backend(...)`

Further, skips a hanged test in `test_nvshmem_trion.py`. (#TODO @codingwithsurya )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157597
Approved by: https://github.com/fduwjj, https://github.com/huydhn
2025-07-04 01:43:08 +00:00
ef97bd4713 [torch] Add MTIA to the list of devices supporting foreach/fused kernels (#157583)
Summary: We currently have foreach kernel implementations for MTIA, and for when we don't we internally decompose the ops. Anyone using this list for compatibility checks should be sending through the foreach kernels.

Reviewed By: egienvalue, scottxu0730

Differential Revision: D77751248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157583
Approved by: https://github.com/egienvalue
2025-07-04 01:15:24 +00:00
f0b388665e Add dynamo_timed to bytecode hook (#157587)
Test Plan:
- ran tlparse on vLLM and saw this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157587
Approved by: https://github.com/jingsh, https://github.com/BoyuanFeng
2025-07-04 01:11:03 +00:00
c9a5bf09ba [FP8] FP8 for SwishLayerNorm (#157574)
Summary: Add a pass use_triton_fp8_swish_replace_normal_swish to replace _triton_swish_rms_norm with its counterpart that supports fp8 triton_swish_rms_norm, and turn on fp8 during inference.

Test Plan:
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12.4 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=899072727_0 --node-replacement-dict="{}" --gpu-trace --add-passes=use_triton_fp8_swish_replace_normal_swish
```
The perf improvement on the 100x model with this pass is roughly ~7%, details are recorded [here](https://docs.google.com/document/d/1eIV_OTQyQcf_DlEDxwycTwhyGxT5OJkLzs8cPL6EMYc/edit?tab=t.0)

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76531303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157574
Approved by: https://github.com/frank-wei
2025-07-04 01:06:21 +00:00
dfcda613b6 Ensure Dynamo can trace through explicit dunder method call (#154366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154366
Approved by: https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066, #154263
2025-07-04 00:46:05 +00:00
0e7f02fe2e [Dynamo] [FrozensetSubclass] Add support for user defined frozensets (#154263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154263
Approved by: https://github.com/williamwen42
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066
2025-07-04 00:46:05 +00:00
308b88bde9 [Dynamo] [Set] Add comparison for set subclass (#154066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154066
Approved by: https://github.com/Skylion007
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065
2025-07-04 00:45:58 +00:00
c51da57b55 [Dynamo] [Set] Raise TypeError in set.union(...) and "__or__" (#154065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154065
Approved by: https://github.com/williamwen42
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064
2025-07-04 00:45:50 +00:00
f9544f1f0c [Dynamo] [Set] Raise TypeError if object is unhashable (#154064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154064
Approved by: https://github.com/Skylion007
ghstack dependencies: #153150, #152991, #154539, #153553, #154063
2025-07-04 00:45:42 +00:00
11c71053e0 [Dynamo] [Set] Implement some binop operators for dict/set/frozenset/dict_keys (#154063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154063
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539, #153553
2025-07-04 00:45:34 +00:00
22abe6ded4 [Dynamo] [SetSubclass] Add support for user defined sets (#153553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153553
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539
2025-07-04 00:45:25 +00:00
2b82c61f04 [Generator] Implement generator.__contains__ (#154539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154539
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991
2025-07-04 00:45:18 +00:00
f651e28f80 [FrozenSet] Fixes for FrozenSet (#152991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152991
Approved by: https://github.com/zou3519
ghstack dependencies: #153150
2025-07-04 00:45:11 +00:00
e7167dbacf [Set] Support sets in VariableBuilder (#153150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153150
Approved by: https://github.com/zou3519
2025-07-04 00:45:03 +00:00
6c42afe196 Introduce sync_cross_rank_decision (#156287)
Summary:
This is an improvement over `_broadcast_rank0_decision` where we uses the rank0's decision to broadcast to every rank. The issue of `_broadcast_rank0_decision` is that we observed large variance on the peak memory usage. One cause is that different ranks receive different dynamic shaped tensors and the hints of those tensors are different in different ranks. If we only rely on rank0's decision and it's unlucky to get unrepresentative hints, then the decision it makes may not be suitable for other ranks.

Here, we introduce `sync_cross_rank_decision` which comes up with the decision after comparing all ranks' local decision, it will:
1. all gather decisions from all ranks;
2. test each decision on the current rank and get its estimated memory usage;
3. all reduce estimated memory usage with ReduceOp.MAX, so that we know the maximum memory usage of each decision on all ranks;
4. pick the decision which gives us minimum maximum memory memory usage;

A graph to show more details
https://internalfb.com/excalidraw/EX484509

After applying sync_cross_rank_decision, we observed that the variance are much smaller

Rollback Plan:

Differential Revision: D76714005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156287
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-07-03 23:43:53 +00:00
f7130c097e [nativert] Move Executor to PyTorch core (#157514)
Test Plan:
CI

Rollback Plan:

Differential Revision: D77693984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157514
Approved by: https://github.com/zhxchen17
2025-07-03 23:31:54 +00:00
ad86c05b78 efficient zero_mask implementation for vec128_*_neon (#155766)
Differential Revision: [D76481039](https://our.internmc.facebook.com/intern/diff/D76481039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155766
Approved by: https://github.com/malfet
2025-07-03 23:27:03 +00:00
b359571c60 Fix is_unaligned usage of statically_known_true (#157400)
Summary:
- symbolic shapes statically_known_true usage  is wrong, this API is meant to be used for SymNodes. what is needed is V.graph.sizevars.statically_known_true. or  V.graph.sizevars.statically_known_Equals or ideally  V.graph.sizevars.statically_known_multiple_of.

- The construction using == 0 is not symbolic, this used to always return false for symbolic inputs.

Differential Revision: D77619293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157400
Approved by: https://github.com/ColinPeppler
2025-07-03 23:26:36 +00:00
a6fab82b16 [BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC (#157453)
Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: https://github.com/pytorch/pytorch/pull/157411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157453
Approved by: https://github.com/kwen2501, https://github.com/atalman
2025-07-03 22:55:18 +00:00
dd3e7170c2 Add async checkpointing impl to experimental checkpointer and add a builder API (#156927)
1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support.

2. Adds two conveinient functions to create sync/async checkpointers

Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927
Approved by: https://github.com/pradeepfn
2025-07-03 22:49:20 +00:00
7081b8233a [BE] Accelerator agnostic timer.py (#157131)
Farewell to a lot of if statements - benefit is this now also supports mps synchronization

Still need to think of a good test strategy for the privateUse1 removal, granted I'm not sure what the semantics of something like https://docs.pytorch.org/docs/stable/generated/torch.cpu.synchronize.html actually since CPU is probably synchronous?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157131
Approved by: https://github.com/albanD
2025-07-03 22:23:04 +00:00
7b392bac13 all_gather_bucketing fx pass (#157396)
Porting passes to bucket all_gathers

The main logic of the pass is done via
1. Searching for all all_gathers from the buckets

Copying tests from @wconstab PR to test compatibility with reordering.
Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering.

Pass changes:
Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops.

Inductor changes:
Adding foreach_copy_ lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396
Approved by: https://github.com/wconstab
2025-07-03 22:07:42 +00:00
19ae5afdaa Fix typo: 'recieve' → 'receive' in comments (#157544)
This PR corrects minor typos in developer-facing comments:

- Replaces 'recieve' with 'receive' in:
  - `FunctionalTensorWrapper.cpp`
  - `make_boxed_from_unboxed_functor.h`

These changes improve code readability and maintain comment correctness.

Thank you for reviewing!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157544
Approved by: https://github.com/soulitzer
2025-07-03 19:11:15 +00:00
3fd84a8592 [BE][PYFMT] migrate PYFMT for torch/[a-c]*/ to ruff format (#144554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144554
Approved by: https://github.com/soulitzer
2025-07-03 18:56:07 +00:00
d56f11a1f2 [MPS] Implement logcumsumexp metal kernel (#156858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858
Approved by: https://github.com/malfet
ghstack dependencies: #157512
2025-07-03 18:16:25 +00:00
794b95d54b Enable Half dtype for logcumsumexp_backward (#157512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512
Approved by: https://github.com/malfet
2025-07-03 18:13:38 +00:00
e3fe001d9e Add einops x torch.compile testing in PyTorch CI (#157416)
Fixes #146782. This PR adds testing for multiple einops versions in
PyTorch CI. This occurs in a new "einops" CI job that runs for both
Python 3.9 and 3.13 (aka, what we test Dynamo over).

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416
Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305
2025-07-03 17:36:39 +00:00
660dbea909 [cutlass backend] modify presets ahead of cutlass 4 upgrade (#157522)
Differential Revision: [D77707409](https://our.internmc.facebook.com/intern/diff/D77707409/)

Also asking in https://github.com/NVIDIA/cutlass/issues/2435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157522
Approved by: https://github.com/coconutruben
2025-07-03 17:13:24 +00:00
5cfe4377d6 [dtensor] Rework partial propagation in pointwise op and support mul (#157340)
I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation.

In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we want to execute `aten.add(lhs, rhs)` on 2 ranks:
* `lhs` is partial, value on rank 0: `r0`, lhs value on rank 1: `r1`
* `rhs` is replicate, value: `a`

Then in order to preserve single device semantic (which should produce the value of `a + r0 + r1`), we do `rhs/world_size` first, then add `rhs` to `lhs`. This means every operand would first need be partial, then we can add them together.

But this become non-true for multiplicative operations, like `aten.mul`, for `aten.mul`, assuming the same `aten.mul(lhs, rhs)` and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. `a* (r0+r1) = a* r0 + a* r1`

So to accomodate the difference of add/mul, in this PR I:
* change linearity to be a int to support different linearity types, add linearity and multiplicative are separate
* add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg)
* handle the linearity type plumbing through the pointwise ops.
* add `mul.Tensor/Scalar` to be the multiplicative linearity
* added the tests to show that the partial placements can be propagated with `aten.mul`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157340
Approved by: https://github.com/zpcore
2025-07-03 17:04:08 +00:00
898179331e [cutlass backend] fix CutlassTensor post-renaming (#157408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157408
Approved by: https://github.com/mlazos
ghstack dependencies: #157402
2025-07-03 17:02:21 +00:00
2e64e45b0b Revert "[build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)"
This reverts commit 404008e3efdabeaf5b140a3aff77131461c33a0a.

Reverted https://github.com/pytorch/pytorch/pull/155998 on behalf of https://github.com/malfet due to Broke inductor_cpp, wrapper see e472daa809/1 ([comment](https://github.com/pytorch/pytorch/pull/155998#issuecomment-3032915058))
2025-07-03 16:47:07 +00:00
e472daa809 [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan

Co-authored-by: Simon Fan <xmfan@meta.com>
2025-07-03 16:42:06 +00:00
ec816d73b4 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
2025-07-03 15:48:37 +00:00
f17f658125 [profiler] add more CUDA API for kernel launcher (#156016)
Add more kernel detection options, resolving TODO
- References : [NVIDIA - docs](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156016
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-07-03 15:26:42 +00:00
c9174a20f7 Revert "[BE] Unskip special ops (#157464)"
This reverts commit e124a0d88ca2aa04bfaca2dcabf5de6244048e45.

Reverted https://github.com/pytorch/pytorch/pull/157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
b6276a425f Revert "[MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)"
This reverts commit 9620994067b18e846a097d1e99af85ec2426ef0a.

Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
a0e0abd037 Fix typo: 'intialized' → 'initialized' in test_modules.py (#157226)
This PR fixes a minor typo in `test/jit/test_modules.py`:

- Before: `intialized`
- After:  `initialized`

There are no functional code changes — this is a comment-only fix to improve clarity and consistency.

Thank you to the PyTorch team for maintaining this outstanding project.
Please let me know if anything else is needed.

With appreciation,
Abhishek Nandy
[@abhitorch81](https://github.com/abhitorch81)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157226
Approved by: https://github.com/Skylion007
2025-07-03 14:56:02 +00:00
b221be9140 Fix typo: 'intial_query_grad' → 'initial_query_grad' in test_transformers.py (#157306)
This is a minor typo fix in `test/test_transformers.py`:

- Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming.

There are **no functional or logic changes** — this PR is aimed purely at improving readability and maintaining code quality.

Thanks to the PyTorch team for their work and review time
Please feel free to suggest if this needs any adjustment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306
Approved by: https://github.com/Skylion007
2025-07-03 14:08:12 +00:00
8408522976 Remove +PTX from CUDA 12.8 builds (#157516)
Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh.
Removing +PTX reduces binary size required to be able to upload binaries to pypi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516
Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv
2025-07-03 13:19:19 +00:00
c329a8f19c Fix CPU bitwise shifts for out-of-limit values in VSX-vec (#157463)
Similar to #96659 this implements the conditionals handling the out-of-limit values in the shift amounts (rhs) for the vectorized VSX code using the same logic as the scalar code.

Fixes #109777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157463
Approved by: https://github.com/jgong5
2025-07-03 10:41:33 +00:00
5dfd8a9c7a Remove is_jit_trace option (#157387)
Summary: Title

Test Plan:
CI

Rollback Plan:

Differential Revision: D77319249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157387
Approved by: https://github.com/pianpwk
2025-07-03 09:20:27 +00:00
8c2e450082 [PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks (#157487)
Summary:
This is a follow up after the PR to add comm override support: https://github.com/pytorch/pytorch/pull/155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.

Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted.

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Differential Revision: D77681620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157487
Approved by: https://github.com/weifengpy
2025-07-03 07:00:35 +00:00
2bb33e7a08 Fixed triton kernel in ET due to Triton version change. (#157484)
Summary: Fixed triton kernel in ET due to Triton version change.

Test Plan:
buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay

Rollback Plan:

Differential Revision: D77398841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157484
Approved by: https://github.com/davidberard98
2025-07-03 06:16:23 +00:00
4ce6e6ec88 XCCL changes for DDP (#155497)
Add XCCL documentation for DDP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155497
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-03 05:18:08 +00:00
382598ef87 Fix unsafe collective reorder past wait (#157489)
Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim

Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note:  this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective.  After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #156879
2025-07-03 05:04:19 +00:00
dc524efb4d Move logging into inner method for reorder pass (#156879)
The reason for inner/outer method is to keep the outer method conforming
to the typedef for a comms graph pass which returns one obj, while
allowing unit tests to call the inner method that returns more metadata
useful for testing the pass.  The logs should be in the inner part, so
they are functional also during unit testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879
Approved by: https://github.com/IvanKobzarev
2025-07-03 05:04:19 +00:00
5d5a5b3501 Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466)
#149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782).

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-07-03 04:57:52 +00:00
404008e3ef [build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.

Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:

```bash
python -m pip install --verbose --editable .

python -m pip install --verbose --no-build-isolation --editable .
```

In addition, the SDist is also buildable:

```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz  # build from source using SDist
```

Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
2025-07-03 04:10:44 +00:00
b642a5c118 [cutlass backend] Add dynamo timed (#157410)
Differential Revision: [D77631592](https://our.internmc.facebook.com/intern/diff/D77631592/)

Before:
![Screenshot 2025-07-01 at 4 08 06 PM](https://github.com/user-attachments/assets/8f6445aa-50c7-456f-b5ac-b2749eb9bf40)

After (different run):
![Screenshot 2025-07-01 at 5 11 09 PM](https://github.com/user-attachments/assets/7513d312-c4dc-4e39-9718-c63eb641bc30)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157410
Approved by: https://github.com/jingsh
2025-07-03 04:03:20 +00:00
493f42a541 [symm_mem] Create a one side get api for symm mem (#157294)
Doing similar like what we did in https://github.com/pytorch/pytorch/pull/156443 so that we can also have a one-side get API for symmetric memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157294
Approved by: https://github.com/kwen2501
2025-07-03 03:52:05 +00:00
662c1cfed2 [c10d][PGNCCL] Add waitcounter for watchdog and heartbeat monitoring thread (#157480)
We want to have a wait counter for both side thread so that we can monitor its lifecycle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157480
Approved by: https://github.com/d4l3k
2025-07-03 02:47:06 +00:00
5cc4e856fd Add device_id to XPU device properties (#156481)
# Motivation

Some older Intel iGPUs may share the same device name across different hardware products.
(See [device name example](aaa01c06f9/shared/source/dll/devices/devices_base.inl (L190-L199)))
To help disambiguate which specific iGPU product is being used, we introduce the use of a
[device id](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-id). This device id corresponds to the Device ID in [official Intel product specification](https://www.intel.com/content/www/us/en/products/sku/232155/intel-core-i71360p-processor-18m-cache-up-to-5-00-ghz/specifications.html) and enables more accurate identification and troubleshooting for user issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156481
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-07-03 01:22:11 +00:00
7597988f1b [fake tensor] fix issue of no attribute tags (#156689)
Fixes #156688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2025-07-03 01:16:01 +00:00
9620994067 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
ghstack dependencies: #157464
2025-07-02 23:29:35 +00:00
e124a0d88c [BE] Unskip special ops (#157464)
They were slow on CUDA-11.3, which has long been gone, let's see if they work now

Before
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 42.379s

OK (skipped=80)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007
2025-07-02 23:16:52 +00:00
7cfd054075 [attempt 2] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#157472)
Summary:
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.

Test Plan:
contbuild & OSS CI,

Rollback Plan:

Reviewed By: malfet

Differential Revision: D77639021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472
Approved by: https://github.com/aorenste
2025-07-02 23:12:29 +00:00
d40aaa42ee [BE][16/16] fix typos in torch/ (torch/utils/) (#156606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156606
Approved by: https://github.com/albanD
ghstack dependencies: #156318, #156320, #156602, #156604
2025-07-02 22:55:29 +00:00
11c07c848c [BE][14/16] fix typos in torch/ (torch/fx/) (#156604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156604
Approved by: https://github.com/jingsh
ghstack dependencies: #156318, #156320, #156602
2025-07-02 22:55:29 +00:00
db259bd6b8 [BE][12/16] fix typos in torch/ (#156602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156602
Approved by: https://github.com/justinchuby, https://github.com/albanD
ghstack dependencies: #156318, #156320
2025-07-02 22:55:29 +00:00
d5cdc36943 [BE][10/16] fix typos in torch/ (torch/csrc/jit/) (#156320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156320
Approved by: https://github.com/albanD
ghstack dependencies: #156318
2025-07-02 22:55:29 +00:00
541584d22e [BE][8/16] fix typos in torch/ (torch/csrc/jit/) (#156318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156318
Approved by: https://github.com/albanD
2025-07-02 22:55:29 +00:00
c0e155a8d2 [cutlass backend] Use alignment of D for EVT / Float8 (#157402)
I encountered an C++ compile error from running cutlass backend tests when upgrading cutlass version. It seems like Nvidia added
"static_assert(detail::is_aligned<ElementC_, AlignmentC, ElementD_, AlignmentD>(),"

b995f93317/include/cutlass/epilogue/collective/builders/sm90_builder.inl (L297)

However, it seems codegen have the wrong alignment for D. For C, 1 is okay since it is void. But for D, this is probably wrong.
```
    void, cutlass::layout::ColumnMajor, 1,
    cutlass::bfloat16_t, cutlass::layout::RowMajor, 1,
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157402
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-07-02 22:55:00 +00:00
48560eef80 [dynamo] Fix bug in dict(mapping_proxy) (#157467)
Fixes https://github.com/pytorch/pytorch/issues/157284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467
Approved by: https://github.com/jansel, https://github.com/StrongerXi

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-02 22:13:02 +00:00
fd4f704905 [ez][CI] Print set output in CI (#157477)
Print what the output that's getting set is for better debugging

It's probably bad there are 4 of these, but I'm also not sure if imports will behave correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157477
Approved by: https://github.com/huydhn
2025-07-02 21:47:19 +00:00
60e66d11ab [CI] Keep-going on main (#157470)
Run an experiment where we turn on keep going on main.  Revert this PR to cancel the experiment

There have been a couple of changes that make it so that HUD will show the failure early even while the job is in progress, so triaging for reverts should still be able to happen quickly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157470
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet
2025-07-02 21:42:46 +00:00
4b4c2a7b1d Support complex numbers in DTensor redistribute (#157329)
Add complex number unwrapping in functional collectives used by DTensor.

Complex tensors are not directly supported by underlying comm kernels
(e.g. nccl) but complex tensors can be viewed as real tensors of a
higher rank (added size-2 tensor dim represents real vs im component).
Collective output is then viewed as complex to restore the
original/expected shape and dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157329
Approved by: https://github.com/XilunWu
2025-07-02 21:37:16 +00:00
af9c92b4cb [CI] Remove redundant accuracy benchmarks for cpp_wrapper (#155966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155966
Approved by: https://github.com/desertfire
2025-07-02 20:58:08 +00:00
c09cf29d7d [ez][BE] Tag deletion script to delete any old ciflow + autorevert tags (#157468)
Change the branch/tag deletion script that runs once per day to delete more tags

Previous: only delete ciflow tags that didn't correspond to an open PR
New: delete ciflow tags attached to commits that are > 7 days old.  Also delete `trunk/<sha>` (I think they are for autorevert) tags that are attached to commits that are > 7 days old

It's hard to figure out when the actual tag was pushed or created, so instead it looks at the commit date, which might lead to unexpected behavior if the tag was pushed much later than the commit (ex triggering periodic later to bisect).  I think it's ok though since you don't really need the tag after the workflow runs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157468
Approved by: https://github.com/izaitsevfb
2025-07-02 20:42:32 +00:00
6f60cfe9b1 [ez] Add super().setUp() in test_ops::TestFakeTensor (#157475)
Noticed some disable issues getting a bunch of comments, so I took a look

One day I'll write a better check for this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157475
Approved by: https://github.com/huydhn
2025-07-02 20:34:00 +00:00
e20784f228 [dynamo] Support BUILTIN_MATCH serialization. (#157016)
Serialize BUILTIN_MATCH since they are all stored in __builtin__ dict.

Also fixed an issue that the wrong global scope is passed to CheckFunctionManager while loading guards. Previously we can always reuse the compile-time global scope for evaluating guards because the compile-time and runtime global scope are always the same.

For precompile, we need to serialize the compile-time global scope for loading only. We need to point the CheckFunctionManager to the new global scope after loading is finished for evaluating guards.

Differential Revision: [D77159313](https://our.internmc.facebook.com/intern/diff/D77159313/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157016
Approved by: https://github.com/jansel, https://github.com/jamesjwu
2025-07-02 20:24:24 +00:00
172853547a [inductor] more size_hint_or_throw usage (#157394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157394
Approved by: https://github.com/jingsh
2025-07-02 20:20:59 +00:00
e0ab1b538a [ez][BE] Remove max jobs override for CI build jobs (#157473)
Basically reverts #147487 since it's not needed anymore

Not an exact revert because some things have already been removed in a different PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157473
Approved by: https://github.com/huydhn
2025-07-02 20:12:28 +00:00
3f569f9af7 [BE] Remove extra semicolon (#157486)
Fixes
```
/Users/nshulga/git/pytorch/pytorch/torch/nativert/executor/GraphExecutorBase.cpp:16:58: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
   16 |       execPlan_(ExecutionPlanner{graph_}.createPlan()) {};
      |                                                          ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157486
Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/Skylion007
2025-07-02 19:56:21 +00:00
94716db222 [BE][DCE] eliminate remnants of global gemm cache (#157327)
Summary: The global gemm cache has not been maintained in ~1 year, and the only entry point (`search_autotune_cache`) was recently deprecated. Meaning, this is now dead code that we can remove.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77520979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157327
Approved by: https://github.com/jansel
2025-07-02 19:52:35 +00:00
06f39a71b6 Add Release 2.8 CUDA matrix. Update Release schedule for 2.7.1 and 2.9 (#157482)
This PR:
- Adds Release 2.8 CUDA matrix
- Update Release 2.9 schedule, to make it more similar to 2.5 release schedule. Mid Oct release
- Update 2.7.1 release day
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157482
Approved by: https://github.com/Camyll
2025-07-02 19:52:24 +00:00
36dd598bda layernorm tests: Tweak test thresholds for comparing tensors (#156699)
After I landed this PR: https://github.com/pytorch/pytorch/pull/156600, this test was failing internally on large tensors because the differences were greater than tolerances on some cuda devices.

We now raise the tolerances for larger tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156699
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-07-02 19:33:38 +00:00
32983ea698 [nativert] continue to move generated static dispatch kernels (#157460)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77623080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157460
Approved by: https://github.com/zhxchen17
2025-07-02 19:28:13 +00:00
5e636d664a [BE] @serialTest decorator must be called (#157388)
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase

class MegaTest(TestCase):
    @serialTest
    def test_foo(self):
        if hasattr(self.test_foo, "pytestmark"):
            print("foo has attr and it is", self.test_foo.pytestmark)
        print("foo")

    @serialTest()
    def test_bar(self):
        if hasattr(self.test_bar, "pytestmark"):
            print("bar has attr and it is", self.test_bar.pytestmark)
        print("bar")

if __name__ == "__main__":
    run_tests()
```

That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.013s

```

Added assert that arg is boolean in the decorator to prevent such silent skips in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
2025-07-02 19:15:19 +00:00
eaf32fffb7 fixed a tiny typo in torch.compiler.md (#157462)
Fixes #157444

there was a typo in [docs/source/torch.compiler.md](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler.md) : see -> seen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157462
Approved by: https://github.com/Skylion007, https://github.com/svekars
2025-07-02 19:15:15 +00:00
0e9d8032a3 [build] remove cmake cache and reconfigure again if it is invalid (#156958)
See also:

- astral-sh/uv#14269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156958
Approved by: https://github.com/Skylion007
ghstack dependencies: #156742
2025-07-02 18:46:32 +00:00
0105cd89ab [ONNX] Fix conversion of attention - 4D (#157130)
Fixes a wrong conversion to onnx while investigation #149662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130
Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-07-02 18:05:10 +00:00
d5d14ee823 [nativert] create persistent value helper (#157286)
Summary: att

Test Plan: CI

Reviewed By: georgiaphillips

Differential Revision: D74300519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157286
Approved by: https://github.com/SherlockNoMad
2025-07-02 17:15:52 +00:00
156bc243f0 Back out "Include c++ stack traces when we hit constraint violation (#155603)" (#157406)
Summary:
Original commit changeset: 4b3fdaa8f2c6

Original Phabricator Diff: D76434787

Meta:
https://fb.workplace.com/groups/1286739428954016/permalink/1535462614081695/

Test Plan:
Meta:
Revert D76434787 for S536719

Rollback Plan:

Differential Revision: D77626334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157406
Approved by: https://github.com/bobrenjc93
2025-07-02 16:51:07 +00:00
bd6b5fddbf [Precompile] [easy] Serialize requires_grad for tensors when serializing guards (#157372)
Need to keep requires_grad on the tensor when serializing/deserializing guards. This matters when there's a TENSOR_MATCH guard on a tensor that requires_grad. Added a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157372
Approved by: https://github.com/jansel, https://github.com/zhxchen17
ghstack dependencies: #156433
2025-07-02 16:34:37 +00:00
54701a0c94 Add is_hidden_event method to KinetoEvent Python interface (#155214)
Fixes #155213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155214
Approved by: https://github.com/sraikund16
2025-07-02 16:29:21 +00:00
0edc1b91f7 [Inductor] Disable decompose_k for AMD (#157283)
Differential Revision: D77544250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157283
Approved by: https://github.com/bdhirsh
2025-07-02 15:21:46 +00:00
9f5276dc07 Fix typo: 'Intializes' → 'Initializes' in _distributed_c10d.pyi docst… (#157455)
Description:

This PR fixes a small documentation typo in torch/_C/_distributed_c10d.pyi, correcting:

Intializes → Initializes

This helps improve clarity in internal docstrings for maintainers and contributors.
Let me know if further changes are needed. Thanks for your time and the amazing work on PyTorch!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157455
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-02 15:19:05 +00:00
9d175bc7e6 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-07-02 15:04:00 +00:00
b096341963 [BE] use pathlib.Path instead of os.path.* in setup.py (#156742)
Resolves:

- https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742
Approved by: https://github.com/malfet
2025-07-02 14:57:58 +00:00
82eefaedd9 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-02 14:02:01 +00:00
c553c55be7 Revert "Fix full_like decomposition to preserve strides (#144765)"
This reverts commit 01b0f09931d47bd2716398a0c335b2807dc3074d.

Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))
2025-07-02 13:56:03 +00:00
d5a89178b0 Revert "[dynamo] Add fx_graph_runnable test coverage (#157021)"
This reverts commit 77676753ecabf6a6645bdd3abfe01939e5751e76.

Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on [D77652793](https://www.internalfb.com/diff/D77652793). Maybe codev could be a better strategy to merge this PR faster... ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3027952946))
2025-07-02 13:48:41 +00:00
bdb7819166 [dynamo, nested graph breaks] remove recursive cell/freevar in instruction tx (#154078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154078
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-07-02 13:36:14 +00:00
34c8033fd3 Fix a div_mod bug in generic_math.h (#157383)
Summary: There is a bug in integer div_mod that when the remainder is 0 and the divisor is negative, mod operation produces a negative number. Fixed in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157383
Approved by: https://github.com/angelayi, https://github.com/jingsh
2025-07-02 12:22:57 +00:00
ab2294d828 [dynamo] fix _torchdynamo_orig_callable naming issues (#156901)
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks

We rename ~the first usage to `_torchdynamo_orig_fn`~ and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.

UPDATE: seems like both internal and OSS users depend on `_torchdynamo_orig_callable`, but it only seems in the first context. We should thus keep the original name for the first case then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-07-02 09:53:55 +00:00
3173616532 [nativert] start to move generated static dispatch kernels (#157403)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77622952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157403
Approved by: https://github.com/georgiaphillips
2025-07-02 08:42:01 +00:00
8c0df6fe17 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 42b48ee67229286127390000f103a11dfc8901f5.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red in internal runs, check D77593713 ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3026944993))
2025-07-02 08:30:39 +00:00
0364db7cd1 [PT] support custom all_gather and reduce_scatter comms (#155189)
Summary:
This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.

This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm` in `_fsdp_api.py`

Related PR:
https://github.com/pytorch/pytorch/pull/150564

Test Plan: CI

Differential Revision: D75714362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189
Approved by: https://github.com/weifengpy
2025-07-02 06:58:45 +00:00
f8c0a4bd28 [inductor] enable bf32 test for mkldnn conv (#127293)
Enable more test on inductor conv + bf32
Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_unary_cpu
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_unary_cpu
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv_transpose2d_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_binary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_binary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127293
Approved by: https://github.com/jgong5
ghstack dependencies: #126050, #126054

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:49:01 +00:00
4c8eb65efb allow to use bf16 as fp32 internal precision for mkldnn conv backward (#126054)
Used for CI since depends on ideep update.

Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d backward vs. BF16 internal computation conv backward on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 461.6734| 358.3779| 1.48
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 358.3779 | 247.8631| 1.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 4.3783| 3.8513| 1.14

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 16.6119 | 12.2047 | 1.38
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 12.0016 | 8.6711 | 1.38
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 20.5947 | 15.9366 | 1.29
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 40.0952 | 32.2222 | 1.24
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 162.7449 | 142.3054 | 1.14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126054
Approved by: https://github.com/jgong5
ghstack dependencies: #126050

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:40:13 +00:00
5a2db5152d allow to use bf16 as fp32 internal precision for mkldnn conv (#126050)
Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d vs. BF16 internal computation conv2d on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 185.5071 | 83.4749 | 2.22
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 194.7558 | 79.1683| 2.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 1.9213 | 1.3690 | 1.40

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 6.5804  | 7.4349 | 0.89
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 4.9940  | 3.8093 | 1.31
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 8.8359 | 5.5802 | 1.58
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 16.5800 | 9.2367 | 1.80
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 79.5436 | 38.3861  | 2.07

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126050
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:31:23 +00:00
0a63053fe9 Don't store flamegraph to tmp folder (#157374)
Where it's accessible(and mutable) by multiple users. Instead use
`~/.cache` folder instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157374
Approved by: https://github.com/eqy
ghstack dependencies: #157373
2025-07-02 00:46:51 +00:00
bb476310a4 [dynamo][guards] Stash root guard manager pointer in the LeafGuard (#157325)
Preparing to simplify the recompilation reason codebase. This PR was 95% done by using AI tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157325
Approved by: https://github.com/jansel
2025-07-02 00:42:43 +00:00
fa1c20ae92 Fix test consolidate hf safetensors (#157386)
Need to change an argument name that was changed in the test so that it doesn't throw

Differential Revision: [D77604210](https://our.internmc.facebook.com/intern/diff/D77604210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157386
Approved by: https://github.com/meetv18
ghstack dependencies: #154743, #156705
2025-07-02 00:16:21 +00:00
77676753ec [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan

Co-authored-by: Simon Fan <xmfan@meta.com>
2025-07-02 00:10:01 +00:00
617e3f69f8 [FP8] Fix Benchmarking for certain Priors (#155722)
Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Differential Revision: D76092551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155722
Approved by: https://github.com/Skylion007
2025-07-02 00:01:23 +00:00
ab6cb34480 Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)"
This reverts commit 563fd95563c5edd732ae260b3bd3d0c38822ab57.

Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))
2025-07-01 23:21:37 +00:00
c6a27bae36 Revert "[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)"
This reverts commit d0a9629435aaceb5acbf31aad70f2109cb8a3ea2.

Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/laithsakka due to was asked by to land this internally  ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3025796794))
2025-07-01 22:58:14 +00:00
563fd95563 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-01 22:51:11 +00:00
6ef70edd9a Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 47f10d0ad0dda281c886ff08ac2f938207027316.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))
2025-07-01 22:11:53 +00:00
3df6360e8c [BE][Easy][setup] use super().method(...) in command subclasses in setup.py (#156044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156044
Approved by: https://github.com/albanD
ghstack dependencies: #156741
2025-07-01 22:09:10 +00:00
d0a9629435 [do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-07-01 21:39:38 +00:00
22edb457c9 [invoke_subgraph][partitioner] Add meta val on run_and_save_rng ops (#157319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157319
Approved by: https://github.com/zou3519
2025-07-01 21:02:08 +00:00
e5f6ffd810 [BE] Replace checkcall("chmod") with os.chmod (#157373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157373
Approved by: https://github.com/clee2000, https://github.com/eqy, https://github.com/Skylion007
2025-07-01 20:46:25 +00:00
019e30e3b8 [BE] Decorate LargeTensorTest with serialTests (#157382)
May be it'll help make M2-15 jobs more stable, as that was the last test run before OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157382
Approved by: https://github.com/clee2000
2025-07-01 20:35:42 +00:00
4500a4aa50 remove allow-untyped-defs from torch/backends/mps/__init__.py (#157227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157227
Approved by: https://github.com/Skylion007
2025-07-01 20:00:19 +00:00
6bc263809d [SymmMem] Add NVSHMEM_CHECK macro (#157174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157174
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-07-01 19:50:28 +00:00
ffac0de07e [export] Remove stack trace from input/output (#157302)
Fixes https://github.com/pytorch/pytorch/issues/157183

https://github.com/pytorch/pytorch/pull/156257 consolidated the path for saving stack traces, but missed the part where stacktraces are not added to placeholder/output nodes in proxy_tensor tracing [(code)](https://github.com/pytorch/pytorch/pull/156257/files#diff-6960ce90e7162c0953b1ca07e92e7f0f2f6ba63b427b42df593e20cc6a096bb7L1107).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157302
Approved by: https://github.com/yushangdi
2025-07-01 19:16:28 +00:00
01b0f09931 Fix full_like decomposition to preserve strides (#144765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames, https://github.com/jansel
2025-07-01 19:13:22 +00:00
6401d1d53d Revert "Fused RMSNorm implementation (#153666)"
This reverts commit e1aee86646aa6d1b9cb9d34351e43936401c5efc.

Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001) [HUD commit link](e1aee86646) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176))
2025-07-01 18:46:45 +00:00
3a5677a380 Revert "ci: Add ability to test images for build-triton-wheel (#156894)"
This reverts commit 0e47312ae5a687f0aed61db753d03180118cddc4.

Reverted https://github.com/pytorch/pytorch/pull/156894 on behalf of https://github.com/seemethere due to causing issues in downstream builds see https://github.com/pytorch/pytorch/pull/156664 for more info ([comment](https://github.com/pytorch/pytorch/pull/156894#issuecomment-3025137790))
2025-07-01 18:43:34 +00:00
02608e560a [ROCm] Add more shards for inductor dashboard, more frequent runs (#157288)
Also increases regularity of dashboard runs on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157288
Approved by: https://github.com/jeffdaily
2025-07-01 18:27:30 +00:00
e1aee86646 Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
1c8844d9e7 [MPS] Switch Cholesky decomp to column wise (#157014)
Everything should go thru a generalized kernels, and Metal kernels should work with the same sizes and strides as CPU or CUDA backends to avoid problems with `torch.compile` that relies on the meta kernels to tell what its ouput going to look like.

To avoid returning tensors with different layout depending on whether upper parameter is true or false, templatize `factorDiagonalBlock`, `applyTRSM` and `applySYRK` to take upper/lower (actually row-wise vs column-wise) as template argument and call appropriate templates from host

TODOs:
 - Rename upper parameter to something more sensible and add comments
 - Use simd_groupsize instead of hardcoded 32 everywhere

Fixes https://github.com/pytorch/pytorch/issues/156658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157014
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157179
2025-07-01 18:00:59 +00:00
720c2c46b1 [Inductor UT][XPU] Reduce the runtime of the test case test_comprehensive_nn_functional_max_pool2d_xpu. (#157357)
This test case has over a thousand input samples, causing it to run for more than 30 minutes, which triggers the timeout mechanism and breaks the XPU CI. This PR limit the sample number as one for this XPU case .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157357
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2025-07-01 17:47:49 +00:00
3bc6bdc866 [BE] add type annotations and run mypy on setup.py (#156741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156741
Approved by: https://github.com/aorenste
2025-07-01 17:09:05 +00:00
47f10d0ad0 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-01 16:51:03 +00:00
0f9c1b374f [dynamo] Ensure global state guard is preserved across serialization. (#157285)
Currently, every time we construct a GLOBAL_STATE guard, we always create a fresh guard based on the current global state. For precompile, we want to create a GLOBAL_STATE guard always based on some external sources, e.g. serialized global states. This can also be applied with the normal case where we just pass in the global state guard from Python.

Differential Revision: [D77400988](https://our.internmc.facebook.com/intern/diff/D77400988/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157285
Approved by: https://github.com/jansel
2025-07-01 15:46:34 +00:00
b146e1a264 [BE] remove duplicates in generated torch._VF.__all__ (#157365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157365
Approved by: https://github.com/Skylion007
2025-07-01 15:43:20 +00:00
c78fce9e79 [dynamo] show frame information when recompilation is triggered on fail_on_recompile (#156433)
adding more information to the error message for debugging.

example error message:
```
Detected recompile when torch.compile stance is 'fail_on_recompile'. filename: 'caffe2/test/dynamo/test_misc.py', function name: 'fn', line number: 0
Failed on the following precompiled guards:

TREE_GUARD_MANAGER:
+- RootGuardManager
| +- LAMBDA_GUARD: isinstance(L['x'], bool)
GuardDebugInfo(
result=0,
verbose_code_parts=["isinstance(L['x'], bool)"],
num_guards_executed=1)
```

Differential Revision: [D76987126](https://our.internmc.facebook.com/intern/diff/D76987126/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156433
Approved by: https://github.com/jamesjwu
2025-07-01 15:15:58 +00:00
023887fc5a Revert "Switch to standard pep517 sdist generation (#152098)"
This reverts commit f16053f0c9a09fa337fbf85aaf64f88712b8dcdb.

Reverted https://github.com/pytorch/pytorch/pull/152098 on behalf of https://github.com/malfet due to IMO this PR needs to be split into few helper ones, with better test plan ([comment](https://github.com/pytorch/pytorch/pull/152098#issuecomment-3024223880))
2025-07-01 14:14:52 +00:00
1586521461 Revert "Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)"
This reverts commit 2c76f31221e117b217b8a6a96a5405f626d2218a.

Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/jeanschmidt due to Breaking 1000s of internal builds, it cant be properly landed internally, there are no options except revert and codev. ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3023503929))
2025-07-01 11:23:00 +00:00
534c454e77 Revert "[xla hash update] update the pinned xla hash (#156584)"
This reverts commit b1a54fab9bcb0cc167773f9a885d4170447e1c68.

Reverted https://github.com/pytorch/pytorch/pull/156584 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/155590 ([comment](https://github.com/pytorch/pytorch/pull/156584#issuecomment-3023492421))
2025-07-01 11:20:05 +00:00
13bf2655c1 Revert "HF loads dcp - don't do a full deserialize on every file (#155942)"
This reverts commit 117db5601d78cbc746b35eef71fc815e042e903f.

Reverted https://github.com/pytorch/pytorch/pull/155942 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red internally, more details on D76442012 ([comment](https://github.com/pytorch/pytorch/pull/155942#issuecomment-3023473036))
2025-07-01 11:15:08 +00:00
0bce390269 Revert "[dynamo] Add fx_graph_runnable test coverage (#157021)"
This reverts commit 20e40492b046b9287726d3ec656117e4dc38f0e2.

Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on D77471538 ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3023455082))
2025-07-01 11:10:45 +00:00
a767e50adc remove allow-untyped-defs from torch/fx/experimental/migrate_gradual_types/util.py (#157236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157236
Approved by: https://github.com/ezyang
2025-07-01 10:36:48 +00:00
210632fae1 [ROCm] support experimental CU carveout (#149466)
Fixes #149280.  Follow up to #147966, but now available for ROCm.

Since hipblaslt does not support HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET, we instead create a hipStream that has a CU mask applied.  We pass this masked stream to hipblaslt instead of pytorch's current stream.  We ensure stream ordering between streams using hipEvents and stream synchronization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149466
Approved by: https://github.com/malfet, https://github.com/atalman
2025-07-01 08:54:52 +00:00
0596323c35 Better fix for __index__ SymInt issue (#157201)
This improves on #156928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157201
Approved by: https://github.com/ezyang
2025-07-01 07:06:46 +00:00
c202a7329a Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit 23491519d288dedb2a54cfad5fef7fcb2ad8eade.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/XuehaiPan due to sys.get_int_max_str_digits is not always available ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3021990027))
2025-07-01 06:16:49 +00:00
754699610b [BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199
Approved by: https://github.com/ezyang
2025-07-01 06:07:29 +00:00
8f0998aafe Check F2C BLAS for OpenBLAS and other vendors (#143846)
This issue came from https://github.com/conda-forge/pytorch-cpu-feedstock/issues/180. MKL follows the F2C convention for returning single precision floats as doubles and uses the G77 convention for returning complex valued scalars. OpenBLAS does the opposite. There is a check for this already, but it's done only when the Generic BLAS vendor code path is used and this PR moves that code to `Dependencies.cmake` to make it work when the BLAS vendor is OpenBLAS and others

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143846
Approved by: https://github.com/rgommers, https://github.com/atalman
2025-07-01 05:56:24 +00:00
04bd7e6850 [ROCm] Remove use of warpsize on host-side compilation (#156979)
Changes needed for ROCm7.0:
* `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore
* `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime
* Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-01 04:55:31 +00:00
c811f41cf5 [BE] Remove unused variable from Pooling.metal (#157332)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:101:21: warning: unused variable 'indices_sizes' [-Wunused-variable]
  constant int64_t* indices_sizes = params.indices_sizes.data();
                    ^

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157332
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/dcci
2025-07-01 04:28:04 +00:00
4d5d627e5f Remove super spammy log (#157157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157157
Approved by: https://github.com/davidberard98
2025-07-01 03:51:58 +00:00
b40981c630 Fix incorrect stride handling in adaptive_avg_pool3d (#157326)
Fixes #157248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157326
Approved by: https://github.com/eqy
ghstack dependencies: #157242
2025-07-01 03:03:48 +00:00
b5ce77c1f5 [ROCm] Initial AITER Integration for mha_bwd asm kernels (#152630)
Generates AITER plumbing via cmake. Calls into fav3 asm bwd CK kernels.

Update submodule composable kernel for this change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152630
Approved by: https://github.com/xw285cornell, https://github.com/yoyoyocmu
2025-07-01 02:53:27 +00:00
f40efde2a4 [CI] Add prebuild command option, set prebuild command option for CI to build flash attention (#156236)
Build flash attention separately in build using 2 jobs since it OOMs on more, then the rest of the job uses 6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156236
Approved by: https://github.com/malfet
2025-07-01 02:53:22 +00:00
3ed4384f5b [dynamo] temporarily disabling generation of weblinks for torch v2.8 release (#157299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157299
Approved by: https://github.com/williamwen42
2025-07-01 02:31:17 +00:00
c174f3a6a5 [ONNX] Delete deprecated tutorial page link (#157310)
Related to https://github.com/pytorch/tutorials/issues/3420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157310
Approved by: https://github.com/justinchuby
2025-07-01 01:18:26 +00:00
6dc2b22269 [ROCm][SymmetricMemory] Performance improvements for two-shot allreduce (#156746)
The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:
![image](https://github.com/user-attachments/assets/57005856-4cb5-43cd-8e9c-46869f75ab0b)

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.

co-authored by: @amd-hhashemi
https://github.com/pytorch/FBGEMM/issues/4072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156746
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-07-01 00:37:30 +00:00
f860992db5 Add a custom profiler configuration option (#151656)
We aim to pass some configuration options to our custom Kineto backend via ExperimentalConfig,, so we added a `custom_profiler_config` parameter.

Requires https://github.com/pytorch/kineto/pull/1077 ,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151656
Approved by: https://github.com/sraikund16
2025-07-01 00:36:09 +00:00
b60569ed94 HF - consolidate shards of safetensors files to full tensors in finish step (#156705)
Title - we can consolidate the shards to a full tensors, optionally behind a flag, in the finish step of DCP.save
also adds the thread count argument which is configurable for users, before we were just using the default of 1.
Re-creating https://github.com/pytorch/pytorch/pull/155940 bc it got into a bad detached state

Differential Revision: [D77231774](https://our.internmc.facebook.com/intern/diff/D77231774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156705
Approved by: https://github.com/saumishr
ghstack dependencies: #154743
2025-07-01 00:30:48 +00:00
4ebd269065 [Testing] Remove duplicate MPSInductor tests (#157328)
They were added there before test_torchinductor were running in CI, but
now the same are covered by `GPUTests.test_pointwise_*_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157328
Approved by: https://github.com/huydhn
2025-07-01 00:21:22 +00:00
7709ff5512 [remove untyped defs] batch 1 (#157011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157011
Approved by: https://github.com/Skylion007
2025-06-30 23:54:40 +00:00
fee2377f9e Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251)
Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week.

Original summary:

headeronly is more clear, let's change the name before anyone depends on standalone

Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire
2025-06-30 23:25:30 +00:00
3dda80e990 Overload mul_overflows for size_t (#155736)
Partially fixes https://github.com/pytorch/executorch/pull/11537.

We want to extend `mul_overflows` to support `size_t` in ExecuTorch. The current workflow in ET checks that the `c10` mirrors exactly as in PT, so the tests are failing.

See comment: https://github.com/pytorch/executorch/pull/11537#issuecomment-2963821312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155736
Approved by: https://github.com/swolchok
2025-06-30 22:57:28 +00:00
42b48ee672 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-06-30 22:32:52 +00:00
a9352bd25e Script for consolidation of sharded safetensor files (#154743)
Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154743
Approved by: https://github.com/saumishr
2025-06-30 22:25:58 +00:00
f096820d0f [precompile] Detect source code changes for save/load. (#156432)
Go through all dynamo traced functions and compute checksum for them. While loading a precompilation back to memory, we will always check the checksum and refuse to load when
source code changes are detected.

Differential Revision: [D76987123](https://our.internmc.facebook.com/intern/diff/D76987123/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156432
Approved by: https://github.com/jansel, https://github.com/jamesjwu
2025-06-30 21:16:15 +00:00
d3efd73234 Revert "[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)"
This reverts commit 84c588e5eada9e7921608065edc444a15c22cb1c.

Reverted https://github.com/pytorch/pytorch/pull/156656 on behalf of https://github.com/henrylhtsang due to breaking fbcode A100 tests ([comment](https://github.com/pytorch/pytorch/pull/156656#issuecomment-3020769914))
2025-06-30 21:16:04 +00:00
3684be056d [dynamo] Fix source for lru_cache method (#157292)
Fixes - https://github.com/pytorch/pytorch/issues/157273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292
Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel
2025-06-30 20:53:57 +00:00
23491519d2 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-30 19:42:11 +00:00
f16053f0c9 Switch to standard pep517 sdist generation (#152098)
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now.

Closes #150461.

The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files.

This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file.

The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now.

## Issues

### sdist name
According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`.

### Symbolic Links
The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy.

PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target.

The list of symbolic links currently is as follows:

<details>

|source|target|problem|solution|
|-|-|-|-|
| `.dockerignore` | `.gitignore` |  ok (individual file) ||
| `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |`..` in target|swap source and target[^1]|
| `functorch/docs/source/notebooks` | `../../notebooks/` |`..` in target|swap source and target[^1]|
| `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` |  ok (omitted from sdist)||
| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |`..` in target|omit from sdist[^2]|
| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |`..` in target|omit from sdist[^3]|
| `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |`..` in target|omit from sdist[^4]|
| `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |`..` in target|omit from sdist[^4]|
| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` | ok (individual file)||
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/XNNPACK/tools/xngen` | `xngen.py` |  ok (individual file)||

</details>

The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`.

[^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well.

[^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul.

[^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist.

[^4]: We don't need to ship the rust bindings for ittapi.

[^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted.

### Nccl
Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`.
This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download.
However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098
Approved by: https://github.com/atalman
2025-06-30 19:07:34 +00:00
c7b6c98d10 [tp] improve parallelize_module API to support more cases (#157182)
This PR improves the parallelize_module API to support more corner cases:
1. if the plan entry specified as "", it should apply the style to the current module
2. if the plan entry does not have a corresponding submodule to apply, raise a warning and ignore this plan entry

As working on this PR, I also found that the while-loop inside is actually not necessary and could produce some nasty on the fly modifying while iterating behavior.. So I removed the while loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157182
Approved by: https://github.com/tianyu-l
2025-06-30 18:10:44 +00:00
d5e6f42094 Revert "Use std::string_view in torchgen (#157050)"
This reverts commit 064288cbab94c9931ca2296a2b9723e864f9050a.

Reverted https://github.com/pytorch/pytorch/pull/157050 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, more details on D77449943. @ezyang may I count on your help to get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/157050#issuecomment-3020222668))
2025-06-30 18:08:54 +00:00
efbf07e7ea Revert "[dynamo] Fix issue with tensors passed as view() shapes (#156928)"
This reverts commit 75f3e5a88df60caef27fd9c9df3fd51161378fcc.

Reverted https://github.com/pytorch/pytorch/pull/156928 on behalf of https://github.com/jeanschmidt due to Breaks a internal test, more details can be found on D77449971 ([comment](https://github.com/pytorch/pytorch/pull/156928#issuecomment-3020186268))
2025-06-30 17:56:01 +00:00
5e18bc3331 [PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255)
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0

***Build Failure Logs***

**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
  302 |     if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
  303 |         (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~

```

**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
   19 | class Vectorized<ComplexFlt> {
      |       ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’

aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  633 |   auto abs_a = a.abs_2_();
      |                  ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  634 |   auto abs_b = b.abs_2_();
      |                  ^~~~~~

/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  666 |       vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  673 |       vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
      |                 ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  680 |       vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```

***With  this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
  -- USE_TENSORPIPE=True
  -- USE_GLOO=True
  -- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc

```

This patch will fix the pytorch build issue on power, and i am able to build successfully.

Hi @malfet  @albanD

Please review this PR for pytorch build issue that we are observing on power.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
2025-06-30 17:54:37 +00:00
2815eea0d0 [dtensor] relax device_mesh argument constraint in local_map (#157049)
This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh.

When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior.

The `device_mesh` argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157049
Approved by: https://github.com/Chillee, https://github.com/zpcore
2025-06-30 17:51:48 +00:00
f8cc4c0af8 [inductor] Update triton_key import to support latest Triton (#157242)
With Triton main things were failing with:
```py
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 205, in get_system
    from triton.compiler.compiler import triton_key
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/home/jansel/pytorch/triton/compiler/compiler.py)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157242
Approved by: https://github.com/aorenste
2025-06-30 17:51:43 +00:00
117db5601d HF loads dcp - don't do a full deserialize on every file (#155942)
Differential Revision: [D76442012](https://our.internmc.facebook.com/intern/diff/D76442012/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155942
Approved by: https://github.com/saumishr
ghstack dependencies: #155707
2025-06-30 17:45:10 +00:00
ed5d6d2a20 python definitely_contiguous-> is_contiguous_or_false (#156515)
We probably can avoid having those in python as well and  just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515
Approved by: https://github.com/bobrenjc93
2025-06-30 17:31:51 +00:00
c038719731 Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 347ace4c7ac2dbb14799089c30bd01a9ac312791.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail on ROCm ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3020006655))
2025-06-30 16:58:54 +00:00
b54eac2a5e Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-30 16:58:06 +00:00
39b71d11fc [Inductor] add pedantic to limit inductor code follow standard. (#156914)
### Background:

During my development work, I found Windows msvc don't support to compile zero size array, please reference: https://github.com/pytorch/pytorch/issues/153180

As discussed with MSFT engineer, we found zero size array don't align to c++ standard, though gcc/clang can support it. When we add `-pedantic` option to gcc, it should check and raise c++ standard strictly. Reference: https://github.com/pytorch/pytorch/issues/153180#issuecomment-2986676878

So this PR add `-pedantic` to torch inductor build option list to constraint codegen generate c++ standard well code.
Additional, It also fixed a halide zero size array code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156914
Approved by: https://github.com/jansel
2025-06-30 16:29:08 +00:00
e3afbb0362 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-30 15:56:35 +00:00
eqy
3b4b5f8d47 [SDPA] Fix alloc_with_matching_layout stride sorting (#157145)
Otherwise dims with "zero" stride get moved before contiguous dims (stride 1).

Need to move the fix from #149282 to here as #154340 moved the original definition from `MHA.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157145
Approved by: https://github.com/Skylion007
2025-06-30 15:43:29 +00:00
da1f337bc4 Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit fab53dfdf1d89cecd5e82b12cced9b6dd217e87c.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/guilhermeleobas due to failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3019457531))
2025-06-30 14:49:44 +00:00
fab53dfdf1 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-30 14:15:47 +00:00
ffaed8c569 Update slow tests (#155448)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155448
Approved by: https://github.com/pytorchbot
2025-06-30 12:08:52 +00:00
b1a54fab9b [xla hash update] update the pinned xla hash (#156584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584
Approved by: https://github.com/pytorchbot
2025-06-30 11:23:06 +00:00
ccb67f39b4 Enable the AMP precision with freezing for CPU nightly test (#152298)
Hi, @desertfire. Since we recommend users to use AMP precision and run with `--freezing` for CPU x86 Inductor inference, we suggest adding the AMP freezing test to the CPU nightly tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152298
Approved by: https://github.com/desertfire, https://github.com/huydhn

Co-authored-by: zengxian <xiangdong.zeng@intel.com>
2025-06-30 09:17:17 +00:00
f79689bd3d updated matplotlib version in docs requirements (#155931)
Fixes #155199

The issue on main is due an outdated version of matplotlib. I have bumped the version so that it is compatible with Numpy 2.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155931
Approved by: https://github.com/malfet
2025-06-30 02:05:53 +00:00
a1282b1823 [MPS] Add boilerplate sparse code support (#157238)
This PR makes minimal changes to support sparse tensors on MPS. In the followup PRs I'll start adding different operations slowly so we can fix the issue of
https://github.com/pytorch/pytorch/issues/129842
which is highly requested(I assume because of whisper using sparse tensors)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157238
Approved by: https://github.com/malfet
2025-06-30 01:53:45 +00:00
771be85704 [AOTI] Print out error msg when nvcc compiler fails (#157203)
Summary: To debug https://github.com/pytorch/pytorch/issues/156930. Not able to reproduce the problem locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157203
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-06-30 01:30:55 +00:00
86ced14453 increment pending_callbacks_counter before initation the pt2 compile callbacks (#157185)
Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it.

Test Plan:
tba

Rollback Plan:

Differential Revision: D77475650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157185
Approved by: https://github.com/xmfan
2025-06-30 01:23:59 +00:00
12cb06e574 [inductor] Increase tolerance for test_comprehensive_nn_functional_linear_cuda_float16 (#156962)
Fixes #156514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156962
Approved by: https://github.com/jamesjwu
2025-06-30 00:54:20 +00:00
cyy
c27f83dd91 Remove old ASAN Docker images (#157197)
The old ASAN jobs have been replaced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157197
Approved by: https://github.com/Skylion007
2025-06-30 00:30:56 +00:00
11f7e2f145 [caffe][executorch] rename to avoid shadow in irange (#157107)
Summary:
D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows.

Simple rename to avoid.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77446104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157107
Approved by: https://github.com/Skylion007
2025-06-30 00:17:09 +00:00
018e9826a2 [nativert] hook up memory planning to execution frame (#157053)
Summary: pretty simple. if planner exists, which implies that planning is enabled, create a manager for each frame. the associated serial executor will use the withMemoryPlannner fn to ensure the deallocation is done after execution completes.

Test Plan: CI

Differential Revision: D73635809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157053
Approved by: https://github.com/henryoier, https://github.com/georgiaphillips
2025-06-30 00:06:37 +00:00
41f6acef83 Update pr_time_benchmarks expected results (#157214)
The job has been unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157214
Approved by: https://github.com/laithsakka
2025-06-29 19:12:13 +00:00
29f76ec0f3 Revert "[BE] use pathlib.Path instead of os.path.* in setup.py (#156742)"
This reverts commit 2380115f9738f97cf706affefd647d2cb6dfbb3f.

Reverted https://github.com/pytorch/pytorch/pull/156742 on behalf of https://github.com/malfet due to Looks like it broke all ROCM tests, see 721d2580db/1 ([comment](https://github.com/pytorch/pytorch/pull/156742#issuecomment-3016937704))
2025-06-29 18:10:03 +00:00
721d2580db [dynamo][callbacks] temporarily disable TRITON_AUTOTUNING (#157186)
Differential Revision: D77476551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157186
Approved by: https://github.com/burak-turk
2025-06-29 17:20:55 +00:00
aec569da23 [Triton] [Inductor[ Add tt.descriptor_store to get_tma_stores (#157212)
Summary: Fixes a gap in the Triton update where the traverse would break because `get_tma_stores` didn't handle both TMA APIs.

Test Plan:
`buck test -m ovr_config//triton:beta  'fbcode//mode/dev-nosan' fbcode//ads_mkl/ops/tests:gdpa_dcpp_test -- --exact 'ads_mkl/ops/tests:gdpa_dcpp_test - test_gdpa_dcpp (ads_mkl.ops.tests.gdpa_dcpp_test.GdpaDCPPTest)'`

Rollback Plan:

Differential Revision: D77501582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157212
Approved by: https://github.com/davidberard98
2025-06-29 16:44:52 +00:00
b147b6c0e3 Increase tolerance for test_corrcoef_cuda_int32 (#157206)
Fixes #156988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157206
Approved by: https://github.com/Skylion007
2025-06-29 16:30:54 +00:00
e959dd017d [TSAN][live speech translation] Fix A data race in caffe2 (#156378)
Summary: noticed that context quantized_engine is accessed and written from multiple threads

Test Plan:
➜  fbsource buck test --flagfile fbcode/mode/dev-tsan //xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests -- --exact 'fbsource//xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests - Translate/LiveSpeechTranslationTests.LiveSpeechTranslationEnFr/silence___fr_en'

Rollback Plan:

Differential Revision: D76921416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156378
Approved by: https://github.com/jerryzh168, https://github.com/cyyever
2025-06-29 07:23:20 +00:00
9d677389cb [async compile] make it more obvious that we support backwards (#157204)
current failing with

```
(/home/bobren/local/a/pytorch-env) [13:02] devgpu009:/home/bobren/local/a/pytorch python test/inductor/test_compile_subprocess.py -k GPUTests.test_async
/home/bobren/local/a/pytorch/torch/backends/cudnn/__init__.py:115: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema
  warnings.warn(f"undefined OpHandler.{name}, please add missing op schema")
/home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema
  warnings.warn(f"undefined OpHandler.{name}, please add missing op schema")
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Unable to pickle input graph or example inputs
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Traceback (most recent call last):
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 484, in serialize_compile
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]     ).serialize()
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 210, in serialize
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]     return _WireProtocolPickledInput(GraphPickler.dumps(self))
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]   File "/home/bobren/local/a/pytorch/torch/fx/_graph_pickler.py", line 124, in dumps
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]     pickler.dump(obj)
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] AttributeError: Can't pickle local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157204
Approved by: https://github.com/aorenste
2025-06-29 05:38:54 +00:00
347ace4c7a Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-29 05:00:47 +00:00
f8293116f5 [BE][13/16] fix typos in torch/ (torch/ao/) (#156603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156603
Approved by: https://github.com/msaroufim
2025-06-29 04:34:04 +00:00
1913c915e0 Fixes issue #156414: Fixes bug in implementation of _combine_histograms. (#156457)
Fixes #156414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156457
Approved by: https://github.com/jerryzh168
2025-06-29 04:30:28 +00:00
2796f31b5e [DCP] OSS Zero Overhead Checkpointing Implementation (#156207)
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing

Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287

Differential Revision: D72391401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207
Approved by: https://github.com/teja-rao
2025-06-29 03:19:48 +00:00
bccb8473fe [ROCm] Allow use of rocSOLVER for Cholesky inversion. (#157154)
Fixes https://github.com/pytorch/pytorch/issues/155046

This change allows Cholesky inversion to use rocSOLVER. This is now also the default on ROCm for Cholesky inversion which aligns with the behavior on NVIDIA (which defaults to cuSOLVER for this linear algebra operation). This fix also gets around a memory access fault encountered in MAGMA for large matrices.

MAGMA can still be forced on ROCm by doing:
```
torch.backends.cuda.preferred_linalg_library(backend='magma')
```

Ran all Cholesky UT on ROCm and there were no regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157154
Approved by: https://github.com/jeffdaily
2025-06-29 01:53:02 +00:00
6cc490d40b simplify max(1,x) to x when x known >=1 (#157189)
Creating contiguous strides creates an expression max(1, x). Often we know that x >= 1, in
 which case we should simplify max(1, x) to x.

This appeared in two situations:
1) An internal user complained about statically_known_true(x == max(1, x)) failing (internal link: https://fb.workplace.com/groups/1028545332188949/permalink/1232958568414290).
This https://github.com/pytorch/pytorch/pull/155938 won't be needed with this.

3) Not simplifying the above could result in wrong ConstraintViolationErrors.
Because we assume non-trival single arg guards shall evaporate see the logic in the function
issue_guard in symbolic_shapes.py

with this change we longer throw ConstraintViolationErrors with the program bellow
this is blocking landing this [PR](https://github.com/pytorch/pytorch/pull/155590) from landing
internally. Due to internal export tests throwing ConstraintViolationErrors.
like
```
Constraints violated (width)!
  - Not all values of width = L['x'].size()[3] in the specified range 224 <= width <= 455 satisfy the generated guard max(1, 1 + (((-1) + L['x'].size()[3]) // 2)) == (1 + (((-1) + L['x'].size()[3]) // 2)).
````

```
x = torch.rand(10)
torch._dynamo.mark_dynamic(x, 0, max=20, min=5)

@torch.compile(fullgraph=True, dynamic=True)
def func(x):
    if max(1, (-1 + x.size()[0]//2)) == (-1+x.size()[0]//2):
        return x*400
    else:
        return (x*10)*100

func(x)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157189
Approved by: https://github.com/pianpwk
2025-06-29 01:16:30 +00:00
836bb1941b [hop] support torch.func.functional_call in hop subgraph (#155886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155886
Approved by: https://github.com/zou3519
2025-06-28 23:47:46 +00:00
2380115f97 [BE] use pathlib.Path instead of os.path.* in setup.py (#156742)
Resolves:

- https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742
Approved by: https://github.com/malfet
2025-06-28 23:31:15 +00:00
90b973a2e2 [BE] parse CMake version from cmake -E capabilities instead of cmake --version (#157073)
`cmake -E capabilities` produces a JSON format that is more machine-friendly.

```console
$ cmake --version
cmake version 4.0.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ cmake -E capabilities | jq '.version.string'
"4.0.3"
$ cmake -E capabilities | jq
{
  "debugger": true,
  "fileApi": {
    "requests": [
      {
        "kind": "codemodel",
        "version": [
          {
            "major": 2,
            "minor": 8
          }
        ]
      },
      {
        "kind": "configureLog",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cache",
        "version": [
          {
            "major": 2,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cmakeFiles",
        "version": [
          {
            "major": 1,
            "minor": 1
          }
        ]
      },
      {
        "kind": "toolchains",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      }
    ]
  },
  "generators": [
    {
      "extraGenerators": [],
      "name": "Watcom WMake",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "Kate"
      ],
      "name": "Ninja Multi-Config",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Ninja",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [],
      "name": "Xcode",
      "platformSupport": false,
      "toolsetSupport": true
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Unix Makefiles",
      "platformSupport": false,
      "toolsetSupport": false
    }
  ],
  "serverMode": false,
  "tls": true,
  "version": {
    "isDirty": false,
    "major": 4,
    "minor": 0,
    "patch": 3,
    "string": "4.0.3",
    "suffix": ""
  }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157073
Approved by: https://github.com/Skylion007
2025-06-28 23:20:10 +00:00
772d590415 [CUTLASS] [CUDA] SM100 GroupMM (#156203)
Closes https://github.com/pytorch/pytorch/issues/156202

PR adds blackwell support for GroupMM

Most of the code that is used for SM90 can be reused, kernel schedule has to be changed in accordance with https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html

Did some preliminary benchmarking of H200 vs B200

Script
```py
import torch
print(torch.__file__)
device = torch.device("cuda")
dtype = torch.bfloat16

shapes = [
    (16, 128000, 7168, 7168),
    (128, 1, 2048, 7168)
]

for batch, M, N, K in shapes:
    a = torch.randn(batch, M, K, device=device, dtype=dtype)
    b = torch.randn(batch, N, K, device=device, dtype=dtype)

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    for i in range(5): c = torch._grouped_mm(a, b)

    num_iter = 50
    start_event.record()

    for i in range(num_iter): c = torch._grouped_mm(a, b)
    end_event.record()

    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iter
    print(f"batch: {batch}\tM: {M}\tN: {N}\tK: {K}")
    print(f"Time per Iteration:\t {avg_time_ms:.4f} ms")
```

On H200
```
batch: 16	M: 128000	N: 7168	K: 7168
Time per Iteration:	 298.6668 ms
batch: 128	M: 1	N: 2048	K: 7168
Time per Iteration:	 4.1462 ms
```

B200
```
batch: 16       M: 128000       N: 7168 K: 7168
Time per Iteration:      190.7458 ms
batch: 128      M: 1    N: 2048 K: 7168
Time per Iteration:      3.0680 ms
```
nsys nvprof
```
root@16930b42ffc6:/workspace/pytorch# nsys nvprof python gemm_test.py
WARNING: python and any of its children processes will be profiled.

Collecting data...
batch: 16	M: 128000	N: 7168	K: 7168
Time per Iteration:	 192.6420 ms
batch: 128	M: 1	N: 2048	K: 7168
Time per Iteration:	 1.2255 ms
Generating '/tmp/nsys-report-6a53.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /workspace/pytorch/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)                 Name
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ---------------------------------
     98.9      10586895744          2  5293447872.0  5293447872.0  73786464  10513109280  7381715954.2  cudaDeviceSynchronize
      1.0        104084608          5    20816921.6    33552480.0    100800     34786208    18048125.3  cudaMalloc
      0.1          5694304          4     1423576.0     1416656.0   1258560      1602432      181668.1  cudaGetDeviceProperties_v2_v12000
      0.1          5430496        130       41773.0        4560.0      2496      3854368      345761.8  cudaLaunchKernel
      0.0           587584        110        5341.7        4992.0      4224        16992        1482.0  cudaLaunchKernelExC_v11060
      0.0           119200        660         180.6         128.0        96         4128         206.7  cudaGetDriverEntryPoint_v11030
      0.0            68352        660         103.6          64.0        32         4928         224.6  cuTensorMapEncodeTiled
      0.0            34976         49         713.8         224.0       160         6720        1343.4  cudaStreamIsCapturing_v10000
      0.0            32992          4        8248.0        7456.0      4128        13952        4804.4  cudaEventRecord
      0.0            16928          4        4232.0        3600.0      1728         8000        2764.7  cudaEventQuery
      0.0            16288          4        4072.0        3568.0      1952         7200        2396.1  cudaEventCreateWithFlags
      0.0            13632          4        3408.0        2672.0       544         7744        3408.7  cudaEventDestroy
      0.0             1056          1        1056.0        1056.0      1056         1056           0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     99.0      10549232845         55  191804233.5  192944479.0  165746368  203645313    5353204.3  void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm…
      0.6         67327135         55    1224129.7    1330656.0     924320    1364928     182180.4  void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm…
      0.3         34854783         20    1742739.1    1597856.0      10080    3899616     818421.2  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.0           354880        110       3226.2       3296.0       1920       4160        554.4  void at::cuda::detail::prepare_grouped_gemm_data<cutlass::bfloat16_t, cutlass::bfloat16_t, cutlass:…
```

The kernel names are too long to be shown via nvprof, I pasted this from nsight systems
```
small kernel 1SM
100.0%	1.286 ms	1	1.286 ms	1.286 ms	1.286 ms	1.286 ms	0 ns	void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)3, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)128, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params)

large kernel 2SM
100.0%	194.178 ms	1	194.178 ms	194.178 ms	194.178 ms	194.178 ms	0 ns	void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)5, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)256>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_2x1SM_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)256, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156203
Approved by: https://github.com/syed-ahmed, https://github.com/drisspg
2025-06-28 23:02:00 +00:00
996206e66f cublaslt/hipblaslt persistent workspace (#156495)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes https://github.com/ROCm/pytorch/issues/2286.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156495
Approved by: https://github.com/eqy
2025-06-28 22:38:43 +00:00
0629dfb860 Fix FSDP offload pin_memory bug (#157147)
Fixes #157146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157147
Approved by: https://github.com/weifengpy
2025-06-28 21:09:11 +00:00
67f8270516 [ROCm] test_hip_device_count safely runs on 1 GPU systems (#156398)
Fixes test_cuda.py::TestCuda::test_hip_device_count on single gpu scenario

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156398
Approved by: https://github.com/jeffdaily
2025-06-28 20:17:26 +00:00
aeffb68d34 [schema_upgrader] add C++ upgrader for json based upgrading (#156761)
Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156761
Approved by: https://github.com/angelayi
2025-06-28 18:15:06 +00:00
064a7db7fc [invoke_subgraph] turn on supports_input_mutation by default (#157177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157177
Approved by: https://github.com/anijain2305
2025-06-28 18:14:47 +00:00
2eb744c08d Revert "[BE] parse CMake version from cmake -E capabilities instead of cmake --version (#157073)"
This reverts commit 0c58bdd8fb5f269aef100af8e2c43cfcf5f1f9dd.

Reverted https://github.com/pytorch/pytorch/pull/157073 on behalf of https://github.com/XuehaiPan due to break libtorch build on Windows ([comment](https://github.com/pytorch/pytorch/pull/157073#issuecomment-3015273679))
2025-06-28 13:40:19 +00:00
0c58bdd8fb [BE] parse CMake version from cmake -E capabilities instead of cmake --version (#157073)
`cmake -E capabilities` produces a JSON format that is more machine-friendly.

```console
$ cmake --version
cmake version 4.0.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ cmake -E capabilities | jq '.version.string'
"4.0.3"
$ cmake -E capabilities | jq
{
  "debugger": true,
  "fileApi": {
    "requests": [
      {
        "kind": "codemodel",
        "version": [
          {
            "major": 2,
            "minor": 8
          }
        ]
      },
      {
        "kind": "configureLog",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cache",
        "version": [
          {
            "major": 2,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cmakeFiles",
        "version": [
          {
            "major": 1,
            "minor": 1
          }
        ]
      },
      {
        "kind": "toolchains",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      }
    ]
  },
  "generators": [
    {
      "extraGenerators": [],
      "name": "Watcom WMake",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "Kate"
      ],
      "name": "Ninja Multi-Config",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Ninja",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [],
      "name": "Xcode",
      "platformSupport": false,
      "toolsetSupport": true
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Unix Makefiles",
      "platformSupport": false,
      "toolsetSupport": false
    }
  ],
  "serverMode": false,
  "tls": true,
  "version": {
    "isDirty": false,
    "major": 4,
    "minor": 0,
    "patch": 3,
    "string": "4.0.3",
    "suffix": ""
  }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157073
Approved by: https://github.com/Skylion007
2025-06-28 13:35:30 +00:00
cdb144fcf0 Display a warning when overwriting CMAKE_CUDA_ARCHITECTURES (#156123)
Really, pytorch shoudn't be messing with basic _global_ cmake configuration like this, but without a careful analysis what all depends on this behaviour, I'm not confident to propose a change.
But at least notifying the user that something wonky is going on seems like a good idea.
@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156123
Approved by: https://github.com/drisspg, https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-06-28 11:22:09 +00:00
8147c4a904 [symm_mem] Create a dedicated ci flow for symmetric memory and only use 4 GPUs (#157181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157181
Approved by: https://github.com/kwen2501, https://github.com/huydhn
2025-06-28 08:33:50 +00:00
88c6199db0 [nativert] Move KernelFactory to PyTorch core (#156913)
Summary: Kernel factory handles the kernel nodes initializations and different type of kernels executions.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77346836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156913
Approved by: https://github.com/zhxchen17
2025-06-28 06:34:24 +00:00
51eb8e8f84 [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen (#153373)
A major release of CCCL 3.0.0 will introduce some bc-breaking changes. Namely iterators like TransformInputIterator and ConstantInputIterator were moved from CUB to Thrust, some operators like Max and Sum were moved to LibCUDACXX.

For the more info on changes please visit: https://nvidia.github.io/cccl/cccl/3.0_migration_guide.html

This is a follow up to PR #147493. A description from the original PR:
> Several cub iterators have been deprecated and removed in the latest CCCL (cub) development https://github.com/NVIDIA/cccl/pull/3831. This PR replaced the usage of those cub iterators with thrust iterators.
>
> Some cub thread operators were also deprecated and removed in https://github.com/NVIDIA/cccl/pull/3918. This PR replaced those operators with libcudacxx ops.
>
> This might also affect ROCM usability a bit.
>
> This patch is tested to work with CCCL commit at 82befb0894
>
> Tracking of CCCL/CUB deprecations in the most recent development https://github.com/NVIDIA/cccl/issues/101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153373
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-28 05:44:52 +00:00
a92b24cd83 Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask (#156384)
`index_put` with a boolean mask (`target[mask] = src`) causes a `cudaStreamSynchronize`. When both `mask` and `target` tensors are on GPU this is expected.

However, the sync can be prevented if the `mask` is a CPU tensor.
Internally a new index tensor is created with `mask.nonzero()` so we can use a non-blocking copy to transfer it to the GPU since it cannot be accidentally mutated by the user between its creation and the device copy. @ngimel Let me know if I'm missing something.

I think this is useful since users can't prevent a sync simply by making sure all tensors are on the same device as with other ops. Instead one would need to do something like this which is much less readable
```python
indices = mask.nonzero().squeeze(1).to("cuda", non_blocking=True)
target[indices] = src
```
Fixes #12461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156384
Approved by: https://github.com/ngimel
2025-06-28 05:41:16 +00:00
5692cbb818 [ONNX] Delete symbolic caffe2 (#157102)
Caffe2 is removed from pytorch. This is a clean up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157102
Approved by: https://github.com/titaiwangms, https://github.com/cyyever
2025-06-28 05:22:02 +00:00
cyy
30d2648a4a Install nvperf_host together with cupti (#156668)
Because cupti depends on nvperf_host, as discussed in https://github.com/pytorch/pytorch/pull/154595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156668
Approved by: https://github.com/Skylion007
2025-06-28 04:26:36 +00:00
adf6dd1e44 Fix aten::index_put args Dtensor type mismatch and add a propagation strategy (#156240)
We notice model code contains indexing syntax like [nanogpt model code](f144fe9095/torchbenchmark/models/nanogpt/model.py (L240)), which causes training fail in the backward pass when using DTensor.

In the code, `x = x[:, [-1], :]` calls the index op and in the backward pass, it will trigger `aten.index_put.default` with the second argument to be of type `torch::List<std::optional<Tensor>>`, e.g., `[None, tensor([-1], device='cuda:0')]`. We are unable to unwarp the op info into Dtensor based on the current logic [here](2625c70aec/torch/distributed/tensor/_dispatch.py (L339-L358)). We need to set runtime_schema_info for the op and enable needs_pytree to support the conversion of tensor list arg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156240
Approved by: https://github.com/wanchaol
2025-06-28 04:09:41 +00:00
f810480dbe Revert "[schema_upgrader] add C++ upgrader for json based upgrading (#156761)"
This reverts commit 61712e6f2ba58cce354a742d918934ec7293ee43.

Reverted https://github.com/pytorch/pytorch/pull/156761 on behalf of https://github.com/ydwu4 due to break linter test, which doesn't show up in the pr ([comment](https://github.com/pytorch/pytorch/pull/156761#issuecomment-3014918800))
2025-06-28 03:58:25 +00:00
0e47312ae5 ci: Add ability to test images for build-triton-wheel (#156894)
This wasn't available prior making it difficult to test if manywheel
image changes would affect triton wheel builds.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156894
Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/malfet
ghstack dependencies: #156893
2025-06-28 03:41:18 +00:00
ef6dfa06a9 Create a base Checkpointer and SyncCheckpointer and add dist barrier impl and (#156926)
In preparation to adding async checkpointing, this diff adds
1.  Change Checkpointer to an Abstract base class and adds a sync checkpointer implementation.
2. torch.distributed.barrier() as one of the barrier choices.

Differential Revision: [D77341314](https://our.internmc.facebook.com/intern/diff/D77341314/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156926
Approved by: https://github.com/pradeepfn
2025-06-28 02:48:29 +00:00
e8217ad8be [inductor][static launcher] Skip correctness test for test_floats (#157023)
https://github.com/triton-lang/triton/issues/6176 causes kernels that take fp64 scalar inputs to generate wrong results. Until we get around to fixing this, just skip the accuracy check (it'll fail on Triton's launcher anyway).

Differential Revision: [D77407307](https://our.internmc.facebook.com/intern/diff/D77407307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157023
Approved by: https://github.com/jamesjwu
2025-06-28 02:19:10 +00:00
e3320965b4 [sym_mem] Further Fix NCCL symm mem unit test (#157156)
We still see CI failures because of error "RuntimeError: CUDA driver error: invalid device ordinal". So upon discussion, we might also need a GPU number skip macro for the test itself:

Fixes #156569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157156
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-06-28 02:17:13 +00:00
a1e4f1f98a [MPS] Reimplement tri[ul] as Metal shaders (#157179)
And add in-place flavor, as it is currently broken for non-contig tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179
Approved by: https://github.com/dcci
2025-06-28 01:33:18 +00:00
c14110056f [caffe2] Allow the elimination of implicit calls to strlen when using the RECORD_FUNCTION macros (#153567)
Summary:
With the way these were written, any string literals that were being passed in, like `__func__`, were only ever passed down as a `const char*`, so this switches it over to take a `std::string_view` at the deepest part.

This also has the side effect of allowing `std::string_view` to be passed to the `RECORD_FUNCTION` macros as well.

Test Plan:
contbuilds

Rollback Plan:

Differential Revision: D74681042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153567
Approved by: https://github.com/Skylion007, https://github.com/swolchok
2025-06-28 01:11:00 +00:00
1e4c5b666a Revert "[dynamo] fix _torchdynamo_orig_callable naming issues (#156901)"
This reverts commit eb9efb37c8f315f1d30e86d5797490c6a8666889.

Reverted https://github.com/pytorch/pytorch/pull/156901 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal tests D77411594 ([comment](https://github.com/pytorch/pytorch/pull/156901#issuecomment-3014734151))
2025-06-28 00:37:01 +00:00
61712e6f2b [schema_upgrader] add C++ upgrader for json based upgrading (#156761)
Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156761
Approved by: https://github.com/angelayi
2025-06-27 23:50:19 +00:00
2815ade9a8 updated adafactor doc #154862 (#155248)
updated adafactor doc to reflect difference in implementation vs original paper

Fixes #154862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155248
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-06-27 23:23:19 +00:00
feea575082 [MTIA ATen Backend] Add dispatch keys for add.out (#156952)
Migrate add.out

Differential Revision: [D77352482](https://our.internmc.facebook.com/intern/diff/D77352482/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156952
Approved by: https://github.com/malfet, https://github.com/huydhn
ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949, #156950, #156951
2025-06-27 22:49:00 +00:00
253cbadade [MTIA ATen Backend] Add dispatch keys for rsub.Tensor / rsub.Scalar / sub.out (#156951)
Migrate rsub.Tensor / rsub.Scalar / sub.out

Differential Revision: [D77015033](https://our.internmc.facebook.com/intern/diff/D77015033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156951
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949, #156950
2025-06-27 22:49:00 +00:00
b6b2871555 [MTIA ATen Backend] Add dispatch keys for fmod / abs.out / logical_not.out (#156950)
Migrate fmod / abs.out / logical_not.out

Differential Revision: [D77220217](https://our.internmc.facebook.com/intern/diff/D77220217/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156950
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949
2025-06-27 22:48:48 +00:00
a95bee9ed6 [MTIA ATen Backend] Add dispatch key for div.out (#156949)
Migrate div.out

Differential Revision: [D77063371](https://our.internmc.facebook.com/intern/diff/D77063371/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156949
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947, #156948
2025-06-27 22:48:39 +00:00
f30e072cb4 [MTIA ATen Backend] Add dispatch keys for mul.Scalar_out / mul.out (#156948)
Migrate mul.Scalar_out / mul.out

Differential Revision: [D77011801](https://our.internmc.facebook.com/intern/diff/D77011801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156948
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947
2025-06-27 22:48:32 +00:00
66ad843583 [MTIA ATen Backend] Add dispatch keys for gt.Tensor_out / gt.Scalar_out (#156947)
Migrate gt.Tensor_out / gt.Scalar_out

Differential Revision: [D77009468](https://our.internmc.facebook.com/intern/diff/D77009468/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156947
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946
2025-06-27 22:48:25 +00:00
f0a5a3b453 [MTIA ATen Backend] Add dispatch keys for ne.Tensor_out / ne.Scalar_out (#156946)
Migrate ne.Tensor_out / ne.Scalar_out

Differential Revision: [D77008139](https://our.internmc.facebook.com/intern/diff/D77008139/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156946
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945
2025-06-27 22:48:18 +00:00
cd1a924dba [nativert] get rid of sigmoid naming (#157134)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77451215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157134
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-27 22:41:52 +00:00
d283fc79b1 chunk_size should always be int64_t for Foreach functors (#156872)
See https://github.com/pytorch/pytorch/issues/156261#issuecomment-3002394773

Testing is a valid q--it is pretty expensive to test such large tensors for all these ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156872
Approved by: https://github.com/Skylion007, https://github.com/eqy
ghstack dependencies: #156876, #156871
2025-06-27 22:35:34 +00:00
5a0926a26e Stop skipping entire foreach tests, just skip the profiler portion (#156871)
Instead of skipping the whole test as the CUPTI team figures out what is wrong, let's temporarily skip the profiler check portion. It is high pri to add it back to ensure foreach ops are actually performant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156871
Approved by: https://github.com/albanD
ghstack dependencies: #156876
2025-06-27 22:35:34 +00:00
20e40492b0 [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan
2025-06-27 21:35:56 +00:00
130d4973bd Documentation update torch.clone #156644 (#157007)
updated torch clone docs to reflect implemented memory behavior

Fixes #156644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157007
Approved by: https://github.com/malfet, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-27 21:10:09 +00:00
3ee75b7eac [MTIA ATen Backend] Add dispatch keys for le.Tensor_out / le.Scalar_out (#156945)
Migrate le.Tensor_out / le.Scalar_out

Differential Revision: [D77002317](https://our.internmc.facebook.com/intern/diff/D77002317/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156945
Approved by: https://github.com/malfet
ghstack dependencies: #156944
2025-06-27 21:03:19 +00:00
6b7767fc8d [MTIA ATen Backend] Add dispatch keys for ge.Tensor_out / ge.Scalar_out (#156944)
Migrate ge.Tensor_out / ge.Scalar_out

Differential Revision: [D77002145](https://our.internmc.facebook.com/intern/diff/D77002145/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156944
Approved by: https://github.com/malfet
2025-06-27 21:02:27 +00:00
0decd966af Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit 216bd6091ec52865052282eced7e6d5d2a4b4fb4.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/huydhn due to Some tests are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3014185210))
2025-06-27 19:39:41 +00:00
7c51619e7f Fix Float16 CooperativeReduction Test Failure (#154516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154516
Approved by: https://github.com/jansel, https://github.com/jeffdaily
2025-06-27 19:31:49 +00:00
4048a144ab Address richard's comments on libtorch_stable_abi note (#156324)
Followups from #155984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156324
Approved by: https://github.com/zou3519
2025-06-27 19:19:12 +00:00
dcb97cd519 Remove unneccesary code to check autograd state (#156855)
Summary: Title

Test Plan:
CI

Rollback Plan:

Differential Revision: D77317627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156855
Approved by: https://github.com/zhxchen17

Co-authored-by: Camyll Harajli <camyllh@meta.com>
2025-06-27 19:18:06 +00:00
8a88c6e85a [nit] fix xavier init doc (#157100)
Remove part of the documentation that is irrelevant and confusing at best, probably a copy-paste mistake:

<img src="https://github.com/user-attachments/assets/77fa5734-5a5a-4f8d-80a5-bc3269668e07" width="500">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157100
Approved by: https://github.com/mikaylagawarecki
2025-06-27 19:13:40 +00:00
75a7d9e868 Revert "python definitely_contiguous-> is_contiguous_or_false (#156515)"
This reverts commit 4c0091fda65b714fa73671a15e379f814af153e0.

Reverted https://github.com/pytorch/pytorch/pull/156515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some torch.export failures internally ([comment](https://github.com/pytorch/pytorch/pull/156515#issuecomment-3014104570))
2025-06-27 19:07:06 +00:00
2860f5c4f5 Remove mentioning of TorchScript in Export doc (#156969)
Remove mentioning of TorchScript

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156969
Approved by: https://github.com/angelayi

Co-authored-by: Angela Yi <yiangela7@gmail.com>
2025-06-27 17:59:15 +00:00
456b7451c7 Minor error message fix in device_mesh.py (#157096)
Fixed error message:
On main:
```
KeyError: ("Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. ", 'Found mesh dim indices to slice: [(1,), (1,)]. ', 'Mesh dim indices should be in ascending order.')
```
On PR:
```
KeyError: Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. Found mesh dim indices to slice: [(1,), (1,)]. Mesh dim indices should be in ascending order.'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157096
Approved by: https://github.com/Skylion007
2025-06-27 17:42:29 +00:00
36fd1ac932 [ONNX] Bump onnxscript api for torch 2.8 (#157017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157017
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-06-27 17:39:17 +00:00
84c588e5ea [cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)
Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/)

Motivation:
* This is the case we care the most.
* We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656
Approved by: https://github.com/ColinPeppler
2025-06-27 17:15:45 +00:00
b22b93a6ba [2/n] rewrite load balancing and sharding in context parallel (#155442)
This PR rewrite how load balancing and sharding works in the current
context parallel implementation.

Why the changes? We should NOT expose another layer of "sharding"
concept as it would confuse the user about its difference with DTensor
sharding. The current CP perform sharding weirdly simply because it
mixed the concept of load balancing and sharding.

I think load balancing and sharding need to be decoupled to separate
layers:

* The load balancing layer is responsible to reorder the input sequence
so that the attention computation are evenly balanced across rows/ranks.
* Sharding is a separate layer after it, it simply take the input reordered by
the load balancer and shard it exactly as how DTensor shard tensor sequentially

In this PR:
* I removed the "Sharder" and "LoadBalancer" mixed usage, and
simply generate a roundrobin indices when the mask is a casual mask
* use `distribute_tensor` to perform the sharding. We still keep the local
shard instead of the DTensor objects to allow maximum compatibility with
arbitrary model architecture given DTensor op coverage is not high
enough.

One alternative design is to still keep the LoadBalancer and add the indices
generation and restore to be the protocol of the LoadBalancer. I thought through
it and think we might want to directly expose the load_balancing indices as
an argument instead of a dedicated class interface, so I removed it here. More
discussion on this is welcomed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155442
Approved by: https://github.com/XilunWu
ghstack dependencies: #155441
2025-06-27 17:06:42 +00:00
f7c730107e [1/n] refactor the ring attention implementation (#155441)
as titled, I'm working on a series of changes to make ring attention
impl and DTensor works better together, this PR specifically refactor the
current implemtnation to:

* remove dead/unused code
* restructure the functions to make them stay organized
* refactor to remove/make error message better

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155441
Approved by: https://github.com/fegin
2025-06-27 17:06:42 +00:00
eeaefa1336 Fix UnbackedSymint rebinding - check unbacked before renaming (#156911)
Differential Revision: D77249427

Due to memoization and graph order update, it can happen that a backed symbol is passed into compute_unbacked_bindings and lead to failure. An example as follow:

- There are 2 boolean indexing operators (e.g. op1 and op2) with the same mask.
- A unbacked symint is generated from op1, and then op2 reuses the unbacked symint due to a nonzero_memo in nonzero's fake implementation and no rebinding is needed for op2.
- Since op1 generated the unbacked symint, its meta has "unbacked_bindings" field filled and op2's meta doesn't have it.
- Output from op1 and op2 are later concated with others with backed symint, so that the unbacked symint can be replaced by a backed symint.
- In Inductor, during fake tensor prop, there is no memoi because new fake tensor is always generated (for the same node). op1 generates an unbacked symint and the unbacked can be rebound successfully to the backed symint. Since there is no memoi, op2 also generates a new unbacked symint, but no rebinding can happen because op2's meta doesn't have "unbacked_bindings". And "compute_unbacked_bindings/_rename_unbacked_to" fails to assert op2's old symbol to be unbacked.

From discussion with [@ezyang](https://www.internalfb.com/intern/profile/?id=503862770), there is no easy way to fix this issue.

- We can try to enable memoization for fake tensor prop in Inductor, however, we need to ensure that op1 is visited before op2 during Inductor fake tensor prop for this to work (op2's meta doesn't have "unbacked_bindings" so no rebinding can happen and we need to do rebinding from op1. But there are passes such as reorder_for_locality that can change the graph order so this doesn't work.
- A simple hack is to just replace the unbacked symbol in op2 by the backed symbol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156911
Approved by: https://github.com/ezyang
2025-06-27 16:57:04 +00:00
216bd6091e Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-27 16:41:00 +00:00
d0cfa3e5bf [c10d] Move the include of header file of TraceUtils.h into NCCLUtil.cpp instead of keeping in hpp (#156909)
We have seen complaint about compilation failure of `NCCLSymmetricMemory.cu` and the reason is because we include <torch/csrc/distributed/c10d/TraceUtils.h> inside NCCLUtil.hpp this is not necessary so we want to move the include to cpp.

Differential Revision: [D77346675](https://our.internmc.facebook.com/intern/diff/D77346675)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156909
Approved by: https://github.com/kwen2501
2025-06-27 16:30:49 +00:00
21b5dc7a6a [CD] Add python-3.14.0b3 to docker image (#156889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156889
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #157033
2025-06-27 16:24:39 +00:00
d158e9ea82 Update nightly PyTorch version to 2.8.0->2.9.0 (#156965)
Same as https://github.com/pytorch/pytorch/pull/149038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156965
Approved by: https://github.com/Camyll, https://github.com/malfet
2025-06-27 16:22:08 +00:00
60abb0d327 [dynamo] Better error for invalid @contextlib.contextmanager usage (#156924)
Fixes #156716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156924
Approved by: https://github.com/williamwen42
2025-06-27 15:50:36 +00:00
ff8b53c056 [Kineto] Add MTIA_INSIGHT to kineto_shim (#156853)
Summary:
Add MTIA_INSIGHT to kMtiaTypes in kineto_shim.cpp

For insight, user can use MTIA_INSIGHT_VERBOSE_TRACES=0 to disable the profiler. So, we can enable it by default

Test Plan:
{F1979756361}
When the environment var isn't set, it uses 0.

Rollback Plan:

Differential Revision: D77315882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156853
Approved by: https://github.com/sraikund16
2025-06-27 15:30:14 +00:00
5118a8f8a5 Rename mm_scaled_grouped.py to mm_grouped.py (#156849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156849
Approved by: https://github.com/amjames, https://github.com/Skylion007
2025-06-27 15:02:22 +00:00
aa2d54148d Add AOTDispatcher config to set backward autocast behavior (#156356)
This PR adds a new config `backward_pass_autocast`, to set the backward autocast
behavior. It does not change the existing behavior.

The reason why we need this is that torch.compile acquires a forward and
backward graph at the time of the forward pass. This means that
implemented naively, if there are any context managers active outside
the call to torch.compile, the backward graph will also get the
behaviors from those context managers. This PR gives users a way to
tweak the autocast behavior of the backward pass.

Please see torch._functorch.config for the options to the
`backward_pass_autocast` config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156356
Approved by: https://github.com/bdhirsh
ghstack dependencies: #155354
2025-06-27 14:58:58 +00:00
adf9644440 Add pg transport and tests (#154653)
Add PG transport and tests under `torch/distributed/checkpoint/`

### API:
```python
def send_checkpoint(self, dst_ranks: list[int], state_dict: object) -> None:
def recv_checkpoint(self, src_rank: int) -> object:
```

### Tests:
```
python test/distributed/checkpoint/test_pg_transport.py
```

### Example:
Under `_pg_transport_example.py` (in https://github.com/pytorch/pytorch/pull/155810)
```
torchrun --nproc_per_node=2 -m torch.distributed.checkpoint._pg_transport_example -- --device cuda
```

Differential Revision: [D76044919](https://our.internmc.facebook.com/intern/diff/D76044919)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154653
Approved by: https://github.com/meetv18
2025-06-27 14:53:34 +00:00
414ad47045 revamp dtype documentation for 2025 (#156087)
The dtype documentation has not been updated in awhile, let's do a revamp.

1. combine the duplicated docs for dtypes from `tensors.rst` and `tensor_attributes.rst` to live in `tensor_attributes.rst`, and link to that page from `tensors.rst`
2. split the dtype table into floating point and integer dtypes
3. add the definition of shell dtype
4. add the float8 and MX dtypes as shell dtypes to the dtype table
5. remove legacy quantized dtypes from the table
6. add the definition of various dtype suffixes ("fn", etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156087
Approved by: https://github.com/albanD
2025-06-27 13:10:23 +00:00
43523bf168 Fix silent incorrectness arising from incorrect alias information (#152011)
Fixes #136662

There are two problems:
1) canonicalize_view_scatter_ops adds some new nodes into the graph.
   These new nodes cause the alias info on the graph to be wrong. To fix
   this, we try to run FakeTensorUpdater on the graph again.
2) FakeTensorUpdater's alias information is wrong. It tries to skip
   nodes that it thinks have "equivalent" FakeTensor metadata.
   It should not be allowed to do this if any users of the node can
   alias the node. The example
   is if we have `x = foo(...); y = x.view(...)`. If the user replaces
   `foo` with a new `bar` node and sets bar.meta["val"] correctly, then
   FakeTensorUpdater still needs to update y's meta["val"] to be a view
   of the new bar node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011
Approved by: https://github.com/yf225
2025-06-27 12:45:03 +00:00
75f3e5a88d [dynamo] Fix issue with tensors passed as view() shapes (#156928)
Fixes #156720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156928
Approved by: https://github.com/ezyang
2025-06-27 08:52:31 +00:00
588b5fb94b Optimize TorchHigherOrderOperatorVariable.make() with lookup table (#157022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157022
Approved by: https://github.com/zou3519
2025-06-27 07:36:12 +00:00
968f90ce73 [ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols (#156479)
Fixing undefined symbol linker error after [exposing MIOpen symbols](https://github.com/pytorch/pytorch/pull/154545).
This fix:

- Hipifies `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` files
- Adds `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` hipified source files to `all_hip_cpp` list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156479
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-27 07:23:32 +00:00
4a80ddfbe7 Revert "Fix reinplace pass handling of view input + mutable custom op (#156729)"
This reverts commit b754b1fa43d20f5b31e17c396487ab56991912da.

Reverted https://github.com/pytorch/pytorch/pull/156729 on behalf of https://github.com/davidberard98 due to breaks lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15918483073/job/44900430950) [HUD commit link](b754b1fa43) ([comment](https://github.com/pytorch/pytorch/pull/156729#issuecomment-3011867746))
2025-06-27 06:38:58 +00:00
cyy
064288cbab Use std::string_view in torchgen (#157050)
Let the generated code use std::sv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157050
Approved by: https://github.com/ezyang
2025-06-27 06:36:10 +00:00
cc3ea2d840 remove gso from Linear.cpp (#156899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156899
Approved by: https://github.com/ColinPeppler
2025-06-27 06:30:50 +00:00
cf0749c92f Use expecttest in test_compiled_optimizers.py (#155308)
Fixes #141262

## Test Result

```bash
pytest test/inductor/test_compiled_optimizers.py -vv
```

![image](https://github.com/user-attachments/assets/1886fb71-ff05-46e7-988c-82d36358a834)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155308
Approved by: https://github.com/mlazos, https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>
2025-06-27 06:29:51 +00:00
cbcffce48a address remaining straight forward gso in meta_registrations (#156902)
Those are all straight forward generalization of existing checks,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156902
Approved by: https://github.com/ColinPeppler
2025-06-27 06:19:54 +00:00
640703d95f add torch.concat to normalization pass (#156574)
Summary: In the normalization pass, we also add torch.concat to it to normalize it as torch.cat

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes -- test_cat_normalization
```

Buck UI: https://www.internalfb.com/buck2/597fd4f1-0aa7-4372-8a66-5a690d9b63a4
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850152284203
Network: Up: 84KiB  Down: 34KiB  (reSessionID-3916e009-7117-41ce-b6f9-089873aa50dd)
Executing actions. Remaining     0/3                                                                                              1.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:47.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D77125331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156574
Approved by: https://github.com/Mingming-Ding
2025-06-27 06:07:26 +00:00
1155c53e7d Port three dynamo test to Intel GPU (#156575)
For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Two dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056). In this PR we will port 3 more dynamo test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- added XPU support in decorators like @requires_gpu
- enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156575
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-27 05:56:22 +00:00
51853b358e [dynamo] Improve error message for cond aliasing (#156963)
See #156724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156963
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-06-27 05:31:46 +00:00
6b05842e47 [test][inductor] fix test_conv_cat failure (#155852)
This test is currently failing because triton_poi_fused_cat_2 has changed to triton_poi_fused_cat_3. I have not investigated why the extra kernel is generated, but this test has been failing on trunk for a while (and I verified locally that it is failing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155852
Approved by: https://github.com/FindHao, https://github.com/Skylion007
2025-06-27 05:11:11 +00:00
2c76f31221 Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-06-27 04:59:52 +00:00
b754b1fa43 Fix reinplace pass handling of view input + mutable custom op (#156729)
Fixes #153389.

Using approach https://github.com/pytorch/pytorch/issues/153389#issuecomment-3006049928 suggested by Richard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156729
Approved by: https://github.com/zou3519
2025-06-27 04:54:17 +00:00
e6d8ed02cb PyTorch Data Sampler benchmark (#156974)
## Motivation
Many PRs optimizing samplers (for eg https://github.com/pytorch/pytorch/pull/147706, https://github.com/pytorch/pytorch/pull/137423) are leveraging an adhoc script for benchmarking samplers. The script and outputs are often copied over in PRs. We want to begin centralizing benchmarks for torch.utils.data components.

## What ?
* This PR adds a new sub-folder in `benchmarks`  for `data`. This is aimed to cover benchmarking scripts for torch.utils.data components like dataloader and sampler.
* Specifically, this PR includes a simple script to time samplers. This is often "copy-pasted" in PRs optimizing samplers. Having it in a centralized location should prevent that, and allow a common standard.

## Output
```
Benchmark Results:
+--------------+-------------+----------------+-----------+-----------+
|   Batch Size | Drop Last   |   Original (s) |   New (s) | Speedup   |
+==============+=============+================+===========+===========+
|            4 | True        |         0.004  |    0.0088 | -119.62%  |
+--------------+-------------+----------------+-----------+-----------+
|            4 | False       |         0.0083 |    0.009  | -9.23%    |
+--------------+-------------+----------------+-----------+-----------+
|            8 | True        |         0.003  |    0.0074 | -147.64%  |
+--------------+-------------+----------------+-----------+-----------+
|            8 | False       |         0.0054 |    0.0075 | -38.72%   |
+--------------+-------------+----------------+-----------+-----------+
|           64 | True        |         0.0021 |    0.0056 | -161.92%  |
+--------------+-------------+----------------+-----------+-----------+
|           64 | False       |         0.0029 |    0.0055 | -92.50%   |
+--------------+-------------+----------------+-----------+-----------+
|          640 | True        |         0.002  |    0.0055 | -168.75%  |
+--------------+-------------+----------------+-----------+-----------+
|          640 | False       |         0.0024 |    0.0062 | -161.35%  |
+--------------+-------------+----------------+-----------+-----------+
|         6400 | True        |         0.0021 |    0.0055 | -160.13%  |
+--------------+-------------+----------------+-----------+-----------+
|         6400 | False       |         0.0021 |    0.0068 | -215.46%  |
+--------------+-------------+----------------+-----------+-----------+
|        64000 | True        |         0.0042 |    0.0065 | -55.29%   |
+--------------+-------------+----------------+-----------+-----------+
|        64000 | False       |         0.0029 |    0.0077 | -169.56%  |
+--------------+-------------+----------------+-----------+-----------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156974
Approved by: https://github.com/ramanishsingh
2025-06-27 04:49:43 +00:00
195ef1bce8 [SymmMem] Refactor NVSHMEM tests: separate Triton tests into dedicated file (#156685)
## Summary

Moved the Triton-specific NVSHMEM tests in `test_nvshmem.py` into a dedicated `test_nvshmem_triton.py` file. Also put the shared Triton JIT kernels at the top-level of new file for reusability.

## Testing

```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem_triton.py
```

All 16 original tests pass with no functionality changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156685
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156684
2025-06-27 04:38:37 +00:00
b6c00dfe24 [user triton] AOT inductor support for device-side TMA (#155896)
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`

Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.

To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs

This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)

Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).

For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire
2025-06-27 04:28:04 +00:00
710b92cf3b [BE][BugFix] Install Python-3.13 correctly (#157033)
Fixes temporary workaround introduced by https://github.com/pytorch/builder/pull/1827

I.e. it's  been downloading latest 3.13 branch rather than 3.13.0 release

Simplify nogil version handling
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157033
Approved by: https://github.com/wdvr, https://github.com/huydhn
2025-06-27 04:19:59 +00:00
1eea2c4fe3 [Inductor][CPP] Fix perf regression of functorch_maml_omniglot (#156526)
**Summary**
Fix the performance regression of `functorch_maml_omniglot` in TorchBench. The issue reported in [#151523](https://github.com/pytorch/pytorch/issues/151523) occurs only when a parallel reduction is performed under the vectorized loop and a scalar kernel is used for the tail loop. Previously, we addressed this regression in [#151887](https://github.com/pytorch/pytorch/pull/151887) by disabling all cases where a parallel reduction occurs under the vectorized  loop. However, for `functorch_maml_omniglot`, we found that a masked vector kernel is used in the tail loop instead of the scalar kernel in the job of `inductor_torchbench_cpu_smoketest_perf`. In this PR, we refine the fix by excluding the cases where a masked vector kernel is used in the tail loop, rather than disabling all such scenarios.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156526
Approved by: https://github.com/CaoE
2025-06-27 03:09:24 +00:00
7392470da4 [nativert] alias analyzer + layout planner/manager to pytorch core (#156897)
Summary: att

Test Plan:
ci - unit tests still have some unresolved deps but will move them later.

Rollback Plan:

Differential Revision: D77320950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156897
Approved by: https://github.com/zhxchen17
2025-06-27 03:01:22 +00:00
382c6190c1 complex.pow(2) on GPU by replacing with complex * complex to avoid numerical instability (#152373)
Fixes #150951
Summary:
For complex.pow(2) on GPU:

Uses complex * complex directly.
Produces results consistent with CPU implementation.
Eliminates spurious imaginary components for real inputs.

🧪 Tests
Added unit tests to verify correctness of the new kernel path.
Verified numerical consistency with CPU results.

This change is backward-compatible and only affects the specific case of pow(2) on complex tensors on GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152373
Approved by: https://github.com/ezyang
2025-06-27 02:21:59 +00:00
e290a4c645 Revert "Rename torch::standalone to headeronly (#156964)"
This reverts commit 7e54c02a35b905e758497b856a1953eb009ba836.

Reverted https://github.com/pytorch/pytorch/pull/156964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156964#issuecomment-3011136947))
2025-06-27 02:20:33 +00:00
4ab4d29cbe [BE] Remove SymmMem allocator destruct log (#157020)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157020
Approved by: https://github.com/fduwjj
2025-06-27 02:10:54 +00:00
56c69bedcc Revert "[dynamo] Better error for invalid @contextlib.contextmanager usage (#156924)"
This reverts commit 863327ae496471654344e1e04ccaa713a44a135d.

Reverted https://github.com/pytorch/pytorch/pull/156924 on behalf of https://github.com/jansel due to Likely same issue as #156963 ([comment](https://github.com/pytorch/pytorch/pull/156924#issuecomment-3011087802))
2025-06-27 01:57:05 +00:00
8e8bbfc803 Remove ts to export retracer (#156857)
Summary: This is probably not used anymore

Test Plan:
CI

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D77318582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156857
Approved by: https://github.com/SherlockNoMad
2025-06-27 01:54:24 +00:00
a4b59498c5 Fix fake kernel for the out=... variant of unbind_copy (#156643)
`unbind_copy(..., out=...)` returns None rather than the `out` argument
(see https://github.com/pytorch/pytorch/issues/130829#issuecomment-2283936222),
but the old fake kernel didn't account for that and caused an assertion
failure in `pushPyOutToStack`. This patch fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156643
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #156642
2025-06-27 01:34:07 +00:00
89aa708b39 [core] Dispatch to at::nansum_out rather than at::native::nansum_out (#156642)
Calling `at::native::nansum_out` causes the fake kernel to dispatch to a
`make_reduction` call and then segfaults later due to the
`mutable_data_ptr` call in `TensorIteratorBase::build`. It also causes
fake tensor propagation issue in Dynamo. The added tests demonstrate the
aforementioned 2 issues.

This patch fixes it by dispatching to `at::nansum_out` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156642
Approved by: https://github.com/zou3519
2025-06-27 01:34:07 +00:00
863327ae49 [dynamo] Better error for invalid @contextlib.contextmanager usage (#156924)
Fixes #156716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156924
Approved by: https://github.com/williamwen42
2025-06-27 01:02:01 +00:00
7e54c02a35 Rename torch::standalone to headeronly (#156964)
Summary: headeronly is more clear, let's change the name before anyone depends on standalone

Test Plan:
CI should pass!

Rollback Plan:

Differential Revision: D77381084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156964
Approved by: https://github.com/swolchok, https://github.com/albanD, https://github.com/desertfire
2025-06-27 01:00:14 +00:00
3bdd5ae334 [PT2] deprecate force_same_precision, guarded by JK (#156789)
Summary:
cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks

we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected

changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed

Test Plan:
CI

Rollback Plan:

Differential Revision: D77265930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789
Approved by: https://github.com/jhadidjojo, https://github.com/masnesral
2025-06-27 00:43:06 +00:00
6215e90b7b Revert "[dynamo] Improve error message for cond aliasing (#156963)"
This reverts commit 9c39bc24807a5843f8affdf56bd71836760dc554.

Reverted https://github.com/pytorch/pytorch/pull/156963 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failures are legit ([comment](https://github.com/pytorch/pytorch/pull/156963#issuecomment-3010870664))
2025-06-27 00:31:00 +00:00
e3977e843d Revert "Fix silent incorrectness arising from incorrect alias information (#152011)"
This reverts commit 2d39a48d524021995269411bd49fe792e59d9f94.

Reverted https://github.com/pytorch/pytorch/pull/152011 on behalf of https://github.com/Camyll due to cannot land internally. owner will update and reland to fix ([comment](https://github.com/pytorch/pytorch/pull/152011#issuecomment-3010723960))
2025-06-26 23:54:13 +00:00
eb9efb37c8 [dynamo] fix _torchdynamo_orig_callable naming issues (#156901)
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks

We rename the first usage to `_torchdynamo_orig_fn` and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #156527
2025-06-26 23:51:08 +00:00
6089ebcf6d [dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527)
Fixes https://github.com/pytorch/pytorch/issues/155057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156527
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-06-26 23:51:08 +00:00
e0447bb5f8 Add max_pool3d for MPS (#156467)
Fixes #100674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156467
Approved by: https://github.com/malfet
2025-06-26 23:33:50 +00:00
1fff6356d9 [MPS] Optimize cummin/cummax metal kernels (#156794)
Performance improvement (M4 Max 64GB, macOS 15.5):
```
                                              | Current | Previous
      cummin-dim0-32x32 (torch.float16)       |  103.4  |   102.5
      cummin-dim0-128x128 (torch.float16)     |  112.2  |   133.6
      cummin-dim0-512x512 (torch.float16)     |  146.9  |   233.1
      cummin-dim0-1024x1024 (torch.float16)   |  193.6  |   364.2
      cummin-dim1-32x32 (torch.float16)       |  102.0  |    94.4
      cummin-dim1-128x128 (torch.float16)     |  103.0  |   109.9
      cummin-dim1-512x512 (torch.float16)     |  109.1  |   227.0
      cummin-dim1-1024x1024 (torch.float16)   |  140.5  |   985.1
      cummin-1d-100 (torch.float16)           |  101.8  |   100.7
      cummin-1d-10000 (torch.float16)         |  112.8  |   805.0
      cummin-1d-1000000 (torch.float16)       | 1343.8  | 70545.6
      cummin-dim0-32x32 (torch.float32)       |  104.6  |   102.7
      cummin-dim0-128x128 (torch.float32)     |  112.3  |   137.2
      cummin-dim0-512x512 (torch.float32)     |  146.6  |   209.7
      cummin-dim0-1024x1024 (torch.float32)   |  194.0  |   340.1
      cummin-dim1-32x32 (torch.float32)       |  100.1  |    99.2
      cummin-dim1-128x128 (torch.float32)     |  101.4  |   111.9
      cummin-dim1-512x512 (torch.float32)     |  110.3  |   250.7
      cummin-dim1-1024x1024 (torch.float32)   |  141.4  |   987.9
      cummin-1d-100 (torch.float32)           |  101.0  |   100.6
      cummin-1d-10000 (torch.float32)         |  112.9  |   794.7
      cummin-1d-1000000 (torch.float32)       | 1311.7  | 71995.3
      cummin-dim0-32x32 (torch.bfloat16)      |  105.8  |   105.9
      cummin-dim0-128x128 (torch.bfloat16)    |  111.9  |   135.7
      cummin-dim0-512x512 (torch.bfloat16)    |  147.1  |   231.9
      cummin-dim0-1024x1024 (torch.bfloat16)  |  191.2  |   327.7
      cummin-dim1-32x32 (torch.bfloat16)      |  101.8  |    91.3
      cummin-dim1-128x128 (torch.bfloat16)    |  100.2  |   108.5
      cummin-dim1-512x512 (torch.bfloat16)    |  108.9  |   222.0
      cummin-dim1-1024x1024 (torch.bfloat16)  |  140.1  |   936.9
      cummin-1d-100 (torch.bfloat16)          |  103.0  |   106.6
      cummin-1d-10000 (torch.bfloat16)        |  113.1  |   795.8
      cummin-1d-1000000 (torch.bfloat16)      | 1296.8  | 68667.4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156794
Approved by: https://github.com/malfet
ghstack dependencies: #156860
2025-06-26 23:30:20 +00:00
9c39bc2480 [dynamo] Improve error message for cond aliasing (#156963)
See #156724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156963
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-06-26 23:12:00 +00:00
e6ed4074e8 update expected results (#157010)
<img width="1490" alt="Screenshot 2025-06-26 at 12 30 46 PM" src="https://github.com/user-attachments/assets/4df626d4-3010-4362-974c-fb96fa68b29f" />

<img width="904" alt="Screenshot 2025-06-26 at 12 28 29 PM" src="https://github.com/user-attachments/assets/42626892-27e1-4e69-9efc-c9baf80c5384" />

<img width="752" alt="Screenshot 2025-06-26 at 12 29 05 PM" src="https://github.com/user-attachments/assets/0b1afb30-5868-4ba6-9985-2cc7994a4227" />
PR https://github.com/pytorch/pytorch/pull/152011
added slight regression

<br class="Apple-interchange-newline">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157010
Approved by: https://github.com/zou3519
2025-06-26 21:56:57 +00:00
80d89974c1 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #156762, #155166
2025-06-26 21:40:38 +00:00
6df6eacce8 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #156762
2025-06-26 21:40:38 +00:00
dcb8982969 [dynamo] move error_on_graph_break out of config (#156762)
error_on_graph_break doesn't need to be in config, so we move it out. It should make the functorch_maml_omniglot regression less severe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156762
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-26 21:40:38 +00:00
36666033ab [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-26 21:40:38 +00:00
7b7eafe7ba [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-26 21:40:38 +00:00
1c3f5e902d [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-26 21:40:38 +00:00
fc10d4b1d6 [SymmMem] Allow selection of allocation backend (#156661)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`.
This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded).

This PR added a programmatic way -- a `set_backend` API.

Implementation:
Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as **the** allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`.

Though added, `set_backend` is **not** a necessary API for user to call -- one backend is still registered as the default at static time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-06-26 21:37:44 +00:00
262654ee51 [nativert] move constantfolder to libtorch (#156918)
Summary: att -- unit tests will be migrated later, since they still have unresolved deps.

Test Plan:
ci

Rollback Plan:

Differential Revision: D77159278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156918
Approved by: https://github.com/henryoier, https://github.com/zhxchen17
2025-06-26 21:26:37 +00:00
7f6e7103a3 Convert to markdown: jit_python_reference.rst, jit_unsupported.rst, jit_utils.rst, library.rst (#155404)
Fixes #155024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155404
Approved by: https://github.com/svekars
2025-06-26 21:09:46 +00:00
aff9c1eec5 [aoti][mps] Add fused_rms and sdpa_mps fallback ops (#156844)
Needed for llama3.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156844
Approved by: https://github.com/desertfire
ghstack dependencies: #156843
2025-06-26 21:03:05 +00:00
17dab018e3 [aoti][mps] Fix deduplication of kernels (#156843)
Previously I was not correctly deduplicating kernels generated by mps, so it would generate multiple of the same kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156843
Approved by: https://github.com/desertfire
2025-06-26 21:03:05 +00:00
977abe786d fix 'register_foward_pre_hook not supported on ScriptModule' error (#156904)
Summary:
Encountered 'register_foward_pre_hook not supported on ScriptModule' error when trying to publish CFR MTML with placing remote_ro module in remote. Issue may come from the fact that the local net from torchArrow is already scriptModule before gen_app_graph pass.
{F1979770267}

Test Plan:
hg checkout 1ff14dfaade4ac1f3cbbf38fbd72f7fdd5cdcd16
bash hstu_blocker.sh

Rollback Plan:

Reviewed By: RenfeiChen-FB

Differential Revision: D77341370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156904
Approved by: https://github.com/jingsh
2025-06-26 20:59:24 +00:00
81759afed4 [nativert] clean up some migration side-effects (#156919)
Summary: explicit torch::nativert namespace usage + // manual declarations

Test Plan:
ci

Rollback Plan:

Differential Revision: D77328855

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156919
Approved by: https://github.com/zhxchen17
2025-06-26 20:28:32 +00:00
b6e625e34f [SymmMem] Remove redundant dist.barrier in Triton NVSHMEM tests & add device‐side signal_op support (#156684)
## Summary

This PR removes unnecessary `dist.barrier` calls up in our Triton NVSHMEM test suite and adds signal_op support, which is a lightweight device-side signaling mechanism. Added test for this in our `wait_until` kernel and corresponding `core.extern` wrapper.

**Why did we drop the `dist.barrier()` calls?**
We dropped the host‐side dist.barrier() in all Triton NVSHMEM tests (except the raw put/get cases) because every other test already uses NVSHMEM collectives or device‐side sync primitives (fence/quiet/signal/wait), making the extra barrier redundant. This keeps synchronization entirely on the GPU and leverages NVSHMEM’s native ordering guarantees for clearer, more efficient tests.

**`test_triton_wait_until` update**
- **Rank 1**: after `put_kernel` writes the data, launches `signal_op_kernel` to atomically set Rank 0's flag via `nvshmemx_signal_op`
- **Rank 0**: drops its old `dist.barrier()` and simply calls `wait_until_kernel` to spin-wait on the device flag, then asserts data correctness
- Changes made per [this comment](https://github.com/pytorch/pytorch/pull/156472#discussion_r2159734046)

## Testing

```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156684
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-26 20:16:06 +00:00
43a09189c6 [MPS] Add benchmark for scan with indices (#156860)
Baseline performance on M4 Max 64GB (macOS 15.5):
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cummin-dim0-32x32 (torch.float16)       |    102.5  |    115.0
      cummin-dim0-128x128 (torch.float16)     |    133.6  |    147.8
      cummin-dim0-512x512 (torch.float16)     |    233.1  |    243.1
      cummin-dim0-1024x1024 (torch.float16)   |    364.2  |    385.2
      cummin-dim1-32x32 (torch.float16)       |     94.4  |    109.8
      cummin-dim1-128x128 (torch.float16)     |    109.9  |    122.5
      cummin-dim1-512x512 (torch.float16)     |    227.0  |    233.8
      cummin-dim1-1024x1024 (torch.float16)   |    985.1  |   1010.5
      cummin-1d-100 (torch.float16)           |    100.7  |    114.3
      cummin-1d-10000 (torch.float16)         |    805.0  |    879.1
      cummin-1d-1000000 (torch.float16)       |  70545.6  |  71310.3
      cummin-dim0-32x32 (torch.float32)       |    102.7  |    115.5
      cummin-dim0-128x128 (torch.float32)     |    137.2  |    143.8
      cummin-dim0-512x512 (torch.float32)     |    209.7  |    222.0
      cummin-dim0-1024x1024 (torch.float32)   |    340.1  |    389.9
      cummin-dim1-32x32 (torch.float32)       |     99.2  |    107.8
      cummin-dim1-128x128 (torch.float32)     |    111.9  |    119.3
      cummin-dim1-512x512 (torch.float32)     |    250.7  |    255.1
      cummin-dim1-1024x1024 (torch.float32)   |    987.9  |   1013.2
      cummin-1d-100 (torch.float32)           |    100.6  |    114.6
      cummin-1d-10000 (torch.float32)         |    794.7  |    862.2
      cummin-1d-1000000 (torch.float32)       |  71995.3  |  71963.5
      cummin-dim0-32x32 (torch.bfloat16)      |    105.9  |    113.9
      cummin-dim0-128x128 (torch.bfloat16)    |    135.7  |    147.9
      cummin-dim0-512x512 (torch.bfloat16)    |    231.9  |    240.7
      cummin-dim0-1024x1024 (torch.bfloat16)  |    327.7  |    366.9
      cummin-dim1-32x32 (torch.bfloat16)      |     91.3  |    103.3
      cummin-dim1-128x128 (torch.bfloat16)    |    108.5  |    117.4
      cummin-dim1-512x512 (torch.bfloat16)    |    222.0  |    233.6
      cummin-dim1-1024x1024 (torch.bfloat16)  |    936.9  |    982.5
      cummin-1d-100 (torch.bfloat16)          |    106.6  |    112.4
      cummin-1d-10000 (torch.bfloat16)        |    795.8  |    819.6
      cummin-1d-1000000 (torch.bfloat16)      |  68667.4  |  68557.9

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156860
Approved by: https://github.com/malfet
2025-06-26 18:44:16 +00:00
9fe2d156a9 Revert "[dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527)"
This reverts commit 5ad2bee2c8a7defd2580bb138145a49c37146fcc.

Reverted https://github.com/pytorch/pytorch/pull/156527 on behalf of https://github.com/Camyll due to failing test assertions ([comment](https://github.com/pytorch/pytorch/pull/156527#issuecomment-3009231797))
2025-06-26 17:32:34 +00:00
13efb2c858 [BE] Deprecate search_autotune_cache (#155302)
We haven't had the offline cache populated in > 1 year, this *should* be safe; if this passes, we can finally go through and rip out the offline cache logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155302
Approved by: https://github.com/masnesral
2025-06-26 17:30:08 +00:00
039a1ce0eb [BE] Remove CXX11_ABI references from cpp_builder.py (#156896)
As all Linux builds are CXX11_ABI compatible at this point

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156896
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-06-26 17:28:01 +00:00
e15ea965a1 remove guard_size_oblivious from unbind. (#148815)
unbind will always specialize on dim, because it determine the number of output tensors.
guard_size_oblivious is not useful there and more confusing probably for code readers
added a comment and a test that verifies the specialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148815
Approved by: https://github.com/pianpwk
2025-06-26 17:16:32 +00:00
61eaaa21a4 Better error message when no .so/cpp files are found (#156863)
Summary:
Sample error message:

```
RuntimeError: Failed to find a generated cpp file or so file for model 'forward' in the zip archive.

Available models in the archive:
model

To load a specific model, please provide its name using the `model_name` parameter when calling AOTIModelPackageLoader()  or torch._inductor.package.load_package.

The following files were loaded from the archive:
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/cqdxv6zki2oiiytjeqrg774uxlxgqdemhdxn5dycn4nnc3rmcd7w.cubin
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.cpp
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel.cpp
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper_metadata.json
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel_metadata.json
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.so
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_format
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_version
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/version
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/byteorder
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/serialization_id

```

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_loading_wrong_model"
```

Rollback Plan:

Differential Revision: D77320485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156863
Approved by: https://github.com/tugsbayasgalan
2025-06-26 17:13:29 +00:00
21990fbad9 Revert "[cond] support gen_schema for cond (#154193)"
This reverts commit 6de41ce0f899604c3f8b33e1f8d37eb89b3a963e.

Reverted https://github.com/pytorch/pytorch/pull/154193 on behalf of https://github.com/Camyll due to issue landing internally, discussed with Yidi offline ([comment](https://github.com/pytorch/pytorch/pull/154193#issuecomment-3009160081))
2025-06-26 17:10:00 +00:00
c808af514d Support deterministic upsample trilinear backward (#154239)
Fixes https://github.com/pytorch/pytorch/issues/154183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154239
Approved by: https://github.com/eellison, https://github.com/albanD
2025-06-26 15:02:27 +00:00
2f94f69b7c [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-26 14:05:54 +00:00
197c1869f5 [Inductor][CLN] Remove unused default configs in flex_attention.py (#156700)
They probably became unusable after 03023f178c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156700
Approved by: https://github.com/jataylo, https://github.com/drisspg
2025-06-26 13:24:09 +00:00
2d39a48d52 Fix silent incorrectness arising from incorrect alias information (#152011)
Fixes #136662

There are two problems:
1) canonicalize_view_scatter_ops adds some new nodes into the graph.
   These new nodes cause the alias info on the graph to be wrong. To fix
   this, we try to run FakeTensorUpdater on the graph again.
2) FakeTensorUpdater's alias information is wrong. It tries to skip
   nodes that it thinks have "equivalent" FakeTensor metadata.
   It should not be allowed to do this if any users of the node can
   alias the node. The example
   is if we have `x = foo(...); y = x.view(...)`. If the user replaces
   `foo` with a new `bar` node and sets bar.meta["val"] correctly, then
   FakeTensorUpdater still needs to update y's meta["val"] to be a view
   of the new bar node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011
Approved by: https://github.com/yf225
2025-06-26 13:05:08 +00:00
53e0b9c393 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-06-26 10:32:20 +00:00
de45c5f673 [aarch64] Add back NCCL lib to cuda arm wheel (#156888)
We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime.

https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build.

related - https://github.com/pytorch/pytorch/issues/144768

```
Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888
Approved by: https://github.com/atalman
2025-06-26 10:24:18 +00:00
18b01afa9e load inline user overridable gencode (#156850)
Fixes https://github.com/pytorch/pytorch/issues/156815

As far as testing goes
* I tried to use cuobjdump but that was kinda goofy bccd9393a5 the problem was that the name of the cubin will have a single gencode always
* Another idea was to read stderr and check that the right amount of gencodes is there 0beadc01b3 this helped a lot to convince me locally that this test works, the test passed on my dev gpu but was failing in CI and I suspect it's because of a bad interaction with subprocesses
* Last approach was to have a simpler unit test to check which flags get added by default, this is not as comprehensive as the previous ideas but it works and is fast so will opt for this since I'm convinced testing is working per my own experiments and customers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156850
Approved by: https://github.com/malfet
2025-06-26 10:15:08 +00:00
bbf1a6feac Add dist_info to non-building setup.py commands (#156709)
This adds the `dist_info` command to the list of non-building commands of `setup.py`, which avoids the current situation where simple metadata generation with any packaging tool already triggers a build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156709
Approved by: https://github.com/Skylion007
2025-06-26 08:38:39 +00:00
455dfd2589 Fix macOS build with USE_MPS=OFF (#156847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156847
Approved by: https://github.com/angelayi
2025-06-26 07:15:41 +00:00
50b2069b61 Move out super large one off foreach_copy test (#156876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156876
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2025-06-26 06:02:38 +00:00
dfc31b3345 [BE] comments + try to get rid of secondary make_autotune_fn (#156358)
Not sure this will work, but let's try it on the unit tests. The only thing I am worried about is the counters drifting off from their true values, so let the unit tests check that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156358
Approved by: https://github.com/masnesral
2025-06-26 05:54:01 +00:00
0d01bafc34 remove gso from set_storage_meta__symint (#156525)
We already check that inputs are hinted? i dont see value here for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156525
Approved by: https://github.com/pianpwk
2025-06-26 05:42:05 +00:00
127695eb5c ci: Add ciflow trigger for build-triton-wheel (#156893)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156893
Approved by: https://github.com/malfet
2025-06-26 04:38:38 +00:00
0a16818d5b [OpenReg] Remove the unit.skip for test_serialization (#156804)
This bugs was fixed by this [PR](https://github.com/pytorch/pytorch/pull/147095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156804
Approved by: https://github.com/albanD
ghstack dependencies: #156588, #156589
2025-06-26 03:59:50 +00:00
98e594b565 [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156589)
----

- serialization
- dlpack

**Next Steps**:

- The rest of `test/test_cpp_extensions_open_device_registration.py` is about the fallback mechanism. In order to keep it consistent with other accelerator usage (C++ registration), the implementation of OpenReg needs to be refactored:

    * Simulate multiple device memory in a single process (a brief RFC will be submitted this week)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156589
Approved by: https://github.com/albanD
ghstack dependencies: #156588
2025-06-26 03:59:50 +00:00
a730c65fe3 [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156588)
----

- fake tensor
- named tensor
- custom autograd function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156588
Approved by: https://github.com/albanD
2025-06-26 03:59:50 +00:00
4585c33e74 [symm_mem] Fix nccl test for symm mem (#156752)
Try not to call set_device to Fixes #156569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752
Approved by: https://github.com/kwen2501
2025-06-26 02:59:38 +00:00
7521cd9111 [BE] Typo fix (#156836)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156836
Approved by: https://github.com/albanD, https://github.com/jingsh, https://github.com/Skylion007
ghstack dependencies: #156830, #156831
2025-06-26 02:48:55 +00:00
68e023cbbb [BE] Add missing type for storage dict (#156831)
For some reason, this one always bleats when I run mypy on OSX, so shut it up.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156831
Approved by: https://github.com/mikaylagawarecki, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #156830
2025-06-26 02:48:55 +00:00
df9e5a276b [BE] Add type and docs for _process_export_inputs (#156830)
Done using claude code and manual review.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156830
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2025-06-26 02:48:55 +00:00
81bf278537 [cutlass] rename cutlass python lib to python-cutlass (#156655)
Differential Revision: [D77173366](https://our.internmc.facebook.com/intern/diff/D77173366/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156655
Approved by: https://github.com/Skylion007
2025-06-26 02:47:14 +00:00
8da774d81f [ez] Add docblock for SchedulerNode.codegen (#156718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156718
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #156466, #156445, #156625, #156717
2025-06-26 02:43:50 +00:00
78ee2ee90e Fix environment and push env var for docker image builds for binary builds (#156910)
Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch

Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io.  You also don't do the push unless WITH_PUSH is true

binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910
Approved by: https://github.com/atalman
2025-06-26 02:06:57 +00:00
5db9a2b54a [BE] Install Helion without dependencies (#156706)
After: https://github.com/pytorch/pytorch/pull/155513
Please see comment: https://github.com/pytorch/pytorch/pull/155513#issuecomment-2998085740

Here are the logs: https://github.com/pytorch/pytorch/actions/runs/15838529400/job/44646874281?pr=156664#step:6:16372

Looks like current workflow is :
Build triton - triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl
Install Helion - Overwrite triton with production 3.3.1 and install production torch
Reinstall triton as final docker build step - triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl

This makes it somewhat messy since we install both torch and triton from prod. This is something we want to avoid when building underlining docker images for CI

Log:
```
#55 311.4 + pip_install helion
#55 311.4 + as_jenkins conda run -n py_3.10 pip install --progress-bar off helion
#55 311.4 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= conda run -n py_3.10 pip install --progress-bar off helion
#55 393.6 Collecting helion
#55 393.6   Downloading helion-0.0.7-py3-none-any.whl.metadata (14 kB)
#55 393.6 Collecting filecheck (from helion)
#55 393.6   Downloading filecheck-1.0.2-py3-none-any.whl.metadata (5.8 kB)
#55 393.6 Collecting torch>=2.7.0 (from helion)
#55 393.6   Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (29 kB)
#55 393.6 Requirement already satisfied: typing-extensions>=4.0.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from helion) (4.14.0)
#55 393.6 Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (3.18.0)
#55 393.6 Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (1.13.3)
#55 393.6 Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (2.8.8)
#55 393.6 Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (3.1.6)
#55 393.6 Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (2025.5.1)
#55 393.6 Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cublas-cu12==12.6.4.1 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cufft-cu12==11.3.0.4 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-curand-cu12==10.3.7.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cusparselt-cu12==0.6.3 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
#55 393.6 Collecting nvidia-nccl-cu12==2.26.2 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
#55 393.6 Collecting nvidia-nvtx-cu12==12.6.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cufile-cu12==1.11.1.6 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting triton==3.3.1 (from torch>=2.7.0->helion)
#55 393.6   Downloading triton-3.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.5 kB)
#55 393.6 Requirement already satisfied: setuptools>=40.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from triton==3.3.1->torch>=2.7.0->helion) (80.9.0)
#55 393.6 Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch>=2.7.0->helion) (1.3.0)
#55 393.6 Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch>=2.7.0->helion) (3.0.2)
#55 393.6 Downloading helion-0.0.7-py3-none-any.whl (149 kB)
#55 393.6 Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl (821.2 MB)
#55 393.6 Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
#55 393.6 Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.9 MB)
#55 393.6 Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
#55 393.6 Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB)
#55 393.6 Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
#55 393.6 Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB)
#55 393.6 Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
#55 393.6 Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (56.3 MB)
#55 393.6 Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (158.2 MB)
#55 393.6 Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (216.6 MB)
#55 393.6 Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
#55 393.6 Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
#55 393.6 Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
#55 393.6 Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
#55 393.6 Downloading triton-3.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.6 MB)
#55 393.6 Downloading filecheck-1.0.2-py3-none-any.whl (23 kB)
#55 393.6 Installing collected packages: nvidia-cusparselt-cu12, triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, filecheck, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch, helion
#55 393.6   Attempting uninstall: triton
#55 393.6     Found existing installation: triton 3.4.0+git5389ed79
#55 393.6     Uninstalling triton-3.4.0+git5389ed79:
#55 393.6       Successfully uninstalled triton-3.4.0+git5389ed79
#55 393.6 Successfully installed filecheck-1.0.2 helion-0.0.7 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 torch-2.7.1 triton-3.3.1
#55 393.6
#55 DONE 428.8s

#56 [final  1/30] COPY --from=triton-builder /opt/triton /opt/triton
#56 DONE 0.0s

#57 [final  2/30] RUN if [ -n "yes" ] || [ -n "" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
#57 0.823 Processing /opt/triton/triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl
#57 2.263 Requirement already satisfied: setuptools>=40.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from triton==3.4.0+git5389ed79) (80.9.0)
#57 2.589 Installing collected packages: triton
#57 6.405 Successfully installed triton-3.4.0+git5389ed79
#57 6.405 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
#57 DONE 86.5s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156706
Approved by: https://github.com/oulgen, https://github.com/malfet
2025-06-26 02:05:47 +00:00
b50075343a [distributed] Enable H100 test for all distributed related changes (#156721)
We want to run H100 CI for distributed related changes. We already have a labeling of oncall:distributed when touching distributed related code: 4491326fb0/.github/labeler.yml (L94). So we want to leverage that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156721
Approved by: https://github.com/huydhn
2025-06-26 01:51:41 +00:00
e581f015ee Bump STATIC_CUDA_LAUNCHER_VERSION to 2 (#156726)
Differential Revision: [D77241813](https://our.internmc.facebook.com/intern/diff/D77241813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156726
Approved by: https://github.com/oulgen
2025-06-26 01:50:51 +00:00
b5bfbba184 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-26 01:24:36 +00:00
214e2959dc Cleanup leftover miniconda brew installation (#156898)
That results in torch.compile being unable to produce working artifacts

Should fix https://github.com/pytorch/pytorch/issues/156833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-06-26 01:02:04 +00:00
4c0091fda6 python definitely_contiguous-> is_contiguous_or_false (#156515)
We probably can avoid having those in python as well and  just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515
Approved by: https://github.com/bobrenjc93
2025-06-26 00:47:14 +00:00
85df746892 refresh expected numbers (#156877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156877
Approved by: https://github.com/huydhn
2025-06-26 00:03:09 +00:00
2c6324a1eb Delete sections referencing torchscript in serialization docs (#156648)
Address [T228333890](https://www.internalfb.com/intern/tasks/?t=228333890)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156648
Approved by: https://github.com/svekars
2025-06-25 23:41:24 +00:00
a25d1443fa Mark TorchServe as all emeritus (#156865)
As per title and to follow the broader tutorial cleanup work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156865
Approved by: https://github.com/svekars, https://github.com/malfet, https://github.com/seemethere
2025-06-25 23:34:57 +00:00
451b525bf0 [ez] add docblock and comments to simd.split_and_set_ranges (#156717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156717
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #156445
2025-06-25 23:07:28 +00:00
204db27a0c Consolidate stack trace in Tracer (#156257)
Summary:
- Consolidate the stack trace recording code in TracerBase and PythonKeyTracer
- Change `make_fx`'s arg name to be consistent with TracerBase member name `record_stack_traces`

We move the stack trace logic from `create_proxy` to `create_node` so all inherited classes of TracerBase and re-use the same stack trace logic.

Test Plan:
```
buck run caffe2/test:test_export -- -r  test_stack_trace
```

Rollback Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156257
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-25 23:07:10 +00:00
653c52fe52 [MPS] Fix batch norm incorrect gradient (#156867)
Fixes #156555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156867
Approved by: https://github.com/malfet
2025-06-25 23:05:49 +00:00
acaf6ba3c6 Organize BUCK for torch/standalone (#156503)
Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself

Test Plan:
CI

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D76920013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503
Approved by: https://github.com/swolchok
2025-06-25 22:56:15 +00:00
d98fa4a103 implement SR's storage group planning algorithm (#156715)
Summary: att

Test Plan:
tested on a localnet. it's ~15% worse performance than greedy-by-size, but more performant.

local:
gbs: 110656b
dsg: 131584b

local_ro:
gbs: 38208
dsg: 44544

Differential Revision: D75653840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156715
Approved by: https://github.com/zhxchen17
2025-06-25 22:43:40 +00:00
1e7e21ec5d unify dynamic shapes API namings 3 (guard_int, guard_int_seq) (#155973)
evaluate_static_shape -> guard_int
evaluate_static_shapes -> guard_int_seq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155973
Approved by: https://github.com/bobrenjc93
2025-06-25 22:40:28 +00:00
61f6aa36b9 [resubmit][export] add _union_dataclass to support comparing dataclasses that inherits from union. (#156765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156765
Approved by: https://github.com/zhxchen17
2025-06-25 22:32:12 +00:00
53057fc16a [dynamo] update base variable call_method hint with note on comprehensions (#156769)
Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1696822194289318/

List/dict comprehensions in Python <= 3.11 result in potentially weird graph breaking behavior because comprehensions result in implicit function calls, which Dynamo may end up tracing as top-level frames, resulting in iterators being passed as arguments to the compiled region.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156769
Approved by: https://github.com/StrongerXi
2025-06-25 21:55:55 +00:00
95a7d1912a [sigmoid] add layout planner to executor (#156852)
Summary: if memory planning is enabled in the runtime config, we will create a copy in the executor here.

Test Plan: ci

Differential Revision: D73635622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156852
Approved by: https://github.com/zhxchen17
2025-06-25 21:41:09 +00:00
6de41ce0f8 [cond] support gen_schema for cond (#154193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193
Approved by: https://github.com/zou3519
ghstack dependencies: #155644
2025-06-25 21:19:58 +00:00
3257c8f74c [cond] preserve merged phs meta for subgraph (#155644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155644
Approved by: https://github.com/zou3519
2025-06-25 21:19:58 +00:00
e7a66166ce [precompile] When using BundledAOTAutogradCache, disable FXGraphCache (#156611)
The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs.

If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit *between* the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers.

One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical:

When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on.

BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now.

Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611
Approved by: https://github.com/zhxchen17
2025-06-25 21:01:42 +00:00
fe1f1a38df add test_batchnorn_2D and 3D tests (#156498)
New set of batchnorm tests to verify NCHW 2D/3D BatchNorm
This test also allows to add and configure different BatchNorm tests (dtypes, NCHW/NHWC, Mixed) in the future
based on:
- Train [test_batchnorm_cudnn_nhwc](1051b93192/test/test_nn.py (L4985))
- Inference [test_batchnorm_nhwc_cuda](1051b93192/test/test_nn.py (L5130))

```
test_batchnorm_3D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_float32) ... ok (0.113s)
test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.057s)
test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.063s)
test_batchnorm_3D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_float32) ... ok (0.059s)
test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.006s)
test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16) ... ok (0.006s)
test_batchnorm_3D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_float32) ... ok (0.007s)
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.005s)
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.005s)
test_batchnorm_3D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_float32) ... ok (0.003s)
test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s)
test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16) ... skip: 3D float16 NCHW train failed on ROCm<7.0 (0.001s)

test_batchnorm_2D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_float32) ... ok (0.016s)
test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.003s)
test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.003s)
test_batchnorm_2D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_float32) ... ok (0.054s)
test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.002s)
test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16) ... ok (0.001s)
test_batchnorm_2D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_float32) ... ok (0.007s)
test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.004s)
test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.004s)
test_batchnorm_2D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_float32) ... ok (0.003s)
test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s)
test_batchnorm_2D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_float16) ... ok (0.002s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156498
Approved by: https://github.com/jeffdaily
2025-06-25 20:38:02 +00:00
48e7b62d3a [dynamo] Add immutable pytree to trace_rules (#156772)
Fixes https://github.com/pytorch/pytorch/issues/155426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156772
Approved by: https://github.com/williamwen42
2025-06-25 20:08:47 +00:00
e99a2a2dba [PG/nccl] Simplify uniqueHash management (#156790)
Summary:

ncclUniqueID is only relevant when a comm is created using ncclCommCreate or ncclCommCreateConfig.  If a comm is created with ncclCommSplit, this field is unset, causing its usage to create unexpected behavior.

This patch creates a unique hash key for each comm, irrespective of how the comm is created.

Test Plan:

CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156790
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2025-06-25 20:06:08 +00:00
070aa59e49 Refactor DynamoStore into disk and in memory implementations (#155818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155818
Approved by: https://github.com/zhxchen17
2025-06-25 18:24:28 +00:00
6c24c6633a [torch][test] skip test_transformer_backend_inductor_fullgraph_True (#156763)
Summary: "Traceable FSDP2" is not being maintained anymore.

Test Plan:
```
buck test @//mode/opt caffe2/test/distributed/_composable:fully_shard_compile -- test_transformer_backend_inductor_fullgraph_True
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16044073764394232/

Rollback Plan:

Differential Revision: D77264408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156763
Approved by: https://github.com/xunnanxu, https://github.com/yf225
2025-06-25 18:15:23 +00:00
09ffba3cf7 [docs] Decorator to create a deprecation warning (#155127)
This PR adds the `@deprecate` decorator for internal functions which we are prepping for deprecation.  Add it on top of an internal function to emit a deprecation warning + allow bc with the non internal version of the function.

Tested with `python test/test_utils.py TestDeprecate.test_deprecated `

Furthermore, testing with a modified version of the tes in the pr gives something like this which is what we want

```
/home/sahanp/repos/pytorch/test/test_utils.py:1239: UserWarning: deprecated_api is DEPRECATED, please consider using an alternative API(s).
  deprecated_api(1, 2)
  ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155127
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-06-25 18:09:04 +00:00
4bc3e4b497 [cutlass backend] Move cutlass key to cutlass_library (#156654)
Differential Revision: [D77188311](https://our.internmc.facebook.com/intern/diff/D77188311/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156654
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh
ghstack dependencies: #156651
2025-06-25 17:55:57 +00:00
c1a629f76d Update device for perf dashboard on AMD runners (#156809)
Uses arch_device naming convention for storing perf dashboard logs on AMD runners based on the following PR
https://github.com/pytorch/test-infra/pull/6793

Updated from zen_cpu_x86 to cpu_x86_zen

Fixes https://github.com/pytorch/test-infra/issues/6823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156809
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-06-25 17:34:49 +00:00
e071837594 [cutlass backend] compile and link for .so files (#155876)
Differential Revision: [D76482736](https://our.internmc.facebook.com/intern/diff/D76482736/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155876
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-06-25 17:01:56 +00:00
1051b93192 [export] Implement _compile_and_package for ExportPackage. (#156638)
add a method to implement weight sharing.

Differential Revision: [D76132005](https://our.internmc.facebook.com/intern/diff/D76132005/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156638
Approved by: https://github.com/tugsbayasgalan
2025-06-25 16:00:40 +00:00
8eb3c5b7a1 [release] delete tag-docker-images.sh as not required anymore (#156737)
Thanks to @clee2000  This is no longer required since the docker images use hash as tag: https://github.com/pytorch/pytorch/actions/runs/15844298044/job/44662813176#step:15:92

```
Login Succeeded
++ docker manifest inspect docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
++ jq '[.layers[].size, .config.size] | add / 1024 / 1024'
+ IMAGE_SIZE=9322.26076889038
+ echo 'Compressed size of image in MB: 9322.26076889038'
+ set -e
+ docker inspect --type=image docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
Compressed size of image in MB: 9322.26076[88](https://github.com/pytorch/pytorch/actions/runs/15844298044/job/44662813176#step:15:90)9038
+ retry docker pull docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
+ docker pull docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09: Pulling from pytorch/manylinux2_28-builder
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156737
Approved by: https://github.com/clee2000
2025-06-25 15:17:06 +00:00
029e2b05c2 Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)"
This reverts commit 19ffb5e6f7606436249742b0f3efc0bab244dc55.

Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/albanD due to The corresponding test still breaks on rocm ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-3004698438))
2025-06-25 13:05:40 +00:00
c2185dc4a5 [Quant][CPU] Enable fp8 qlinear (#155678)
**Summary**
Enable fp8 qlinear on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qlinear op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qlinear op is not changed either.

So, the FP8 qlinear shares the same op as INT8 qlinear and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.

The differences of qlinear from `_scaled_mm` are that
- Qlinear supports post op fusion while `_scaled_mm` does not
- Weights are prepacked for qlinear

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qlinear and fp8"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155678
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-25 10:01:08 +00:00
19ffb5e6f7 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-25 09:28:54 +00:00
0ab075a69e Fix docker image build for s390x (#156687)
Add upstream patch for onnxruntime
updating eigen dependency URL and hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156687
Approved by: https://github.com/seemethere
2025-06-25 09:09:22 +00:00
4918502d2e bug fix for losing shape on wrapper tensor for DTensor (#156774)
Summary: Wrapper tensor for DTensor is losing shape in offload_tensor. This PR fixes this bug.

Test Plan:
updated the test. Test fails with old code and passes with the fix.

Rollback Plan:

Differential Revision: D77269733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156774
Approved by: https://github.com/mikaylagawarecki
2025-06-25 08:14:16 +00:00
d9577df312 [ROCm] Bump AOTriton to 0.10b (#156499)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-25 07:09:03 +00:00
62272d5b24 [BE][Easy][setup] wrap over long error messages and redirect them to stderr in setup.py (#156043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156043
Approved by: https://github.com/jingsh
2025-06-25 06:57:59 +00:00
6c008e2fb5 [nativert] Move ParallelGraphExecutor to PyTorch core (#156751)
Summary: `ParallelGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a parallel manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D77088996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156751
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-06-25 06:54:45 +00:00
44a5f93462 [dynamo] allow symints in list.__setitem__ (#156197)
Fixes https://github.com/pytorch/pytorch/issues/155174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156197
Approved by: https://github.com/StrongerXi
2025-06-25 06:20:35 +00:00
162ca185ff [BE][PYFMT] migrate PYFMT for torch/_[a-h]*/ to ruff format (#144551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144551
Approved by: https://github.com/ezyang
ghstack dependencies: #148186
2025-06-25 06:16:06 +00:00
9642c75689 added stubs for jit tree views (#156504)
Fixes #156488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156504
Approved by: https://github.com/ezyang
2025-06-25 06:15:17 +00:00
c60327ba74 avoid to declare an unknown bound array without any element (#156543)
Fixes #153180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156543
Approved by: https://github.com/jansel

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-06-25 06:14:57 +00:00
4237ee3c33 [XPU] Add periodic run for xpu worklfow (#156698)
Enable XPU periodic testing in xpu.yml workflow directly. It works for https://github.com/pytorch/pytorch/issues/114850.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156698
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-06-25 05:57:52 +00:00
194c221e0a Update the UT of test_decompose_mm_cpu (#154100)
**Summary**
Fixes #153616
Based on the latest decomposed heuristic in daca611465/torch/_inductor/fx_passes/decompose_mem_bound_mm.py (L79-L82), for the shape in this test case `[m=1, k=64, n=32]`, the result should be decomposed. The previous CI didn't capture this failure due to the UT skip described in https://github.com/pytorch/pytorch/pull/153245. So this PR should be verified in CI after https://github.com/pytorch/pytorch/pull/153245 landed.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_decompose_mem_bound_mm.py -k test_decompose_mm_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154100
Approved by: https://github.com/jansel
2025-06-25 05:45:58 +00:00
f5f4beaf56 [invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #156260
2025-06-25 04:29:22 +00:00
558d7f7db0 [invoke_subgraph] make same subgraph share get_attr target (#156260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-06-25 04:29:22 +00:00
568ca89bac Add a crash handler to async compile subprocesses (#155068)
When the async compile subprocesses crash in C++ they tend to just silently die instead of leaving any kind of trace.  This installs a crash handler so that if they SEGV, ILL, or ABRT they'll attempt to output a backtrace instead.

While in there I also cleaned up the CLANGTIDY warnings coming from Module.cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155068
Approved by: https://github.com/masnesral
2025-06-25 03:27:28 +00:00
beb52f5c0a use more efficient implementation for broadcasted indexing in determi… (#156744)
…nistic scatter_add

per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156744
Approved by: https://github.com/suo
2025-06-25 02:59:50 +00:00
9b498d3bb2 Update docs for torch.device (#156686)
# Motivation
Update the doc, to make `torch.device`'s constructor officially support the following methods:
- A device string, which is a string representation of the device type and optionally the device ordinal.
- A device type and a device ordinal.
- A device ordinal, which is treated as the current accelerator type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156686
Approved by: https://github.com/albanD
2025-06-25 02:12:36 +00:00
3608737347 [ez] fix typo in comment (#156402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156402
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #156397
2025-06-25 02:07:36 +00:00
d06a406656 [dynamo] Graph break on torch.Tensor.data assignment with mismatched dtype (#156623)
Fixes #152162. Discussed with @bdhirsh and decided this is the easiest
workaround for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156623
Approved by: https://github.com/bdhirsh
2025-06-25 02:03:04 +00:00
e8cf5ff564 Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move common functionality to separate files to reduce dependencies between picklers and unpicklers
- Move the inline function that defines the static variable to .cc

Differential Revision: [D76266755](https://our.internmc.facebook.com/intern/diff/D76266755)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD

Co-authored-by: Edward Yang <ezyang@meta.com>
2025-06-25 01:59:10 +00:00
cyy
41910d7a94 Move use of c10::string_view to std::string_view (#152509)
Eliminate use of c10::string_view in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152509
Approved by: https://github.com/ezyang
2025-06-25 01:57:49 +00:00
02c7ab2f9b [cpp wrapper] add AOTI shim for collective ops (#154492)
Implementations:
1. Move collective ops to c10d namespace, so that we can call them externally.
2. Add AOTI shims for collective ops.

Testing
1. Add c10d functional UT for cpu.
2. Include the above one in cpp wrapper UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154492
Approved by: https://github.com/desertfire
2025-06-25 01:20:05 +00:00
d797038ea9 [dcp_poc] Introduce a new simple rank local checkpointer (#156142)
Summary:
Adds an experimental implementation for a rank local checkpointer with save and load with partial load, blind load and in-place load.

This uses an new API and simpler format.

Plan to add async checkpointing, IO layer, pluggable storage backend, layout customization,  Resharding, deduplication etc are not implemented.

Test Plan: unit tests

Reviewed By: saumishr

Differential Revision: D75426560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156142
Approved by: https://github.com/saumishr
2025-06-25 01:19:40 +00:00
0d8e4e2327 [PG/nccl] improvements to eager init (#156748)
Summary:

Cleanup eager init management, to detect and throw a warning when multiple p2p are issued on the same PG in eager init mode.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156748
Approved by: https://github.com/wconstab, https://github.com/kwen2501, https://github.com/Skylion007
2025-06-25 01:04:37 +00:00
06930706a1 Improve documentation for torch.lobpcg (#156139)
The changes are documentation changes to the function lobpcg. There are three changes to the doc.
1. Match doc arg description to be in the same order as the parameters to the function.
2. Update documentation for arg `n` to indicate that when arg `x` is specified value of `n` is ignored if set.
3. Add warning that `m` must be bigger than 3 x the number of requested eigenpairs.

Fixes #152107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156139
Approved by: https://github.com/soulitzer
2025-06-25 00:39:34 +00:00
3dd872e6d5 Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))
2025-06-25 00:11:35 +00:00
6459a5c7a9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))
2025-06-25 00:11:35 +00:00
fd4bb29410 Revert "[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)"
This reverts commit fb75dea2c1b93c78dccf08d5fd5e20b362ecd405.

Reverted https://github.com/pytorch/pytorch/pull/156517 on behalf of https://github.com/Camyll due to internal reverted ([comment](https://github.com/pytorch/pytorch/pull/156517#issuecomment-3002172049))
2025-06-24 23:45:13 +00:00
313a6a8ef9 [pt2][pr_time_benchmarks] Refresh instructions count after disabled test (#156738)
https://github.com/pytorch/pytorch/issues/153987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156738
Approved by: https://github.com/laithsakka
2025-06-24 23:45:02 +00:00
4bd18e31e5 Revert "Add fx_graph_runnable tests boilerplate (#156552)"
This reverts commit 0a2ec7681d2af973d8daaf7905431a088739dc90.

Reverted https://github.com/pytorch/pytorch/pull/156552 on behalf of https://github.com/Camyll due to breaking internal ([comment](https://github.com/pytorch/pytorch/pull/156552#issuecomment-3002159473))
2025-06-24 23:34:21 +00:00
2ff3280c77 [ez] Disable some failing periodic tests (#156731)
test_torch.py::TestTorchDeviceTypeCUDA::test_storage_use_count_cuda:
Added in https://github.com/pytorch/pytorch/pull/150059
Fails in debug mode [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44706020831) [HUD commit link](4491326fb0)

inductor/test_inductor_freezing.py::FreezingGpuTests::test_cpp_wrapper_cuda:
[GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44707119967) [HUD commit link](4491326fb0)
started failing after moving to new cuda version https://github.com/pytorch/pytorch/pull/155234

I'll ping people if this gets merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156731
Approved by: https://github.com/huydhn
2025-06-24 23:02:21 +00:00
d8bb5ac260 [ez] fix typo in select_algorithm.py (#156625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156625
Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng
ghstack dependencies: #156445
2025-06-24 23:01:58 +00:00
ce97a5dcfa [Inductor] Restrict block analysis to only match integer dims and strides (#149615)
Restrict block analysis to only match dimension sizes and strides that are integers. E.g. `sympy` can match index expressions like  `ModularIndexing(xindex, 4, 4)) + 4*(ModularIndexing(xindex, 32, 2))` with the candidate below that is invalid.
  ```python
match_expr = stride_mod0_*((xindex//(dim_mod1_*dim_mod2_*dim_mod3_*dim_mod4_))) + stride_mod1_*(ModularIndexing(xindex, dim_mod2_*dim_mod3_*dim_mod4_, dim_mod1_)) + stride_mod2_*(ModularIndexing(xindex, dim_mod3_*dim_mod4_, dim_mod2_)) + stride_mod3_*(ModularIndexing(xindex, dim_mod4_, dim_mod3_)) + stride_mod4_*(ModularIndexing(xindex, 1, dim_mod4_))
match={
      dim_mod4_: 32, dim_mod3_: 2, stride_mod3_: 4, dim_mod2_: 1/16,
       dim_mod1_: 4, stride_mod1_: 1, stride_mod4_: 0, stride_mod2_: 0, stride_mod0_: 0
     }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149615
Approved by: https://github.com/blaine-rister
2025-06-24 22:43:12 +00:00
c48d0f4643 [Inductor] Fix epilogue fusion decision with 1 Triton caller as choice (#156500)
Differential Revision: D76904773

In the current scheduler logic, if a template buffer is only a Triton template, which can result from only 1 Triton choice in the autotuning, the fusion won't be benchmarked.

This can lead to an edge case in which a Triton GEMM template from the autotune lookup table can have a problematic fusion, leading to shared memory requirements above the hardware limit. `(256, 128, 64, 4, 8, 8)` is such a config, where we have seen fusion with a `.to(torch.float32)` can lead to this issue, `out of resource: shared memory, Required: 264224, Hardware limit: 232448`. We benchmark the fusion for this case to ensure it's safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156500
Approved by: https://github.com/jansel
2025-06-24 22:33:47 +00:00
e96f530af5 Remove unnecessary use of c10::SmallVector from moments_utils (#156714)
It's just making arrays of a particular size. (If it was resizing the vectors, we'd see compile errors.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156714
Approved by: https://github.com/Skylion007
2025-06-24 22:30:10 +00:00
4ee4863232 Fix #156261 _foreach_copy indexing (#156719)
Fixes #156261

Thanks to @ngimel's fast eyes

For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719
Approved by: https://github.com/albanD
2025-06-24 21:58:44 +00:00
310e8361c5 [nativert] Move PrimKernelRegistry to PyTorch core (#156506)
Summary:
Torch Native Runtime RFC: pytorch/rfcs#72
PrimKernelRegistry manages a small subset of kernel registry in NativeRT.
Including ListPack, ListUnpack, Input, Output, VarConcat, VarStack

Test Plan: Internal unittests

Differential Revision: D77034945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156506
Approved by: https://github.com/zhxchen17
2025-06-24 21:42:41 +00:00
fa0ea57f5e [ROCm][CD] upgrade to 6.4.1 patch release (#156636)
During https://github.com/pytorch/pytorch/pull/156112, we missed upgrading the manylinux and libtorch docker images.

Fixes #155292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156636
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-24 21:41:42 +00:00
3efb22e091 Enable C++ dynamic shape guards by default (#140756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756
Approved by: https://github.com/anijain2305, https://github.com/laithsakka
2025-06-24 21:10:17 +00:00
26f7ca3972 Unify dynamic shapes APIs naming 2 (expect_true and check) attempt2 (#156518)
Summary:
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.

This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
-  it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals

I am also seeing a couple of wrong usages !! that i will fix  in the next PR

Test Plan:
OSS and cont

Rollback Plan:

Differential Revision: D77054177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156518
Approved by: https://github.com/bobrenjc93
2025-06-24 21:01:38 +00:00
dfef1e4408 Optimize dim description in torch.max (#156153)
Fixes #156071

## Test Result

### Before

![image](https://github.com/user-attachments/assets/8dd0d952-277a-4197-b323-d68ae1438171)

### After

![image](https://github.com/user-attachments/assets/4af5388e-ca9e-4268-a7c4-cf16b09b899f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156153
Approved by: https://github.com/albanD
2025-06-24 20:50:40 +00:00
1dc1eedd43 Revert "[dynamo] Graph break on torch.Tensor.data assignment with mismatched dtype (#156623)"
This reverts commit c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586.

Reverted https://github.com/pytorch/pytorch/pull/156623 on behalf of https://github.com/albanD due to Breaks Dynamo tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/156623#issuecomment-3001806841))
2025-06-24 20:44:42 +00:00
aa280ea19f Revert "Remove remaining CUDA 12.4 CI code (#155412)"
This reverts commit 9fed2addedb42da86b657165fe14eadc911232cf.

Reverted https://github.com/pytorch/pytorch/pull/155412 on behalf of https://github.com/Camyll due to cuda 12.4 still needed ([comment](https://github.com/pytorch/pytorch/pull/155412#issuecomment-3001711830))
2025-06-24 20:05:39 +00:00
19f851ce10 Revert "Simplify nvtx3 CMake handling, always use nvtx3 (#153784)"
This reverts commit 099d0d6121125062ebc05771c8330cb7cd8d053a.

Reverted https://github.com/pytorch/pytorch/pull/153784 on behalf of https://github.com/Camyll due to breaking internal tests and cuda 12.4 builds still used in CI ([comment](https://github.com/pytorch/pytorch/pull/153784#issuecomment-3001702310))
2025-06-24 20:02:07 +00:00
376c16703c Document each of the private member variables on ExportedProgram (#156704)
Authored with claude code and then reviewed by hand. If you don't like it, tell me.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156704
Approved by: https://github.com/albanD, https://github.com/zhxchen17, https://github.com/jingsh
2025-06-24 19:56:40 +00:00
c1ad4b8e7a [dynamo] Graph break on torch.Tensor.data assignment with mismatched dtype (#156623)
Fixes #152162. Discussed with @bdhirsh and decided this is the easiest
workaround for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156623
Approved by: https://github.com/bdhirsh
2025-06-24 19:33:11 +00:00
f97f03c7ef [cutlass backend] delete pip cutlass path since nvidia stops supporting nvidia-cutlass (#156651)
Differential Revision: [D77186982](https://our.internmc.facebook.com/intern/diff/D77186982/)

source: https://pypi.org/project/nvidia-cutlass/

If users want to use it, they can install pytorch through wheel, git clone cutlass, and specify cutlass path via TORCHINDUCTOR_CUTLASS_DIR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156651
Approved by: https://github.com/mlazos
2025-06-24 18:32:15 +00:00
a00a697c17 [dynamo] updated version of detecting any differences between PRs unimplemented_v2() callsites and graph_break_registry json file (#156237)
This PR runs an automatic check as part of dynamo_wrapped to make sure that all unimplemented_v2() callsites are mapped to the JSON file. It also fixes the issue of the CI not able to expand the hints, which was the root cause of the previous workflow failure. If not, the dev gets a message giving them instructions on how to update the JSON file. I also updated a dynamic gb_type to static and updated its test_error_message to include the GBID link for the graph break (before the link would not be produced).

Testing:
I ran the file with the argument to ensure all cases were covered, and also tested the test in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156237
Approved by: https://github.com/williamwen42
2025-06-24 18:12:23 +00:00
2d7e6c6241 [MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708
Approved by: https://github.com/malfet
2025-06-24 18:12:18 +00:00
af284b45d5 [sigmoid] layout planner alias analyzer (#156676)
Summary: we need a mechanism that provided the functionschemas  for each kernel will be able to trace aliasing behaviour s.t., we have correct value lifetimes when we plan.

Test Plan: ci + unit tests

Reviewed By: SherlockNoMad

Differential Revision: D73635213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156676
Approved by: https://github.com/zhxchen17
2025-06-24 18:11:03 +00:00
644cc58dff Add CPython exception tests (#150789)
----

* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
2025-06-24 18:06:42 +00:00
5ad2bee2c8 [dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527)
Fixes https://github.com/pytorch/pytorch/issues/155057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156527
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-06-24 17:57:14 +00:00
4491326fb0 [inductor] select_algorithm: add preprocessing fns (#156464)
Summary:
# Why

- keep code cleaner
- modular way to hook up preprocessing steps
- expand testability of flows that change which choices are provided e.g. to test performance models and lookup tables by running torch.compile

# What

- similar to feedback_saver_fns, now there are preprocessing_fns
- the existing regex logic is exported into those as a proof of concept

Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx038
```

This does not exercise the logic, it just shows that it's safe right now

Rollback Plan:

Differential Revision: D76946993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156464
Approved by: https://github.com/masnesral
2025-06-24 16:44:40 +00:00
6e17315cd3 Skip FSDP tests if device count is less then requested world_size value (#155836)
Usually `world_size=torch.cuda.device_count()` for FSDPTest-based tests
But distributed test class `TestFullyShardAllGatherExtensionsMultiProcess` [forces to use `world_size=2`](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L170)) even for 1 GPU.

Then NCCL fails with errors:
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000
```
The test method [has `@skip_if_lt_x_gpu(2)` decorator](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L209)), but test fails during test class initialization before decorator activation

This PR will skip FSDPtest-based tests if `world_size > torch.cuda.device_count()`
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
dist init r=0, world=2
dist init r=1, world=2
SKIPPED [15.5507s] (Need at least 2 CUDA devices)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155836
Approved by: https://github.com/jeffdaily
2025-06-24 16:38:23 +00:00
e2c9d8d641 Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)
Fix https://github.com/pytorch/pytorch/issues/145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-06-24 15:41:34 +00:00
cb853945a7 [ez][CI] Update viable strict: change concurrency group to cancel in progress (#156619)
Should help with https://github.com/pytorch/pytorch/issues/156425

The one I saw today was because the job was waiting for an environment deployment approval for mergebot environment, which I think comes from something like a temporary github outage or a dropped webhook since it should have permissions as it was on the main branch, and other runs are fine
The run is https://github.com/pytorch/pytorch/actions/runs/15820977440 but you can't see anything about waiting for deployment anymore

My solution is to change the concurrency group so that it will cancel in progress jobs if there is one.  My hope is that if one gets stuck, the next one will cancel and re do the environment check.  I don't know how to replicate this because apparently you're just supposed to fail if you don't match the protection rules https://github.com/pytorch/pytorch/actions/runs/15830920815

The job runs every 30 minutes so there might be an issue if this job needs to run for >30 minutes to find a green sha, but takes <5 minutes to run usually so I think its ok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156619
Approved by: https://github.com/atalman
2025-06-24 15:37:43 +00:00
4c59edf0c5 [nativert] Move call_torchbind_kernel (#156571)
Summary: Move call_torchbind_kernel target from internal sigmoid to pytorch

Test Plan:
Test Internally:

buck2 test mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test
buck build //sigmoid/core/kernels:kernel_factory
and all  sandcastle tests

Rollback Plan:

Differential Revision: D77118592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156571
Approved by: https://github.com/zhxchen17
2025-06-24 15:24:06 +00:00
795a6a0aff Update github first merge rule (#156583)
**Summary**
Update the merge rules for `CPU Frontend` and `Autocast`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156583
Approved by: https://github.com/atalman
2025-06-24 14:04:22 +00:00
dd78d6e7ea Add CPython generator/contextlib tests (#150796)
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py

Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.

The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:

```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
2025-06-24 13:15:04 +00:00
3a7ff829c5 Fix MacOS MP hang in Python-3.12+ (#155698)
By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 )  at exit, as at this point handle to child process might no longer be valid

Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)

Fixes https://github.com/pytorch/pytorch/issues/153050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
2025-06-24 12:13:35 +00:00
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
4e8dd11be1 simplify nvrtc discovery login in compile_kernel (#156674)
Followup from https://github.com/pytorch/pytorch/pull/156332

Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380

Works just fine on dev gpus
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674
Approved by: https://github.com/malfet
2025-06-24 08:55:40 +00:00
ce73b0c53f Validate custom op support for compile_kernel (#156332)
Follow-up work from #151484 - just makes sure that compile_kernel composes nicely with custom ops by writing some new tests, no new code functionality is added

benchmark failure in CI is unrelated to this change, CI is green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156332
Approved by: https://github.com/zou3519, https://github.com/malfet
2025-06-24 08:21:21 +00:00
35e44067c4 Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-06-24 07:57:48 +00:00
cyy
ce1a07570d Fix TORCH_CUDA_ARCH_LIST (#156667)
Before the fix, `TORCH_CUDA_ARCH_LIST` variable contains string `TORCH_CUDA_ARCH_LIST`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156667
Approved by: https://github.com/ngimel
2025-06-24 07:27:53 +00:00
04178d347c [Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-06-24 06:09:59 +00:00
a7b29c88b1 [ONNX] Preserve all legacy exporter params in fallback (#156659)
Fixes #151693

Previous to this PR, the fallback does not take care of all user parameters. This pr preserves them to ensure a smooth transition for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156659
Approved by: https://github.com/justinchuby
2025-06-24 05:28:55 +00:00
a6a8641c8a Fix UT failure on non-cuda backend (#156577)
# Motivation
`HAS_TRITON` is a generic API that could return `True` on xpu backend. It will result in these cases failing on xpu. So we should use `HAS_CUDA` (equivalently `torch.cuda.is_available() && HAS_TRITON`) to avoid these failures.

Please refer to https://github.com/pytorch/pytorch/actions/runs/15813693789/job/44569593370#step:15:2129

# Additional Context
This PR aims to fix the CI failure soon. We will have a dedicated PR to generalize these UT to be generic. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @daisyden
Fix https://github.com/pytorch/pytorch/issues/156576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156577
Approved by: https://github.com/jansel
2025-06-24 05:24:24 +00:00
495c317005 Replace deprecated is_compiling method (#154476)
Replace depreacted `is_compiling` in `torch._dynamo` with `torch.compiler`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154476
Approved by: https://github.com/eellison
2025-06-24 05:16:40 +00:00
1044934878 [CUDAGraph] add config cudagraph_capture_sizes (#156551)
Users may want CUDAGraph for certain sizes and fallback for other sizes.

As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.

Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.

This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.

Example:
```python
import torch

torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]

def f(x):
    return x + 1

f = torch.compile(f, mode="reduce-overhead", dynamic=False)

def run(batch_size, seq_len, d):
    x = torch.randn((batch_size, seq_len, d), device="cuda")
    # Need to mark the dimension as dynamic. Automated-dynamic
    # may have some ux issues on matching `cudagraph_capture_sizes`
    # with the actual dynamic shapes, since there are specialization and
    # multiple dynamo graphs.
    torch._dynamo.mark_dynamic(x, 0)
    torch._dynamo.mark_dynamic(x, 1)
    for _ in range(3):
        f(x)

for i in range(2, 10):
    for j in range(2, 10):
        run(i, j, 8)

num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
2025-06-24 05:14:49 +00:00
899d3d3e9e Don't call sum() on a tensor that is not summable in layer_norm (#156600)
Don't call `sum()` on a tensor that is default constructed.

Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:

```
Traceback (most recent call last):
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
    method(*args, **kwargs)
  File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
    ln_out_cuda.backward(grad_output_cuda)
  File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
    torch.autograd.backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::TensorBase::options() const from :0
#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0

```

Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-06-24 05:00:42 +00:00
17eb649d55 Implement guard collectives (optimized version) (#156562)
This is a remix of https://github.com/pytorch/pytorch/pull/155558

Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562
Approved by: https://github.com/Microve
2025-06-24 04:59:49 +00:00
73772919d2 remove deprecated numpy.typing.mypy_plugin in mypy.ini (#156601)
Fixes #156489
removed deprecated numpy plugin in mypy.ini
 @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156601
Approved by: https://github.com/ezyang
2025-06-24 04:56:08 +00:00
6d5c789ad5 [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555
Approved by: https://github.com/ezyang
ghstack dependencies: #144551, #144554
2025-06-24 04:53:54 +00:00
e600e044a7 Revert "[aotd] Support mutations of the same input in fw and bw (#155354)"
This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f.

Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see 930b575389/1 ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))
2025-06-24 04:42:14 +00:00
930b575389 [symm_mem] Add sym mem test into ptd h100 ci (#156634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156634
Approved by: https://github.com/ngimel, https://github.com/mori360
2025-06-24 03:43:22 +00:00
b2d473c8f8 [ROCm][Windows] Fix rocsolver undefined symbol error (#156591)
Fix undefined symbol error while using `rocsolver_ssyevd_strided_batched` call in `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156591
Approved by: https://github.com/jeffdaily
2025-06-24 03:28:45 +00:00
87d615efab [fr] Use a vector to temporarily keep the reference to future object to avoid block (#156653)
At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished.

Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653
Approved by: https://github.com/kwen2501
2025-06-24 03:25:04 +00:00
cyy
b09bd414a6 Deprecate c10::string (#155084)
Now there is no mention of c10::string in OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155084
Approved by: https://github.com/ezyang
2025-06-24 03:03:06 +00:00
0a2ec7681d Add fx_graph_runnable tests boilerplate (#156552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156552
Approved by: https://github.com/StrongerXi
2025-06-24 02:41:38 +00:00
9665702c64 [nativert] reland D76832891 remove designated initializer cpp20 (#156565)
Summary: fix windows build broke in https://github.com/pytorch/pytorch/pull/156508

Test Plan:
ci

Rollback Plan:

Differential Revision: D77080420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156565
Approved by: https://github.com/zhxchen17
2025-06-24 02:38:08 +00:00
6a3d00aa3b Add Windows cuda 12.9.1 build (#156630)
Without Support for SegmentReduce.cu
Test PR confirmed by Removing SegmentReduce.cu windows build for CUDA 12.9 can succeed

Related to: https://github.com/pytorch/pytorch/issues/156181
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156630
Approved by: https://github.com/malfet

Co-authored-by: Ting Lu <tingl@nvidia.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-24 02:15:49 +00:00
a9ef7c4d04 [dynamo] update to lru_cache message and updated user stack trace in debug mode (#156639)
I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639
Approved by: https://github.com/atalman
2025-06-24 01:52:13 +00:00
86996c15dc [Inductor] Allow exhaustive autotuning across all GEMM options (#156610)
Differential Revision: D76843916

Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature:
Excessive register spillage
Using much larger amounts of shared memory than available on the hardware
This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive:

```
  AUTOTUNE mm(1152x21504, 21504x1024)
  strides: [21504, 1], [1, 21504]
  dtypes: torch.bfloat16, torch.bfloat16
  mm 0.1167 ms 100.0%
  triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices
  INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610
Approved by: https://github.com/jansel
2025-06-24 01:42:05 +00:00
40a785103c [dynamo] fix debugging code_parts for relational guards (#154753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154753
Approved by: https://github.com/anijain2305
ghstack dependencies: #154772
2025-06-24 01:38:29 +00:00
849468034d [dynamo] fix selecting shape guards (#154772)
Not all LAMBDA_GUARDs are shape guards. Only the epilogue guards
are lambda guards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154772
Approved by: https://github.com/anijain2305
2025-06-24 01:38:29 +00:00
5dd9652389 Clean up HF components (#155707)
Differential Revision: [D76427358](https://our.internmc.facebook.com/intern/diff/D76427358/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155707
Approved by: https://github.com/saumishr
2025-06-24 00:07:37 +00:00
ca5a40395d [partitioner] Fix _broadcast_on_rank0 to use deterministic hash function (#153734)
Summary:
I was using python's hash, which is not deterministic across different interpreter runs.

Use hashlib instead.

Test Plan:
Run using it

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_cc-8e17be61ce?job_attempt=1&version=0&tab=summary&env=prod

Differential Revision: D74882405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153734
Approved by: https://github.com/Microve
2025-06-24 00:06:23 +00:00
24063ad109 Fix native static dispatch kernels (#156331)
Summary: Fix for native static dispatch kernels not taking effect

Test Plan:
```
buck2 test //sigmoid/backend/test:static_kernels_ops_test

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --pytorch_predictor_sigmoid_static_dispatch_enable=true --pytorch_predictor_sigmoid_graph_passes_enable=true --benchmarkEnableProfiling=true --load_lowered_merge=3 --using_aoti_lowering_allowlist=false --requestFilePath=/data/users/georgiaphillips/replayer/inputs/742055223/0/mix/742055223_0_mix.inputs.recordio --benchmarkNumIterations=2
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76559764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156331
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-06-24 00:05:49 +00:00
380e30a723 [EZ/Profiler] Change 'b' to 'B' in FunctionEvent Frontend (#156250)
Summary: Fixes https://github.com/pytorch/pytorch/issues/149311

Test Plan:
Just changes string output

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      60.993us         0.97%      60.993us       1.848us           0 B           0 B            33
...
```

Rollback Plan:

Differential Revision: D76857251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250
Approved by: https://github.com/sanrise
2025-06-23 23:25:04 +00:00
07bb097698 Fix clang-tidy bugprone* warnings (#148529)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148529
Approved by: https://github.com/ezyang
2025-06-23 23:09:56 +00:00
3f920f3d8f [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-23 22:25:45 +00:00
c82a174cea Extract CPU log_softmax kernels to header (#156243)
This allows sharing them with ExecuTorch.

Differential Revision: [D76830114](https://our.internmc.facebook.com/intern/diff/D76830114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156243
Approved by: https://github.com/janeyx99
2025-06-23 21:31:16 +00:00
96e4c95cd8 [Inductor] Subgraph as a choice symbolic expression as input (#156185)
Differential Revision: D76514984

Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185
Approved by: https://github.com/masnesral
2025-06-23 21:29:17 +00:00
b1d62febd0 Revert "Use official CUDAToolkit module in CMake (#154595)"
This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100.

Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))
2025-06-23 21:15:31 +00:00
31e1274597 [MTIA Aten Backend] Migrate max.dim_max / min.dim_min (#156568)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate max.dim_max / min.dim_min to in-tree.

Differential Revision: [D77095185](https://our.internmc.facebook.com/intern/diff/D77095185/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156568
Approved by: https://github.com/malfet
ghstack dependencies: #156502, #156539, #156554
2025-06-23 20:43:39 +00:00
dfdd636cfa [aoti] Check longlong upperbound for codegening input size check (#156522)
Summary:
Fixes
```
error: integer literal is too large to be represented in any integer type
 38979 |     if (arg410_1_size[0] > 1171368248680556527362) {
```

Test Plan: ci

Differential Revision: D77057898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156522
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-06-23 20:38:34 +00:00
edd9c09e73 [MTIA Aten Backend] Migrate isnan (#156554)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate isnan to in-tree.

Differential Revision: [D77094811](https://our.internmc.facebook.com/intern/diff/D77094811/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156554
Approved by: https://github.com/malfet
ghstack dependencies: #156502, #156539
2025-06-23 20:22:32 +00:00
070e580d30 [MTIA Aten Backend] Migrate _log_softmax.out / _log_softmax_backward_data.out (#156539)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate _log_softmax.out / _log_softmax_backward_data.out to in-tree.

Differential Revision: [D77044380](https://our.internmc.facebook.com/intern/diff/D77044380/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156539
Approved by: https://github.com/malfet
ghstack dependencies: #156502
2025-06-23 19:56:01 +00:00
93cd16512f [MTIA Aten Backend] Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out (#156502)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out to in-tree.

Differential Revision: [D76917384](https://our.internmc.facebook.com/intern/diff/D76917384/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156502
Approved by: https://github.com/malfet
2025-06-23 19:56:01 +00:00
ee4d343499 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)" (#156624)
This reverts changes to [test/dynamo/test_repros.py](https://github.com/pytorch/pytorch/compare/main...atalman:revert_only_portion_of_file?expand=1#diff-4c82a5798a61d4cceb176b2700ba6fdd7c3e72d575b8e7e22458589139459caa)

Missed by: ee3d9969cc (diff-036cb21341ff8e390cc250e74fe9e3f0f15f259ea4bec4abcce49d95febf1553)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156624
Approved by: https://github.com/Camyll
2025-06-23 19:30:08 +00:00
56b3bf0c74 [nativert] Move HigherOrderKernel (#156507)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan: CI

Differential Revision: D77032074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156507
Approved by: https://github.com/zhxchen17
2025-06-23 19:29:27 +00:00
d061a02e6e Revert "[invoke_subgraph] make same subgraph share get_attr target (#156260)"
This reverts commit 39dd2f4d7defc63164a7969bfac0d0c62ffac900.

Reverted https://github.com/pytorch/pytorch/pull/156260 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156260#issuecomment-2997478798))
2025-06-23 18:24:10 +00:00
35d03398e5 Revert "[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)"
This reverts commit f179b7198522e6d93bd103efba1a1ebd5a2cf891.

Reverted https://github.com/pytorch/pytorch/pull/156347 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156347#issuecomment-2997453729))
2025-06-23 18:19:29 +00:00
98a34e8d4b Move code out of individual token linters (#152256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152256
Approved by: https://github.com/Skylion007
2025-06-23 18:16:33 +00:00
da910e603a [ROCm] update state check for test_trace_while_active* (#153545)
When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-23 17:58:14 +00:00
55ef7b15e0 Revert "[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463)"
This reverts commit afbf5420b8745099bf7d871f5a4fb6dec338f825.

Reverted https://github.com/pytorch/pytorch/pull/156463 on behalf of https://github.com/atalman due to This is temoprary revert, to restore diff train sync. We should be good to reland this change ([comment](https://github.com/pytorch/pytorch/pull/156463#issuecomment-2997335541))
2025-06-23 17:44:36 +00:00
a95504b10f [torchbench] update environment setup script (#156465)
Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model.

This PR adds a config to only install torchbench models we want to run.

Example usage:
```
# Install 1 torchbench model
make build-deps TORCHBENCH_MODELS="alexnet"

# Install 3 torchbench models
make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch"

# Install all models
make build-deps

# Install all models
make build-deps TORCHBENCH_MODELS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465
Approved by: https://github.com/ezyang
2025-06-23 17:41:29 +00:00
e583b88819 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))
2025-06-23 17:36:44 +00:00
f179b71985 [invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #156260
2025-06-23 17:10:07 +00:00
39dd2f4d7d [invoke_subgraph] make same subgraph share get_attr target (#156260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-06-23 17:10:07 +00:00
276c790010 [ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587)
This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072

One-Shot:
![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f)
As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers.

Two-Shot:
![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc)
Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587
Approved by: https://github.com/jeffdaily
2025-06-23 16:14:01 +00:00
5a533f74a1 Checkout optional submodules when publishing a release tarball (#156615)
This includes Eigen and nccl for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156615
Approved by: https://github.com/huydhn
2025-06-23 16:08:22 +00:00
6835ba1b34 Register hpu device to fake backend (#156076)
## MOTIVATION

This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created.

## CHANGES
- Add "hpu" to the list of devices

@ankurneog, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076
Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD
2025-06-23 16:08:08 +00:00
cc410d3761 [SymmMem] Rename all_to_all_vdev ops (#156582)
`all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156582
Approved by: https://github.com/fduwjj
ghstack dependencies: #155134
2025-06-23 15:57:36 +00:00
640f5a7090 [dynamo] Support builtin bool on non-constant VTs (#155863)
In practice `bool(...)` is either constant folded by Dynamo or used for
branching (so most of its emulation logic lived in
`InstructionTranslator.generic_jump`.

This patch adds a dedicated `bool` hanlder (only for symbolic
bool/int/float for now), and fixes #136075.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863
Approved by: https://github.com/williamwen42
2025-06-23 15:53:15 +00:00
6b45af38a5 [easy] better copy_misaligned_inputs assertion failure message (#154472)
internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/688540560729579/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154472
Approved by: https://github.com/williamwen42
2025-06-23 15:39:15 +00:00
2e9bd03f60 Implemented Size.__radd__ (#152554)
Fixes #144334
Builds on top of #146834 by @khushi-411

The needed trick was to add `PyNumberMethods` because these Number Protocol appears to be responsible for `__radd__` (see https://stackoverflow.com/q/18794169)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152554
Approved by: https://github.com/albanD

Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-06-23 15:38:37 +00:00
3cbae6dde8 [MPSInductor][BE] Fix multistage reduction check (#156567)
From less than max threadgroup size to less or equal to that, which eliminates redundant trivial loops.

I.e. it changes shader code generated for
```python
import torch

def f(x):
    var, mean = torch.var_mean(x, dim=2, keepdim = True)
    return x / var, var

torch.compile(f)(torch.rand(1, 16, 1024, dtype=torch.float32, device='mps'))

```

from
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr1,
    device float* out_ptr2,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int x0 = xindex;
    threadgroup float3 tmp_acc_0[1024];
    tmp_acc_0[r0_index * 1] = 0.0;
    for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) {
        int r0_1 = 1 * r0_index + r0_1_cnt;
        auto tmp0 = in_ptr0[r0_1 + 1024*x0];
        tmp_acc_0[r0_index * 1] = ::c10:🤘:welford_combine(tmp_acc_0[r0_index * 1], float3(tmp0, 0.0, 1.0));
    }
    auto tmp1 = c10:🤘:threadgroup_welford_combine(tmp_acc_0, 1024);
    auto tmp2 = 1023.0;
    auto tmp3 = tmp1.y / tmp2;
    out_ptr1[x0] = static_cast<float>(tmp3);
    for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) {
        int r0_1 = 1 * r0_index + r0_1_cnt;
        auto tmp4 = in_ptr0[r0_1 + 1024*x0];
        auto tmp5 = tmp4 / tmp3;
        out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp5);
    }
}
```
to
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr1,
    device float* out_ptr2,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int r0_1 = r0_index;
    int x0 = xindex;
    threadgroup float tmp_acc_0[1024];
    auto tmp0 = in_ptr0[r0_1 + 1024*x0];
    tmp_acc_0[r0_index * 1] = tmp0;
    auto tmp1 = c10:🤘:threadgroup_welford_reduce(tmp_acc_0, 1024);
    auto tmp2 = 1023.0;
    auto tmp3 = tmp1.y / tmp2;
    out_ptr1[x0] = static_cast<float>(tmp3);
    auto tmp4 = tmp0 / tmp3;
    out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp4);
}

``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156567
Approved by: https://github.com/dcci
ghstack dependencies: #156566
2025-06-23 14:49:26 +00:00
e28925aa75 [MPS] Activation kernels: do compute at float precision (#155735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155735
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479, #155571, #155586
2025-06-23 14:48:57 +00:00
f5e1b24945 Revert "Enable Leak Sanitizer (#154584)"
This reverts commit c79c7bbe615265b6b3d7df39d6d5a68afd7d6b2a.

Reverted https://github.com/pytorch/pytorch/pull/154584 on behalf of https://github.com/cyyever due to Need to suppress more output ([comment](https://github.com/pytorch/pytorch/pull/154584#issuecomment-2995792265))
2025-06-23 10:08:40 +00:00
4f70fbbd16 Revert "Use CMake wholearchive group (#156393)"
This reverts commit d1b4e0fa9a5feb22fc6de1d36dc4c9dac685caed.

Reverted https://github.com/pytorch/pytorch/pull/156393 on behalf of https://github.com/etaf due to This PR is breaking XPU windows build. ([comment](https://github.com/pytorch/pytorch/pull/156393#issuecomment-2995576362))
2025-06-23 09:03:19 +00:00
92409b6c89 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD
2025-06-23 08:49:30 +00:00
d5781c8d21 remove allow-untyped-defs from torch/fx/passes/utils/fuser_utils.py (#156538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156538
Approved by: https://github.com/ezyang
2025-06-23 08:18:16 +00:00
e0ae4ecca8 Refactor cpp codegen to support overridable class attributes. (#155553)
- Refactored CppKernelProxy and CppScheduling to use class-level attributes (kernel_cls, kernel_proxy_cls) for backend-specific kernel customization.
 - Avoids method duplication (e.g., codegen_functions, codegen_node) for backend-specific overrides thus reduces downstream maintenance when upgrading Torch.
 - Ensures type safety with annotations while keeping core logic centralized and extensible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155553
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2025-06-23 07:36:30 +00:00
cyy
67ee0c6725 Remove outdated Android workarounds of nearbyintf (#151292)
This PR uses std::nearbyint on all supported platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151292
Approved by: https://github.com/ezyang
2025-06-23 06:28:15 +00:00
cyy
d1b4e0fa9a Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-06-23 06:22:34 +00:00
cyy
099d0d6121 Simplify nvtx3 CMake handling, always use nvtx3 (#153784)
Fall back to third-party NVTX3 if system NVTX3 doesn't exist. We also reuse the `CUDA::nvtx3` target for better interoperability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153784
Approved by: https://github.com/ezyang
2025-06-23 06:12:46 +00:00
31659964a5 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-23 05:58:39 +00:00
cyy
c79c7bbe61 Enable Leak Sanitizer (#154584)
It enables Leak Sanitizer and also provides a suppression file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154584
Approved by: https://github.com/ezyang
2025-06-23 05:20:27 +00:00
9fed2added Remove remaining CUDA 12.4 CI code (#155412)
Because no 12.4 job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155412
Approved by: https://github.com/ezyang
2025-06-23 05:16:38 +00:00
4cd6e96bf0 [MPSInductor] Fix nested loop var elimination (#156566)
As reduction resuts must be kept around
Add regression test that is specific for this issue

Fixes https://github.com/pytorch/pytorch/issues/156426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156566
Approved by: https://github.com/dcci
2025-06-23 04:35:16 +00:00
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
5b210bb3a6 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-23 02:57:50 +00:00
ced90016c1 [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316
2025-06-23 02:57:41 +00:00
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
4ccc0381de [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-23 02:57:28 +00:00
1b2146fc6d [BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314
Approved by: https://github.com/jingsh
ghstack dependencies: #156313
2025-06-23 02:57:19 +00:00
6ff6630375 [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
c55eef79f8 [Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395)
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
2025-06-23 02:00:42 +00:00
3c7079959c [Inductor][CPP] Enable WOQ int4 concat linear (#156387)
**Summary**
Enable the concat linear optimization pass in Inductor for woq int4 linear.

**Test Plan**
```
 python test/inductor/test_cpu_select_algorithm.py -k test_int4_concat_woq_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156387
Approved by: https://github.com/CaoE, https://github.com/jansel
ghstack dependencies: #156407
2025-06-23 01:52:00 +00:00
03023f178c FlexAttn config refactor + ROCm optimisations (#156307)
This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends.

This PR also introduces:
1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases.
3. Updates configs for ROCm flex autotune path providing perf optimisations

AMD perf numbers on score mod benchmark (default inputs)
flex_attn | mode | Speedup (Avg) | Speedup (Max)
-- | -- | -- | --
fwd | autotune before PR | 2.608 | 20.56
fwd | autotune after PR | 2.862 | 22
fwd | exhaustive_autotune | 2.943 | 22.471
bwd | autotune before PR | 2.196 | 9.831
bwd | autotune after PR | 2.423 | 11.331
bwd | exhaustive_autotune | 2.566 | 13.87

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307
Approved by: https://github.com/drisspg, https://github.com/jansel
2025-06-22 22:27:38 +00:00
a5cbb2bcb3 Improve All to All Perf for inter-node use-case (#156376) (#156389)
Summary:

For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf.

Test Plan:
Verify on two hosts
Host1:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip}  --node_rank=0  comms.py --	master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Host2:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip}  --node_rank=1  comms.py --	master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda

Rollback Plan:

Differential Revision: D76937048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389
Approved by: https://github.com/kwen2501
2025-06-22 20:45:46 +00:00
a28e6ae38f [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156401)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156401
Approved by: https://github.com/albanD
ghstack dependencies: #156400
2025-06-22 18:40:38 +00:00
1d522325b4 [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156400)
As the title stated.

**Changes:**

- add resize_ for OpenReg
- migrate related tests into test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156400
Approved by: https://github.com/albanD
2025-06-22 18:40:38 +00:00
54b8087f63 Improve torch.ops typing (#154555)
Summary:
Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors.

Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test Plan: CI

Differential Revision: D75497142

Co-authored-by: Benjamin Glass <bglass@quansight.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1
2025-06-22 15:52:27 +00:00
10fb98a004 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-22 15:05:08 +00:00
b5c8b8d09f Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit b46eb1ccaff944cdcd43e9ce3958819226d2952f.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
5e56db59d4 Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 2c372a0502578e0136a84423c3f49c19c26d6bb7.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
c10eeb5bad Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 537b0877a87948bc221301a518fdbc1cf772bc7e.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
ee3d9969cc Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 24dc33b37b50ec92da08fc693dd83e7c87b74f8b.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
f1331f3f1b Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)"
This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a.

Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
5b427c92a8 Revert "[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)"
This reverts commit ead741c5fb0036e0fc95b79d4fe1af3a426e1306.

Reverted https://github.com/pytorch/pytorch/pull/156314 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
145d4cdc11 Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)"
This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18.

Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
035a68d25a Revert "[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)"
This reverts commit ee72815f1180fe2d8bcdb23493999256169ac2fa.

Reverted https://github.com/pytorch/pytorch/pull/156317 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
1d3bca40ed Revert "[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)"
This reverts commit a23ccaa8479e038e79532759a64e9947c0fac43d.

Reverted https://github.com/pytorch/pytorch/pull/156319 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874a3116f1067f9092456ee7281003614.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
afbf5420b8 [dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463)
This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352

This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.*

As of right now with this code change the terminal outputs:

if the code came from torch.* :
Nothing, as we shouldn't warn for functions that come from torch.*

else:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)

If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0]   File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module>
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0]     torch.compile(foo, backend="eager")(torch.randn(4))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463
Approved by: https://github.com/williamwen42
2025-06-22 11:40:28 +00:00
aeaf6b59e2 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-22 11:39:31 +00:00
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
a23ccaa847 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-22 08:43:49 +00:00
ee72815f11 [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316
2025-06-22 08:43:41 +00:00
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
ead741c5fb [BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314
Approved by: https://github.com/jingsh
ghstack dependencies: #156313
2025-06-22 08:43:18 +00:00
3627270bdf [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
cyy
08dae945ae Use official CUDAToolkit module in CMake (#154595)
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with `torch::` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595
Approved by: https://github.com/albanD
2025-06-22 05:44:29 +00:00
1d993fa309 Don't change set_skip_guard_eval_unsafe for DisableContext, since compiler won't run (#156490)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156490
Approved by: https://github.com/anijain2305
2025-06-22 00:51:32 +00:00
333e0e6147 Make build-deps drop builds into current venv again (#156200)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200
Approved by: https://github.com/malfet
2025-06-22 00:45:02 +00:00
74ebd8d14e use guard_or_false for expand utils reduction (#155868)
This is classic broadcast like pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155868
Approved by: https://github.com/bobrenjc93
2025-06-21 23:42:19 +00:00
f70c80105e Enables NCCL symmetric memory kernels through mempool registration (#155134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-06-21 23:24:04 +00:00
9e132b770e [CUDA] Skip test on low vram machines (#156548)
I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines.

CC: @eqy @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548
Approved by: https://github.com/eqy, https://github.com/malfet
2025-06-21 22:32:57 +00:00
e4ae60a413 [SymmMem] Add NVSHMEM Quiet support to Triton (#156475)
This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations.

Changes:
- Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py`
- Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including:
  - A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion
  - Flag-based signaling only after `quiet()` completes, guaranteeing data delivery
  - Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475
Approved by: https://github.com/kwen2501
ghstack dependencies: #156472, #156473, #156474
2025-06-21 22:19:58 +00:00
c2d1b225e6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-06-21 19:57:21 +00:00
04b91a9e43 [SymmMem] Add NVSHMEM Fence support to Triton (#156474)
This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed.

 Changes:
- Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py`
- Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including:
  - A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls
  - Final fence before flag update to ensure all data transfers complete before signaling
  - Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees

 Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156472, #156473
2025-06-21 18:57:05 +00:00
c06c2569ee [ca] Support TorchDispatchMode via pass through (#156516)
The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes

In terms of differences with eager autograd engine:
- For function mode, CA additionally disables/re-enables `_set_multithreading_enabled`
- For dispatch mode:
  - accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead
  - Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516
Approved by: https://github.com/jansel
ghstack dependencies: #156374, #156509
2025-06-21 18:33:47 +00:00
5f2f343e1e [ca] suggest to disable compiled autograd for trace-time NotImplementedErrors (#156509)
Example:

```python
  File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd.
  You can disable compiled autograd for this operation by:
  1.  Relocating the unsupported autograd call outside the compiled region.
  2.  Wrapping the unsupported autograd call within a scope that disables compiled autograd.
  3.  Configuring the specific compilation unit to disable compiled autograd.
  4.  Globally disabling compiled autograd at the application's initialization.
```

No duplicate error messages for python side trace-time errors
```python
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture
    raise NotImplementedError(
NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either:
1. Moving the unsupported autograd call outside of the torch.compile'd region.
2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager.
3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call.
4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509
Approved by: https://github.com/jansel
ghstack dependencies: #156374
2025-06-21 18:33:46 +00:00
f1968a5e76 [ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374)
These aren't supported. Not sure how they passed CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374
Approved by: https://github.com/jansel
2025-06-21 18:33:38 +00:00
fab85fc5f9 [compile][hierarchical compilation] Release nested_compile_region API (#156449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-21 15:14:59 +00:00
fb75dea2c1 [logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)
Summary: Discussed internally at https://fburl.com/workplace/v3hllrs9. With coordinate descent tuning enabled, we're missing the dynamo_timed logging.

Test Plan:
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 1 --performance --cold-start-latency`
* tlparse: https://fburl.com/bh2hxw4z
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u88ogw39
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/yqljow6c

Rollback Plan:

Differential Revision: D77053918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156517
Approved by: https://github.com/mengluy0125
2025-06-21 14:17:19 +00:00
a47ca4fc74 Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033)" (#156546)
Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](9de23d0c29)

This reverts commit 9de23d0c29dfac8dc0f6f234bdbcd85a6375fa81.

PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546
Approved by: https://github.com/jansel
2025-06-21 14:10:12 +00:00
d846e21355 Revert "[nativert] move layout planner algorithms to libtorch (#156508)"
This reverts commit eab45643f22e58ee12d95d8b0162d51ca0a50801.

Reverted https://github.com/pytorch/pytorch/pull/156508 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/15793524714/job/44524067679) [HUD commit link](eab45643f2) ([comment](https://github.com/pytorch/pytorch/pull/156508#issuecomment-2993589983))
2025-06-21 13:42:40 +00:00
1cfdcb975a [CUDA] fix illegal memory access in attention (#155397)
Fixes https://github.com/pytorch/pytorch/issues/150054

CI seemed to be messed up in the old one, old PR:
https://github.com/pytorch/pytorch/pull/155145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397
Approved by: https://github.com/ngimel
2025-06-21 12:32:00 +00:00
cd75cf3cab [symm_mem] Add one side put API for nvshvem (#156443)
`nvshmem_put(Tensor tensor, int peer)`, where `tensor` must be a symmetric tensor, i.e. rendezvoused before this call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156443
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-06-21 12:16:36 +00:00
4ff0e033c1 [SymmMem] Add NVSHMEM signal_wait_until support to Triton (#156473)
This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs.

Changes:
- Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py`
- Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675))
- Implemented `test_triton_signal_wait_until` with:
  - Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block`
  - Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value
  - End-to-end validation of both data transfer and signal synchronization

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
ghstack dependencies: #156472
2025-06-21 10:55:40 +00:00
8485f19507 remove gso from vector_norm (#156530)
guard_or_false here does same thing that guard_size_oblivuous do, note that
size is >=0 and this is size like by definition since its a tensor size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156530
Approved by: https://github.com/bobrenjc93
2025-06-21 08:42:36 +00:00
6ffa03ef9e [Inductor-CPU] int8 WoQ concat linear (#153004)
### Summary

int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape.

### Perf data

GPT-J 128 input tokens, 128 output tokens.
32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded.

| May 8 nightly first token latency | First token latency with this implementation | Rest token latency with May 8 nightly | Rest token latency with this implementation combined with #149373  |
|---|---|---|---|
|202 ms | 190 ms | 33 ms | 30 ms|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel

Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
2025-06-21 08:40:09 +00:00
35321b2ad6 remove make_fast_binary_impl from make_fast_binary_impl (#156528)
This was added in https://github.com/pytorch/pytorch/pull/133584.
Take slow path when we cant determine fast path is valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156528
Approved by: https://github.com/bobrenjc93
2025-06-21 08:27:54 +00:00
eab45643f2 [nativert] move layout planner algorithms to libtorch (#156508)
Summary: tt

Test Plan:
ci

Rollback Plan:

Differential Revision: D76832891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156508
Approved by: https://github.com/zhxchen17
2025-06-21 07:35:40 +00:00
bf50d71553 Add missing inline namespace CPU_CAPABILITY to Gelu/Elu.h (#156512)
As I recently learned the hard way (#156243), it is necessary to put kernel code that uses Vectorized in headers in this namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156512
Approved by: https://github.com/malfet
2025-06-21 06:26:23 +00:00
e3b44edfd8 [SymmMem] Add NVSHMEM wait_until support to Triton (#156472)
This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs.

 Changes:
- Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`.
- Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including:
  - A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag.
  - Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data.

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until`

```python
@triton.jit
def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

@triton.jit
def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr):
    nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val)

...

if rank == 0:
    print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21")
    wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib)
    print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...")
    print(f"[RANK 0] Current out buffer contents: {out.tolist()}")
    torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device))
    print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.")

if rank == 1:
    print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0")
    put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)
    print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0")
    put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib)
    print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.")

...
```
Output:
```
[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21
[RANK 1] About to PUT 8 elements of value 13 to rank 0
[RANK 1] About to PUT flag value 21 to wake up rank 0
[RANK 1] FLAG PUT complete! Rank 0 should wake up now.
[RANK 0] WAIT IS OVER! Flag was set, checking data now...
[RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13]
[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.
[RANK 0] Test completed successfully! 🎉
[RANK 1] Test completed successfully! 🎉

...

----------------------------------------------------------------------
Ran 1 test in 18.773s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472
Approved by: https://github.com/kwen2501
2025-06-21 06:18:31 +00:00
92c79f36db [PGO] frame-specific whitelist logging (#155959)
Summary:
In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1).

This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states).

The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry.

Test Plan:
loggercli codegen GeneratedPt2CompileEventsLoggerConfig

Rollback Plan:

Differential Revision: D76628834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959
Approved by: https://github.com/bobrenjc93
2025-06-21 06:15:51 +00:00
9de23d0c29 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-21 05:47:54 +00:00
b8ace6f951 Make dtensor tests device agnostic (#155687)
## MOTIVATION
This PR is a continuation of https://github.com/pytorch/pytorch/pull/154840 and we are trying to make the tests more device agnostic by removing hard coded references to any particular device.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

## CHANGES
1. test_convolution_ops.py:
    - Replace "cuda" with self.device_type
2. test_random_ops.py:
    - Remove setting and using TYPE_DEVICE variable since device_type is set as per the environment (device) in DTensorTestBase class.
    - Replace "cuda" with self.device_type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155687
Approved by: https://github.com/EikanWang, https://github.com/d4l3k
2025-06-21 04:51:59 +00:00
f3ec16c26a [MTIA Aten Backend][3/n] Migrate mm.out from out-of-tree to in-tree (#154393)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate mm.out from out-of-tree to in-tree.

We dispatch mm.out to MTIA separately from CPU/CUDA. So this diff adds the file `MTIAOps.cpp` under `ATen/native/mtia` to hold the dispatched functions. In future we can split `MTIAOps.cpp` to categorized ops files.

Differential Revision: [D74743849](https://our.internmc.facebook.com/intern/diff/D74743849/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154393
Approved by: https://github.com/albanD, https://github.com/egienvalue, https://github.com/nautsimon
2025-06-21 04:31:04 +00:00
88b9c285e0 Workaround for e4m2 dtype (#156461)
Found in: https://github.com/pytorch/ao/pull/2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156461
Approved by: https://github.com/vkuzo
2025-06-21 04:01:44 +00:00
554b568040 Add internal use only utility to allow externally visible side effects within HOPs (#155715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155715
Approved by: https://github.com/zou3519
2025-06-21 03:55:28 +00:00
c09b054878 Add runtime profiler info for AOTDispatcher prologue (#155785)
Fixes #155721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155785
Approved by: https://github.com/bdhirsh
2025-06-21 03:34:07 +00:00
fd8ea3c8a3 [symm_mem] Add nccl as a backend for symmetric memory (#155740)
Running unit test:

 TORCH_SYMMMEM=NCCL TORCH_DISTRIBUTED_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO pytest test/distributed/test_nccl.py -k test_nccl_symmem_alloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155740
Approved by: https://github.com/kwen2501
2025-06-21 03:22:23 +00:00
ee56e9f8a8 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
2025-06-21 03:02:02 +00:00
b4228a94d1 Split the exclude pattern for CODESPELL linter (#156229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156229
Approved by: https://github.com/albanD
ghstack dependencies: #156080, #156081
2025-06-21 02:47:40 +00:00
e3507c3777 [BE] fix typos in functorch/ and scripts/ (#156081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156081
Approved by: https://github.com/albanD
ghstack dependencies: #156080
2025-06-21 02:47:40 +00:00
2ccfd14e23 [BE] fix typos in docs/ (#156080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-06-21 02:47:32 +00:00
clr
9aaa184105 dynamo: Don't crash when someone tries to access a non existent list member (#156335)
dynamo: Don't crash when someone tries to access a non existent list member

Test added which reproduces the failure. Note that I'm using the new
unimplemented_v2 API. Let me know if people have a strong preference that I use
something else.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335
Approved by: https://github.com/jansel
2025-06-21 02:26:31 +00:00
ac86ec0e60 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-21 01:34:41 +00:00
e98dd95446 [nativert] Move SerialGraphExecutor to PyTorch core (#156459)
Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D76917966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-21 01:32:06 +00:00
a67eb1a0d6 [ez] remove unused functions (#156466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466
Approved by: https://github.com/jingsh
2025-06-21 00:38:34 +00:00
2ee23175d9 [dynamo][guards] Catch exception and return false in the backend match (#156341)
Its difficult to write a test. I found this while debugging a sefgault.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341
Approved by: https://github.com/williamwen42
2025-06-21 00:13:26 +00:00
0f0c010714 [c10d] init_process_group supports index-only device id (#156214)
Before:
```
acc = torch.accelerator.current_accelerator()
if acc:
  local_idx = ...
  dist.init_process_group(
    device_id=torch.device(acc.type, local_idx)
  )
```
After:
```
dist.init_process_group(device_id=local_idx)
```

That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-06-21 00:02:37 +00:00
fbbab794ef [ONNX] Implement Attention-23 (#156431)
Implement Attention-23 using sdpa and flexattention.

- I used copilot for this.
- Also updated the conversion logic to remove trailing None inputs.

@gramalingam @kunal-vaishnavi @titaiwangms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431
Approved by: https://github.com/titaiwangms

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 23:54:57 +00:00
0ad88a2224 Support environement var for autotune log (#156254)
Summary: Titled

Test Plan:
See the scadcastle signal

Rollback Plan:

Differential Revision: D76860928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254
Approved by: https://github.com/Mingming-Ding
2025-06-20 23:06:33 +00:00
6098209bff [BE][5/X] Phase out usage of use_max_autotune() (#156269)
These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269
Approved by: https://github.com/masnesral
2025-06-20 22:37:45 +00:00
5ab257c74c [invoke_subgraph] Make invoke_subgraph cacheable (#156448)
Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-06-20 21:20:23 +00:00
e2351f2dcf fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379)
This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379
Approved by: https://github.com/janeyx99
2025-06-20 20:54:53 +00:00
b8fc5e0c0d skip flaky test in CPython 3.13 tests (#155561)
Changed files:
* test_math.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561
Approved by: https://github.com/zou3519
2025-06-20 20:25:35 +00:00
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
de1930a429 Add ONNX dynamo metadata documentation (#155816)
Describe auto-generated metadata when calling torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 20:12:22 +00:00
a69e27ca5a Remove unused MultiKernelCall import from inductor codegen (#156158)
Since it's now actually used within async_compile.multi_kernel

```
    def multi_kernel(self, *args, **kwargs) -> Any:
        from torch._inductor.codegen.multi_kernel import MultiKernelCall

        # no need to call this in parallel since the sub-kernels are already parallel tasks
        return MultiKernelCall(*args, **kwargs)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-20 19:55:24 +00:00
e5ea24fb27 [nativert] Move auto_functionalize_kernel (#156454)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Copied from original auto_functionalize Diff Summary D53776805:

This is a non-functional kernel implementation for auto_functionalize

In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.

This would mutates the input tensors inplace, which is unsafe in general.

However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.

In the proper functional implementation, it will

make a clone of the mutating input tensor

return these new instance of tensors as AutoFunctionalizeKernel output.

If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.

Test Plan: See internal for test plan

Differential Revision: D76926383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
2025-06-20 19:53:16 +00:00
eb331b59fe Add shim fallback for narrow (#156496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496
Approved by: https://github.com/albanD
2025-06-20 19:47:00 +00:00
6ed85bfe6a Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-20 19:42:57 +00:00
ef6d2cee7a [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-20 18:54:38 +00:00
18e4c461fb Update index.md (#155143)
Related to: https://github.com/pytorch/pytorch/issues/152134
Update to index.md to add language for Stable and Unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143
Approved by: https://github.com/AlannaBurke, https://github.com/atalman

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-20 18:53:32 +00:00
502486d946 [PT2]Add weight and constant config path template (#156359)
Summary: At title.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D76925510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359
Approved by: https://github.com/SherlockNoMad
2025-06-20 18:46:01 +00:00
4b6cbf528b Add C shim fallback for fill_ (#156245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245
Approved by: https://github.com/desertfire
2025-06-20 18:45:48 +00:00
208ec60e72 Revert "[BE] Make Eigen an optional dependency (#155955)"
This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019.

Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))
2025-06-20 18:43:52 +00:00
d309cd1d50 Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969)"
This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d.

Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))
2025-06-20 18:40:38 +00:00
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
39270430c9 [inductor] force min num-split (off by default) (#155941)
This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ).

The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance.

 The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order:
- [xnumel * num_split, rnumel / num_split]
- [xnumel, num_split]

A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2.

Here are some benchmarking results.
```
import torch
from triton.testing import do_bench
import functools
from torch._inductor import config
from torch._dynamo.decorators import mark_dynamic
import os

@torch.compile(dynamic=True)
def f(x):
    return x.sum(dim=0)

N = 512
C = functools.partial(torch.randn, device="cuda")
x_small = C(4096, N)
x_large = C(4096 * 1000, N)

if os.getenv("HINT_WITH_SMALL_INPUT") == "1":
    x = x_small
else:
    x = x_large

mark_dynamic(x, 0)
f(x)

ms = do_bench(lambda: f(x_large))

# 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15
# 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272
# 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba
print(ms)
```
This test mimic what we see in the original problem.

- If we compile with large inputs and benchmark for large inputs, latency is 4.03ms
- if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms
- with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve

The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test.
- Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms
- Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62  Perf is neutral (0.011 ms v.s. 0.011ms)

A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325

# Update for not applying the change to static shape:
from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example:
```
import torch
from torch._inductor import config

aten = torch.ops.aten

def f(x):
    return aten.bernoulli(x).sum()

x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda")
torch.compile(f)(x)
```

With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue.

Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6)

Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941
Approved by: https://github.com/jansel, https://github.com/bobrenjc93
2025-06-20 18:01:28 +00:00
55dae0bf7a Add a basic shim and stable::Tensor is_contiguous API (#156228)
Add a limited is_contiguous in shim, stable::Tensor API with a test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228
Approved by: https://github.com/desertfire
2025-06-20 17:59:52 +00:00
49ee1e7106 [CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138)
Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this
Use `--no-renames` to make both the old name and the old name show up in the diff.  Without it I think only the new name shows up in git diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever
2025-06-20 17:58:04 +00:00
e31f205292 [Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504)
Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504
Approved by: https://github.com/blaine-rister

Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>
2025-06-20 17:43:38 +00:00
d83ff89d3b Add toggle functionality for XPU profiler (#155135)
Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-20 17:27:48 +00:00
1b50c12584 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
2025-06-20 17:21:27 +00:00
63360e64da [BE][Easy] do not install yanked types-pkg-resources in lint environment (#156462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462
Approved by: https://github.com/ezyang
2025-06-20 16:00:43 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
b4442f42a9 Revert "Upgrade to DLPack 1.0. (#145000)"
This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6.

Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))
2025-06-20 15:32:47 +00:00
edd45f3a02 Revert "[Precompile] Hook up backend="inductor" (#155387)"
This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840.

Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](2c68c3e8d5) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))
2025-06-20 15:30:04 +00:00
e1f28fe17b add device generalisation support for distributed tests (#152471)
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices

### CHANGES

- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

- torch/testing/_internal/common_distributed.py

extended common utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
2025-06-20 07:35:42 +00:00
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
24dc33b37b [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-20 07:03:29 +00:00
537b0877a8 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-20 07:03:16 +00:00
2c372a0502 [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-20 07:03:07 +00:00
b46eb1ccaf [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-20 07:02:57 +00:00
2c68c3e8d5 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-20 06:38:29 +00:00
d5b4a32960 [BE] fix PYPROJECT linting errors in test/ and tools/ (#156021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021
Approved by: https://github.com/Skylion007
2025-06-20 06:19:05 +00:00
4cbbc8b458 [MPS] Implement backward pass for interpolate_trilinear (#156373)
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights

TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
2025-06-20 05:41:24 +00:00
c37ddcaefb Fix torchgen update-aoti-shim (#156323)
will remove the fill changes before landing and let Jane merge her changes!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323
Approved by: https://github.com/janeyx99
2025-06-20 05:23:06 +00:00
f7a5ad6c29 [Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407)
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
2025-06-20 03:08:02 +00:00
72c8751b61 Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048)
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
2025-06-20 01:41:03 +00:00
159a39ad34 Add an option for cpp_wrapper to compile entry and kernel separately (#156050)
Fixes #156037.
Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050
Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel
2025-06-20 01:11:16 +00:00
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
cyy
3c2324c64a [2/N] Fix cppcoreguidelines-init-variables suppression (#146237)
This PR removes all `cppcoreguidelines-init-variables` suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237
Approved by: https://github.com/ezyang
2025-06-19 23:26:42 +00:00
52f873adc2 Add logging for async compile worker statistics (#155820)
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.

Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
2025-06-19 23:10:15 +00:00
c60d8188d2 [nativert] Move GraphExecutorBase to PyTorch core (#156196)
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with  execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
CI

Rollback Plan:

Differential Revision: D76830436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
2025-06-19 22:42:35 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
3644b41a7c [ONNX] Note on attention op symbolic function (#156441)
Follow up https://github.com/pytorch/pytorch/pull/156367
Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441
Approved by: https://github.com/justinchuby
2025-06-19 21:00:05 +00:00
443b5b43c3 xpu: fix AOT compilation in sycl cpp extension (#156364)
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.

Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")

CC: @pengxin99, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
2025-06-19 20:11:38 +00:00
d32deb664a [c10d] Disable NCCL NVLS when using deterministic mode (#156381)
via setting env `NCCL_ALGO=^NVLS`.

Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381
Approved by: https://github.com/ngimel
2025-06-19 20:09:24 +00:00
69f2e09cc2 Add more shards to H100 benchmark, and also run it more frequently (#156429)
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours.  I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090

With this computing power, we could also run the whole suite every 4 hours now.  I could run this less frequently later if I see queueing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
2025-06-19 20:02:56 +00:00
aac0e8f0e9 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 20:02:38 +00:00
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
c0ee01c2fb tools/nightly.py: only download torch via pip and install dependenices via uv (#156409)
Setup time (cpu-only): 70s -> 27.6s -> 17.4s

The tool can setup the pinned NVIDIA dependencies correctly:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate
make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda"
make[1]: Entering directory '/home/PanXuehai/Projects/pytorch'
/home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda
log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log
Creating virtual environment
Removing existing venv: /home/PanXuehai/Projects/pytorch/venv
Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv
Installing packages
Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - uv
  - pip
  - setuptools
  - packaging
  - wheel
  - build[uv]
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting uv
  Using cached f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB)
Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1)
Collecting setuptools
  Using cached 17031897da/setuptools-80.9.0-py3-none-any.whl (1.2 MB)
Collecting packaging
  Using cached 38679034af/packaging-25.0-py3-none-any.whl (66 kB)
Collecting wheel
  Using cached 87f3254fd8/wheel-0.45.1-py3-none-any.whl (72 kB)
Collecting build[uv]
  Using cached 80633736cd/build-1.2.2.post1-py3-none-any.whl (22 kB)
Collecting pyproject_hooks (from build[uv])
  Using cached 12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build
Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1
Installing packages took 6.251 [s]
Creating virtual environment took 9.050 [s]
Downloading packages
Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting torch
  Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB)
Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Successfully downloaded torch
Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww:
  - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Downloading packages took 6.284 [s]
Unpacking wheel file
Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK
Unpacking wheel file took 15.107 [s]
Installing dependencies
Installing packages
Installing package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - filelock
  - typing-extensions>=4.10.0
  - setuptools; python_version >= "3.12"
  - sympy>=1.13.3
  - networkx
  - jinja2
  - fsspec
  - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64"
  - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux"
  - numpy
  - cmake
  - ninja
  - packaging
  - ruff
  - mypy
  - pytest
  - hypothesis
  - ipython
  - rich
  - clang-format
  - clang-tidy
  - sphinx
Using Python 3.13.4 environment at: venv
Resolved 78 packages in 2.95s
Installed 76 packages in 93ms
 + alabaster==1.0.0
 + asttokens==3.0.0
 + attrs==24.2.0
 + babel==2.17.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + clang-format==20.1.6
 + clang-tidy==20.1.0
 + cmake==3.25.0
 + decorator==5.2.1
 + docutils==0.21.2
 + executing==2.2.0
 + filelock==3.18.0
 + fsspec==2025.5.1
 + hypothesis==6.135.11
 + idna==3.10
 + imagesize==1.4.1
 + iniconfig==2.1.0
 + ipython==9.3.0
 + ipython-pygments-lexers==1.1.1
 + jedi==0.19.2
 + jinja2==3.1.6
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib-inline==0.1.7
 + mdurl==0.1.2
 + mpmath==1.3.0
 + mypy==1.16.1
 + mypy-extensions==1.0.0
 + networkx==3.5
 + ninja==1.11.1.4
 + numpy==2.3.0
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.3
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.2.5
 + nvidia-nvtx-cu12==12.8.90
 + parso==0.8.4
 + pathspec==0.12.1
 + pexpect==4.9.0
 + pluggy==1.6.0
 + prompt-toolkit==3.0.51
 + ptyprocess==0.7.0
 + pure-eval==0.2.3
 + pygments==2.19.1
 + pytest==8.4.1
 + pytorch-triton==3.3.1+gitc8757738
 + requests==2.32.3
 + rich==14.0.0
 + roman-numerals-py==3.1.0
 + ruff==0.12.0
 + snowballstemmer==3.0.1
 + sortedcontainers==2.4.0
 + sphinx==8.2.3
 + sphinxcontrib-applehelp==2.0.0
 + sphinxcontrib-devhelp==2.0.0
 + sphinxcontrib-htmlhelp==2.1.0
 + sphinxcontrib-jsmath==1.0.1
 + sphinxcontrib-qthelp==2.0.0
 + sphinxcontrib-serializinghtml==2.0.0
 + stack-data==0.6.3
 + sympy==1.14.0
 + traitlets==5.14.3
 + typing-extensions==4.14.0
 + urllib3==2.2.3
 + wcwidth==0.2.13
Installing packages took 3.080 [s]
Installing dependencies took 3.080 [s]
Pulling nightly PyTorch
Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba
Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef
Already up to date.
Pulling nightly PyTorch took 0.017 [s]
Moving nightly files into repo
Moving nightly files into repo took 4.898 [s]
Writing pytorch-nightly.pth
Writing pytorch-nightly.pth took 0.021 [s]
-------
PyTorch Development Environment set up!
Please activate to enable this environment:

  $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate

make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409
Approved by: https://github.com/ezyang
ghstack dependencies: #156408
2025-06-19 19:42:15 +00:00
71faa7e5b9 tools/nightly.py: use uv pip install instead of pip install (#156408)
Setup time: 70s -> 27.6s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408
Approved by: https://github.com/ezyang
2025-06-19 19:42:15 +00:00
134dfb3fe6 [dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791)
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.

We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.

Fixes #155778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
2025-06-19 19:37:44 +00:00
e4c9f6d9a2 [nativert] Move c10_kernel (#156208)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:c10_kernel_test
```

Differential Revision: D76825830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
2025-06-19 17:36:23 +00:00
f402eed4d9 [ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611)
This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4)

CUDAHooks::versionMIOpen() was added to detect MIOpen version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-19 17:22:37 +00:00
085f270a00 [ROCm] Enable more parallelism for multi-dimensional reductions (#155806)
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2025-06-19 17:19:40 +00:00
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
6e185c5312 Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-19 16:27:42 +00:00
6eb6f198e1 update codebase structure documentation to include mps (#156297)
📚 The doc update

adding description about mps folder in code structure guide

@albanD @malfet @svekars @sekyondaMeta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297
Approved by: https://github.com/ezyang
2025-06-19 16:16:29 +00:00
7f0cddfb55 [dynamo] Add documentation for guard_filter_fn (#156114)
Summary: Adding a section of doc for guard_filter_fn.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76756743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114
Approved by: https://github.com/jansel
2025-06-19 16:13:12 +00:00
c9afcffed0 [AOTInductor] Call most runtime fallback ops without calling into Python (#154142)
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python.  This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.

Fixes #150988
Fixes #153478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
2025-06-19 15:27:15 +00:00
317af4c87b Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)"
This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c.

Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))
2025-06-19 15:09:29 +00:00
ab3393e923 [ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368)
Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-19 15:02:40 +00:00
0b62465b99 Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466)"
This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e.

Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))
2025-06-19 14:25:38 +00:00
fec8af8b98 [bugfix] [build] guard cuda version for ipc with fabric handle (#156394)
https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8).

this pr improves the support by adding some compilation-time check against cuda versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394
Approved by: https://github.com/ngimel
2025-06-19 13:54:01 +00:00
769d754ab2 [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-19 13:22:41 +00:00
8cb0c4a4da [Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586)
This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #150287
2025-06-19 13:17:22 +00:00
83259cf7a7 [Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287)
This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode.
The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion.

The performance improvement:

| Suite       | Inductor Speedup (Baseline) | Inductor Speedup (Compared) | Acc Failed | Perf Failed | Inductor Perf Ratio | Speedup  |
|-------------|-----------------------------|------------------------------|------------|--------------|----------------------|----------|
| Huggingface | 2.134838                    | 2.125740314                  | 0          | 0            | 1.001462504          | 100.43%  |
| Torchbench  | 1.808558                    | 1.675100479                  | 0          | 0            | 1.075722187          | 107.97%  |
| Timm        | 2.343893                    | 2.070476653                  | 0          | 0            | 1.131023832          | 113.21%  |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287
Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel
2025-06-19 13:17:22 +00:00
0504480f37 Add CUDA 12.9 libtorch nightly (#155895)
https://github.com/pytorch/pytorch/issues/155196

with libtorch docker added, we can add the build script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895
Approved by: https://github.com/atalman
2025-06-19 13:15:42 +00:00
ccb1f687d6 Port two dynamo test cases for Intel GPU (#156056)
For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-06-19 12:49:04 +00:00
a8fe982993 Revert "[build] Create target for flash attention (#156235)"
This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0.

Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](6d02321472) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))
2025-06-19 11:47:27 +00:00
4da98351b9 [SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211)
Adds NVSHMEM PUT with Signal operation support for Triton kernels:

- Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block
- Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`)
- Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`)

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set`
`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add`

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_set(self) -> None:
    @triton.jit
    def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                         signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
        nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

    # ... setup code ...

    val = 11
    inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val)
    out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)  # destination buffer

    # Signal flag buffer - starts at 0, will be set to 1 upon completion
    flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

    peer = 1 - rank
    NVSHMEM_SIGNAL_SET = 0  # atomic set operation
    SIGNAL_VAL = 1  # completion signal value

    if rank == 0:
        # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1
        put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                    signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET,
                                    peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   # Rank 1 can check flag to know data transfer completed!
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")
```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([1], device='cuda:1')

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

Working as expected! Data is received, and flag set to 1 for completion signal!

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_add(self) -> None:
   @triton.jit
   def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                        signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
       nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

   # ... setup code ...

   # Signal buffer (uint64 flag)
   flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

   peer = 1 - rank
   NVSHMEM_SIGNAL_ADD = 5  # atomic add operation
   SIGNAL_VAL = 16  # Signal value to add

   if rank == 0:
       # Rank 0 puts into Rank 1 and adds to signal
       put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                   signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD,
                                   peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")

```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([16], device='cuda:1')

----------------------------------------------------------------------
Ran 1 test in 17.145s

OK
```

The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-19 10:24:30 +00:00
348e2a76df s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397
Approved by: https://github.com/laithsakka
2025-06-19 10:18:28 +00:00
02080c2cd9 Fix num_heads inference in ONNX Attention-23 exporter (#156367)
Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105

Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3:

![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a)

But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed.

Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-06-19 09:40:01 +00:00
8fcda2c60d [SymmMem] Add runtime detection of NVSHMEM (#156291)
so that we can pick the default backend for SymmetricMemory without
fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM`

On Python side, the following API is added:
`torch.distributed._symmetric_memory.is_nvshmem_available()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117
2025-06-19 08:26:11 +00:00
eabf7cd3c5 [export] update docs for Dims (#156262)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262
Approved by: https://github.com/angelayi
2025-06-19 06:25:21 +00:00
ec0276103f [PGO] fix whitelist scalar bug (#156194)
Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D76830552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194
Approved by: https://github.com/bobrenjc93
2025-06-19 05:51:21 +00:00
1c960c5638 [Makefile] lazily setup lintrunner on first make lint run (#156058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058
Approved by: https://github.com/ezyang
2025-06-19 05:43:35 +00:00
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
ce8180a61d [c10d] Disable stack trace call in logging (#156362)
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76929722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
2025-06-19 05:11:57 +00:00
a21806f038 [ez][export] Better error message for schema check in torch.export.load (#156361)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156354

torch.export.load() only supports files generated by torch.export.save()

Test Plan:
CI

Rollback Plan:

Differential Revision: D76928725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361
Approved by: https://github.com/zhxchen17
2025-06-19 04:50:56 +00:00
3f69e3b3a0 Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-19 04:50:18 +00:00
3bec588bf5 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
2025-06-19 03:47:41 +00:00
6d02321472 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 03:35:04 +00:00
77518d1a13 [CI] fix xpu-smi hang in XPU test container (#156171)
Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171
Approved by: https://github.com/huydhn
2025-06-19 02:48:11 +00:00
19ffdf4ea0 [dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype.  this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.

This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging.  This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.

Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references.  I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.

The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
 with staging_context(stager):
     cpu_state_dict = copy.deepcopy(state_dict)
```

Also, adds support for pinned-memory.

One problem this implementation does not address is that we lose the original device.

The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.

Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.

The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```

Test Plan:
unit tests

Differential Revision: D75993324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-06-19 02:04:21 +00:00
d4ad280429 Enable querying the build and runtime NCCL versions (#156305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305
Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin
2025-06-19 02:00:08 +00:00
bc9bd2a766 Use linux.2xlarge runner (#156351)
The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-06-19 01:50:56 +00:00
e5a1197191 Fix fx tracing for mark dynamic (#156346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346
Approved by: https://github.com/tony-ivchenko
2025-06-19 01:03:09 +00:00
6959b5febe Context on torch.cuda.memory._record_memory_history max_entries (#155889)
Context on torch.cuda.memory._record_memory_history buffer behavior

## Description

Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?

Fixes #129674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
2025-06-19 00:44:43 +00:00
6303cc41b7 [ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262)
We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing.

Fixes #155045.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262
Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 23:52:35 +00:00
b8c2d4c259 add a corner test case of dynamic sizes for combo kernel (#156035)
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op

Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```

Rollback Plan:

Differential Revision: D76699002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
2025-06-18 22:57:09 +00:00
76d07e919f Unbreak //c10/util:base (#156216)
Missing dep.

Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2025-06-18 22:44:20 +00:00
9bfefda296 [DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092)
Summary:
### Diff Context

1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.

2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.

Test Plan: CI

Differential Revision: D75553096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
2025-06-18 22:41:25 +00:00
a5b4463d60 [nativert] session state (#156190)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76827309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190
Approved by: https://github.com/zhxchen17
2025-06-18 22:40:44 +00:00
6918758f55 [export] Update documents for ExportGraphSiganture (#156244)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156184

The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused.

Test Plan:
Document change only. CI

Rollback Plan:

Differential Revision: D76849097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244
Approved by: https://github.com/yushangdi
2025-06-18 22:37:34 +00:00
1e474cc9c8 [ONNX] Fix how shapes are computed for float4 (#156353)
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.

Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
2025-06-18 22:28:02 +00:00
9afee0fa96 [inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201)
internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif

Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201
Approved by: https://github.com/masnesral
2025-06-18 22:15:32 +00:00
e5a0b73ce9 [MTIA Aten Backend] Migrate logical_and.out (#156286)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_and.out to in-tree

Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286
Approved by: https://github.com/nautsimon, https://github.com/jingsh
ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285
2025-06-18 21:57:05 +00:00
bfccfa0b31 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))
2025-06-18 21:48:50 +00:00
f5eb42e4c0 [nativert] move layoutplanneralgorithm to libtorch (#156205)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76831634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205
Approved by: https://github.com/zhxchen17
2025-06-18 21:46:38 +00:00
d1c924c68a [MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate t.Tensor_out / lt.Scalar_out to in-tree.

Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283, #156284
2025-06-18 21:40:26 +00:00
5c7e1d39ab [MTIA Aten Backend] Migrate logit (#156284)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logit to in-tree.

Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283
2025-06-18 21:36:27 +00:00
706e236b08 [MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_or.out / log.out / log2.out to in-tree.

Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047
2025-06-18 21:27:58 +00:00
ab81fb846c [MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047)
Migrate remainder.Tensor_out / reciprocal.out / neg.out

Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046
2025-06-18 21:17:34 +00:00
c26ce593d8 [MTIA Aten Backend] Migrate nan_to_num.out (#156046)
Migrate nan_to_num.out

Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634
2025-06-18 21:14:13 +00:00
2f1c5c4131 [MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634)
# Context

MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.

# This diff

Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.

The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.

----------------

p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time

Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
2025-06-18 21:10:18 +00:00
e99cc126a4 [AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133)
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid

Rollback Plan:

Differential Revision: D76769711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
2025-06-18 20:53:21 +00:00
728cf6721e Revert "[PT2]load dense delta by trimming prefixes (#155872)"
This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f.

Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))
2025-06-18 20:05:56 +00:00
c74fd35050 [PT2]load dense delta by trimming prefixes (#155872)
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.

Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.

Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}  --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```

Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455

USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```

Rollback Plan:

Differential Revision: D76520301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
2025-06-18 19:13:22 +00:00
48de3da253 fix: avoid flamegraph script setup conflicts (#156310)
Fixes #156309

Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet

Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
2025-06-18 19:06:22 +00:00
cbafba5794 Allow forcing FSDP2 to always use SUM reductions (#155915)
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.

Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.

This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).

Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
2025-06-18 18:57:47 +00:00
9944cd0949 Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520)
Related to #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 18:46:04 +00:00
30d3cf62fb support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680)
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-18 18:39:01 +00:00
aee2bfc5ba [Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194)
As title.
Thanks @anmyachev  for the work on compatibility adaptation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194
Approved by: https://github.com/jansel
2025-06-18 18:17:07 +00:00
2620361d19 Add batching rule for torch.matrix_exp (#155202)
## Summary

Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.

Fixes #115992

## Details

`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.

## Testing

The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback

**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
2025-06-18 17:35:35 +00:00
eqy
a5f59cc2ea [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel
2025-06-18 17:32:36 +00:00
94f8679019 Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)"
This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365.

Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))
2025-06-18 16:52:19 +00:00
36f7a027b5 [MPS] Implement upsample_trilinear as Metal shader (#156263)
But only forward for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263
Approved by: https://github.com/dcci
ghstack dependencies: #156256, #156090
2025-06-18 16:10:02 +00:00
bf06190e21 Integrated AMD AWS runners into Pytorch CI (#153704)
Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd

Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>
2025-06-18 15:58:22 +00:00
ce3406817d Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit fe37db4f1270745d6c523623143332ddf263af55.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))
2025-06-18 15:53:32 +00:00
c5d3e7a4ff Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))
2025-06-18 15:51:06 +00:00
408d9884b0 Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))
2025-06-18 15:47:21 +00:00
6201981f48 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))
2025-06-18 15:43:22 +00:00
d290fe7690 Remove legacy export testing path (#156093)
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76734378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
2025-06-18 15:36:44 +00:00
7531bd6491 [ROCm] upgrade to 6.4.1 patch release (#156112)
Fixes #155292.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 15:21:44 +00:00
830a335a7d Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-18 15:15:05 +00:00
6d3a4356f6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
2025-06-18 14:38:55 +00:00
c177abd217 Disable pinning check when loading sparse tensors (#154638)
Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled.

Fixes https://github.com/pytorch/pytorch/issues/153143 .

For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-18 14:33:36 +00:00
8f02161d10 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))
2025-06-18 14:19:39 +00:00
b30e04b3c8 Make the NCCL PG Options and Config copyable and safe to init standalone (#155700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700
Approved by: https://github.com/kwen2501
2025-06-18 13:36:27 +00:00
1bb9b1858b [CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174)
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
2025-06-18 13:17:46 +00:00
d99cac2816 [Kineto][submodule] Update kineto pin for XPU toggle feature (#155488)
Part of #154898
Update kineto submodule

Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-18 12:39:58 +00:00
c11888e7a6 Skip more tests on s390x (#155210)
Make CI for s390x green before fixing and restoring tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210
Approved by: https://github.com/seemethere
2025-06-18 12:07:17 +00:00
402ae09e41 [BE] fix typos in c10/ (#156078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-06-18 10:24:44 +00:00
f45f483884 [user triton] AOT Inductor support for new host-side TMA api (#155879)
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.

Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.

What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.

Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2}  and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
2025-06-18 09:35:11 +00:00
577baa4116 [c10d] Add a logger for all nccl collectives with its time duration when completed (#156008)
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
2025-06-18 09:08:42 +00:00
c5a4fe9c17 [CI] fix the ci image name for public copy in ghcr (#156169)
After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169
Approved by: https://github.com/malfet
2025-06-18 08:16:56 +00:00
a6a3a44144 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-18 07:27:20 +00:00
614a415145 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-18 07:27:20 +00:00
3c8c48f793 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-18 07:27:09 +00:00
920f6e681e [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-18 07:27:00 +00:00
fe37db4f12 [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-18 07:26:52 +00:00
ccc6279b40 flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719)
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198

There are two problems:

(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.

This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.

Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.

(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.

**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.

This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-06-18 07:02:04 +00:00
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
96df866410 [audio hash update] update the pinned audio hash (#156259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259
Approved by: https://github.com/pytorchbot
2025-06-18 06:02:46 +00:00
a5df6ffbc2 Improve IPC for Expandable Segments to use fabric handle when possible (#156074)
Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-06-18 05:22:06 +00:00
29867b211a [cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234)
When using docker with cutlass backend, we can get
```
No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions'
```
First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888

Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-06-18 05:03:43 +00:00
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
a82c171bb2 remove skipifrocm from composability tests (#156036)
Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too

tested locally
<img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036
Approved by: https://github.com/jeffdaily
2025-06-18 04:24:42 +00:00
9ed0060225 Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164)
There are a few considerations here:

1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404

2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.

3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.

4. There are two foot guns:

   a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).

   b. The following seuquence won't work as intended:

```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```

This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.

I think these two foot guns are probably okay given that this a bit of an experts' API.

Fixes #155106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
2025-06-18 03:39:28 +00:00
17b38b850e [ca] Allow using compiled autograd context managers during backward runtime (#156120)
Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check.

FIXES https://github.com/pytorch/pytorch/issues/152219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120
Approved by: https://github.com/jansel
ghstack dependencies: #155521, #155480
2025-06-18 03:01:15 +00:00
10d41c7d20 Add SDPA patterns for T5 models (#155455)
* Add SDPA patterns for T5 models.
* Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-06-18 02:09:55 +00:00
4851863e3f fix hack to check if register_buffer has been overridden (#155963)
Followup on https://github.com/pytorch/pytorch/pull/125971

`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!

Example demonstration:

```python
class Base:
    def register_buffer(self, buffer):
        pass

class InheritedOk(Base):
    pass

class InheritedOverride(Base):
    def register_buffer(self, buffer):
        pass

b = Base()
ok = InheritedOk()
override = InheritedOverride()

print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False

print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```

(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
2025-06-18 01:50:30 +00:00
202d2ae53a Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430)
Fixes #155033

- [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst)
- [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst)
- [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst)
- [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size.
- [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 01:27:04 +00:00
68996dc183 [BE][2/X] Phase out usage of use_max_autotune() (#155848)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848
Approved by: https://github.com/masnesral
2025-06-18 01:18:09 +00:00
e8bfce9a43 Document how to use stack-based APIs with StableIValue (#155984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-06-18 01:10:23 +00:00
541297daae [Build] Allow metal shaders to include ATen headers (#156256)
No-op change that will be used later to share structs between CPU and Metal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256
Approved by: https://github.com/dcci
2025-06-18 01:03:25 +00:00
3dabc351bb [Break XPU] Fix XPU UT failures introduced by community. (#156091)
Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091
Approved by: https://github.com/jansel
2025-06-17 23:43:37 +00:00
38e1e5d54c Add get_pipeline_order() for Gpipe and 1F1B (#155935)
The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR.

The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935
Approved by: https://github.com/wconstab
2025-06-17 23:39:17 +00:00
5435e75399 [ez] rename choice_timings -> choice_timings_fn (#156099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099
Approved by: https://github.com/mlazos
ghstack dependencies: #155982, #155996, #156053
2025-06-17 23:30:27 +00:00
12b02137af [MPS] Add benchmark for scan operations (#156241)
Comparison of cumsum performance before and after Metal implementaton:

Previous performance (using torch==2.7.1):
```[-------------------------------  -------------------------------]
                                              |  eager  |  compile
1 threads: -------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |  131.0  |   136.9
      cumsum-dim0-128x128 (torch.float16)     |  116.9  |   121.2
      cumsum-dim0-512x512 (torch.float16)     |  132.5  |   151.9
      cumsum-dim0-1024x1024 (torch.float16)   |  150.0  |   163.0
      cumsum-dim1-32x32 (torch.float16)       |  125.9  |   140.9
      cumsum-dim1-128x128 (torch.float16)     |  116.4  |   129.4
      cumsum-dim1-512x512 (torch.float16)     |  135.9  |   150.1
      cumsum-dim1-1024x1024 (torch.float16)   |  139.5  |   154.2
      cumsum-1d-100 (torch.float16)           |  119.5  |   127.1
      cumsum-1d-10000 (torch.float16)         |  128.9  |   142.5
      cumsum-1d-1000000 (torch.float16)       |  140.6  |   145.6
      cumsum-dim0-32x32 (torch.float32)       |  115.7  |   132.5
      cumsum-dim0-128x128 (torch.float32)     |  118.0  |   131.5
      cumsum-dim0-512x512 (torch.float32)     |  138.8  |   151.6
      cumsum-dim0-1024x1024 (torch.float32)   |  155.5  |   164.2
      cumsum-dim1-32x32 (torch.float32)       |  127.2  |   141.7
      cumsum-dim1-128x128 (torch.float32)     |  117.7  |   130.5
      cumsum-dim1-512x512 (torch.float32)     |  138.2  |   152.3
      cumsum-dim1-1024x1024 (torch.float32)   |  144.4  |   158.6
      cumsum-1d-100 (torch.float32)           |  118.6  |   128.0
      cumsum-1d-10000 (torch.float32)         |  125.5  |   141.5
      cumsum-1d-1000000 (torch.float32)       |  143.9  |   158.4
      cumsum-dim0-32x32 (torch.bfloat16)      |  106.6  |   137.6
      cumsum-dim0-128x128 (torch.bfloat16)    |  118.1  |   131.0
      cumsum-dim0-512x512 (torch.bfloat16)    |  140.0  |   154.3
      cumsum-dim0-1024x1024 (torch.bfloat16)  |  153.2  |   164.4
      cumsum-dim1-32x32 (torch.bfloat16)      |  127.9  |   132.6
      cumsum-dim1-128x128 (torch.bfloat16)    |  116.5  |   129.6
      cumsum-dim1-512x512 (torch.bfloat16)    |  136.5  |   151.2
      cumsum-dim1-1024x1024 (torch.bfloat16)  |  139.8  |   144.8
      cumsum-1d-100 (torch.bfloat16)          |  115.7  |   129.4
      cumsum-1d-10000 (torch.bfloat16)        |  125.0  |   143.3
      cumsum-1d-1000000 (torch.bfloat16)      |  127.8  |   143.4

Times are in microseconds (us).
```

Current performance:
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |    107.4  |    123.8
      cumsum-dim0-128x128 (torch.float16)     |    134.2  |    145.8
      cumsum-dim0-512x512 (torch.float16)     |    207.3  |    231.6
      cumsum-dim0-1024x1024 (torch.float16)   |    318.9  |    355.3
      cumsum-dim1-32x32 (torch.float16)       |     98.0  |    114.3
      cumsum-dim1-128x128 (torch.float16)     |    110.8  |    121.6
      cumsum-dim1-512x512 (torch.float16)     |    193.0  |    209.1
      cumsum-dim1-1024x1024 (torch.float16)   |    844.7  |    870.8
      cumsum-1d-100 (torch.float16)           |    108.4  |    125.0
      cumsum-1d-10000 (torch.float16)         |    784.7  |    852.3
      cumsum-1d-1000000 (torch.float16)       |  65855.2  |  66725.9
      cumsum-dim0-32x32 (torch.float32)       |    114.7  |    115.7
      cumsum-dim0-128x128 (torch.float32)     |    139.0  |    151.6
      cumsum-dim0-512x512 (torch.float32)     |    197.3  |    208.0
      cumsum-dim0-1024x1024 (torch.float32)   |    312.7  |    332.9
      cumsum-dim1-32x32 (torch.float32)       |     92.0  |    110.8
      cumsum-dim1-128x128 (torch.float32)     |    114.2  |    125.0
      cumsum-dim1-512x512 (torch.float32)     |    186.2  |    196.1
      cumsum-dim1-1024x1024 (torch.float32)   |    752.0  |    825.0
      cumsum-1d-100 (torch.float32)           |    112.4  |    122.0
      cumsum-1d-10000 (torch.float32)         |    793.5  |    863.5
      cumsum-1d-1000000 (torch.float32)       |  66431.8  |  66040.0
      cumsum-dim0-32x32 (torch.bfloat16)      |    111.6  |    121.6
      cumsum-dim0-128x128 (torch.bfloat16)    |    139.0  |    138.4
      cumsum-dim0-512x512 (torch.bfloat16)    |    217.6  |    230.1
      cumsum-dim0-1024x1024 (torch.bfloat16)  |    305.2  |    325.6
      cumsum-dim1-32x32 (torch.bfloat16)      |    100.5  |    110.9
      cumsum-dim1-128x128 (torch.bfloat16)    |    112.8  |    125.0
      cumsum-dim1-512x512 (torch.bfloat16)    |    187.8  |    208.9
      cumsum-dim1-1024x1024 (torch.bfloat16)  |    790.9  |    864.7
      cumsum-1d-100 (torch.bfloat16)          |    111.6  |    124.6
      cumsum-1d-10000 (torch.bfloat16)        |    778.1  |    844.9
      cumsum-1d-1000000 (torch.bfloat16)      |  64654.3  |  64082.5

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
2025-06-17 22:30:22 +00:00
fa4f07b5b8 Revert "[Docs] Convert to markdown to fix 155032 (#155520)"
This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-06-17 22:22:50 +00:00
54998c2daa Document padding size limitations in nn.modules.padding (#134840) (#155618)
Fixes #134840

Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:

- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive

These changes help prevent runtime errors when users attempt to use large padding values.

## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
2025-06-17 22:16:48 +00:00
937529f0b3 Pass by const ref instead of by value in StableIValue from (#156126)
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.

I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).

Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
2025-06-17 22:11:30 +00:00
4c0aa37dda Support stream capture of event record and wait nodes in cuda graphs (#155372)
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
2025-06-17 21:44:51 +00:00
8e02cd9c5a Skip cache related configs for cache config serialization (#156195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195
Approved by: https://github.com/masnesral
2025-06-17 21:24:07 +00:00
3106a33e41 [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156)
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.

Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.

Rollback Plan:

Differential Revision: D76775373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
2025-06-17 21:17:58 +00:00
bb462a6237 [cutlass backend] Fix prescreening non-deterministic problem (#156144)
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)

What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.

However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.

With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
2025-06-17 20:39:06 +00:00
cd66ff8030 [Docs] Convert to markdown to fix 155032 (#155520)
Fix #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-17 20:29:45 +00:00
50940270ae [BE][3/X] Phase out usage of use_max_autotune() (#155849)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849
Approved by: https://github.com/masnesral
2025-06-17 20:26:29 +00:00
b020971e78 [BE] fix typos in torchgen/ (#156083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083
Approved by: https://github.com/jingsh
ghstack dependencies: #156079, #156082
2025-06-17 19:25:50 +00:00
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
ccea6ddac3 [BE] fix typos in cmake/ (#156079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079
Approved by: https://github.com/Skylion007
2025-06-17 19:25:43 +00:00
5eb5c3700b [ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525)
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.

`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2

Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1,  rocSolver <3.26)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
2025-06-17 19:20:36 +00:00
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
4a26bb8a12 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-17 18:59:44 +00:00
fc177801af Enable FP8 row-wise scaled-mm for sm12x (#155991)
## Update using Cutlass 3.x (2025/06/15)

Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91

This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.

## Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).

Simple benchmark script

```python
import torch
from torch._inductor.utils import do_bench_using_profiling

N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
    A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
    B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
    scale_A = torch.ones(M, 1).cuda()
    scale_B = torch.ones(1, N).cuda()

    out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
    out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
    torch.testing.assert_close(out, out_ref)

    latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
    tflops = (2 * M * N * K) / latency_us / 1e9
    print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```

M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS

Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53

Better than default configs, but still far away from the config above for compute-bound

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-06-17 18:52:44 +00:00
e323d46b61 ELU: compute ELU(0) with the cheaper definition (#155765)
Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.)

Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765
Approved by: https://github.com/ezyang
2025-06-17 18:20:22 +00:00
8b0e0e4f23 [dynamo] Support tracing of functools.lru_cached method (#156125)
Fixes https://github.com/pytorch/pytorch/issues/155841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125
Approved by: https://github.com/williamwen42
2025-06-17 18:11:32 +00:00
fc5ae12293 Fix issue with right-nav (#156119)
Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example:
* Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty.
* Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed
<img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" />
vs
<img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119
Approved by: https://github.com/albanD
2025-06-17 18:09:51 +00:00
32c1611263 [CI][run_test] Fix rerun logic for failing at exit (#155853)
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.

The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file).  This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted.  But sometimes pytest reports a success, but the process has non zero exit code.  The two cases I know of are hangs and double freeing at exit.  In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option).  But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly

Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds.  This is correct even in the normal path because initial value shouldn't change in that case

Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this

I think I have a PR that fixes the above but its old so I need to take another look

Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
2025-06-17 17:51:40 +00:00
6629eaf0c6 [CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193)
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`

Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
2025-06-17 17:49:33 +00:00
a4ea242edc [MPS] Implement scan metal kernels (#156100)
Implements metal kernels for scan operations:
- Migrates cumsum and cumprod from MPSGraph implementation to Metal.
- Fixes #154881
- Adds MPS backend support for cummin and cummax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100
Approved by: https://github.com/malfet
2025-06-17 17:44:22 +00:00
9a5c59368d Replace all RAIIATH with Tensor in libtorch_agnostic test, test some APIs (#155977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155977
Approved by: https://github.com/albanD
ghstack dependencies: #155367
2025-06-17 17:36:31 +00:00
b115a4c03a torch::stable::Tensor beginnings, mainly mem mgmt (#155367)
```
// The torch::stable::Tensor class is a highlevel C++ header-only wrapper around
// the C shim Tensor APIs. We've modeled this class after TensorBase, as custom
// op kernels only really need to interact with Tensor metadata (think sizes,
// strides, device, dtype). Other functions on Tensor (like empty_like) should
// live like the ATen op that they are and exist outside of this struct.
//
// There are several goals of this class over AtenTensorHandle and
// RAIIAtenTensorHandle:
// 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the
//    C APIs with AtenTensorHandle. Under the hood we still call to these C shim
//    APIs to preserve stability.
// 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass
//    around ownership. This makes it difficult to pass one input into 2
//    different functions, e.g., doing something like c = a(t) + b(t) for
//    stable::Tensor t. Thus, we use a shared_ptr here.
```

This PR:
- exemplifies the above
- adds test cases in libtorch_agnostic to make sure the file actually works
- includes the results of a battle with template specialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367
Approved by: https://github.com/albanD
2025-06-17 17:36:31 +00:00
2625c70aec Update CODEOWNERS (#156182)
as title says. removing me as codeowner for cpp extensions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156182
Approved by: https://github.com/albanD
2025-06-17 17:15:41 +00:00
a24afbff3f Support torch.cuda.*Tensor in Dynamo (#156107)
Summary:
This PR adds support for torch.cuda.FloatTensor and friends in Dynamo.
These are indeed legacy APIs, but that doesn't stop us from adding
support for them in torch.compile.

I add support for these in the same way that we support torch.Tensor:
these APIs can be safely put into the Dynamo graph.

Fixes #130722

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107
Approved by: https://github.com/williamwen42
2025-06-17 16:31:10 +00:00
9a2c669425 [inductor][cutlass] binary remote cache (#156106)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

Test Plan:
## prove that we can upload successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Rollback Plan:

Reviewed By: henrylhtsang

Differential Revision: D76454741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106
Approved by: https://github.com/henrylhtsang

Co-authored-by: atalman <atalman@fb.com>
2025-06-17 16:24:10 +00:00
d66b4bcc3f [inductor][triton pin] Support triton builtins after #7054 (#156031)
Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder.

To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used.

(Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37`

Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031
Approved by: https://github.com/NikhilAPatel
2025-06-17 16:09:55 +00:00
d083841c0e Fix a small sphinx markup error (#156061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156061
Approved by: https://github.com/colesbury
2025-06-17 15:36:02 +00:00
0079c80b35 [CI] Do not constrain memory for ROCm testing in CI (#156115)
Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115
Approved by: https://github.com/jeffdaily
2025-06-17 15:30:36 +00:00
7fcad0231c [Docs] Convert to markdown to fix 155025 (#155789)
Related to #155025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789
Approved by: https://github.com/svekars
2025-06-17 15:08:14 +00:00
4886ba64dc [BE] Refactor functions from optional_submodules (#155954)
And use `pathlib.Path` instead of `os.path`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155954
Approved by: https://github.com/Skylion007
ghstack dependencies: #155947
2025-06-17 14:41:52 +00:00
cf90c9f8d1 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel, https://github.com/cyyever

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-17 14:15:49 +00:00
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
0a0023d984 Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564)
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
2025-06-17 12:54:58 +00:00
11bb1ece50 [CI] Fix triton version split issue (#155670)
Fix a bug caused by #155313, refer https://github.com/pytorch/pytorch/actions/runs/15576592378/job/43862613039?pr=154194#step:7:652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155670
Approved by: https://github.com/atalman, https://github.com/EikanWang
2025-06-17 12:42:40 +00:00
1cce73b5f4 [build] Change --cmake{,-only} arguments to envvars to support modern Python build frontend (#156045)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045
Approved by: https://github.com/ezyang
ghstack dependencies: #156040, #156041
2025-06-17 11:40:24 +00:00
57084ca846 [BE][setup] allow passing pytorch-specific setup.py options from envvars (#156041)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156041
Approved by: https://github.com/ezyang
ghstack dependencies: #156040
2025-06-17 11:40:24 +00:00
092aed1b18 [Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992)
In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v.
In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR.   SDPA UTs pass in local test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992
Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 11:09:51 +00:00
4a8f5e752b [FSDP2] explain user contract for fully_shard (#156070)
<img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070
Approved by: https://github.com/mori360
2025-06-17 10:03:19 +00:00
8d7ee0f4fb [BE] fix typos in .ci/, .circleci/, .github/ (#156069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156069
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-17 09:43:14 +00:00
2e0e08588e [BE][PYFMT] migrate PYFMT for torch/[e-n]*/ to ruff format (#144553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553
Approved by: https://github.com/ezyang
ghstack dependencies: #144551
2025-06-17 08:18:47 +00:00
cyy
95cb42c45d Use CMAKE_COLOR_DIAGNOSTICS (#154583)
`CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583
Approved by: https://github.com/ezyang
2025-06-17 04:52:31 +00:00
cyy
d43c0bdf46 [CI] Move ASAN jobs to clang-18 (#149099)
Use clang-18 for ASAN jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149099
Approved by: https://github.com/ezyang
2025-06-17 04:51:07 +00:00
7b0118884e [invoke_subgraph][inductor] Dont fallback on complex dtype (#155885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155885
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #155828
2025-06-17 04:47:12 +00:00
ffcc6fea7b [invoke_subgraph] Ignore redundantly nested invoke_subgraph (#155828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155828
Approved by: https://github.com/zou3519
2025-06-17 04:47:12 +00:00
b1713c6655 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
ghstack dependencies: #156121
2025-06-17 04:46:26 +00:00
82672206b7 [SymmMem] Make get_rank_to_global_rank return const ref (#156117)
Avoiding a copy, not expecting a caller to change its value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156117
Approved by: https://github.com/fegin
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116
2025-06-17 04:13:18 +00:00
eea3bcb3d1 [SymmMem] Cache rank_to_global_rank exchange (#156116)
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.

Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```

After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
2025-06-17 04:12:37 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
45382b284d [cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875)
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.

Also I want to change how cutlass .so are compiled. Hence this change.

This change is broken down since @coconutruben is trying to make a change to the same files too.

Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
2025-06-17 02:06:54 +00:00
cyy
64bb6317a5 [Accelerator] Fix Python typing in accelerator (#152394)
There are some changes:
1. Use keywords for arguments if possible.
2. `__exit__ ` of `device_index` is changed to return None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152394
Approved by: https://github.com/XuehaiPan, https://github.com/guangyey, https://github.com/ezyang

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 01:27:40 +00:00
1f0eb79e3e [dynamo] fix KeyError in LOAD_FAST_CHECK (#155763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155763
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #155761
2025-06-17 00:54:16 +00:00
4e833c2005 [dynamo] support tracing weakref callback (#155761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155761
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-17 00:54:16 +00:00
e6252f62ef [ONNX] Implements converter for higher order ops scan (#154513)
Fixes #151327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154513
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-17 00:54:07 +00:00
b618817479 [PGO] include ints/floats in suggested whitelist (#155980)
Made the mistake of dropping these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155980
Approved by: https://github.com/bobrenjc93
2025-06-17 00:41:38 +00:00
4311aea5e7 [AOTInductor] Add class declarations to torch._C._aoti interface file (#155128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155128
Approved by: https://github.com/desertfire
ghstack dependencies: #155149
2025-06-17 00:10:57 +00:00
82fb904140 Add warning for incorrected grad results at world size 1 (#154928)
Add warning for the issue discussed at https://github.com/pytorch/pytorch/issues/144045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154928
Approved by: https://github.com/weifengpy
2025-06-17 00:08:04 +00:00
eb4cf59ecd Add FSDP2 logging (#155826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155826
Approved by: https://github.com/weifengpy
2025-06-16 23:49:58 +00:00
6e2992a998 Remove unused Azure pipeline trigger script (#156134)
## Summary
- delete `.circleci/scripts/trigger_azure_pipeline.py`

## Testing
- `python3 -m pip install flake8`
- `python3 -m flake8 .circleci/scripts`

------
https://chatgpt.com/codex/tasks/task_e_6850a55f530c83279036800308fb6871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156134
Approved by: https://github.com/izaitsevfb
2025-06-16 23:42:52 +00:00
4781b0ee60 [SymmMem] Add NVSHMEM GET support to Triton (#155890)
Adds NVSHMEM GET operation support for Triton kernels:

- Add `getmem_block` core.extern wrapper for nvshmemx_getmem_block
- Add basic `test_triton_get` for 2-rank GET operation
- Add `test_triton_get_ring` for ring topology GET across arbitrary ranks

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_get`

```python
@skipIfRocm
@requires_triton()
def test_triton_get(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   val = 7
   inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(
       val if rank == 0 else -1
   )
   out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)

   peer = 1 - rank
   if rank == 1:
       # Rank 1 gets data from rank 0
       get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")
```

```

[Rank 0] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 1] inp buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:1', dtype=torch.int8)
...
[Rank 1] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:1', dtype=torch.int8)
...
[Rank 1] got data from peer 0

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

```python
@skipIfRocm
@requires_triton()
def test_triton_get_ring(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   # Ring topology: each rank gets data from the rank to its left
   peer = (rank - 1) % world_size

   # All ranks execute the get operation
   get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")

```

```
Output (8 GPUs):

[Rank 0] inp buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int8)
[Rank 2] inp buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:2', dtype=torch.int8)
[Rank 5] inp buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:5', dtype=torch.int8)
[Rank 6] inp buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:6', dtype=torch.int8)
[Rank 3] inp buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:3', dtype=torch.int8)
[Rank 1] inp buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:1', dtype=torch.int8)
[Rank 2] out buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2', dtype=torch.int8)
[Rank 5] out buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:5', dtype=torch.int8)
[Rank 0] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 2] got data from peer 1
[Rank 5] got data from peer 4
[Rank 0] got data from peer 7
[Rank 7] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:7', dtype=torch.int8)
[Rank 6] out buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:6', dtype=torch.int8)
[Rank 3] out buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:3', dtype=torch.int8)
[Rank 6] got data from peer 5
[Rank 3] got data from peer 2
[Rank 1] out buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 4] inp buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:4', dtype=torch.int8)
[Rank 7] out buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:7', dtype=torch.int8)
[Rank 7] got data from peer 6
[Rank 4] out buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:4', dtype=torch.int8)
[Rank 4] got data from peer 3

----------------------------------------------------------------------
Ran 1 test in 41.045s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155890
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-16 23:18:15 +00:00
bb1f3d1a55 [MPSInductor] Improve _default dtype inference (#156121)
By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121
Approved by: https://github.com/dcci
2025-06-16 23:11:53 +00:00
508cdc4fc9 [BE][4/X] Phase out usage of use_max_autotune() (#155850)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155850
Approved by: https://github.com/masnesral
2025-06-16 23:10:26 +00:00
f2d70898c6 [nativert] Move OpKernel to PyTorch core (#156011)
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test

Rollback Plan:

Differential Revision: D76525939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
2025-06-16 22:53:10 +00:00
35ecd7c2d4 Revert "[Cutlass] Fix buffer missing issues (#155897)"
This reverts commit 9bd42c15707a4b410ee005d5916e882a7db432bb.

Reverted https://github.com/pytorch/pytorch/pull/155897 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/155897#issuecomment-2978391416))
2025-06-16 22:44:39 +00:00
190f76fa31 Revert "Implement guard collectives (#155558)"
This reverts commit 5a5a05a6a3be376130848e235df73b752eef0230.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/malfet due to Hmm, may be I'm looking at the wrong metric, but c92f1075aa/1 shows that test started to pass after PR were reverted ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2978337152))
2025-06-16 22:26:52 +00:00
c92f1075aa Fix if condition for CUDA 12.9 Win build (#156108)
follow-up for https://github.com/pytorch/pytorch/pull/155799/files
Currently the last if condition will be executed for CUDA 12.9, overriding previous CUDA_ARCH_LIST. We should exclude 12.9 from the last if condition to fix this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156108
Approved by: https://github.com/atalman
2025-06-16 21:57:34 +00:00
cce4d213a6 Remove non-header-only dep from c10_headers target (#155858)
It depends on /c10/util:base which is not header-only.

Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858
Approved by: https://github.com/ezyang
2025-06-16 21:41:25 +00:00
a24ce67dee [ez] fix grammar error in comment (#156053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156053
Approved by: https://github.com/jingsh
ghstack dependencies: #155982, #155996
2025-06-16 20:53:07 +00:00
247113e03e Add size_hint_or_throw (#155615)
## Summary
`TypeError("Cannot convert symbols to int")` is coming up more recently since more unbacked symints are making its way into Inductor. See https://github.com/pytorch/pytorch/issues/155484
- One way to deal with this is to add `size_hint_or_throw` to throw if we try to pull a hint from an unbacked expr.
- Then, repurpose `size_hint` to accommodate unbacked symints by setting a default fallback or adding an appropriate fallback for each callsite.

This PR adds `size_hint_or_throw` which will throw if unbacked symints exist
- use `size_hint_or_throw` -- usually when the callee can try/catch the exception or guards against unbacked symints

------
with Codex
https://chatgpt.com/codex/tasks/task_e_684869dfc740832882c88d05534cc8f9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155615
Approved by: https://github.com/ezyang, https://github.com/laithsakka, https://github.com/jingsh

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-16 20:46:51 +00:00
008345be9d Fix #155018 (convert distributed rst to markdown) (#155528)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155018

Docs comparison (check out the 'new' whenever docs build)

1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html))
2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html))
3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html))
4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html))
5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528
Approved by: https://github.com/wz337, https://github.com/svekars
2025-06-16 20:46:09 +00:00
eb2af14f8e [PT2][partitioners] Add aten.split to view_ops list [relanding #155424] (#155943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943
Approved by: https://github.com/ShatianWang
2025-06-16 20:42:54 +00:00
03488d820c Revert "[MPS][Testing][BE] Fix samples for full_like (#156026)"
This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06.

Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](2d832c9587) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))
2025-06-16 19:50:26 +00:00
6d2155db49 [PGO] no code state update on dynamic=False (#155961)
Summary:
When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic.

A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs.

Test Plan:
test/dynamo/test_pgo.py

Rollback Plan:

Differential Revisi,on: D76630499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961
Approved by: https://github.com/bobrenjc93
2025-06-16 19:47:55 +00:00
5a5a05a6a3 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 19:46:16 +00:00
61b271e0f3 Revert "Implement guard collectives (#155558)"
This reverts commit 38e5e81e55fc5d85d6cf8a83c96c88578995e3fe.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/atalman due to Breaks CI, sorry: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683161593/job/44181274826) [HUD commit link](38e5e81e55) ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2977871178))
2025-06-16 19:40:46 +00:00
7cf38d2a05 Make benchmark by op for TS model work with sample inputs (#155988)
Summary: Add pickle input type to allow for running ptvsc2_predictor_bench to get individual node benchmarks for SR

Test Plan:
```
buck2 run mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/data/users/georgiaphillips/models/742055223/1/742055223_1.predictor.local --pt_inputs=/data/users/georgiaphillips/models/742055223/0/mix.pt --pt_enable_static_runtime=1 --compare_results=0 --iters=1000 --warmup_iters=100 --num_threads=1 --do_profile=1 --method_name=${MODULE_NAME}.forward --set_compatibility --do_benchmark=1 --pytorch_predictor_default_model_id=${MODEL_ENTITY_ID}_${SNAPSHOT_ID} --input_type=pickle
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76554920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155988
Approved by: https://github.com/dolpm
2025-06-16 19:15:07 +00:00
2dc1627451 [doc] Add documentation for division by zero behavior in autograd (#155987)
Fixes #128796

This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:

1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
   - Masking before division
   - Using MaskedTensor (experimental API)

The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.

This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.

dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
2025-06-16 19:02:12 +00:00
907d0931cc [ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480)
For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480
Approved by: https://github.com/jansel
ghstack dependencies: #155521
2025-06-16 18:45:03 +00:00
9ff9c28fe8 [ca] Functionalize AccumulateGrad (#155521)
This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up.

Before:
- Compiled Autograd Engine: proxies call to inductor accumulate_grad op
- Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract)
```python
        new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1);  getitem_1 = None
        copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1);  copy_ = None
```
- AOTAutograd: functionalizes the copy_

After:
- Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call.
- Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd
- AOTAutograd: traces into the op, with FunctionalTensors.

While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case.

In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521
Approved by: https://github.com/jansel
2025-06-16 18:45:02 +00:00
42ff6a4a5c [Inductor] Delay codegen for fallback arguments and improve typing (#154371)
Delays code generation for arguments to fallback ops.  This is inspired by #155642, and likely fixes similar memory leaks.

Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-06-16 18:00:04 +00:00
4162c0f702 [BE][setup] gracefully handle envvars representing a boolean in setup.py (#156040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156040
Approved by: https://github.com/malfet
2025-06-16 17:56:31 +00:00
f48a157660 [aoti] Add more to error message (#155974)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155974
Approved by: https://github.com/yushangdi
2025-06-16 17:49:52 +00:00
fbd88ae2b5 Convert to markdown: checkpoint.rst (#156009)
Related to #155014

Use two commits to have a try.
```bash
 1800  git mv docs/source/checkpoint.rst docs/source/checkpoint.md
 1802  git commit -m "[Docs] Rename checkpoint.rst"
 1803  git push origin ckpoint

# update the markdown file
 1805  git add .
 1806  git commit -m "modify checkpoint.md"
 1807  git push origin ckpoint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009
Approved by: https://github.com/svekars
2025-06-16 17:48:23 +00:00
a10024d7de Convert complex_numbers.rst to markdown (#156039)
Related to #155014

Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039
Approved by: https://github.com/svekars
2025-06-16 17:24:37 +00:00
e9fdaf8701 Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)"
This reverts commit e375d21bb9b0ef6fefe7a8af5a054a17de8c63c9.

Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/malfet due to Looks like it broke ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-2977428354))
2025-06-16 17:22:55 +00:00
45596ec58f Delete tools/onnx/update_default_opset_version.py (#156055)
The tool is no longer relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156055
Approved by: https://github.com/titaiwangms
2025-06-16 17:21:36 +00:00
365ce465f3 Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)"
This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7.

Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](8142a02860) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))
2025-06-16 16:59:25 +00:00
2a4e357192 Fix compilation warning with gcc14 (#155934)
Note that nccl still doesn't work so you have to build with `USE_NCCL=0` @eqy is that something being tracked there?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155934
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-06-16 16:43:15 +00:00
503362d019 Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)"
This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968.

Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))
2025-06-16 15:13:53 +00:00
b8d96c3f78 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit 47c8810b5275179833d6b33ca3d70922f485272c.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/atalman due to failing torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2976990461))
2025-06-16 14:59:01 +00:00
013dfeabb4 [BE] fix typos in top-level files (#156067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067
Approved by: https://github.com/malfet
ghstack dependencies: #156066
2025-06-16 14:56:07 +00:00
6c493e2b14 [BE] add codespell linter (#156066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156066
Approved by: https://github.com/malfet
2025-06-16 14:56:07 +00:00
2d832c9587 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
2025-06-16 14:27:42 +00:00
831c9010c7 [BE] Remove non-existing operator from unimplemented list (#156025)
Never heard of torch.login :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025
Approved by: https://github.com/dcci
2025-06-16 14:14:58 +00:00
38e5e81e55 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 14:09:14 +00:00
05faba4028 Bump requests from 2.32.2 to 2.32.4 in /.github (#155491)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-16 06:48:08 -07:00
d6ee5144ca [xla hash update] update the pinned xla hash (#156064)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156064
Approved by: https://github.com/pytorchbot
2025-06-16 11:11:10 +00:00
8142a02860 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-16 10:55:47 +00:00
bf7e290854 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/clee2000
2025-06-16 10:28:45 +00:00
f810e98143 [ONNX] Update default opset to 18 (#156023)
Update default opset for the torchscript exporter to 18 to match the dynamo exporter, because support was actaully added and tested in https://github.com/pytorch/pytorch/pull/118828. In the next version we should plan to update to opset 21 or higher. This change also removes the hard limit on the torchscript exporter for more flexibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156023
Approved by: https://github.com/Skylion007
2025-06-16 08:40:49 +00:00
39c605e8b3 remove allow-untyped-defs from context.py (#155622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155622
Approved by: https://github.com/Skylion007
2025-06-16 07:38:34 +00:00
d9799a2ee7 Support boolean tensor for torch.fused_moving_avg_obs_fake_quant on CUDA (#153699)
Fixes #153310

As the title

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fused_obs_fake_quant_moving_avg
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153699
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
2025-06-16 07:10:06 +00:00
156b28e62a [audio hash update] update the pinned audio hash (#155648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155648
Approved by: https://github.com/pytorchbot
2025-06-16 03:57:28 +00:00
c620d0b5c7 convert: rst to myst pr2/2 (#155911)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following three .rst files with the mentioned referenced in each file

- [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst)
  - torch.compiler_troubleshooting
  - nonsupported_numpy_feats
  - torchdynamo_fine_grain_tracing

- [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst)
  - None

- [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst)
  - torch.compiler_overview
  - torch.compiler_api
  - torchdynamo_fine_grain_tracing

I made the suggested edits by the maintainers as commented in the parent PR
(used git mv on all files, yet it still appeared as delete-create action)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-16 00:44:44 +00:00
c83041cac2 [test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827)
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```

These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
2025-06-15 20:24:19 +00:00
bc9b8ea230 [user triton] JIT inductor support for new host-side TMA api (#155814)
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.

* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)

AOTI support is not implemented yet.

Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
2025-06-15 20:24:19 +00:00
b7c95acc6c [user triton] triton_kernel_wrap support for new host-side TMA API (#155777)
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.

Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)

Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
2025-06-15 20:24:19 +00:00
54976bca10 [dynamo] Provide helper functions for guard filter hook (#155083)
Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083
Approved by: https://github.com/zhxchen17, https://github.com/jansel
2025-06-15 17:49:36 +00:00
0935a97d95 [Dynamo] Add torch.accelerator API to trace_rules (#155884)
# Motivation
- Add binding API and non-binding API in torch.accelerator to trace rules.
- Add some function in torch.accelerator to const fold functon list for Dynamo capature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155884
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787, #155788
2025-06-15 17:09:57 +00:00
b51d803785 [Dynamo] Add XPU API to trace_rules (#155788)
# Motivation
- Add binding API and non-bindling API to trace rules for XPU;
- Add some XPU API to the const fold function for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155788
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787
2025-06-15 17:09:57 +00:00
69acba2b19 [Dynamo] Add generic and XPU-specific Stream&Event in UserDefineClass (#155787)
# Motivation
- Add XPU-specific Stream and Event to in graph calss list for Dynamo capture.
- Add generic Stream and Event to i graph class list for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155787
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-06-15 17:09:57 +00:00
53cd18f6b3 Update gradient behavior note in torch.amin and torch.amax (#155071)
Fixes #155048

The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change:

New note:
`amax, amin, max(dim), min(dim) evenly distributes gradient between equal values
        when there are multiple input elements with the same minimum or maximum value.`

cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071
Approved by: https://github.com/soulitzer
2025-06-15 16:09:31 +00:00
655b3b14ff [executorch hash update] update the pinned executorch hash (#156007)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156007
Approved by: https://github.com/pytorchbot
2025-06-15 04:51:37 +00:00
517d2995e0 Add__int__ and __float__ methods to _sympy.functions.Identity (#155873)
Fixes #155688

Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`.  This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.

Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
2025-06-15 04:24:40 +00:00
6ebe9a4f47 [BE][Ez]: Optimize nvshmem alloc with missing move (#156000)
Saw this in another PR where there was a missing move on this potentially very hot path with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156000
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-06-15 03:04:08 +00:00
32eee8ed22 [SymmMem] Add nvshmem_free (#155975)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed.

Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155975
Approved by: https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971
2025-06-15 01:23:49 +00:00
b8aee84fb9 [c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949)
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
2025-06-15 00:37:42 +00:00
3159ee2ad3 Update test_schedule_multiproc to use world_size=2 (#155921)
The multiproc schedule tests previously ran with world_size=2, and PP tests became flakier due to the longer pipeline execution, this is restoring previously behavior. This will fix the tests (https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

In follow up PRs I will refactor the tests and move some tests to use large world sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155921
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
ghstack dependencies: #155920
2025-06-15 00:24:18 +00:00
8e1471bdc9 Allow MultiProcContinuousTest to set world_size (#155920)
`MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920
Approved by: https://github.com/fduwjj
2025-06-15 00:24:17 +00:00
9bd42c1570 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-15 00:08:50 +00:00
a35b3a9b95 [cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939)
Differential Revision: D76571828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-06-15 00:01:57 +00:00
eqy
47c8810b52 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-14 23:34:31 +00:00
0fa361e429 [ez] fix typo in _inductor/scheduler.py (#155996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155996
Approved by: https://github.com/Skylion007
ghstack dependencies: #155982
2025-06-14 21:21:35 +00:00
77ac3a0965 [SymmMem] Remove wrappers around nvshmem APIs (#155971)
`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018.

Therefore there is no need to build an extra layer to hide dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155971
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968
2025-06-14 19:58:09 +00:00
2c0d94a7de [SymmMem] Remove unused ptr_to_symm_mem_ (#155968)
No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty.
This PR removes it and supports relying functionalities via the `allocations_` map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155968
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835
2025-06-14 19:57:06 +00:00
a317c63d1b [BE]: Update NCCL to 2.27.3 (#155233)
Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517

This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233
Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy
2025-06-14 19:20:31 +00:00
794ef6c9b8 Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 19:14:31 +00:00
5285d10243 remove duplicated pybind flag in mps code (#155936)
gcc14 (at least) warns that this is already defined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155936
Approved by: https://github.com/Skylion007
2025-06-14 18:41:12 +00:00
e95e8eed0a mypy 1.16.0 (#155821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-14 18:18:43 +00:00
ce79056471 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-14 17:29:54 +00:00
603a54a9b3 Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.

This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
-  it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals

I am also seeing a couple of wrong usages !! that i will fix  in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154774
2025-06-14 17:13:53 +00:00
c219dbd2fc avoid gso in has_internal_overlap (#155870)
existing comment already explains it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155870
Approved by: https://github.com/bobrenjc93
2025-06-14 17:13:20 +00:00
279cae52e7 [BE][PYFMT] migrate PYFMT for torch/ao/ to ruff format (#148185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148185
Approved by: https://github.com/ezyang
2025-06-14 16:47:04 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
370fc49dde Handle aten.to at submodule boundaries (#153972)
Summary: #buildall

Test Plan: CI

Differential Revision: D74582970

When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972
Approved by: https://github.com/avikchaudhuri
2025-06-14 16:13:29 +00:00
d42c11819f [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-06-14 16:09:41 +00:00
70b68caf58 Fix logging of failed tensorified ops (#155982)
Tested via

```
TORCH_LOGS="torch.fx.passes._tensorify_python_scalars" tlp python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unspecialized_float_fallback_symint_specialization
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function pow>
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function floor>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155982
Approved by: https://github.com/flaviotruzzi
2025-06-14 14:23:54 +00:00
e375d21bb9 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-14 14:12:38 +00:00
1a568f4e5d [BE][Easy] bump isort to 6.0.1 (#155919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155919
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909, #155914
2025-06-14 12:29:01 +00:00
5467765990 [BE][Easy] bump ruff to 0.11.13 (#155914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155914
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909
2025-06-14 12:29:01 +00:00
736a15a81a [torchgen] Fix ruff format for # fmt: skip comment for function signature (#155909)
See also:

- astral-sh/ruff#18658

This fix follows the suggestion from:

- https://github.com/astral-sh/ruff/issues/18658#issuecomment-2970130276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155909
Approved by: https://github.com/ezyang
2025-06-14 12:28:55 +00:00
d859e65826 [DCP][Ez]: Fix broadcast_object bug in DCP utils (#155912)
Fixes #152310. Broadcast_object is now symmetric with gather_object and scatter_object. It was likely a typo that wasn't fixed in https://github.com/pytorch/pytorch/pull/147675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155912
Approved by: https://github.com/ezyang
2025-06-14 12:14:14 +00:00
596b418391 [BE][PYFMT] migrate PYFMT for {torch,test}/{nn,optim}/** to ruff format (#144548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548
Approved by: https://github.com/ezyang
2025-06-14 11:27:04 +00:00
3e38feb05f [inductor] Add configuration control for CUTLASS operation selection. (#155770)
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.

**Fixes #155718**

## Usage Examples

```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"

# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"

# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"

# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
2025-06-14 08:19:54 +00:00
1982ec2d22 Add api info for torch._C._nn.pyi (#148405)
APis involved are as followed:

- adaptive_avg_pool2d
- adaptive_avg_pool3d
- binary_cross_entropy
- col2im

ISSUE Related:
https://github.com/pytorch/pytorch/issues/148404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405
Approved by: https://github.com/ezyang
2025-06-14 07:57:07 +00:00
7070ab3180 use guard_or_false in checkInBoundsForStorage (#155874)
this was added in https://github.com/pytorch/pytorch/pull/147354, the comment already justify guard_or_false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155874
Approved by: https://github.com/bobrenjc93
2025-06-14 07:21:26 +00:00
d79651571f assume sparse tensor not coalesced_ gsv -> guard_or_false. (#155869)
preserve current behavior. Generalize it such that no need for torch._check_is_size to opt into this,
and make it work for more complex unbacked sizes with ranges [-inf, inf]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155869
Approved by: https://github.com/bobrenjc93
2025-06-14 07:19:56 +00:00
e7da21806f [Easy][BE] update recommanded VS Code settings (#152760)
Changes:

- Remove old invalid settings and replace with new settings.
- Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760
Approved by: https://github.com/drisspg
2025-06-14 07:11:10 +00:00
cyy
1393f71e07 Use CUDA language in generated CMakeLists.txt from cpp_builder.py (#155979)
The CMake CUDA module has been deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155979
Approved by: https://github.com/ezyang
2025-06-14 06:52:51 +00:00
c843909d9e [flex attention][triton pin] use new TMA API (#155771)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API).

This PR does the following:
* replace the experimental TMA APIs with the stable TMA APIs
* remove the workspace args.

Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone]

Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771
Approved by: https://github.com/mandroid6, https://github.com/nmacchioni
2025-06-14 06:34:16 +00:00
92b7ed6d07 Add Helion softmax test (#155976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155976
Approved by: https://github.com/jansel
2025-06-14 05:53:21 +00:00
9338d85d45 [ProcessGroupNCCL] Added log when fr dump triggered from pipe (#155754)
Summary:
TSIA

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
eyes

Sandcastle run

Differential Revision: D76472617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155754
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
2025-06-14 04:34:29 +00:00
bf897b4cea [ONNX] Support 0/1 on dynamic dimension (#155717)
Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export.

However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export.

For example,

```python
import torch

class Model(torch.nn.Module):
    def forward(self, x, y, z):
        return torch.cat((x, y), axis=1) + z

model = Model()
x = torch.randn(2, 3)
y = torch.randn(2, 5)
z = torch.randn(1, 8)
model(x, y, z)

DYN = torch.export.Dim.DYNAMIC
ds = {0: DYN, 1: DYN}

with torch.fx.experimental._config.patch(backed_size_oblivious=True):
    ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds))

print(ep)
"""
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"):
             #
            sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0)
            sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1)
            sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0)
            sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1)
            sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0)
            sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1)

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1);  x = y = None

             #
            eq: "Sym(True)" = sym_size_int_2 == sym_size_int;  sym_size_int_2 = None
            _assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'");  eq = _assert_scalar_default = None
            add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3;  sym_size_int_1 = sym_size_int_3 = None
            eq_1: "Sym(True)" = add_1 == sym_size_int_5;  add_1 = sym_size_int_5 = None
            _assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'");  eq_1 = _assert_scalar_default_1 = None
            eq_2: "Sym(True)" = sym_size_int == sym_size_int_4;  sym_size_int = sym_size_int_4 = None
            _assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'");  eq_2 = _assert_scalar_default_2 = None

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z);  cat = z = None
            return (add,)

Graph signature:
    # inputs
    x: USER_INPUT
    y: USER_INPUT
    z: USER_INPUT

    # outputs
    add: USER_OUTPUT

Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]}
"""
ep.module()(x, y, z)
"""
Traceback (most recent call last):
  File "/home/titaiwang/pytorch/test_export.py", line 20, in <module>
    ep.module()(x, y, z)
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__
    raise e
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl
    return inner()
           ^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook
    _check_input_constraints_for_graph(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph
    _check_symint(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint
    raise RuntimeError(
RuntimeError: Expected input at *args[2].shape[0] to be equal to 2, but got 1
"""
```

The explanation (from @pianpwk):

In the model we have `return torch.cat((x, y), axis=1) + z`.

Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are:

1) Yes -> then we must specialize `s8` to 1
2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`.

Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2).

For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717
Approved by: https://github.com/justinchuby
2025-06-14 04:04:47 +00:00
187828dcb4 [OpenReg][5/N] add set_.source_Storage for openreg (#155191)
**Changes**:
- add set_.source_Storage for openreg to support torch.load & torch.serialization
- uncomment some related tests in the test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155191
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181, #155101
2025-06-14 03:44:32 +00:00
e4fd0bf771 [OpenReg][4/N] Migrate cpp_extensions_open_device_registration to OpenReg (#155101)
As the title stated.

**Involved testcases**:
- test_open_device_storage_pin_memory
- test_open_device_serialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155101
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181
2025-06-14 03:44:32 +00:00
1e7989cad5 [OpenReg][3/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154181)
As the title stated.

**Involved testcases**:
- test_open_device_quantized
- test_open_device_random
- test_open_device_tensor
- test_open_device_packed_sequence
- test_open_device_storage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154181
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106
2025-06-14 03:44:32 +00:00
7e5f29b2de [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154106)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154106
Approved by: https://github.com/nareshrajkumar866, https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019
2025-06-14 03:44:32 +00:00
676abded4b [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154019)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154019
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018
2025-06-14 03:44:32 +00:00
d3d469092f [Openreg] Split TestOpenReg into two parts (#154018)
----

- TestPrivateUse1: testing 3rd accelerator integration mechinasm itself
- TestOpenReg: testing openreg itself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154018
Approved by: https://github.com/albanD
ghstack dependencies: #153947
2025-06-14 03:44:31 +00:00
cafd2344d6 [OpenReg] add manual_seed related capabilities (#153947)
**Changes**:
- Add manual_seed manual_seed_all initial_seed and so on
- Delay execution of self._lazy_init more deeply
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153947
Approved by: https://github.com/albanD
2025-06-14 03:44:31 +00:00
297805fd8f Typo fixes for "overridden" in comments and function names (#155944)
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
2025-06-14 03:37:38 +00:00
ca3cabd24a Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696)
Fixes #155028

This pull request updates the documentation  by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-14 03:32:00 +00:00
cdfa33a328 [nativert] move execution frame to torch (#155830)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76369008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155830
Approved by: https://github.com/zhxchen17
2025-06-14 03:28:55 +00:00
a6084b71ed [BE][1/X] Phase out usage of use_max_autotune() (#155847)
`use_max_autotune()` is likely not what people expect it to be;

Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning.

Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847
Approved by: https://github.com/jingsh, https://github.com/masnesral
2025-06-14 03:16:20 +00:00
7982b8c703 [BE][AOTI] Remove duplicate schema for ExternKernelNode (#155867)
Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76558294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867
Approved by: https://github.com/jingsh
2025-06-14 02:03:27 +00:00
8f5f01bf19 [BE][AOTI] Combine DynamicArgType in Proxy Executors (#155871)
Summary:
As title.

Move the duplicate definition to the base class header `proxy_executor.h`

Test Plan:
CI

Rollback Plan:

Differential Revision: D76559180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155871
Approved by: https://github.com/yushangdi
2025-06-14 01:52:43 +00:00
4574b39aa4 Revert "[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)"
This reverts commit bbbced94a43cf764ddfe719e7d4c161a3992830c.

Reverted https://github.com/pytorch/pytorch/pull/155709 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15645591737/job/44082402642) [HUD commit link](bbbced94a4) landrace with 155819? easy forward fix but its the end of the week so idk when id get a review ([comment](https://github.com/pytorch/pytorch/pull/155709#issuecomment-2972094849))
2025-06-14 01:43:16 +00:00
c10339559d [BE] Better uv detection in pip init (#155972)
If one has some UV and non-UV environments locally, one shoudl call `uv
pip install` only on the UV-enabled ones, which could be detected by
checking if `uv/python` path is present in `sys.base_prefix`

Fixes https://github.com/pytorch/pytorch/issues/152999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155972
Approved by: https://github.com/janeyx99
2025-06-14 01:35:50 +00:00
d7e3c9ce82 Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287)"
This reverts commit 3b6569b1ef4b9ff25f5b75fe0a216d6d084d573f.

Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](3b6569b1ef) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))
2025-06-14 01:32:27 +00:00
c165b36a31 [MTIA Aten Backend] Migrate relu / relu_ (#155927)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate relu / relu_.

Note: Pytorch in-tree implementation delegates relu to clamp_min, so no more need to launch relu kernel.
https://www.internalfb.com/code/fbsource/[0c9eedb2fc8f99bcca00cb67a5738cfe07e39349]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520

Let me know if any concern about this

Differential Revision: [D75803582](https://our.internmc.facebook.com/intern/diff/D75803582/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155927
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925, #155926
2025-06-14 01:24:48 +00:00
50f6431e0a [MTIA Aten Backend] Migrate sqrt.out / rsqrt.out / sin.out / silu.out (#155926)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate sqrt.out / rsqrt.out / sin.out / silu.out

Differential Revision: [D75801847](https://our.internmc.facebook.com/intern/diff/D75801847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155926
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925
2025-06-14 01:24:48 +00:00
7b11cb8c12 [MTIA Aten Backend] Migrate tanh.out and tanh_backward.grad_input (#155925)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate tanh.out and tanh_backward.grad_input

Differential Revision: [D75769242](https://our.internmc.facebook.com/intern/diff/D75769242/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155925
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659
2025-06-14 01:24:41 +00:00
0185d3a5ed [MTIA Aten Backend] Migrate bitwise_or.Tensor_out (#154659)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_or.Tensor_out from out-of-tree to in-tree.

Differential Revision: [D75629937](https://our.internmc.facebook.com/intern/diff/D75629937/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154659
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632
2025-06-14 01:24:34 +00:00
163cdaaa3a [MTIA Aten Backend] Migrate bitwise_not.out (#154632)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_not.out from out-of-tree to in-tree.

Differential Revision: [D75610643](https://our.internmc.facebook.com/intern/diff/D75610643/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154632
Approved by: https://github.com/egienvalue
2025-06-14 01:24:27 +00:00
04cf2c9d24 fix tensor print behavior for MAIA (#155609)
This pull request fixes the tensor print behavior for `MAIA` to account for the absence of double-precision support in its backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155609
Approved by: https://github.com/soulitzer
2025-06-14 01:04:12 +00:00
dabb55baff Add resolve in add decomp to enable view (#153945)
Fixes #148950.

During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301
), the functional argument of conj complex tensor gets cloned. This result in always having *.is_conj()* evaluted to false in decomposition function.

Propose a fix of calling resolve_conj() in the decomposition of complex tensor add.

Test as below
`python test/dynamo/test_repros.py ReproTests.test_add_complex_conj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945
Approved by: https://github.com/jansel
2025-06-14 00:41:50 +00:00
fec571cfd4 [BE][CI] Remove hardshrink integer exclusions (#155965)
As they are not called anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965
Approved by: https://github.com/dcci
2025-06-14 00:32:57 +00:00
38410cf9b5 Fix DDPOptimizer issue on static tensor index (#155746)
We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)

The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules.
bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)

Fixes #140395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746
Approved by: https://github.com/xmfan, https://github.com/bdhirsh
2025-06-14 00:15:58 +00:00
3b6569b1ef Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 00:05:57 +00:00
bbbced94a4 [BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)
followup for https://github.com/pytorch/pytorch/pull/154980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155709
Approved by: https://github.com/tinglvv, https://github.com/atalman, https://github.com/nWEIdia, https://github.com/cyyever
2025-06-13 23:10:01 +00:00
d512584718 [BE] Refactor clamp dtypes check (#155930)
By introducing `check_for_unsupported_clamp_dtypes` similar to `check_for_unsupported_isin_dtypes`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155930
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/clee2000
ghstack dependencies: #155470
2025-06-13 23:05:02 +00:00
0cb85c188f [BE] Move optional submodules checkout to its own module (#155947)
To expand it to optional eigen checkout later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155947
Approved by: https://github.com/Skylion007
2025-06-13 23:02:38 +00:00
3003c681ef Converting .rst files to .md files (#155377)
Fixes #155036
This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:54:27 +00:00
799443605b Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297)
Fixes #155019

## Description
Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst

## Checklist
- [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html)
- [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html)
- [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html)
- [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:08:37 +00:00
764c02b78b [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-13 22:07:03 +00:00
d59ed21d0f [CI] Reuse old whl: track why failed to use the old whl (#155860)
As in title
Any other things I should track?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155860
Approved by: https://github.com/malfet
2025-06-13 22:01:31 +00:00
3596c0c77f Fix test after revert (#155946)
ex
test_dynamic_shapes.py::TestUbackedOps::test_unbacked_reshape2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15642199583/job/44073674212) [HUD commit link](06408dae49)

started after 06408dae49d06b6146fdd9d7a37eb5dde4f5e78d

idk what the test does so maybe theres a better way to fix this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155946
Approved by: https://github.com/yangw-dev, https://github.com/huydhn, https://github.com/malfet
2025-06-13 21:52:07 +00:00
eef253d9f6 [CI] Keep going display on HUD: upload log when test fails (#155371)
I guess this is more of an RFC

Goal:
Enable keep going so that we can get information immediately for failures.  We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.

Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`.  If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.

I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label).  There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.

After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification.  This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one

To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85) to add a new field like "will_fail": true, and also triggers the log classifier to run

Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed

Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones

Pros:
Not many changes outside of HUD/UI

Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet
2025-06-13 21:21:55 +00:00
e5ed267f83 Update h100-distributed image (#155861)
Move non inductor workflows cuda 12.6->cuda 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155861
Approved by: https://github.com/seemethere
2025-06-13 21:17:05 +00:00
20a74c370b Add error message with assert to topK if ndims() - dim > 4 (#155475)
Addressing #154890

Not really a proper fix but at least it's more informative than the current crash.

For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475
Approved by: https://github.com/kulinseth
2025-06-13 21:10:06 +00:00
049dc48d1e fix code chunk indentation for jit_language_reference_v2.md (#155937)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155781

Description:
As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation.

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label "module: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-06-13 21:05:23 +00:00
731351bb4a Convert rst to markdown - optim.rst #155031 (#155813)
Fixes #155031
![image](https://github.com/user-attachments/assets/36507ca1-eb1e-4358-9e66-ce25ec8a2be1)

@pytorchbot label "docathon-h1-2025" "module: docs" "topic: not user facing" "topic: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155813
Approved by: https://github.com/AlannaBurke
2025-06-13 21:03:39 +00:00
92388bb2ab [export] Remove broken check for multiple cpp files in PT2 package (#155149)
This check was recently added, but (when fixed to refer to CPP rather than library files) fails with the separate kernel and wrapper build of AOTInductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155149
Approved by: https://github.com/angelayi
2025-06-13 21:02:31 +00:00
7d1b3f599d [Docs] Convert to markdown cond.rst, config_mod.rst (#155653)
Related to #155014

Only included 2 files in this PR:

- cond.rst
- config_mod.rst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155653
Approved by: https://github.com/svekars
2025-06-13 20:58:57 +00:00
fdf5d97fa8 [cutlass backend][ez] Log timings from prescreening (#155757)
Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757
Approved by: https://github.com/ColinPeppler
2025-06-13 20:44:04 +00:00
f3e6c8e834 Fix #155016 for Docathon - convert rst to markdown (#155198)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though.

Fixes #155016

Docs comparison (check out the 'new' whenever docs build)

1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html))
2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html))
3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings
4. cudnn_rnn_determinism.rst as is because it's reused in docstrings.
5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198
Approved by: https://github.com/albanD, https://github.com/svekars
2025-06-13 20:24:34 +00:00
bf798a2f01 Change _hfstorage to hfstorage (#155837)
Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D76364024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837
Approved by: https://github.com/saumishr
2025-06-13 20:19:51 +00:00
77f884c2ec Optimize Tensor.backward type hints (#155656)
Fixes #81963

## Test Result

![image](https://github.com/user-attachments/assets/67380fdc-73c4-43d8-b2a5-5e16d63f4fd3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155656
Approved by: https://github.com/soulitzer
2025-06-13 19:16:48 +00:00
06408dae49 Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)"
This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec.

Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))
2025-06-13 19:11:43 +00:00
4628f1b7a9 [Hierarchical-Compile] Track mutations for setitem (#155880)
This fixes a bug in tensor variable where we would not do things like set the example value on setitem nodes (but these don't typically have users so it doesn't matter)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155880
Approved by: https://github.com/anijain2305
2025-06-13 18:59:31 +00:00
344731fb25 Add CUDA 12.9.1 sbsa nightly binaries (#155819)
https://github.com/pytorch/pytorch/issues/155196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155819
Approved by: https://github.com/atalman
2025-06-13 18:52:41 +00:00
ce44877961 [c10d][PGNCCL] Make watchdog thread a class (#155831)
By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-06-13 18:05:22 +00:00
c5d00e150a convert: rst to myst pr 1/2 (#155840)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following two .rst files
- [torch.compiler_dynamo_overview](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_dynamo_overview.rst)
- [torch.compiler_fake_tensor](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fake_tensor.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155840
Approved by: https://github.com/sekyondaMeta
2025-06-13 18:02:28 +00:00
36bf81e363 [BE] Fix minifier when one has multiple Python runtimes (#155918)
By using `sys.executable` instead of `"python"`

Otherwise, it fails on Ubuntu with `python not found` error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155918
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/zou3519
2025-06-13 17:55:04 +00:00
093aaccae2 convert jit_language_reference_v2.rst to jit_language_reference_v2.md (#155781)
Fixes https://github.com/pytorch/pytorch/issues/155023

Description:
converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md`
**I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged.**

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781
Approved by: https://github.com/svekars
2025-06-13 17:33:10 +00:00
f0bee87eea [xla hash update] update the pinned xla hash (#155779)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155779
Approved by: https://github.com/pytorchbot
2025-06-13 17:13:37 +00:00
1f3cc4875c [ATen][CUDA][cuSOLVER] Add cusolverDnXsyevBatched for torch.linalg.eigh (#155695)
This PR add a new API for SYEV operation of cuSOLVER [`cusolverDnXsyevBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxsyevbatched) which is a new alternative to [`cusolverDn<t>syevjBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-syevjbatched). This API was introduced in cuSOLVER as part of 64-bit API in CUDA Tool Kit 12.6.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155695
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-06-13 17:12:26 +00:00
b6add8c8ba [MPSInductor] Fix remainder implementation for int types (#155891)
Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers.

This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types.

This fixes compilation of something like
```python
@torch.compile
def f(x, y):
    return x[y % 5]
```

which beforehand failed to compile with
```
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant long* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[x0];
        auto tmp1 = 12;
        auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1));
        auto tmp3 = 1024;
        auto tmp4 = static_cast<long>(tmp3);
        auto tmp5 = tmp2 + tmp4;
        auto tmp6 = tmp2 < 0;
        auto tmp7 = tmp6 ? tmp5 : tmp2;
        if ((tmp7 < 0) && (tmp7 > 1024)) return;
        auto tmp9 = in_ptr1[tmp7];
        out_ptr0[x0] = static_cast<float>(tmp9);
    }
 with program_source:372:28: error: array subscript is not an integer
        auto tmp9 = in_ptr1[tmp7];
                           ^~~~~
```

This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891
Approved by: https://github.com/manuelcandales
2025-06-13 16:42:56 +00:00
9462106b7e [nativert] Move graph_passes to nativert (#155411)
Summary: Move graph_passes to nativert

Test Plan:
CI

Rollback Plan:

Differential Revision: D76205048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155411
Approved by: https://github.com/zhxchen17
2025-06-13 16:41:01 +00:00
338a8c7853 fix slice w/ dynamic shapes (#153131)
Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes.
Cause overflow error with a huge number “9223372036854776048”
Test Plan: CIs should pass.

Differential Revision: D74354663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131
Approved by: https://github.com/laithsakka
2025-06-13 15:53:17 +00:00
a5938ff431 [BE][c10d/Store]add check in pyi (#155855) (#155865)
Summary:

"check" is already binded https://fburl.com/code/9lx1zf9o
which is also documented in https://docs.pytorch.org/docs/stable/distributed.html
add it to pyi for type checking

Test Plan:
skip

Rollback Plan:

Differential Revision: D76547457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155865
Approved by: https://github.com/fduwjj
2025-06-13 15:39:27 +00:00
bee93f9f0d Move glslc to cas to enable remote execution (#155832)
Meta:
`fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid.
See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734

This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE.

Here are commands executed:
```
% manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz
% manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz
% manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar

% ls
-rw-r--r--  1 navidq  staff   2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz
-rw-r--r--  1 navidq  staff   4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz
-rw-r--r--  1 navidq  staff   4.4M Jun 12 10:03 glslc-windows-v2024_3.tar

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz
ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz
1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar
76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984
```

Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832
Approved by: https://github.com/GregoryComer
2025-06-13 14:38:51 +00:00
ce6e0523f9 Revert "[BE] Raise NotImplementedError (#155470)"
This reverts commit 5ab6a3fb6fd37c542060c606edd4b95c7e3cae82.

Reverted https://github.com/pytorch/pytorch/pull/155470 on behalf of https://github.com/malfet due to foreach tests are failing on ROCm because we are not running the same on CUDA ([comment](https://github.com/pytorch/pytorch/pull/155470#issuecomment-2970592124))
2025-06-13 14:32:50 +00:00
3819584f12 [precompile] Implement PrecompileContext for recording precompile artifacts, integrate with CompilePackage (#154415)
This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling,  dump it all into bytes, and then stitch it back together into a cache of callables.

## Why use CacheArtifactManager?
Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly.

## How is it different from CacheArtifactManager?
Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable.
Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so:

![image](https://github.com/user-attachments/assets/49a0a75b-1e7f-4d96-8d81-6769fe5a53ca)

Whereas we'd visualize PrecompileContext's result like so:

![image](https://github.com/user-attachments/assets/fcc0dd4e-dfbf-4b13-9c08-2e99b373180b)

For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext.

After this PR, precompile consists of three main interfaces:

### CompilePackage
- Everything needed to run one torch.compile'd function (including graph breaks)
- `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry
- `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends
- `cache_entry()` return a serializable cache entry to save

### DynamoStore
- Responsible for tracking CompilePackages on disk (and/or in memory)
- `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact
- `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data
- `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization)

### PrecompileContext
- Overarching context for serializing and deserializing precompile artifacts. Supports **global** and **local** setups.
- `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes
- `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO)
- `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key

<img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415
Approved by: https://github.com/zhxchen17
ghstack dependencies: #155118
2025-06-13 14:11:24 +00:00
b2fc9cfea1 [precompile] Add CompilePackage to serialize dynamo states. (#155118)
Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface:

1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states.
     a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object.
     b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state.
2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry.
3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions.

This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend.

Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118
Approved by: https://github.com/jamesjwu

Co-authored-by: James Wu <jjwu@meta.com>
2025-06-13 13:54:10 +00:00
670dab6c63 [AOTI] Enable OP test__weight_int4pack_mm_with_scales_and_zeros in AOTI. (#155780)
The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780
Approved by: https://github.com/jansel
2025-06-13 11:12:13 +00:00
463fe36532 fix error message on specialization with Dim.DYNAMIC (#155738)
Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`.

This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation.

Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources.

Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738
Approved by: https://github.com/bobrenjc93
2025-06-13 10:33:46 +00:00
6abe450a6f [pytorch Aten] Delete unused duplicate clamp_stub, to avoid compile error (#154631)
I found the `clamp_stub` in `UnaryOps.h` is not used. And it's a duplicate of the `clamp_stub` in `TensorCompare.cpp`:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorCompare.cpp#L313-L314

This diff/PR deletes it as this duplicate caused build failure for me:
```
ATen/native/UnaryOps.h:109:1: error: redefinition of 'clamp_stub_DECLARE_DISPATCH_type'
```

Differential Revision: [D75612521](https://our.internmc.facebook.com/intern/diff/D75612521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154631
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/nautsimon
ghstack dependencies: #154589, #154591
2025-06-13 10:01:51 +00:00
1cc31b213d [MTIA Aten Backend] Migrate bitwise_and.Tensor_out (#154591)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75578498](https://our.internmc.facebook.com/intern/diff/D75578498/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154591
Approved by: https://github.com/malfet
ghstack dependencies: #154589
2025-06-13 10:01:51 +00:00
65b9c13cce [Intel GPU] Enable safe softmax for XPU SDPA (#151999)
Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975

When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption.

With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 08:53:47 +00:00
56b03df6ac [MTIA Aten Backend] Migrate where.self and where.self_out (#154589)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75577304](https://our.internmc.facebook.com/intern/diff/D75577304/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154589
Approved by: https://github.com/malfet
2025-06-13 08:25:13 +00:00
3d595fd559 update get start xpu (#151886)
update link and product name
add print to print ```torch.xpu.is_available()``` result in code snippet for user not using command python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151886
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 07:46:13 +00:00
53d06e18d9 [dynamo] add missing algorithm header (#154754)
Needed for `std::max(<initializer-list>)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154754
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2025-06-13 06:56:11 +00:00
6020440683 remove allow-untyped-defs from adaround_fake_quantize.py (#155621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155621
Approved by: https://github.com/Skylion007
2025-06-13 06:14:22 +00:00
99e99d5bfe [a2av] Test must allocate tensors symmetrically (#155835)
This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks.

In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155835
Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007
ghstack dependencies: #155506
2025-06-13 06:05:38 +00:00
0860606729 [export] Add meta[val] to getattr nodes (#154934)
Fixes [P1830293318](https://www.internalfb.com/intern/paste/P1830293318/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154934
Approved by: https://github.com/yushangdi, https://github.com/muchulee8
2025-06-13 05:48:21 +00:00
25717da8c8 [BE] Don't run the same tests on 2xlarge and 4xlarge (#155859)
Also, speedup builds by moving them to 4xlarge instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155859
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr
2025-06-13 05:40:20 +00:00
a87dfc7480 [symm_mem] Update CMakeList to reflect code moving a dedicated folder (#155823)
We moved all symm_mem code into a folder ([CudaDMAConnectivity](https://github.com/pytorch/pytorch/pull/155573)) but somehow forgot update for CudaDMAConnectivity in the CMakeList.

Users see errors: RuntimeError: DMA connectivity detector for cuda over nvlink is not available while torch.distributed.init_process_group(backend=backend). So this PR should fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155823
Approved by: https://github.com/Skylion007
2025-06-13 05:27:59 +00:00
70bb34929a Convert to .md: draft_export.rst, export.ir_spec.rst, fft.rst (#155567)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check.

Docs comparison (check out the 'new' whenever docs build)

1. draft_export ([old](https://docs.pytorch.org/docs/main/draft_export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/draft_export.html))
2. export.ir_spec ([old](https://docs.pytorch.org/docs/main/export.ir_spec.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.ir_spec.html))
3. fft ([old](https://docs.pytorch.org/docs/main/fft.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/fft.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155567
Approved by: https://github.com/svekars
2025-06-13 05:19:43 +00:00
b878ca0c91 [cutlass backend] add fp8 to cutlass benchmark script (#155507)
Summary:
Add fp8.

Right now FP8 only allows fast_accum.

Test Plan:
```
Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         aten          | 967.1226739883423  | 1136.8895149998868 |  1.219131228979677   |         NA         |
|        triton         | 1764.6185159683228 |  623.08743664783   |  20.373826419003308  | 82.46067054670186  |
| triton_persistent_tma | 1769.0335512161255 | 621.5323768280928  |  20.48663099599071   | 82.91718297956578  |
|  cutlass_lvl_default  | 790.5075550079346  | 1390.8932568835019 |  13.788519630907103  | -18.26191482535096 |
|   cutlass_lvl_3332    | 803.7384748458862  | 1367.996757884245  |  226.81587297911756  | -16.89384434227684 |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
```

Rollback Plan:

Differential Revision: D76310809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507
Approved by: https://github.com/ColinPeppler
2025-06-13 05:11:15 +00:00
2ba930d4ce Convert rst to markdown - profiler.rst #155031 (#155559)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [profiler.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/profiler.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155559
Approved by: https://github.com/svekars
2025-06-13 05:02:54 +00:00
e8b3dfa7c0 convert jit_language_reference.rst to jit_language_reference.md (#155633)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

- converted jit_language_reference.rst to jit_language_reference.md

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155633
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 04:58:28 +00:00
3f65e38b73 Convert hub.rst to hub.md (#155483)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155483
Approved by: https://github.com/svekars
2025-06-13 04:39:55 +00:00
0a6b66c881 Inductor comms reorder logs to tlparse (#155737)
Hacked test_inductor_collectives test to demonstrate this works:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Follow up: it would be nice to move the logging out of this pass and
into the broader comms pass loop, where the before/after each pass
visualization could be logged into the same tlparse file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737
Approved by: https://github.com/bdhirsh
2025-06-13 02:59:42 +00:00
f151b20123 [AOTI] Remove the emit_current_arch_binary option (#155768)
Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768
Approved by: https://github.com/angelayi
2025-06-13 02:06:07 +00:00
020da74437 [Easy] Remove empty file (#155796)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155796
Approved by: https://github.com/malfet
ghstack dependencies: #155772
2025-06-13 01:42:11 +00:00
905b194a2e Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318)
Fixes #136849

## Test Result

```python
>>> import torch
>>> device = torch.cuda.device_count() + 1
>>> torch.cuda.current_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream
    streamdata = torch._C._cuda_getCurrentStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.default_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream
    streamdata = torch._C._cuda_getDefaultStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.set_per_process_memory_fraction(0.5, device)  #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction
    torch._C._cuda_setMemoryFraction(fraction, device)
RuntimeError: Allocator not initialized for device : did you call init?

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318
Approved by: https://github.com/albanD
2025-06-13 01:20:19 +00:00
d7e657da35 pyfmt lint more torch/utils files (#155812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155812
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782, #155783
2025-06-12 23:51:42 +00:00
4d3ecefda5 [aoti][mps] Use cpp sym-expr printer (#155646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155646
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582, #155583
2025-06-12 23:33:28 +00:00
2e65d72e1e [aoti][mps] Fix int/symint kernel args (#155583)
Integer arguments to mps kernels need to go through a different function, since `aoti_torch_mps_set_arg` only takes a Tensor. So I added a `aoti_torch_mps_set_arg_int`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155583
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582
2025-06-12 23:33:28 +00:00
ffbda61fbe [aoti][mps] Fix dynamic dispatch size (#155582)
In the case where we pass in a symint to the `dispatch` call, the compiler errors, so we need to cast the input to int64_t.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155582
Approved by: https://github.com/malfet
ghstack dependencies: #155752, #154287
2025-06-12 23:33:15 +00:00
a4ab392251 [aoti][mps] mps constants support (#154287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287
Approved by: https://github.com/malfet
ghstack dependencies: #155752
2025-06-12 23:33:07 +00:00
8821a9dc4e [BE][aoti][mps] Fix tests to use common function (#155752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155752
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-06-12 23:32:59 +00:00
5ab6a3fb6f [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-12 23:19:12 +00:00
d9b8369f39 fix warning spam for list indexing (#155815)
Per title, #154806 incorrectly placed a warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155815
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-12 23:07:24 +00:00
2903e5ad3c pyfmt lint more export files (#155783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155783
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782
2025-06-12 23:04:11 +00:00
86b1116f22 pyfmt lint torch/_custom_op/* (#155782)
file torch/_custom_op/functional.py does not exisits
file torch/_custom_op/__init__.py is empty.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155782
Approved by: https://github.com/Skylion007
2025-06-12 23:04:11 +00:00
4cdbdcdbcf Switch to miniconda for ROCm CI (#155239)
Related to https://github.com/pytorch/pytorch/issues/148335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155239
Approved by: https://github.com/jeffdaily
2025-06-12 22:55:47 +00:00
f04fd4dc4e typing: allow integer in bitwise operations (#155704)
Fixes #155701 (false positives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704
Approved by: https://github.com/Skylion007, https://github.com/aorenste
2025-06-12 22:40:17 +00:00
938515fa75 [aoti] Update cshim for all backends (#155604)
Fixes https://github.com/pytorch/pytorch/issues/155349
`python torchgen/gen.py --update-aoti-c-shim` will now update all cpu/cuda/mps/xpu shims -- I verified this using `aten._print.default`, but didn't commit the changes since I'm not sure if we actually want to add this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155604
Approved by: https://github.com/desertfire, https://github.com/janeyx99
2025-06-12 22:10:58 +00:00
38bfd462b8 Use swap_tensors path in nn.Module.to for FakeTensor (#152539)
Fixes https://github.com/pytorch/pytorch/issues/148977

Differential Revision: [D76458023](https://our.internmc.facebook.com/intern/diff/D76458023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539
Approved by: https://github.com/albanD
2025-06-12 22:08:21 +00:00
db01f1032f Support XPU in memory tracker (#150703)
This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU.

Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k
2025-06-12 21:33:52 +00:00
154a39bfbd basic compile support for grouped_mm (#153384)
grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384
Approved by: https://github.com/eellison
2025-06-12 21:24:51 +00:00
f2b44424a1 [ROCm] Skip *_stress_cuda and test_ddp_apply_optim_in_backward* (#155724)
These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-06-12 21:18:04 +00:00
590fe4d2d7 Skip updating the default device distributed backend if already registered (#155320)
Motivation:

PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`).

Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However,  if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want.

Therefore, we would like to skip updating the default distributed backend if one is already registered in the map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-06-12 21:17:06 +00:00
29391c7cf9 [ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811)
9df2e8020f/1

8e8d4b13b0 (43980565863-box)

started OOMing after switching to cuda 12.8

Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-12 21:05:32 +00:00
093fd47dbe Add a Additional Example that showcases the usage of torch.autograd.functional.jacobian (#155683)
Fixes #132140

As described in the issue, I've added an example that showcases the use of higher-dimensional inputs and outputs, batched inputs, and the vectorize=True with `torch.autograd.functional.jacobian`.

Could you please review?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155683
Approved by: https://github.com/soulitzer
2025-06-12 19:46:55 +00:00
e6d71f3789 Support re-sharding for safetensors checkpoints (#154519)
This change will add the ability to support re-sharding for hf safetensors checkpoints.
This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class.

Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519
Approved by: https://github.com/saumishr
2025-06-12 19:38:29 +00:00
d3da03d6fa [2/n]passing event log handler to record function calls (#155457)
Summary: This diff modifies the elastic agent's API to pass the event log handler to the record function calls. This change enables the elastic agent to log events to a specific destination, improving the monitoring and debugging capabilities of the distributed training process.

Test Plan:
unit tests

ran an e2e training job.

Differential Revision: D75194115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155457
Approved by: https://github.com/d4l3k
2025-06-12 19:35:08 +00:00
e085012335 Fix #155020 - rst2markdown for export.rst (split PR) (#155753)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check. This is the 3rd one.

Docs comparison (check out the 'new' whenever docs build)

1. export ([old](https://docs.pytorch.org/docs/main/export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155753
Approved by: https://github.com/sekyondaMeta
2025-06-12 19:30:52 +00:00
4bb936d8b7 refresh expected results (#155817)
some changes landed when the test is recently unstable with out updating the results.
<img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817
Approved by: https://github.com/yushangdi
2025-06-12 19:14:21 +00:00
7986c0dba6 rename distributed.rst to md (#155767)
Fixes #155019

For sanity checks, split PR to have this one only include distributed.rst -> distributed.md

Preview -> [distributed.md](https://docs-preview.pytorch.org/pytorch/pytorch/155767/distributed.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155767
Approved by: https://github.com/sekyondaMeta
2025-06-12 18:42:15 +00:00
bcad962550 [BE][Testing] Delete some unused code (#155760)
- Fix typo in class name `OpenRgistration`->`OpenRegistration`
- Use existing `common` alias of `torch.testing._internal.common_utils`, i.e. `s/torch.testing._internal.common_utils.markDynamoStrictTest/common.markDynamoStrictTest/`
- Remove unused `TEST_CUDA` and `TEST_ROCM` are unused in that file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155760
Approved by: https://github.com/albanD
2025-06-12 18:41:53 +00:00
fac0cc16ef [scan] fix doc of scan and list the restrctions. (#155577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155577
Approved by: https://github.com/zou3519
2025-06-12 18:22:28 +00:00
a1257446f8 [AOTInductor] Memory leak fix for Fallback Kernels (#155642)
Summary:
We generate AtenTensorHandles for Fallback kernels regardless of the arg
type. If we indeed "fallback", we will regenerate the AtenTensorHandles
that will cause the first handle being generated not recycled, thus a
memory leak would occur.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_fallback_mem_leak

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155642
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-06-12 17:42:56 +00:00
0d3d84d866 [CD] Windows Magma build 12.9 and cuda scripts (#155799)
Scripts needed to build Magma and CUDA on windows
Same as https://github.com/pytorch/pytorch/pull/146653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155799
Approved by: https://github.com/jeanschmidt
2025-06-12 17:41:24 +00:00
430cc1c636 Run tests on Amazon EC2 M8g Instances (#153940)
Requires machines configured here: https://github.com/pytorch/test-infra/pull/6642

This adds additional test runs against AWS Graviton4 processors, alongside existing runs against AWS Graviton3 and AWS Graviton2 processors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153940
Approved by: https://github.com/fadara01, https://github.com/malfet
2025-06-12 17:33:08 +00:00
522a18bd6c Fix provenance unit test (#155747)
Summary: Fix the test to adapt added provenance tracking in D75837494

Test Plan:
```
 buck2 run @//mode/dev-nosan  fbcode//caffe2/test:fx -- -r test_graph_provenance
```

Rollback Plan:

Differential Revision: D76466778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155747
Approved by: https://github.com/YUNQIUGUO
2025-06-12 17:26:43 +00:00
50d8168c8b [DTensor] Support in gradient placement for local_map() (#155181)
Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map().  The argument helps enforce placement of gradient of the input Dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181
Approved by: https://github.com/wanchaol
2025-06-12 17:07:04 +00:00
6c0b42fd2f [inductor][cutlass backend] Log prescreening elpase (#155508)
Differential Revision: [D76311352](https://our.internmc.facebook.com/intern/diff/D76311352/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155508
Approved by: https://github.com/jingsh
2025-06-12 16:48:52 +00:00
c1ae768baa Basic MTIA ATen CMake (#155477)
Summary: Basic ATen CMake

Differential Revision: D75203592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155477
Approved by: https://github.com/andyanwang, https://github.com/cyyever
2025-06-12 16:29:32 +00:00
f4376cac54 unify symbolic_shapes and sizevars dynamic shapes APIs naming 1 (#154774)
Inductor have a set of APIs that allows performing symbolic evaluations similar to that of symbolic shapes
but it operates on sympy expressions instead of symnodes. Namings are not consistent making them consistent
in this stack.

Step 1 : unify statically_know_true naming! for consistent experience.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154774
Approved by: https://github.com/drisspg, https://github.com/bobrenjc93, https://github.com/eellison
2025-06-12 16:11:55 +00:00
9df2e8020f fix code indentation for fx.md (#155764)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155482

Description:
As discussed here https://github.com/pytorch/pytorch/pull/155482#pullrequestreview-2918032289, I removed indentation for python code blocks as a follow-up modification for fx.md

Checklist:

- [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155764
Approved by: https://github.com/svekars
2025-06-12 16:02:33 +00:00
75824035d3 [dynamic shapes] skip fused linear path if not definitely contiguous (#155051)
Falls back to non-fused linear -> add bias path for non-contiguous tensors with unbacked sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155051
Approved by: https://github.com/laithsakka
2025-06-12 15:55:21 +00:00
51560797ce [CI] Reuse old whl: switch default to always (#155572)
Switch default to always reuse old whl

I have a few worries about API rate limits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155572
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-12 15:43:29 +00:00
62fa3f5aeb Support tuning of _grouped_mm (#153953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153953
Approved by: https://github.com/ngimel
2025-06-12 15:39:35 +00:00
6b3eef6d31 [cutlass backend] Only consider to use re worker if nvcc doesn't exist (#155745)
Differential Revision: D76463340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155745
Approved by: https://github.com/masnesral
2025-06-12 15:23:52 +00:00
851a6fa82d [MPS] Migrate softshrink (forward and backward) to Metal kernel (#155586)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155586
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479, #155571
2025-06-12 15:02:43 +00:00
2a3b41cbd0 Revert "[CI] Use setup-python from for Mac tests (#155698)"
This reverts commit 2b9d638e3333e6e9ae324e1486774e83292e1883.

Reverted https://github.com/pytorch/pytorch/pull/155698 on behalf of https://github.com/malfet due to It causes weird flaky failures in MPS and do not upload usage logs anymore ([comment](https://github.com/pytorch/pytorch/pull/155698#issuecomment-2967120676))
2025-06-12 14:42:32 +00:00
0fd711df19 [export] Allow user frame to be None when symbolic shape tries to get stacktrace. (#155744)
Summary: Fixing https://github.com/pytorch/pytorch/issues/155605

Test Plan:
CI

Rollback Plan:

Differential Revision: D76463358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155744
Approved by: https://github.com/angelayi
2025-06-12 14:36:29 +00:00
dd1b6621bc Remove C10_DEPRECATED references in c10 (#151058)
Summary:
Revive https://github.com/pytorch/pytorch/pull/138406.  Only limit the scope to files in c10.

Summary from the original PR,
```
Looking in the code I see

// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later.

Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]].

Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED.
```

Test Plan: CI

Differential Revision: D72762767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058
Approved by: https://github.com/r-barnes
2025-06-12 13:38:03 +00:00
d632cf2cc9 [Easy][Code Clean] Remove the unused and undefined function in pickler (#155772)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155772
Approved by: https://github.com/malfet
2025-06-12 13:03:36 +00:00
8e8d4b13b0 [XPU] Simplify XPU make triton by install from PyTorch source (#155675)
Remove install from source code build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155675
Approved by: https://github.com/atalman
2025-06-12 13:02:23 +00:00
132babe7e0 [user triton] dynamo support for new host-side TMA API (#155662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155662
Approved by: https://github.com/aakhundov
ghstack dependencies: #155510
2025-06-12 12:56:23 +00:00
9cced33c7c [BE]: Update cudnn to 9.10.2.21 (#155576)
Update to CUDNN 9.10.2.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-12 12:50:36 +00:00
c199a4d0fd Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)
Move non inductor workflows cuda 12.6->cuda 12.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234
Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet
2025-06-12 12:42:34 +00:00
eecaa0bbc6 [Multiprocesing] Fix _release_ipc_counter missing in rebuilding cuda ipc tensor with UntypedStorage (#155312)
Fixes https://github.com/pytorch/pytorch/issues/155311

To avoid `torch.multiprocessing.reductions::rebuild_cuda_tensor` failed on untyped storage, this FIX PR adds the `_release_ipc_counter` into UntypedStorage like the previous legacy typed storage.

e2d141dbde/torch/storage.py (L1466-L1469)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155312
Approved by: https://github.com/mikaylagawarecki
2025-06-12 10:41:58 +00:00
0029259bdf Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-12 09:58:15 +00:00
d3d655ad14 [Hierarchical-Compile] Hash int args in addition to input shapes (#155655)
Fixes Swsl_resnext101_32x16d in TIMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155655
Approved by: https://github.com/anijain2305
2025-06-12 06:35:12 +00:00
c3ecabf059 [inductor][triton pin] add support for new TMA API for mm.py templates (#155723)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488

For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs).

For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API).

For mm_scaled_grouped.py, https://github.com/pytorch/pytorch/pull/150944 will remove TMA support for Triton 3.2.

Note: we attempted this earlier with https://github.com/pytorch/pytorch/pull/154858, but this broke TMA usage in Triton 3.2.

Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155723
Approved by: https://github.com/NikhilAPatel
2025-06-12 06:25:47 +00:00
2b9d638e33 [CI] Use setup-python from for Mac tests (#155698)
Instead of `setup-miniconda`
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
- Skip `TestMultiprocessing.test_fs_sharing` as even though it completes, it hangs on the shutdown both in CI and in all local setups I have
- Skip `TestCppExtensionOpenRgistration.test_base_device_registration` as it hangs on the shutdown as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515, #155697
2025-06-12 04:58:00 +00:00
57e4d7b5cc [nativert] Move DelegateExecutor to PyTorch core (#155581)
Summary:
Moves DelegateExecutor base class to PyTorch core. It provides the extension point of backend delegation for NativeRT.
Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
This is only a virtual base class. So relying on internal CI is sufficient.

Rollback Plan:

Differential Revision: D76351984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155581
Approved by: https://github.com/zhxchen17
2025-06-12 04:33:31 +00:00
a9d5157e25 [dynamo] Use BINARY_SUBSCR for pre-graph bytecode for regular dict accesses (#155727)
vLLM profiler sets with_stack=True that shows the dict_getitem on the profiler, both inflating the numbers and confusing compile users. This PR keeps BINARY_SUBSCR for regular dicts, while using `dict.__getitem__` only for dict subclasses.

Using binary_subscr is little bit faster, but not enough to make any major latency improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155727
Approved by: https://github.com/zou3519, https://github.com/StrongerXi, https://github.com/jansel
2025-06-12 04:02:29 +00:00
c9e9a0c823 [inductor][invoke_subgraph] Mark invoke_subgraph outputs as user_visible to constrain output strides (#155395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155395
Approved by: https://github.com/zou3519
2025-06-12 03:58:16 +00:00
9f5153b1a4 Preserve GrpahModule node stack trace after torch package deserializaion re-tracing (#155638)
Summary:
urrently the node.meta["stack_trace"] is not preserved when we torch package/load GraphModule, which means the original stack trace is lost. When we re-trace the packaged graph module, we just get a stack trace like fx-generated._0......

Adding the node.meta["stack_trace"] to torch packaged graph module

Test Plan:
```
buck2 run @//mode/dev-nosan fbcode//caffe2/test:package -- -r  TestPackageFX
```

Rollback Plan:

Differential Revision: D76379692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155638
Approved by: https://github.com/angelayi
2025-06-12 03:48:27 +00:00
ce9ba071fd [BE] Fix warning in open_registration_extension.cpp (#155755)
Namely
```
/Users/nshulga/git/pytorch/pytorch/test/cpp_extensions/open_registration_extension.cpp:306:33: warning: left operand of comma operator has no effect [-Wunused-value]
  306 |   at::Tensor first = at::empty((2,3)).to(at::DeviceType::PrivateUse1);

```

Or switching between Python and C++ is hard
In Python `(2, 3)` creates a tuple, in C/C++ it's just a integral literal 3

P.S. I could have vibe-coded the fix with Claude: https://claude.ai/share/82479e88-84cb-4299-aa2f-dafd28ee2d55

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155755
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-06-12 03:01:30 +00:00
d96dec8415 [export] Fix serialization for call_torchbind hop with as_none argument (#155647)
Summary:
As title.

D75251816 broke one internal test. This diff fixes it.

Test Plan: Internal CI

Differential Revision: D76383202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155647
Approved by: https://github.com/ydwu4
2025-06-12 02:59:03 +00:00
b00b641ff1 [Docs] Convert to markdown: accelerator.rst, amp.rst, autograd.rst, backends.rst, benchmark_utils.rst (#155762)
Fixes #155013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155762
Approved by: https://github.com/svekars
2025-06-12 02:55:06 +00:00
b6f84b3b0f [Inductor][CPU] Use AMX-based microkernels when M > 4 for GEMM template for INT4 weight (#155444)
**Summary**
GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155444
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
2025-06-12 02:28:48 +00:00
212575f994 [ca] Annotate AccumulateGrad branching and add polyfill tests (#155289)
Annotates AccumulateGrad and tracks the semantics for AccumulateGrad's polyfill , except for Scenario 1.4: Cloning MKLDNN new_grad and Scenario 2.2: Vmap-incompatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155289
Approved by: https://github.com/jansel, https://github.com/albanD
2025-06-12 02:10:52 +00:00
d84efde3f0 Move _storage_Use_Count to be gerneric (#155451)
# Motivation
`torch._C._storage_Use_Count` should be a generic API that is not aware of device type. It is also used in 337cd7c53d/torchtune/training/_activation_offloading.py (L323) to do some memory optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155451
Approved by: https://github.com/albanD
2025-06-12 01:39:04 +00:00
8372d0986a Revert "[PT2][partitioners] Add aten.split to view_ops list (#155424)"
This reverts commit e1db10e05aa720aef1989773adcf48f311bcf920.

Reverted https://github.com/pytorch/pytorch/pull/155424 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm [GH job link](https://github.com/pytorch/pytorch/actions/runs/15596830833/job/43931044625) [HUD commit link](e1db10e05a) but idk how, reverting to see if it fixes the problem ([comment](https://github.com/pytorch/pytorch/pull/155424#issuecomment-2964717706))
2025-06-12 01:38:34 +00:00
9b122aab5d Fix set per proc memory fraction when running tests (#155631)
env setting needs to happen before pool creation for it to take effect

In theory this should fix some OOMs and also cause some OOMs, but this PR is green so idk

alt options: use initializer?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155631
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-12 01:28:08 +00:00
8ad6197b46 [draft export] avoid storing intermediate real tensors in proxies (#154630)
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.

While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.

Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?

Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-06-12 01:18:57 +00:00
4e19477196 [nativert] Move Pytree (#155136)
Summary: fbcode/sigmoid/core/common -> fbcode/caffe2/torch/nativert/common

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:pytree_test
```

OSS CI

Rollback Plan:

Differential Revision: D75965059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155136
Approved by: https://github.com/zhxchen17, https://github.com/XuehaiPan, https://github.com/zou3519
2025-06-12 01:10:34 +00:00
ee5c2908cb [dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592)
as titled. It's sometimes confusing to use PlacementStrategy as a name,
as we also have OpStrategy and TupleStrategy, the latter two contain
the former, so it is better to make the naming clearer.

Renaming PlacementStrategy -> OpSpec as it is an operator spec that
contains output_spec + input_specs.

Also found some utils can be merged to OpSchema so included together in
this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592
Approved by: https://github.com/awgu
2025-06-12 00:51:36 +00:00
7485ef078f Run torch.compile benchmark more frequently on H100 (#155719)
We have more capacity now with 20+ `linux.aws.h100` runners, half of them are idle.  Running benchmark more frequently would utilize these runner better and provide early signals multiple times per day.  Running every 8 hours to start with.  The workflow usually finishes within 5 hours https://github.com/pytorch/pytorch/actions/runs/15578331612/job/43878878434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155719
Approved by: https://github.com/atalman
2025-06-12 00:24:21 +00:00
9e9484d022 [SymmMem] Enable NVSHMEM for Triton (#155506)
(This is an **Experimental** feature)
Allow Triton kernels to invoke NVSHMEM device functions.

### Example Triton program
Key parts:
- Call `nvshmem.enable_triton()` to initialize;
- Call `nvshmem.putmem_block` in Triton kernel;
- Add `extern_libs` kwarg at kernel invocation.

```
import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem

@triton.jit
def put_kernel(
    dst_ptr,
    src_ptr,
    numel: tl.constexpr,
    peer: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

if __name__ == "__main__":
    # Enable NVSHMEM for Triton
    nvshmem_lib = nvshmem.enable_triton()

    # Use torch Symmetric Memory to allocate Symmetric tensors
    ...

    peer = 1 - rank
    if rank == 0:
        kernel = put_kernel[(1, 1, 1)](
            dst_ptr,
            src_ptr,
            numel=numel,
            peer=peer,
            BLOCK_SIZE=BLOCK_SIZE,
            extern_libs=nvshmem_lib,
        )

    dist.barrier()
    if rank == 1:
        print(f"Rank {rank}: received {out=}")
```

### Test output:
```
$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put
Rank 0: writing value 5 to Peer 1
Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506
Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj
2025-06-12 00:22:49 +00:00
cf9878d7a2 Fix #155022 rst to markdown conversion (#155540)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155022

Docs comparison (check out the 'new' whenever docs build)

1. func.ux_limitations ([old](https://docs.pytorch.org/docs/main/func.ux_limitations.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.ux_limitations.html))
2. func.whirlwind_tour ([old](https://docs.pytorch.org/docs/main/func.whirlwind_tour.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.whirlwind_tour.html))
3. future_mod ([old](https://docs.pytorch.org/docs/main/future_mod.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/future_mod.html))
4. futures ([old](https://docs.pytorch.org/docs/main/futures.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/futures.html))
5. fx.experimental ([old](https://docs.pytorch.org/docs/main/fx.experimental.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/fx.experimental.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155540
Approved by: https://github.com/AlannaBurke, https://github.com/svekars
2025-06-12 00:21:22 +00:00
7918978653 [dynamo] uploaded full json file of all unimplemented_v2() calls currently in repository (#155758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155758
Approved by: https://github.com/williamwen42
2025-06-12 00:17:28 +00:00
a6210fd07b [c10d] Enhance get_process_group_ranks() to accept group=None (#154902)
Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified.

Test Plan:
contbuild & OSS CI

Rollback Plan:

Differential Revision: D75817800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902
Approved by: https://github.com/wz337
2025-06-11 23:41:03 +00:00
eqy
bd3c32916c [cuDNN] Enabled dilation for deterministic convolutions in cuDNN (#154292)
Provides order-of-magnitude speedup over fallback impl.

https://github.com/pytorch/pytorch/issues/28777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154292
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-11 23:35:52 +00:00
c13e725edd Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518)
As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR.
In addition this PR adds an integration test in addition to the unit tests.
It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend.

Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518
Approved by: https://github.com/saumishr
2025-06-11 23:35:05 +00:00
1b032384b1 Convert rst files to md (#155369)
Fixes #155021
Fixes #155158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155369
Approved by: https://github.com/svekars, https://github.com/malfet
2025-06-11 23:00:52 +00:00
48921721d8 [MPS] Fix binary builds (#155733)
Introduced by https://github.com/pytorch/pytorch/pull/155611

All functions in those headers must be static and inline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155733
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-06-11 22:55:33 +00:00
c1446e1e9d [easy] revert unintended changes from #152579 (#155614)
Summary:
I accidentally removed a test and a small change in my pr:
https://github.com/pytorch/pytorch/pull/152579
- `test_load_package_multiple_gpus` from https://github.com/pytorch/pytorch/pull/152093

Rollback Plan:

Differential Revision: D76370555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155614
Approved by: https://github.com/jingsh
2025-06-11 22:54:58 +00:00
4a954fc185 [refactor] make do_auto_functionalize_v2 take HopInstance (#154192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154192
Approved by: https://github.com/zou3519
ghstack dependencies: #155261, #154072, #154191
2025-06-11 22:52:37 +00:00
d6be87648f [hop schema] add schema.tree_spec to support pytree inputs (#154191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154191
Approved by: https://github.com/zou3519
ghstack dependencies: #155261, #154072
2025-06-11 22:52:37 +00:00
6ded656aee [hop] auto functionalize invoke_subgraph (#154072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154072
Approved by: https://github.com/zou3519
ghstack dependencies: #155261
2025-06-11 22:52:28 +00:00
20fb8f5d1f [refactor] make check input alias and mutation easier to use (#155261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155261
Approved by: https://github.com/zou3519
2025-06-11 22:52:21 +00:00
61e13782dd [inductor] handle -1 for pointless view pairs (#155295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155295
Approved by: https://github.com/laithsakka, https://github.com/jansel
2025-06-11 22:20:36 +00:00
458cc7213b DOC: Convert to markdown: mobile_optimizer.rst, model_zoo.rst, module_tracker.rst, monitor.rst, mps_environment_variables.rst (#155702)
Fixes #155026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155702
Approved by: https://github.com/sekyondaMeta, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-11 22:16:04 +00:00
e1db10e05a [PT2][partitioners] Add aten.split to view_ops list (#155424)
Summary: Add `aten.split` to view_ops list in partitioners.py

Test Plan:
na

Rollback Plan:

Differential Revision: D76011951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155424
Approved by: https://github.com/xuanzhang816
2025-06-11 22:12:13 +00:00
f59c76b549 Revert "[BE]: Update cudnn to 9.10.2.21 (#155576)"
This reverts commit 2d3615f577894c7a117a55e85bb8371bb598ec50.

Reverted https://github.com/pytorch/pytorch/pull/155576 on behalf of https://github.com/malfet due to breaks the same test again (I remember there were a version that adjusted tolerances), see bc3972b80a/1 ([comment](https://github.com/pytorch/pytorch/pull/155576#issuecomment-2964404710))
2025-06-11 22:03:45 +00:00
bc3972b80a [reland] Add stack_trace on make_fx (#155486)
Summary:
Previosuly, we only add stack trace in class _ModuleStackTracer(PythonKeyTracer) for non-strict export. I moved this stack trace logic to the parent class PythonKeyTracer, this way the graph traced from Module using make_fx will have stack_trace as well.

Motivation: we've observed some uses cases where users first use make_fx on the Module, and then run export on the resulting graph. If the result of make_fx doesn't have stack trace, the stack trace information is lost.

**User needs to turn this on by passing in `stack_trace=True` to make_fx. We don't make this the default option since this might increase inductor compilation time (`make_fx` is used in inductor to trace graph patterns for pattern matching). It's also turned on if `_inductor.config.trace.enabled` is True.**

**preserving stack trace is on by default for ModuleStackTracer, which is used for non-strict export.**

Test Plan:
```
buck run test:test_export -- -r  test_stack_trace
buck run fbcode//caffe2/test/dynamo:test_dynamo -- -k test_autocast_ordering
```

Rollback Plan:

Differential Revision: D76298692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155486
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-11 21:27:43 +00:00
9bd0830ed8 [dynamic shapes] guard_or_false for cat, repeat (#155290)
Summary:
assumes:
- specified repeats are non-negative
- 1d cat arguments like [u0] aren't non-zero sized (replaces existing size-oblivious)

Test Plan:
test_export

Rollback Plan:

Differential Revision: D76092011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155290
Approved by: https://github.com/laithsakka
2025-06-11 21:03:32 +00:00
4609699bfd [MPS] Migrate leaky_relu (forward and backward) to Metal kernel (#155571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155571
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479
2025-06-11 20:58:46 +00:00
f8d93b3783 [MPS] Migrate hardswish (forward and backward) to Metal kernel (#155479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155479
Approved by: https://github.com/kulinseth, https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462
2025-06-11 20:58:46 +00:00
db5970c1a6 [coreml-backend-tool] fix pytorch-backended issue on new coremltools (#155543)
Summary:
the new coreml tool is export mlpakage instead mlmodel in default option.  when we use new 8.0 coreml tool to convert to backend, the error is

```
Exception: MLModel of type mlProgram cannot be loaded just from the model spec object. It also needs the path to the weights file. Please provide that as well, using the 'weights_dir' argument.
```

Test Plan:
tested with internal workflow

Rollback Plan:

Differential Revision: D76325462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155543
Approved by: https://github.com/shoumikhin
2025-06-11 20:52:26 +00:00
cec264c8c6 remove single remaining gso from compute_stride (#155635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155635
Approved by: https://github.com/ColinPeppler
2025-06-11 20:36:21 +00:00
cc09d3a5ba remove float args benchmark (#155674)
This benchmark very sensitive. removing it for now until we make it better .

<img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674
Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93
2025-06-11 20:34:58 +00:00
2d3615f577 [BE]: Update cudnn to 9.10.2.21 (#155576)
Update to CUDNN 9.10.2.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 20:32:07 +00:00
94ae615337 [trymerge] Error on ghstack commit with multiple PRs (#154941)
see https://github.com/pytorch/pytorch/issues/154427#issuecomment-2932941343 for context

Errors if do not find 1 match in ghstack commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154941
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-11 20:26:50 +00:00
b7a73a2cdb Convert to markdown: export.programming_model.rst (#155659)
Converts only export.programming_model.rst to markdown

Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020, but split into a second PR to pass sanity check

Docs comparison (check out the 'new' whenever docs build)

1. export.programming_model ([old](https://docs.pytorch.org/docs/main/export.programming_model.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155659/export.programming_model.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155659
Approved by: https://github.com/sekyondaMeta
2025-06-11 20:23:46 +00:00
1b6772a90f A small fix in do_bench_using_profiling (#155500)
Summary: Results: https://docs.google.com/document/d/1B_4rtiDFPH_jV3VpnqLPnInwDMpF7yX29G82UoJTcu8/edit?tab=t.0

Test Plan:
```
buck2 run mode/opt  -c fbcode.enable_gpu_sections=true ai_acceleration/float8/benchmarks/bench:bench_fp8_shapes_eval 2>&1 | tee output44.txt
```

Rollback Plan:

Differential Revision: D76298690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155500
Approved by: https://github.com/yoyoyocmu, https://github.com/nmacchioni
2025-06-11 20:06:19 +00:00
1dd0b1d12b Unbreak torch.is_vulkan_available() on Mac (re-send of #154675, please stamp) (#155595)
This is a new PR duplicating #154675 due to merge issues with that PR coming from my old (now updated) version of ghstack.

I am a Vulkan noob, but this extension and flag seem to be necessary. See "Encounted VK_ERROR_INCOMPATIBLE_DRIVER" at https://vulkan-tutorial.com/Drawing_a_triangle/Setup/Instance .

(For anyone trying to repro at home, I have the following homebrew packages installed, not all of which may be necessary: molten-vk, vulkan-headers, vulkan-loader, vulkan-tools, vulkan-utility-libraries. I also have VK_ICD_FILENAMES set to /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json, and I built PyTorch with USE_VULKAN=1. Making sure vkcube works helped me debug this setup.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155595
Approved by: https://github.com/malfet
2025-06-11 19:51:35 +00:00
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
f80a61adf5 Revert "[dynamo] added github_cli to detect unimplemented_v2 calls (#155610)"
This reverts commit 5dd07c70e53a86b73f49711b8186d86dc4f1b32a.

Reverted https://github.com/pytorch/pytorch/pull/155610 on behalf of https://github.com/malfet due to Looks like it fails on every pull request, based on https://github.com/pytorch/pytorch/actions/workflows/check-unimplemented-calls.yml, but it does not run on trunk ([comment](https://github.com/pytorch/pytorch/pull/155610#issuecomment-2963929765))
2025-06-11 19:31:55 +00:00
1e373d02d5 [ONNX] Change deprecation message from 2.8 to 2.9 (#155580)
~~The PR: https://github.com/pytorch/pytorch/pull/152478 did not respect the release policy that the deprecation should happen after the deprecation message has been set for 2 releases. This PR postpone 2.8 to the rightful version 2.10.~~

~~NOTE: "as early as" 2.10 shall give ONNX users more time to adapt and provide feedback.~~

To follow the upcoming torchscript deprecation, `torch.onnx.export` expects to switch dynamo=True (also turn on fallback=True for bc) on torch 2.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155580
Approved by: https://github.com/justinchuby, https://github.com/tugsbayasgalan
2025-06-11 19:31:29 +00:00
3f29642ecf Update XLA pin (#155471)
Update pin after XLA PR https://github.com/pytorch/xla/pull/9312 landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155471
Approved by: https://github.com/laithsakka
2025-06-11 19:16:52 +00:00
f8baec8984 Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-06-11 19:12:52 +00:00
6dfada220e [ca] better error message for subclasses not supported by FakeTensor (#155481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155481
Approved by: https://github.com/jansel
ghstack dependencies: #155473, #155570
2025-06-11 19:09:29 +00:00
5dcc718a77 [dynamo][ci] update PYTORCH_TEST_WITH_DYNAMO xfail/skips script for 3.13 (#155570)
No more 311 runners, tested by generating the files for the next PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155570
Approved by: https://github.com/zou3519
ghstack dependencies: #155473
2025-06-11 19:09:29 +00:00
87b002b6fb [ca] make torch.compile API respect ambient disable contexts (#155473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155473
Approved by: https://github.com/jansel
2025-06-11 19:09:29 +00:00
be124a61a4 [MPS] Migrate hardsigmoid (forward and backward) to Metal kernel (#155462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155462
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316
2025-06-11 19:09:23 +00:00
c04a4e7094 Add types to torch/utils/_triton.py (#155612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155612
Approved by: https://github.com/jamesjwu
2025-06-11 19:04:10 +00:00
2002e3a311 [Docs] Convert to markdown: torch.compiler_transformations.rst, torch.compiler.config.rst (#155347)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155347
Approved by: https://github.com/svekars
2025-06-11 18:55:30 +00:00
925fbfca27 Convert fx.rst to fx.md (#155482)
Part of changes #155023 (parent PR #155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155482
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-11 18:46:35 +00:00
4d9d884c3f [NCCL] Expose new ncclConfig_t flags in 2.27 (#155379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155379
Approved by: https://github.com/Skylion007
2025-06-11 18:26:55 +00:00
247f83e0a4 [dynamic shapes] guard individual terms in sym_and; user-code-friendly sym_and/sym_or (#154737)
Previously when processing `sym_and(a, b, c)`, symbolic shapes wouldn't individually process a, b, and c and store their implications. This would lead us to data-dependent error on individual checks, e.g. we stored `u0 >= 0 & u0 <= 10`, but then couldn't figure out `u0 <= 10`.

This handles that, and also makes `sym_and/or` user-code friendly, for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154737
Approved by: https://github.com/laithsakka
2025-06-11 18:08:06 +00:00
c1cbaca7fd [CI] Move setuptools requirements from conda to pip (#155697)
Needed for `import z5` to work without warning, otherwise
`LoggingTests.test_logs_out` will fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155697
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515
2025-06-11 18:03:18 +00:00
3a43dba21f Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit dc5e8f7999cccb51efcf0f5fe197a740a918c73d.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/malfet due to It broke ROCM, see c75c732481/1 ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2963708109))
2025-06-11 18:01:08 +00:00
c75c732481 [CI] Disable ET tests (#155708)
I'm tired of seeing red on PRs and it has been consistently broken since May 30th per 59eb61b2d1/10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155708
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-06-11 17:56:52 +00:00
59eb61b2d1 [inductor] Improve GEMM logging to display batch size for batched operations (#155544)
Improves the GEMM overview logging in PyTorch Inductor to properly display batch size information for batched matrix operations like `torch.bmm` and `torch.baddbmm`.

**Fixes #155307**

## Problem

The current GEMM logging for `torch.bmm` shows:
```python
# Repro
import os
os.environ["TORCH_LOGS"] = "inductor"
import torch

M, N, K = 1024, 1024, 1024
dtype = torch.bfloat16
A = torch.randn(10, M, K, device="cuda", dtype=dtype)
B = torch.randn(10, K, N, device="cuda", dtype=dtype)

compiled_model = torch.compile(torch.bmm, fullgraph=True)
_ = compiled_model(A, B)
```

**Before:**
```
Name                 | M                    | N                    | K                    | Count
----------------------------------------------------------------------------------------------------
aten.bmm             | 1024                 | 1024                 | 1024                 | 1
----------------------------------------------------------------------------------------------------
```

The batch size (10) is missing from the logs, making it unclear what the actual operation dimensions were.

## Solution

**After:**
```
Name                           | B                    | M                    | N                    | K                    | Count
----------------------------------------------------------------------------------------------------------------------------------
aten.bmm                      | 10                   | 1024                 | 1024                 | 1024                 | 1
aten.mm                       | -                    | 1024                 | 1024                 | 1024                 | 2
----------------------------------------------------------------------------------------------------------------------------------
```

## Changes Made

### 1. Enhanced Parsing Logic in compile_fx.py
- Detects batched operations by checking if operation name ends with `'bmm'` or `'baddbmm'`
- For batched operations: takes last 4 parts as `batch, m, n, k`
- For non-batched operations: takes last 3 parts as `m, n, k`
- **Dedicated "B" column**: Added separate column for batch size instead of embedding in operation name
- Shows batch size for batched operations, shows "-" for non-batched operations

### 2. Updated All MM Operations for Consistency
- **bmm.py**:
  - Extract batch size from `mat1.get_size()[0]` for both `tuned_bmm` and `tuned_baddbmm`
  - Use positional counter keys: `aten.bmm_{batch_size}_{m}_{n}_{k}`
  - Enhanced log messages to include batch size information

- **mm.py**: Updated counter keys for consistency:
  - `aten.mm_{m}_{n}_{k}` (no batch dimension)
  - `aten.addmm_{m}_{n}_{k}` (no batch dimension)
  - `aten._int_mm_{m}_{n}_{k}` (no batch dimension)
  - `aten._scaled_mm.default_{m}_{n}_{k}` (no batch dimension)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155544
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-06-11 16:57:40 +00:00
7b7cd56f5e [export] support linear & layer_norm unbacked (#155260)
## What
- use `definitely_contiguous_for_memory_format` instead of `is_contiguous` when the non-contiguous case is fine if we encounter a DDE.
- use ref's contiguous over Aten's contiguous because Aten's version will DDE and stop tracing. ref's version will use `definitely_contiguous_for_memory_format` and clone if there's a DDE.

## Example DDEs

- Fixed with `definitely_contiguous_for_memory_format` in `fast_binary_impl`
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq((u0//387), 0) (unhinted: Eq((u0//387), 0)).  (Size-like symbols: u0)

Caused by: layer_norm = self.layer_norm(linear)  # caffe2/test/export/test_export.py:4566 in forward (_subclasses/fake_impls.py:1022 in fast_binary_impl)
```

- Fixed with `refs.contiguous` instead of calling aten's contiguous (that'd require a bigger re-write in Aten)
```
  File "c10/core/TensorImpl.h", line 825, in torch::autograd::THPVariable_contiguous(_object*, _object*, _object*)
  File "c10/core/SymbolicShapeMeta.h", line 87, in c10::TensorImpl::is_contiguous_default(c10::MemoryFormat) const
  File "c10/core/SymbolicShapeMeta.cpp", line 250, in c10::SymbolicShapeMeta::init_is_contiguous() const

torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(128*((u0//387)), 0) (unhinted: Eq(128*((u0//387)), 0)).  (Size-like symbols: u0)

Caused by: (_refs/__init__.py:3302 in native_layer_norm)
```

- Fixed with `definitely_contiguous_for_memory_format` in ref's contiguous
```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression 387*((u0//387)) < 2 (unhinted: 387*((u0//387)) < 2).  (Size-like symbols: u0)

Caused by: (_prims_common/__init__.py:279 in is_contiguous)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155260
Approved by: https://github.com/laithsakka
ghstack dependencies: #155499
2025-06-11 16:47:34 +00:00
b49edc0d6c [Export] Fix some typos in docstring (#155485)
Summary: nit change, fix the doc string

Test Plan:
CI

Rollback Plan:

Differential Revision: D76297740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155485
Approved by: https://github.com/ColinPeppler
2025-06-11 16:44:38 +00:00
18bf6addc4 set_grad_enabled add str and repr for prints (#155681)
Fixes #86718

## Test Result

```python
>>> import torch
>>> torch.set_grad_enabled(False)
torch.autograd.grad_mode.set_grad_enabled(mode=False)
>>> print(torch.set_grad_enabled(False))
torch.autograd.grad_mode.set_grad_enabled(mode=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155681
Approved by: https://github.com/soulitzer
2025-06-11 16:01:03 +00:00
dc5e8f7999 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-11 15:20:48 +00:00
45c5a23237 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit 5264f8cd8d08272003298cdefe6bd60b1b8c80b4.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Just testing if it will fix PR time benchmarks signal ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2963232606))
2025-06-11 15:18:47 +00:00
359e8f5d69 [CI] Use setup-python from test-infra to do MacOS builds (#155515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155515
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601
2025-06-11 15:11:38 +00:00
9328a7fb58 [triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510)
This splits out the tests so we can independently test both the new and old API.

Note: the new API doesn't work yet - we still need to fix those tests.

Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510
Approved by: https://github.com/oulgen
2025-06-11 13:52:15 +00:00
4c3da611c2 Add CUDA 12.9.1 x86 nightly binaries (#154980)
Adding CUDA 12.9.1 to nightly binaries matrix for linux (x86) builds.
Add sbsa and libtorch build docker images, builds addition will be follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154980
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 13:43:17 +00:00
013cf1e330 [MPS] Move expm1 op to Metal (#155611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155611
Approved by: https://github.com/malfet
2025-06-11 13:06:14 +00:00
44df7cf28d [AOTI] Fix embed_kernel_binary error when max_autotune is ON (#155569)
Summary: Stop removing cubin files so that it won't be missing when max_autotune is ON.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155569
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-06-11 12:27:36 +00:00
f34ab1628b [Graph Partition] move cpu scalar tensor to gpu (#154464)
cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops.

This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Close: #119241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-06-11 10:22:45 +00:00
eaceb243df [BE] Update the XPU support package to 2025.1.3 (#154346)
Fixes #153632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154346
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-06-11 09:46:18 +00:00
2585960b47 remove redundent type_id (#155539)
Those were added in https://github.com/pytorch/pytorch/pull/92229 to prevent confusion of overloads.
but the variants that accepts SymBool are all removed in https://github.com/pytorch/pytorch/pull/112890
with the introduction of SymbolicShapeMeta.
Hence that dummy arg is not needed anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155539
Approved by: https://github.com/ezyang
2025-06-11 08:46:56 +00:00
717a099d42 Revert "[flex attention][triton pin] triton_helpers shim for TMA apis (#154858)" (#155640)
This reverts commit ea7b233015ff00098df687884be4e2efbf7a55fa.

It fails internal tests in fbcode, but they weren't running so we didn't get signal

Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal.

Differential Revision: [D76380887](https://our.internmc.facebook.com/intern/diff/D76380887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155640
Approved by: https://github.com/nmacchioni, https://github.com/danzimm
2025-06-11 07:37:47 +00:00
0e2013a12d Add helion x pt2 test (#155513)
This kinda just worked out of the box, shocking. PT2 traced into helion and emitted it as a user defined triton kernel: P1836496774

In the long run, we do not actually want this, but rather to create a helion HOP so we can do fusions etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155513
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-11 07:08:06 +00:00
5b9db4335e Include c++ stack traces when we hit constraint violation (#155603)
Example new error message

```
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.

Framework stack:
  File "??", line 0, in _start
  File "", line 0, in __libc_start_main_alias_2
  File "??", line 0, in __libc_start_call_main
  File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain
  File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module>
    foo(x)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper
    return fn(*args, **kwargs)
  File "offloadstuff.c", line 0, in dynamo__custom_eval_frame
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__
    return self._torchdynamo_orig_callable(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__
    result = self._inner_convert(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__
    return _compile(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform
    tracer.run()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run
    super().run()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run
    while self.step():
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl
    self.push(fn_var.call_function(self, self.popn(nargs), {}))
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
    return handler(tx, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda>
    return lambda tx, args, kwargs: obj.call_function(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
    return handler(tx, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph
    return wrap_fx_proxy(tx, proxy)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls
    return _wrap_fx_proxy(
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value
    ret_val = wrap_fake_exception(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception
    return fn()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node
    return node.target(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL
  File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS
  File "python_variable_methods.cpp", line 0, in _object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object*, _object*, _object*)
  File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object*, _object*, _object*)
  File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&)
  File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName)
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl
    return self._dispatch_impl(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl
    return maybe_propagate_real_tensors(fast_impl(self, *args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl
    final_shape = infer_size(final_shape, shape)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size
    torch._check(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check
    _check_with(RuntimeError, cond, message)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with
    if expect_true(cond):
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true
    return a.node.expect_true(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true
    return self.guard_bool(file, line)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool
    r = self.evaluate()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate
    return self.shape_env.evaluate_sym_node(self, size_oblivious)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node
    return self.evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr
    return self._inner_evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr
    return self._evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr
    self._maybe_guard_rel(g)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel
    self._refine_ranges(expr)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges
    self._set_replacement(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement
    self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract
    torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp),
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call
  File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
  File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> (*)(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)
  File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool)
  File "??", line 0, in torch::unwind::unwind()

User stack:
  File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo
    return torch.randn(5) * x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155603
Approved by: https://github.com/zou3519, https://github.com/cyyever
ghstack dependencies: #155133
2025-06-11 05:00:36 +00:00
84c14361c2 [ez][AOTI] Add test for std::nullopt return in custom op (#155636)
Summary: As title. Follow up of https://github.com/pytorch/pytorch/pull/154286

Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_nullopt_output

Rollback Plan:

Differential Revision: D76378892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155636
Approved by: https://github.com/zou3519, https://github.com/cyyever
2025-06-11 03:52:31 +00:00
1e690b6c41 Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK in set_history (#155453)
Fixes #154357

## Test Result

```bash
>>> import torch
>>>
>>> x = torch.tensor(1, device=torch.device('cpu'))
>>> y = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
>>> z0 = (x.abs() * y).prod(dtype=torch.int16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Autograd not support dtype: Short
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155453
Approved by: https://github.com/albanD, https://github.com/soulitzer
2025-06-11 03:46:48 +00:00
110ae0f433 Custom Op handle 1-element tuples (#155447)
Fixes #150472

Modification of [PR 151408](https://github.com/pytorch/pytorch/pull/151408). This PR modifies the return parsing in `infer_schema` to handle the case of a Tuple with a single element.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155447
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-06-11 03:43:40 +00:00
a2b0b2698d inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
2025-06-11 01:33:24 +00:00
5264f8cd8d Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-11 01:22:06 +00:00
3040ca6d0f [Cutlass] Include fp8 headers in aoti cpp wrapper (#155173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155173
Approved by: https://github.com/desertfire
ghstack dependencies: #154829, #154835, #155195
2025-06-11 01:21:16 +00:00
1ed243f01c Add missing attr access check for legacy autograd.Function (#155055)
Fixes https://github.com/pytorch/pytorch/issues/154981
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155055
Approved by: https://github.com/albanD
ghstack dependencies: #154509, #154852
2025-06-11 01:03:49 +00:00
5dd07c70e5 [dynamo] added github_cli to detect unimplemented_v2 calls (#155610)
This PR adds the workflow of whenever a dev makes a PR that contains files under torch/_dynamo, we check for any unimplemented_v2() callsites and if any of them have been modified in some sort of way, the workflow fails and lists them exactly which callsites and let's them know what the command lines are to update the registry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155610
Approved by: https://github.com/williamwen42
2025-06-11 00:40:56 +00:00
3580b8dde4 [BE] Mention debug=True in AC error messages (#155593)
See https://github.com/pytorch/pytorch/issues/155171#issuecomment-2949415407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155593
Approved by: https://github.com/janeyx99
2025-06-11 00:32:41 +00:00
dbec08bc1c Changes to HFStorageWriter to support saving shards of tensors (#154742) (#155566)
Summary:

As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
    -  as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now

- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.

Test Plan: test_hf_storage.py

Reviewed By: saumishr

Differential Revision: D75099862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
2025-06-10 23:37:47 +00:00
2161be8497 Move unary trig ops to metal kernels (#154465)
Move inverse trig unary ops, sinh, & cosh to metal kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154465
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-10 22:56:59 +00:00
c4b93e6579 Replace frame_traced_fn hook with get_traced_code() util (#155249)
#153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running.

This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249
Approved by: https://github.com/zou3519
2025-06-10 22:40:58 +00:00
8892b782a8 [nativert] move execution planner to torch (#155374)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revidsion: D76167093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155374
Approved by: https://github.com/zhxchen17
2025-06-10 22:36:06 +00:00
ffc6cbfaf7 [symm_mem] Move all symm mem code into a dedicated folder (#155573)
We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573
Approved by: https://github.com/fegin, https://github.com/d4l3k
2025-06-10 22:30:11 +00:00
3e131f7779 [CI] Move tlparse to requirements files (#155601)
Not sure why we had it that way to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155601
Approved by: https://github.com/seemethere
ghstack dependencies: #155476, #155493
2025-06-10 22:25:47 +00:00
94da4523ec Disable foreach tests that depend on profiler for CUDA 12.6 (#155596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155596
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-06-10 22:21:06 +00:00
672ac2ec86 Reapply "Cleanup VS 2019 refs in pytorch (#145863)" (#152613) (#155478)
This reverts commit e4f2282.
I believe fix PR was landed https://github.com/pytorch/pytorch/pull/153480 that triggered the revert.
Hence this is reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155478
Approved by: https://github.com/malfet
2025-06-10 22:20:14 +00:00
3b7c5e6fa5 Revert "[inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182)"
This reverts commit b07725a9516028a485153c4b5356b3e33b990f81.

Reverted https://github.com/pytorch/pytorch/pull/155182 on behalf of https://github.com/davidberard98 due to fails on triton 3.2 (internally) ([comment](https://github.com/pytorch/pytorch/pull/155182#issuecomment-2960664845))
2025-06-10 21:53:01 +00:00
d2f06d2b06 [logs] Change autotune data into separate items (#155525)
Summary: Split the autotune data into multiple keys and items : this is better for storage of the data and easier querying.

Test Plan:
```
 TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run (sample)
```

Rollback Plan:

Differential Revision: D76303514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155525
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-06-10 21:47:07 +00:00
14f3639e09 Convert to .md: onnx_verification.rst, onnx.rst, package.rst, (#155556)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [onnx_verification.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_verification.rst)
* [onnx.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx.rst)

* [package.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/package.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155556
Approved by: https://github.com/AlannaBurke, https://github.com/sekyondaMeta
2025-06-10 21:40:40 +00:00
ae0f1f8984 Convert to markdown onnx rst (#155228)
Fixes #155030

Converts the following files to MyST markdown and ensure that the doc tests are green:

- [x] [onnx_dynamo_onnxruntime_backend.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo_onnxruntime_backend.rst)
- [x] [onnx_dynamo.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo.rst)
- [x] [onnx_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_ops.rst)
- [onnx_torchscript_supported_aten_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript_supported_aten_ops.rst) - not changed as it is autogenerated
- [onnx_torchscript.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript.rst) - fixed in #155390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155228
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 21:33:07 +00:00
7a03b0d2ca [BE] Remove CUDA 11 artifacts. Fix Check Binary workflow (#155555)
Please see: https://github.com/pytorch/pytorch/issues/147383

1. Remove CUDA 11 build and test artifacts. One place CUDA 12.4
2. Fix Check Binary Workflow to use Stable Cuda version variable rather then hardcoded one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155555
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-06-10 21:32:08 +00:00
40fefe2871 Revert "[BE] Update cudnn to 9.10.1.4 (#155122)"
This reverts commit 73220d52fd67b5f4f5b15e0e0433e09733c93f31.

Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/atalman due to wrong pr description ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2960592004))
2025-06-10 21:13:18 +00:00
8a396c5635 DOC: Convert to markdown: torch.compiler_best_practices_for_backends.rst, torch.compiler_cudagraph_trees.rst, torch.compiler_custom_backends.rst, torch.compiler_dynamic_shapes.rst, torch.compiler_dynamo_deepdive.rst (#155137)
Fixes #155037

[torch.compiler_best_practices_for_backends.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/torch.compiler_best_practices_for_backends.rst) shows error 404

cc  @svekars @sekyondaMeta @AlannaBurke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155137
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 20:51:05 +00:00
01b8f5e685 Convert to markdown: testing.rst, threading_environment_variables.rst, torch_cuda_memory.rst, torch_environment_variables.rst, torch_nccl_environment_variables.rst (#155523)
Fixes #155035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155523
Approved by: https://github.com/AlannaBurke, https://github.com/svekars
2025-06-10 20:38:36 +00:00
545fbd58dc [export] inline jit.scripted function in export (#155180)
When we export a scripted function, we inline the original callable stored in "_torchdynamo_inline", this is the same strategy as torch.compile path.

We do the same thing for script method, where a "\_\_wrapped\_\_" attribute points to the original callable in most cases. There are some corner cases we identified: top-level jit.scripted modules' method doesn't have a \_\_wrapped\_\_. In this case, we fall back to the original scripted approach. Maybe there're more such cases but need verification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155180
Approved by: https://github.com/zou3519
2025-06-10 20:34:12 +00:00
a666cf3b38 [xla hash update] update the pinned xla hash (#154348)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154348
Approved by: https://github.com/pytorchbot
2025-06-10 20:33:31 +00:00
c9404faacb [refactor] is_known_channels_last_contiguous* -> definitely_channels_last_contiguous* (#155499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155499
Approved by: https://github.com/laithsakka
2025-06-10 20:29:46 +00:00
94763f5ca7 [ROCm][Inductor][CK] add kBatch as runtime parameter to CK-tile gemms (#155389)
Similar to old-CK gemms

### Testing

Rely on existing coverage in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155389
Approved by: https://github.com/chenyang78
2025-06-10 20:25:02 +00:00
ab51a93737 [CI] Set PATH during build to include location of sccache wrapped nvcc (#155464)
Sccache wasn't working for nvcc on jammy, so manually set the path to include where nvcc is

I had problems with always making nvcc a wrapper in some inductor tests where I got
```
sccache: encountered fatal error
sccache: error: PCH not supported by nvcc
sccache: caused by: PCH not supported by nvcc
```
and I also got an error (only on clang) when trying to set CMAKE_CUDA_COMPILER_LAUNCHER to /opt/cache/bin/sccache or sccache
```
ccache: error: failed to execute compile
    sccache: caused by: Compiler not supported: "nvcc warning : Support for offline compilation for architectures prior to \'<compute/sm/lto>_75\' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\nnvcc fatal   : Failed to preprocess host compiler properties.\n"
```

Non jammy cuda jobs' docker images used a different dockerfile, which set CMAKE_CUDA_COMPILER_LAUNCHER e895e9689c/.ci/docker/ubuntu-cuda/Dockerfile (L110)

Alt solution:
Given that I only get the error on clang, I could set CMAKE_CUDA_COMPILER_LAUNCHER=sccache only when not using clang

Setting CUDA_NVCC_EXECUTABLE doesn't fail but also doesn't result in cache hits/misses

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155464
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-06-10 20:23:33 +00:00
35e8f2593c [CUDA] Fix missing bounds check in Softmax.cu (#154778)
Uncovered by @ngimel, same as change in #144009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154778
Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/malfet
2025-06-10 20:03:54 +00:00
0ca2a79f5b [TEST] Modernize test_sort_large (#155546)
Since its introduction ~4 years ago, the test `test_sort_large` has always been deselected because it requires 200GB of CUDA memory. Now, as we do have GPUs this big, it gets selected, but fails with `var_mean` not being a member if `torch.Tensor` and `var_mean` accepting only floating point tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155546
Approved by: https://github.com/eqy
2025-06-10 19:59:12 +00:00
ea23eb4b98 [ez][CI] Reuse old whl: turn off on releases, add docs files to ok list (#155346)
Add docs/**/*.md and docs/**/*.rst to files that are ok to reuse old whls

Prevent using old whls on release branches

Move check for changed files earlier to reduce api usage?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155346
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-06-10 19:57:40 +00:00
8a22551300 Fixes OpInfo gradient checks for ctc_loss (#154590)
Fixes #67462

Re-enables `OpInfo` gradient checks for the restricted scenarios where the current `ctc_loss` implementation is accurate and consistent.

The desired `ctc_loss` gradient behavior appears to be an ongoing discussion, see
https://github.com/pytorch/pytorch/issues/52241. The `OpInfo` gradient checks can be updated if/as the underlying implementation advances.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154590
Approved by: https://github.com/soulitzer
2025-06-10 19:56:39 +00:00
954ce94950 Add __main__ guards to quantization tests (#154728)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In quantization tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154728
Approved by: https://github.com/ezyang
2025-06-10 19:46:07 +00:00
07eb374e7e [dynamo] Avoid unncessary caching source codegen (#155376)
We only need to cache a source (e.g., `x.y.z`) into a temporary local if
it's used multiple times in the codegen, otherwise we'd just be creating
redundant `DUP` and `STORE_FAST tmp_...` instructions, which might
degrade perf and definitely makes generated bytecode harder to read.

Example:
```python
import torch

@torch.compile(backend="eager")
def fn(x, y):
    return x + y

fn(torch.ones(2), torch.ones(1))
```

Original bytecode:
```verbatim
[0/0] [__bytecode]   3           0 RESUME                   0
[0/0] [__bytecode]
[0/0] [__bytecode]   5           2 LOAD_FAST                0 (x)
[0/0] [__bytecode]               4 LOAD_FAST                1 (y)
[0/0] [__bytecode]               6 BINARY_OP                0 (+)
[0/0] [__bytecode]              10 RETURN_VALUE
```

Modified bytecode (before this patch):
```verbatim
[__bytecode]   3           0 RESUME                   0
[__bytecode]               2 LOAD_GLOBAL              1 (NULL + __compiled_fn_1_578c8d9a_2a9b_4d15_bac7_267591cdee32)
[__bytecode]              14 LOAD_FAST                0 (x)
[__bytecode]              16 COPY                     1
[__bytecode]              18 STORE_FAST               3 (tmp_1)
[__bytecode]              20 LOAD_FAST                1 (y)
[__bytecode]              22 COPY                     1
[__bytecode]              24 STORE_FAST               4 (tmp_2)
[__bytecode]              26 PRECALL                  2
[__bytecode]              30 CALL                     2
[__bytecode]              40 STORE_FAST               2 (graph_out_0)
[__bytecode]              42 LOAD_FAST                2 (graph_out_0)
[__bytecode]              44 LOAD_CONST               1 (0)
[__bytecode]              46 BINARY_SUBSCR
[__bytecode]              56 DELETE_FAST              2 (graph_out_0)
[__bytecode]              58 RETURN_VALUE
```

Modified bytecode (after this patch):
```verbatim
[__bytecode]   3           0 RESUME                   0
[__bytecode]               2 LOAD_GLOBAL              1 (NULL + __compiled_fn_1_2c498af2_ce5c_49cb_abba_a0c7489b09ce)
[__bytecode]              14 LOAD_FAST                0 (x)
[__bytecode]              16 LOAD_FAST                1 (y)
[__bytecode]              18 PRECALL                  2
[__bytecode]              22 CALL                     2
[__bytecode]              32 STORE_FAST               2 (graph_out_0)
[__bytecode]              34 LOAD_FAST                2 (graph_out_0)
[__bytecode]              36 LOAD_CONST               1 (0)
[__bytecode]              38 BINARY_SUBSCR
[__bytecode]              48 DELETE_FAST              2 (graph_out_0)
[__bytecode]              50 RETURN_VALUE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155376
Approved by: https://github.com/williamwen42
2025-06-10 19:38:15 +00:00
91ee9ee82d [MPS][BE] Refactor round_decimals shader code to leverage new macro (#155316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155316
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #155304
2025-06-10 19:29:57 +00:00
b1b8e57cda Add __main__ guards to ao tests (#154612)
This is the first PR of a series in an attempt to get the content of #134592 merged as smaller PRs (Given that the original one was closed due to a lack of reviewers).

This specific PR contains:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Update ao tests.

There will be follow up PRs to update the other test suites but I don't have permissions to create branches directly on pytorch/pytorch so I can't create a stack and therefore will have to create them one at the time.

Cc @jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154612
Approved by: https://github.com/jcaip
2025-06-10 18:33:09 +00:00
0b677560e6 [inductor] use int64 for large index (#154575)
Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31.

Fix https://github.com/pytorch/pytorch/issues/154168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-06-10 18:30:43 +00:00
0f47e76937 [MPS] Implement hardshrink metal kernel (#155304)
Implements the forward and backward hardshrink operators as Metal kernels.
In order to support the lambda parameter, we extend the `exec_unary_kernel`  and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar.
We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304
Approved by: https://github.com/malfet
2025-06-10 18:20:27 +00:00
8347268edc Revert "Make open device registration tests standalone (#153855)"
This reverts commit 8823138e47a3200c313f6bf2d21eb689d8150f39.

Reverted https://github.com/pytorch/pytorch/pull/153855 on behalf of https://github.com/clee2000 due to causing some linux aarch64 tests to fail [GH job link](https://github.com/pytorch/pytorch/actions/runs/15566289293/job/43832373302) [HUD commit link](8823138e47), should be easy fix, rename in places where its mentioned, there might be more than just aarch64 though ([comment](https://github.com/pytorch/pytorch/pull/153855#issuecomment-2960191503))
2025-06-10 18:11:24 +00:00
cb9b479f4f XPU enable XCCL by default (#154963)
Enable USE_XCCL=ON by default when building PyTorch XPU binary, which is on par with NCCL for PyTorch CUDA binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154963
Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-10 17:56:13 +00:00
0b6c0898e6 [dynamo] added additional_info functionality (#155526)
There is now functionality for the developer to add a  --additional-info arg to the add and update dev terminal command to include any additional info the dev might want to remark about the graph break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155526
Approved by: https://github.com/williamwen42
2025-06-10 17:40:50 +00:00
8823138e47 Make open device registration tests standalone (#153855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153855
Approved by: https://github.com/janeyx99
2025-06-10 17:33:26 +00:00
c88e87f355 [Inductor] Set Triton Allocator Function For Use with New TMA API in Inductor (#155373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155373
Approved by: https://github.com/davidberard98
2025-06-10 17:09:04 +00:00
73220d52fd [BE] Update cudnn to 9.10.1.4 (#155122)
Follow up to #152782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-10 16:59:00 +00:00
38c4d05535 [precompile] Ensure @disable()-ed function won't trigger recompile from precompile bytecode. (#155363)
In a precompiled bytecode, it looks like the following:
```
pre-graph bytecode
...
compiled graph code
...
post-graph bytecode
```

In pre-graph bytecode we have calls into helper functions like torch._dynamo.utils.call_size which will invoke @disable inside the bytecode.

Normally torch.compile() will handle these frames fine, but for precompile we will load bytecode from a clean state of dynamo and we want a way to assert recompile never happen, so the current way to ensure this is by doing set_stance("fail_on_recompile") (open to any other idea to test this, but IMO this is the closest thing we have today).

This approach doesn't work when util functions like call_size() is involved and this PR fixes a bunch of places to make sure "fail_on_recompile" can skip through the functions meant to be skipped during compilation.

Differential Revision: [D76156867](https://our.internmc.facebook.com/intern/diff/D76156867/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155363
Approved by: https://github.com/jamesjwu, https://github.com/jansel
ghstack dependencies: #155329
2025-06-10 16:13:38 +00:00
ddee927f31 [precompile] Add low level C API to load precompiled dynamo code on functions. (#155329)
While loading deserialized dynamo states back from disk, precompile will need a direct way to access ExtraState and populate guarded bytecode as cache entries.

This diff adds two API at code level to load precompiled guard + bytecode entries.
1. _load_precompile_entry() will append an entry to a precompile entry list per code object. This precompile entry will be looked up before normal compiled entries.
2. _reset_precompile_entries() will clean up all the installed existing entries. This is useful to prevent a case where user call loading multiple times and explode the number of entries on the list.

Differential Revision: [D76083247](https://our.internmc.facebook.com/intern/diff/D76083247/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155329
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-06-10 16:13:38 +00:00
e8d29c45e0 [ROCm][TunableOp] Unit test to verify that there is only one kernel launch per PyTorch API invocation. (#155077)
TunableOp UT covers breakage that was fixed in this PR: https://github.com/pytorch/pytorch/pull/153764

After tuning is complete, verify that there is only one kernel launch. for each PyTorch API invocation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155077
Approved by: https://github.com/jeffdaily
2025-06-10 16:11:43 +00:00
08d15d3ec1 [Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155514)
Part of changes #155040 (parent PR #155120)

Follow-up of #155351. I split the changes of `torch.compiler_troubleshooting.rst ` into #155351 and this PR due to the 2000-line limit in one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155514
Approved by: https://github.com/svekars
2025-06-10 15:41:31 +00:00
eb152ab1dd Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 060838c2312ad207c7afe2c86f8a484afea5f328.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/clee2000 due to broke a bunch of tests internally D76299454, probably also broke rocm inductor/test_analysis.py::TestAnalysisCUDA::test_augment_trace_against_flop_counter_maxat0_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545277599/job/43766911025) [HUD commit link](060838c231) ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2959747153))
2025-06-10 15:38:40 +00:00
b44306d368 Add dont constant fold flag (#154945)
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops

Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-10 14:52:26 +00:00
68f36683f0 [Testing] Add more models to MPSInductor tests (#155494)
Enable all 46 HuggingFace models, only GPT2ForSequenceClassification fails to compile with a rather strange `array subscript is not an integer` error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494
Approved by: https://github.com/dcci
ghstack dependencies: #155476, #155493
2025-06-10 14:40:59 +00:00
c8d39a1045 [docs] Add docstring indicating UB for converting inf to int (#154781)
Fixes #154724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154781
Approved by: https://github.com/malfet
2025-06-10 14:04:50 +00:00
805297981a Revert "[Testing] Add more models to MPSInductor tests (#155494)"
This reverts commit f154f9b3040369a7979d5de7acb6fe21433eda83.

Reverted https://github.com/pytorch/pytorch/pull/155494 on behalf of https://github.com/malfet due to I'm blind ([comment](https://github.com/pytorch/pytorch/pull/155494#issuecomment-2959319787))
2025-06-10 13:45:32 +00:00
e53ddaf1f6 Adapt dtensor tests to be device agnostic (#154840)
##MOTIVATION
This PR includes minor changes to skip some unsupported tests on Intel Gaudi devices as well as to make some of the tests more device agnostic.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

##CHANGES
- test_dtensor_compile.py : Make some of the tests device agnostic . ( Replace "cuda" hard codings with self.device_type)
- test_dtensor.py and test_comm_mode_features.py: Skip some tests which are unsupported on Intel Gaudi devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154840
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD
2025-06-10 12:43:16 +00:00
f154f9b304 [Testing] Add more models to MPSInductor tests (#155494)
Enable all hugging face models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494
Approved by: https://github.com/dcci
ghstack dependencies: #155476, #155493
2025-06-10 12:30:38 +00:00
75f258dd1f Fix spelling mistake (#155495)
Summary: Change "primtivies" to "primitives".

Test Plan:
n/a

Rollback Plan:

Differential Revision: D76229938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155495
Approved by: https://github.com/angelayi, https://github.com/cyyever
2025-06-10 09:06:58 +00:00
a205e8fd73 Apply all replacements on backward graph args during inductor codegen. (#155469)
Summary: temporary mitigation for https://github.com/pytorch/pytorch/issues/155468

Test Plan:
NA

Rollback Plan:

Differential Revision: D76096355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155469
Approved by: https://github.com/bobrenjc93
2025-06-10 08:56:18 +00:00
5116293f7e [XPU] Split triton version as 2 files to decouple triton version bump (#155313)
Triton XPU shares its version file with the community one. When the community updates Triton version, it will temporarily break the XPU CI/CD because they use different repositories and commits. To decouple Triton version bumps between the community and XPU, we propose splitting the version into two separate files.

Refer the latest community triton version bump PR #153117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155313
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/atalman
2025-06-10 08:49:03 +00:00
2cdcd16e83 [Easy] update pip sources for CUDA in nightly pull tool (#149143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149143
Approved by: https://github.com/ezyang, https://github.com/cyyever
ghstack dependencies: #145685
2025-06-10 08:07:30 +00:00
0319044e92 [Easy] update pip sources for ROCm in nightly pull tool (#145685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145685
Approved by: https://github.com/ezyang
2025-06-10 08:07:30 +00:00
9d2d227003 [CI] Fix XPU runner setup status issue (#155443)
Flow with PR #155194, fix the timeout exit code issue refer https://github.com/pytorch/pytorch/actions/runs/15526078422/job/43706927778?pr=154962#step:3:74
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155443
Approved by: https://github.com/etaf, https://github.com/atalman, https://github.com/EikanWang
2025-06-10 08:06:37 +00:00
5dfe1787b5 [Inductor] Limit fusions to a node distance of 64 (#154688)
fix for https://github.com/pytorch/pytorch/issues/154652 and https://fb.workplace.com/groups/1075192433118967/permalink/1484799079148049/

[window 128 dashboard run here w/ no regressions](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sun%2C%2001%20Jun%202025%2006%3A38%3A41%20GMT&stopTime=Sun%2C%2008%20Jun%202025%2006%3A38%3A41%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=mlazos/fuse-window&lCommit=8576f00ebfa53567d7bddc89d9882df9eb990561&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154688
Approved by: https://github.com/eellison, https://github.com/Raymo111
2025-06-10 07:32:23 +00:00
8b8684466a Add a stub AGENTS.md for Codex (#155459)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155459
Approved by: https://github.com/albanD, https://github.com/malfet
2025-06-10 07:20:21 +00:00
b07725a951 [inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182)
Follow-up to #154858.

Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API.

First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr).

Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim.

Differential Revision: [D76318784](https://our.internmc.facebook.com/intern/diff/D76318784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155182
Approved by: https://github.com/drisspg
2025-06-10 06:48:42 +00:00
8153340d10 [CI/CD] Remove CUDA 11.8 builds (#155509)
This removes CUDA 11.8 from CI/CD
Please see: https://github.com/pytorch/pytorch/issues/147383

TODO: Will followup of cleaning CUDA 11.8 config from scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155509
Approved by: https://github.com/cyyever, https://github.com/huydhn, https://github.com/malfet
2025-06-10 05:16:41 +00:00
671a9d175b Add warning for module full backward hook when no input requires gradient (#155339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155339
Approved by: https://github.com/Skylion007
2025-06-10 04:42:06 +00:00
e25ce0f928 [invoke_subgraph] Use eager input vals to constrain input strides (#155291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155291
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-10 04:06:09 +00:00
660695f11d Revert "Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)"
This reverts commit ede6ead8cd8e925cb093f2b3016342e645bd728d.

Reverted https://github.com/pytorch/pytorch/pull/155234 on behalf of https://github.com/clee2000 due to causing a bunch of tests to fail?  ex test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545607752/job/43773157441) [HUD commit link](ede6ead8cd), some of the failures attributed to broken trunk on friday seem real? ([comment](https://github.com/pytorch/pytorch/pull/155234#issuecomment-2957578769))
2025-06-10 03:34:36 +00:00
76644c9ff5 Make require_contiguous require exact strides instead of stride order (#148424)
Make `require_contiguous` require exact strides instead of stride order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148424
Approved by: https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-06-10 02:36:18 +00:00
b916d8a583 [Testing] Shard MacOS inductor perf tests (#155493)
One dtype per shard, as current job takes 2+ hours to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155493
Approved by: https://github.com/dcci
ghstack dependencies: #155476
2025-06-10 02:26:22 +00:00
52edfb2cbc Updates to README about CUDA install dir and conda not required (#155458)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155458
Approved by: https://github.com/malfet
2025-06-10 01:30:34 +00:00
f34335bf33 Convert compiler rst files to markdown (#155335)
Convert following compiler rst files to md file.
torch.compiler_inductor_profiling.rst
torch.compiler_ir.rst
torch.compiler_nn_module.rst
torch.compiler_performance_dashboard.rst
torch.compiler_profiling_torch_compile.rst

Fixes #155039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155335
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 01:12:11 +00:00
1851f50866 [AOTI] Add int return type support for custom op in proxy executor (#155465)
Summary:
When a custom op has int return type in its schema. The returned value will be specialized and such behaviour is different from a symint return type. This diff **only added support for int return type**.

As the returned int will be specialized and fused into downstream kernels (if being used), we can simply skip the int return type in the proxy executor.

Note that in the eager run, the returned int will be specialized to the value defined in the real impl of the custom op. In exported program or in AOTI, the returned int will be specialized to the value defined in the fake impl of the custom op. So the definitions of the return value should be consistent across real and fake impl of the custom op. Otherwise the eager run and AOTI run will have different results.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_int_output
```

Rollback Plan:

Differential Revision: D76159406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155465
Approved by: https://github.com/angelayi
2025-06-10 01:07:15 +00:00
da50835bde [aoti] Support c10 calls (#155256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155256
Approved by: https://github.com/malfet
2025-06-10 00:45:59 +00:00
07e340e29c Build magma-cuda 129 (#155496)
followup for https://github.com/pytorch/pytorch/pull/155340
https://github.com/pytorch/pytorch/issues/155196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155496
Approved by: https://github.com/atalman
2025-06-10 00:32:24 +00:00
e7698ff5cf [MPS] Move abs op to Metal (#155474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155474
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-10 00:23:59 +00:00
7a48cc6990 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit b8bc2c2660e84034ff15232e2161e3ef9a6656d0.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2957346976))
2025-06-10 00:18:23 +00:00
a9a0501ec4 [user triton] mutation analysis for on-device TMA (#155380)
Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA.

On-device TMA works like this:

```
@triton.jit
def kernel(A_ptr, workspace_ptr, ...):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...)
    tl._experimental_descriptor_store(workspace_ptr, data, ...)
```

The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case.

The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example).

Consider this example, which I'll used to describe the pros/cons of this approach.

```
@triton.jit
def create_tma(global_ptr, workspace_ptr):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...)

@triton.jit
def kernel(A, B, workspace_ptr):
    create_tma(A, workspace_ptr)
    workspace_B = workspace_ptr + 128
    create_tma(B, workspace_B)
    data = tl._experimental_descriptor_load(workspace_ptr, ...)
    tl._experimental_descriptor_store(workspace_B, data, ...)
```

An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`).

A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor.

Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380
Approved by: https://github.com/oulgen
2025-06-10 00:07:18 +00:00
2578796e23 Fix sqlite3 in x86 Docker container. (#155211)
Some core modules for versions of python installed in /opt depend on libraries in /usr/local but those libraries are not copied over from the base container.

For example: /opt/python/cp312-cp312/bin/python3 -c "import sqlite3"
ImportError: libsqlite3.so: cannot open shared object file: No such file or directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155211
Approved by: https://github.com/huydhn
2025-06-09 23:42:02 +00:00
5df3bf13ec [Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155351)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155351
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-09 23:18:31 +00:00
e12597090c Revert "Update auto-tuning support for _scaled_grouped_mm (#150944)"
This reverts commit 09328eb02f5412d2211b5fd638ce82d0e03b9c1f.

Reverted https://github.com/pytorch/pytorch/pull/150944 on behalf of https://github.com/davidberard98 due to breaks internal usage & complicates triton pin update - more details in https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957246463 ([comment](https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957248841))
2025-06-09 23:12:56 +00:00
40d02eb481 [Cutlass] Allow filtering by fast_accum for scaled_mm (#155195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155195
Approved by: https://github.com/drisspg
ghstack dependencies: #154829, #154835
2025-06-09 22:46:18 +00:00
2c1a93a0ae Revert "[Graph Partition] move cpu scalar tensor to gpu (#154464)"
This reverts commit c1f531f0b0e6faf443d90f8de2936e866c8c27c2.

Reverted https://github.com/pytorch/pytorch/pull/154464 on behalf of https://github.com/clee2000 due to some of the newly added tests are failing internally, along with some other tests, D75913292 ([comment](https://github.com/pytorch/pytorch/pull/154464#issuecomment-2957201054))
2025-06-09 22:43:20 +00:00
82e6475d92 Add doc for missing functions for torch.special module (#155074)
Fixes #132178

Added all the missing functions that had a docstring but were not present in the documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155074
Approved by: https://github.com/albanD
2025-06-09 22:28:26 +00:00
bdbf2792a8 Fix docs build (#155129)
Not sure why the online doc build passes but it fails locally with these broken strings...

~Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129
Approved by: https://github.com/janeyx99, https://github.com/svekars
2025-06-09 22:25:20 +00:00
034a7f6437 [BE] Raise better exception in torch.[con]cat[enate] (#155460)
By replacing `TORCH_CHECK` with `TORCH_CHECK_VALUE`

Also make redispatching from aliases an even simpler, by just calling
respective original class

Addresses feedback raised in https://github.com/pytorch/pytorch/pull/155383/files#r2133952368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155460
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-09 22:18:00 +00:00
398fca9dcf Add almalinux CUDA 12.9 docker build, required for magma build (#155340)
https://github.com/pytorch/pytorch/issues/155196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155340
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-09 22:10:24 +00:00
ede6ead8cd Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)
Move non inductor workflows cuda 12.6->cuda 12.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234
Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet
2025-06-09 22:04:19 +00:00
060838c231 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-09 21:43:21 +00:00
b95dadd717 [MPS] Enable RProp test for non-contiguous (#155439)
I believe this issue has already been fixed, but I don't know the hero PR. I'm relying on ci signals to verify it's fixed across macOS versions.

Fixes #118117

xref #115350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155439
Approved by: https://github.com/Skylion007
2025-06-09 21:29:09 +00:00
3490a4f906 [MPS] Enable optimizer tests affected by addcdiv (#155437)
Tracked in #118115. Fixed in #124442. This PR unskips the tests.

xref #115350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155437
Approved by: https://github.com/Skylion007
2025-06-09 21:27:37 +00:00
b8bc2c2660 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-09 21:23:32 +00:00
eba5fc91ac [nativert] Move serialization to PyTorch core (#155229)
Summary:
Serialization contains utilities to deserialize a graph saved on disk in json format as defined in `torch/csrc/utils/generated_serialization_types.h` to the in-memory representation as defined in `torch/nativert/graph/Graph.h`

Test Plan:
buck2 run @mode/dev-nosan caffe2/test/cpp/nativert:serialization_test

Rollback Plan:

Differential Revision: D76012641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155229
Approved by: https://github.com/zhxchen17
2025-06-09 21:12:30 +00:00
1e6a653234 [ROCm][Inductor][CK] Split ck and ck-tile inductor backend(s) (#155294)
... and fix ck-tile instances not being generated due to incorrect caching

### Testing

Added test cases for CKTILE instances

```
pytest test/inductor/test_ck_backend.py -k gemm_backends_CKTILE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155294
Approved by: https://github.com/coconutruben
2025-06-09 20:40:26 +00:00
620415e018 Revert "Add stack_trace on make_fx (#155155)"
This reverts commit d4d0ede6bacb4b3b33c0e4aa4cb0e79d34e697ec.

Reverted https://github.com/pytorch/pytorch/pull/155155 on behalf of https://github.com/malfet due to Not sure why it was merged, it indeed breaks those tests in CI ([comment](https://github.com/pytorch/pytorch/pull/155155#issuecomment-2956973633))
2025-06-09 20:40:13 +00:00
abbdf9f363 [BE][Testing] Unskip ones_like/zeros_like testing on MPS (#155476)
But skip `double` dtype form OpInfo variants for this test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155476
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-09 20:37:44 +00:00
ea37f72099 enable test (#155342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155342
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #154768
2025-06-09 19:26:05 +00:00
d4d0ede6ba Add stack_trace on make_fx (#155155)
Summary:
Previosuly, we only add stack trace in `class _ModuleStackTracer(PythonKeyTracer)` for non-strict export. I moved this stack trace logic to the parent class `PythonKeyTracer`, this way the graph traced from Module using make_fx will have stack_trace as well.

Motivation: we've observed some uses cases where users first use `make_fx` on the Module, and then run `export` on the resulting graph. If the result of `make_fx` doesn't have stack trace, the stack trace information is lost.

Test Plan:
```
buck run test:test_export -- -r  test_stack_trace
```

Rollback Plan:

Differential Revision: D75985427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155155
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-09 18:31:57 +00:00
2aade5ee9f Fix weight tensor documentation #134896 (#155093)
Fixes #134896

## Description

Remove line about 'weight' tensor needing to be of floating point type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155093
Approved by: https://github.com/AlannaBurke
2025-06-09 18:07:21 +00:00
3863bbb55b [BE]: Update cusparselt to 0.7.1 (#155232)
Needed to support sparse operations on Blackwell, and implements new features for the library. Also optimizes library sizes vs 0.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155232
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-06-09 18:01:23 +00:00
79bdafe5b6 Revert "Custom FX pass for inductor's backend registration (#154841)"
This reverts commit e694280d1215caf70f41575f2611bfa26c69ebdb.

Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))
2025-06-09 16:56:45 +00:00
0083032e75 [aotd] Support mutations in reordering_to_mimic_autograd_engine (#155353)
Original issue: https://github.com/pytorch/pytorch/issues/154820

Dedicated sub-issue: https://github.com/pytorch/pytorch/issues/155242

Backward graph is reordered by partitioners.py: reordering_to_mimic_autograd_engine

Which only records in the backward graph compute that starts from tangents.

Mutation of primals(inputs) in backward can be disconnected from backward.

Handling this copy_ specifically, as we  add this mutation in framework and this is the only mutation that exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155353
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-06-09 16:39:47 +00:00
6c05f2fca0 [test] use JK to force graph break on slow aliasing/mutation/dynamic_shape behavior (#155257)
Summary: test to unblock shampoo, needs cleanup

Test Plan:
CI

Rollback Plan:
steps:
  - jk.update:
      jk: pytorch/compiler:aliased_inputs_with_mutation_and_dyn_shapes_killswitch
      constant_bool: null
      consistent_pass_rate: null
      fractional_host_rollout: null
      sampling_rate: null
  - manual.note:
      content: Set it to false.

Reviewed By: c00w

Differential Revision: D76051868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155257
Approved by: https://github.com/c00w
2025-06-09 16:21:59 +00:00
4a4cac0cef Update torch-xpu-ops commit pin (#154962)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@`a3a196`](a3a196ccdb) includes:

- Enhanced Adaptive Average Pooling 2D Backward Kernel for performance and code simplification
- Group Norm Backward Optimization with vectorization and parallel reduction
- Support CL path for MaxUnpooling2d and MaxUnpooling3d
- Rename USE_ONEMKL as USE_ONEMKL_XPU and set it as default ON
- Refactor USE_XCCL & USE_C10D_XCCL option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154962
Approved by: https://github.com/EikanWang
2025-06-09 15:54:13 +00:00
b9b84d8011 Generate unique id for tensor storage object by observing the week pointer of tensor storage object (#154859)
Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D75749065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154859
Approved by: https://github.com/eellison, https://github.com/ngimel
2025-06-09 15:46:27 +00:00
79aef14169 [ONNX] Set the name of the producing node using the value name (#155413)
When comparing two graphs exported using different opset versions, even though the value names are the same in both graphs, the node names did not match, causing model-explorer to not be able to sync the two graphs. This change updates the names of the nodes that directly produce the output values, for better correspondence across exported graphs.

![image](https://github.com/user-attachments/assets/3c00ca18-221f-4add-8429-4bcf12069036)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155413
Approved by: https://github.com/cyyever, https://github.com/xadupre
2025-06-09 13:03:58 +00:00
e15848669f [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268)
Summary:

**Problem Statement**
Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run.
*recording events to default destination:* https://fburl.com/code/7f9b0993
The default destination is "null".

***Solution***
adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line.

Test Plan:

https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION

Rollback Plan:

Reviewed By: kiukchung

Differential Revision: D75183591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268
Approved by: https://github.com/d4l3k
2025-06-09 10:43:52 +00:00
9968c854b6 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/tensor.py (#153146)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/tensor.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153146
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-06-09 06:27:50 +00:00
9b4a748e29 [nativert] Move Weights to PyTorch core (#155156)
Summary:
Moves Weights class to PyTorch core
Torch Native Runtime RFC: pytorch/rfcs#72
 README: https://github.com/pytorch/pytorch/blob/main/torch/nativert/OVERVIEW.md

Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:weights_test

Differential Revision: D75973156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155156
Approved by: https://github.com/zhxchen17
2025-06-09 05:49:32 +00:00
6fb6293159 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit c6b4f98625bb6b22bb9a60112a6d58e684a97e1b.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/etaf due to This is breaking tests on xpu, detail log: https://hud.pytorch.org/pr/pytorch/pytorch/154962#43700962849 ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2954517883))
2025-06-09 03:13:27 +00:00
be2ad70cfa Fix dynamo tracing into AOTAutogradCache results in cpu tensors (#155251)
On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable:
05dd638ee9/torch/_dynamo/backends/common.py (L76)
This disables dynamo in the bw_compiler but also disables the runnable the compiler returns.

On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already.

```
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ]

```

This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical.

Fixes https://github.com/pytorch/pytorch/issues/154536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-06-09 02:06:16 +00:00
2908c10259 Document the default garbage_collection_threshold value and improve the organization of cuda docs (#155341)
Fixes #150917

As mentioned in the issue, I've updated the documentation of `garbage_collection_threshold`and improved the organization.

Could you please review?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155341
Approved by: https://github.com/AlannaBurke, https://github.com/ngimel
2025-06-08 22:09:35 +00:00
d41f62b7a0 Fix/issue #155027 (#155252)
Fixes #155027
Converted RST files to Markdown

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155252
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-08 21:17:31 +00:00
3d82a1dfb5 Add checks for empty tensor list (#155383)
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b

Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
   556 	  bool has_names() const {
   557 	    // If a user is using unnamed tensors, then we can short-circuit right here.
   558 	    // Otherwise, impl::has_names attempts to retrieve names.
-> 559 	    if (!impl_->has_named_tensor_meta()) {
   560 	      return false;
   561 	    }
   562 	    return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
   20  	int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
   21  	  TORCH_CHECK(dim.type() != NameType::WILDCARD,
   22  	      "Please look up dimensions by name, got: name = None.");
-> 23  	  TORCH_CHECK(tensor.has_names(),
   24  	      "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
   25  	  const auto names = tensor.names();
   26
```

TODOs:
 - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
 - Replace  `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests

Fixes https://github.com/pytorch/pytorch/issues/155306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: #155382
2025-06-08 18:53:19 +00:00
95448b2ce6 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 65b1aedd09e98fcafcdd893ca4924f4fa598fd18.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/clee2000 due to see henry's comment above.  This was reverted internally because it causes a memory leak and OOMs on AMD? ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2954192879))
2025-06-08 17:37:29 +00:00
30293b8b5e Preserve Enum types during torch.export serialization and deserialization (#154821)
Fixes #154674

Addresses an issue where `torch.export` does not correctly preserve Python `Enum` types during the save/load round-trip. Previously, Enum inputs were serialized by value only, causing their type to be lost after deserialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154821
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007, https://github.com/yushangdi, https://github.com/angelayi
2025-06-08 17:30:31 +00:00
27df0c56b7 Revert "[inductor] use int64 for large index (#154575)"
This reverts commit 2596e3d0617852469241be8777cf46db5c83928c.

Reverted https://github.com/pytorch/pytorch/pull/154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](2596e3d061), note for self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154575#issuecomment-2954175761))
2025-06-08 16:58:59 +00:00
49888e6be0 [BE] Polish Makefile (#155425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155425
Approved by: https://github.com/ezyang
2025-06-08 16:37:12 +00:00
b981fb6744 Add docblock to torch/_dynamo/variables/builtin.py (#155402)
Add comprehensive module docstring explaining built-in function and type
variable tracking, including handling of Python built-ins, type constructors,
operators, and special constructs during symbolic execution.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155402
Approved by: https://github.com/Skylion007
ghstack dependencies: #155403
2025-06-08 15:24:29 +00:00
09328eb02f Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel
2025-06-08 10:18:13 +00:00
1339e88105 Add docblock to torch/_dynamo/side_effects.py (#155403)
Add comprehensive module docstring explaining side effect tracking and
management, including mutation tracking, context changes, aliasing,
and state preservation during symbolic execution.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155403
Approved by: https://github.com/williamwen42
2025-06-08 07:02:30 +00:00
0756ebcd48 Add docblock to torch/_dynamo/trace_rules.py (#155401)
Add comprehensive module docstring explaining the tracing rules and policies
that govern TorchDynamo's compilation decisions, including skip rules,
inlining policies, and library-specific handling.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155401
Approved by: https://github.com/williamwen42
2025-06-08 04:30:03 +00:00
abf4da0d24 [Profiler] Induce Inductor Import before Profiling (#155243)
Fixes #151829
Summary:
Currently, inductor has a lazy init which causes certain aten ops to run during a profiling run. This ends up cluttering the function events especially for smaller traces. One of the attempts to fix this was to simply remove that import from the profiler entirely but it looks like the import happens somewhere downstream anyways and the event still flood our profile.

To fix this, we induce the inductor import during prepare trace if the inductor is present. This way regardless of how the workload imports the inductor the actual init process will be done before tracing starts, resulting in more accurate tracing.

Test Plan:
Added test, also ran N7316820 manually and went from getting many events on the first run to the following output (only difference is Runtime Triggered Module Loading which is CUPTI overhead event):

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
                                             aten::mul_         1.40%     340.638us        99.92%      24.390ms      24.390ms       1.535us       100.00%       4.605us       4.605us             1
                                       cudaLaunchKernel         0.60%     146.533us        98.52%      24.049ms      24.049ms       0.000us         0.00%       3.070us       3.070us             1
                       Runtime Triggered Module Loading         6.14%       1.500ms         6.14%       1.500ms       1.500ms       1.535us       100.00%       1.535us       1.535us             1
                       Runtime Triggered Module Loading        91.78%      22.403ms        91.78%      22.403ms      22.403ms       1.535us       100.00%       1.535us       1.535us             1
                       void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.535us       100.00%       1.535us       1.535us             1
                        cudaDeviceSynchronize         0.08%      20.031us         0.08%      20.031us      20.031us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
                                   aten::mul_        82.81%     484.396us        94.26%     551.378us     551.378us       1.440us       100.00%       1.440us       1.440us             1
                                   cudaLaunchKernel        11.45%      66.982us        11.45%      66.982us      66.982us       0.000us         0.00%       0.000us       0.000us             1
                                  void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.440us       100.00%       1.440us       1.440us             1
                                  cudaDeviceSynchronize         5.74%      33.581us         5.74%      33.581us      33.581us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Rollback Plan:

Differential Revision: D76056511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155243
Approved by: https://github.com/ngimel
2025-06-07 23:58:50 +00:00
f1f49e56b0 [CI] remove xfail sm89 job (#155244)
No need to collect more data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155244
Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/Skylion007
2025-06-07 21:04:57 +00:00
11bc29856d Fix some incorrect reST markups in the document (#154831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154831
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-07 19:09:46 +00:00
2596e3d061 [inductor] use int64 for large index (#154575)
Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31.

Fix https://github.com/pytorch/pytorch/issues/154168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-06-07 18:41:46 +00:00
cyy
f6e18bc105 Fix CUDA 12.8 docker tag (#155087)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155087
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-06-07 16:39:42 +00:00
783a4c1f50 [ROCm] fix nightly wheel, second attempt (#155388)
Fixes #155207. hipsparselt logic was still broken, but smoke test didn't catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155388
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 15:57:55 +00:00
ab56e5add9 [CUDA][BUILD] Add back the capability to use env TORCH_CUDA_ARCH_LIST (#155314)
Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build.

Follow up to: https://github.com/pytorch/pytorch/pull/152715

Before this PR,
On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using.

After this PR,
TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build.

p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314
Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman
2025-06-07 15:52:39 +00:00
456f40cb09 Add docblock for autotune_cache.py (#155133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155133
Approved by: https://github.com/aorenste
2025-06-07 14:50:09 +00:00
29e6033ff3 [Break XPU] Fix failed test cases which are introduced by community for XPU. (#155317)
Fixes #155186, Fixes #154701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155317
Approved by: https://github.com/jansel
2025-06-07 14:46:30 +00:00
694028f502 update get_default_device to also respect torch.device ctx manager (#148621)
Fixes https://github.com/pytorch/pytorch/issues/131328
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148621
Approved by: https://github.com/ezyang
2025-06-07 14:26:17 +00:00
db491825e0 [invoke_subgraph] Add logging (#155284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155284
Approved by: https://github.com/zou3519
ghstack dependencies: #155270
2025-06-07 11:31:53 +00:00
0f3f59784d [invoke_subgraph] Throw assertion on uncaptured speculate_subgraph (#155270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155270
Approved by: https://github.com/zou3519
2025-06-07 11:31:53 +00:00
c1f531f0b0 [Graph Partition] move cpu scalar tensor to gpu (#154464)
cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops.

This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Close: #119241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464
Approved by: https://github.com/eellison
2025-06-07 06:59:39 +00:00
386aa72003 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Differential Revision: [D75985423](https://our.internmc.facebook.com/intern/diff/D75985423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-07 06:54:12 +00:00
da1f8980df [nativert] move function schema to torch (#154948)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D75826905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154948
Approved by: https://github.com/zhxchen17
2025-06-07 05:45:30 +00:00
5fbaa041e7 SDPA support gfx950 (#155103)
Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27

Test Plan:
```
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           179.737 |             182.874 |          3106.6  |          359.662 |              205.506 |        382.334 |          375.776 |       22.1205 |       191.067 |           334.392 |                17.2841 |                 16.9877 |               8.63754 |                   15.1169 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |           617.271 |             623.38  |          7169.73 |          998.961 |              654.534 |        445.312 |          440.947 |       38.3387 |       275.164 |           419.96  |                11.6152 |                 11.5014 |               7.17719 |                   10.9539 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |           667.032 |             670.118 |         13031.8  |         1383.42  |              768.452 |        412.091 |          410.193 |       21.0928 |       198.694 |           357.703 |                19.5371 |                 19.4471 |               9.42    |                   16.9586 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          2074.64  |            2214.81  |         29186.9  |         3916.35  |             2404.29  |        529.978 |          496.437 |       37.6714 |       280.749 |           457.313 |                14.0684 |                 13.1781 |               7.45257 |                   12.1395 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          2456.6   |            2472.38  |         51095.8  |         5647.01  |             3008.09  |        447.574 |          444.718 |       21.5186 |       194.707 |           365.518 |                20.7994 |                 20.6666 |               9.0483  |                   16.9861 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |          8048.8   |            8070.96  |        113478    |        15580.8   |             9768.71  |        546.423 |          544.922 |       38.7569 |       282.274 |           450.218 |                14.0987 |                 14.06   |               7.2832  |                   11.6165 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           692.323 |             697.649 |          4241.81 |          1562.26 |              906.441 |        248.148 |          246.254 |       40.5012 |      109.968  |           189.531 |                6.12693 |                 6.08015 |               2.71518 |                   4.67963 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |          2263.22  |            2267.38  |          9482.64 |          7003.8  |             2765.5   |        303.636 |          303.079 |       72.4687 |       98.1174 |           248.489 |                4.1899  |                 4.18221 |               1.35393 |                   3.42891 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |          2553.94  |            2572.68  |         15909.8  |          5697.16 |             3284.77  |        269.073 |          267.112 |       43.193  |      120.621  |           209.206 |                6.22953 |                 6.18415 |               2.79259 |                   4.84352 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          8187.67  |            8201.71  |         35449.2  |         26424.3  |            10364.5   |        335.722 |          335.147 |       77.5413 |      104.025  |           265.21  |                4.32959 |                 4.32218 |               1.34154 |                   3.42025 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          9948.15  |            9815.47  |         62815.1  |         23741.9  |            12710     |        276.31  |          280.046 |       43.7598 |      115.778  |           216.269 |                6.31425 |                 6.39961 |               2.64575 |                   4.94217 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |         32187.6   |           32035.6   |        137832    |        102075    |            40623.4   |        341.595 |          343.216 |       79.7716 |      107.716  |           270.66  |                4.28216 |                 4.30248 |               1.35031 |                   3.39293 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+

```

Rollback Plan:

Differential Rev,ision: D75934358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2025-06-07 03:47:29 +00:00
30387ab2e4 [ROCm] Adds initialization support for PyTorch when built from ROCm wheels. (#155285)
AMD is beginning to roll out ROCm distribution via Python wheels. This patch adds the `__init__.py` hook that is necessary to bootstrap ROCm correctly on Linux and Windows when built from these wheels.

See draft, developer documentation describing the mechanism here: https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md

This operates to similar effect as how Torch can depend on CUDA wheels, with some differences:

* ROCm libraries and checks are delegated to helpers in the `rocm_sdk` module, which knows how to find and configure access to the installed libraries. This limits the amount of plumbing and path machinations that must match up between the framework and ROCm.
* When building torch against ROCm, no ROCm system install is needed: instead the proper SDK development wheel is installed and the `CMAKE_PREFIX_PATH` is obtained via `rocm-sdk path --cmake`.
* It is expected that whoever produces such a build will also place a generated `_rocm_init.py` in the `torch` module with initialization logic to preload libraries, check versions, verify GPU compatibility, etc.
* See [build_prod_wheels.py](https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py) for an example build script that is being used to generate nightlies in this configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155285
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 02:59:03 +00:00
f140fac8dc [MPS] Implement erfc (#155382)
And migrate `erf` to Metal kernel

Use `erf` approximations from https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/erf.h as previous approximation did not match the CPU implementation

After that, `erfc(x) := 1.0 - erf(x)`

Fixes https://github.com/pytorch/pytorch/issues/155337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155382
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-06-07 02:35:12 +00:00
400f439670 [pt][easy] Rename metadata column (#155365)
Summary: Fixing typo: our logging requires autotuning_data instead of autotune_data, making it consistent

Test Plan:
Run benchmark, observe in perfetto trace proper name

Rollback Plan:

Differential Revision: D76159393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155365
Approved by: https://github.com/masnesral, https://github.com/Skylion007
2025-06-07 02:25:55 +00:00
81b0b308ca [dynamo] constant fold torch.cuda.is_initialized (#155300)
Fixes https://github.com/pytorch/pytorch/issues/129659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155300
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-07 02:21:11 +00:00
10cd1de518 [ROCm] Make optional features in LoadHIP better conditioned. (#155305)
* The `rocm-core` CMake package only started appearing in ROCm 6.4, so rework the version probing to work if it is not present. Also collapses the unneeded operating system conditioning in favor of feature probing.
* Make `hipsparselt` optional: it only started appearing in ROCm 6.4 and it is not in all recent distribution channels yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155305
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 02:20:55 +00:00
5596cefba6 Fix segfault during NumPy string tensor conversion (#155364)
By checking dtype first, but add elemnt_size check as well

Fixes https://github.com/pytorch/pytorch/issues/155328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155364
Approved by: https://github.com/Skylion007
2025-06-07 01:55:00 +00:00
be2e43264d [CI]Update windows runner to windows-2022 (#154368)
As per info in : actions/runner-images#12045
We need to change window runner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154368
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-07 01:39:19 +00:00
83d22256f8 [BE][Ez]: Improve typing in torch._logging (#155345)
Add a few missing returns in torch._logging and use ruff to infer the obvious ones.
LazyStr now properly checks the return type of the Callable and the args and kwargs passed to it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155345
Approved by: https://github.com/ezyang
2025-06-07 00:04:39 +00:00
9b4db093cb Add C shim for at::pad and fix some typos (#155226)
As stated, we would like a pad shim to support custom ops wanting to build in an ABI stable manner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155226
Approved by: https://github.com/desertfire
2025-06-06 23:08:39 +00:00
cd82096973 DOC: Convert to markdown: ddp_comm_hooks.rst, debugging_environment_variables.rst, deploy.rst, deterministic.rst, distributed.algorithms.join.rst (#155298)
Fixes #155017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155298
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-06 22:44:50 +00:00
457dd79927 [BE][Ez]: Remove unnecessary accesses of dim vector (#155334)
It's better because you return less date, encapsulate more, and no longer need special handling of symvec vs nonsym vec dim(). Also removes a few casts and fixes a few potential edge cases relating to unsigned comparisons

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155334
Approved by: https://github.com/ezyang
2025-06-06 21:28:25 +00:00
c95705dac2 [Docs] Convert to markdown: torch.compiler_troubleshooting_old.rst, torch.compiler.rst (#155348)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155348
Approved by: https://github.com/svekars
2025-06-06 21:26:24 +00:00
d2a2bfcb58 Turn on new tiling by default (#154768)
Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set.

TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red)

I am doing another run and will update here.

Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See:
```
(Pdb) p expr
((s25*s85)//32)
(Pdb) p FloorDiv(expr, expr)
((s25*s85)//(32*(((s25*s85)//32))))
```

and also - unbacked shape is not multiple of itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768
Approved by: https://github.com/jansel
2025-06-06 21:19:35 +00:00
bc5a11b581 [easy][invoke_subgraph] Remove skip from already fixed test (#155286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155286
Approved by: https://github.com/zou3519
2025-06-06 21:16:22 +00:00
0d8c029584 [FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319)
for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319
Approved by: https://github.com/mori360
2025-06-06 20:29:31 +00:00
4f5b34427b DOC: Convert to markdown: torch.overrides.rst, type_info.rst, utils.rst, xpu.rst (#155088)
Fixes #155041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155088
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-06 20:16:13 +00:00
067fd0b3ab [dynamo][cleanup] Simplify disabling of the helper functions on tensor properties (#155259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155259
Approved by: https://github.com/zhxchen17
2025-06-06 19:44:40 +00:00
749757ac1b [a2av] Align length of major dimension in output of 2D a2av (#155172)
Downstream consumer of the 2D all-to-all-v is often a group GEMM.
Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8.

This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed.

The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.)

The algorithm is as follows.

![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac)

In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155172
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677, #155058
2025-06-06 19:39:44 +00:00
1ccc57e428 Log backward no-op to tlparse and pt2 compile events. (#154544)
Summary: Log backward no-op to tlparse and pt2 compile events.

Test Plan:
$ rm -rf /tmp/r && TORCH_TRACE=/tmp/r buck2 run //scripts/jovian:backward_noop_repro_compile

Used print statements to verify we enter the logging code region.

Differential Revision: D75231665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154544
Approved by: https://github.com/c00w
2025-06-06 18:08:19 +00:00
2e2ea7290a [Inductor] Support autotuning in the FX backend. (#155049)
# Feature
If `config.triton.autotune_at_compile_time` is set to `True`, autotune Triton kernels during FX conversion. Else, stick with the existing behavior of using the first precompiled config.

# Test plan
Added CI tests verifying that the tuner is called iff this flag is set, with and without dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155049
Approved by: https://github.com/jansel
2025-06-06 17:44:14 +00:00
453bc9fbdf [a2av] 2D all-to-all-vdev (#155058)
A 2D AllToAllv shuffle is illustrated below:
(`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank)
```
        Source: |       Rank 0      |       Rank 1      |
                | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 |

        Dest  : |       Rank 0      |       Rank 1      |
                | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 |
```
where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`).

That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155058
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677
2025-06-06 17:35:39 +00:00
64436c38c9 [logs] Add autotuning data (#154771)
Summary: Add autotuning logging data to scuba/chrome trace.

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run //scripts/sashko:compilation_sample
```

Open https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/viewer?local_cache_key=00000000-0000-0000-92db-f23383ebf5b5, search for template_autotuning, see in metadata strides (see screenshot)

Differential Revision: D75457770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154771
Approved by: https://github.com/masnesral, https://github.com/PaulZhang12
2025-06-06 17:12:55 +00:00
706bc41c4c pass mempool arg through emptyCache (#155315)
Fixing typo in a previous PR #154746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155315
Approved by: https://github.com/Skylion007
2025-06-06 16:14:26 +00:00
7ae7c14143 Reduce scope of s390x CI (#155208)
The purpose of this change is to reduce scope of s390x CI to stop it potentially blocking usual workflows for other users
while still keeping nightly builds and tests for me to look at.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155208
Approved by: https://github.com/malfet
2025-06-06 16:07:34 +00:00
fc77269262 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
2025-06-06 15:48:00 +00:00
7e4c097b07 Revert "[inductor] Add typing to _inductor/ir.py (#149958)"
This reverts commit 529e0357c6c4e74f8cd32c29198c5f1c9f6e329d.

Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see b0fbbef136/1 ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))
2025-06-06 15:19:16 +00:00
b0fbbef136 Revert "Turn on new tiling by default (#154768)"
This reverts commit 7dcc77e422dcf97ce35991a138ab635a5cb88731.

Reverted https://github.com/pytorch/pytorch/pull/154768 on behalf of https://github.com/malfet due to Looks like it broke inductor CPU, see 231eb9902b/1 ([comment](https://github.com/pytorch/pytorch/pull/154768#issuecomment-2949468396))
2025-06-06 14:40:03 +00:00
231eb9902b [MPS][BE] Extend ndim_and_dtypes to 4 elements (#155272)
Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running
any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to
exception
```
% MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled
test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ...
validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).'
zsh: abort      MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
```

Extend it to 4 elements and pass output dtype, which will be used by
binary_op later on anyway

Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make
sure everything passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272
Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever
2025-06-06 14:20:21 +00:00
529e0357c6 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-06 14:15:01 +00:00
348fd45065 Support detached checkout in tools/nightly.py (#154314)
Prompt for Sonnet 3.7 in Claude Code: Only inspect tools/nightly.py, all
other files are irrelevant to your task. Do not use any shell commands.
Task: Add a --detach argument to this script which instead of making a
new branch just directly checks out the correct commit in detached mode.

With two interventions:
- Branch and detach are mutually exclusive. So you should consolidate
  them into a single argument. Why don't we take over the 'None' option?
- Do you know that nightly_version is guaranteed to be a commit hash? It
  seems it would be safer to explicitly pass --detach

I tested by running `python tools/nightly.py checkout` and observing
that my worktree was detached at this point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154314
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
2025-06-06 13:28:29 +00:00
907aea032d Add claude local md files (#155299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155299
Approved by: https://github.com/ezyang
2025-06-06 13:28:26 +00:00
6b1211df29 [BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130)
Backports some behavior changes and performance improvements with runtime_checkable in 3.12 to older versions of Python. Should be free performance improvement on typing checking protocols since everything works on Python 3.12.

The difference between the two versions of runtime_checkable is [these lines](40e22ebb2c/src/typing_extensions.py (L800-L823)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155130
Approved by: https://github.com/rec, https://github.com/aorenste
2025-06-06 13:28:05 +00:00
10cef1e25d Remove torch XPU ABI=0 build logic for old compiler (#150095)
# Motivation
Follow https://github.com/pytorch/pytorch/pull/149888, this PR intends to remove ABI=0 build logic for PyTorch XPU build with old compiler (< 2025.0). For newer compilers >= 2025.0, the ABI is neutral by default without requiring additional compilation options (`-fpreview-breaking-changes`).

# Additional Context
This PR depends on XPU CI pass, which will be fixed by  https://github.com/pytorch/pytorch/pull/149843 and https://github.com/intel/torch-xpu-ops/pull/1515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150095
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-06-06 13:13:19 +00:00
58e5d20c57 [BE] Delete IS_SPMM_AVAILABLE() logic (#155296)
As it's been available on all currently supported platforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155296
Approved by: https://github.com/clee2000
2025-06-06 13:12:35 +00:00
271ca679a8 [reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)
reland of https://github.com/pytorch/pytorch/pull/154769

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-06-06 13:11:03 +00:00
9656251bb1 Revert "[BE] Update cudnn to 9.10.1.4 (#155122)"
This reverts commit a14f427db68e54500ef4cd9ed34cb9537263bb74.

Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/malfet due to Looks like it breaks a bunch of tests, see 36a722e20d/1 ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2949209801))
2025-06-06 13:03:49 +00:00
36a722e20d [typo] Fix 'intialize' -> 'initialize' in proxy_tensor.py (#155301)
## Description
Fixes a typo in the comment of `torch/fx/experimental/proxy_tensor.py`, changing "intialize" to "initialize".

## Issue
None

## Type of change
- [x] Typo fix

## Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] My changes generate no new warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155301
Approved by: https://github.com/jingsh, https://github.com/ezyang, https://github.com/cyyever
2025-06-06 10:43:44 +00:00
9d59b516e9 Make device check throw specific error (#155085)
Fixes #122757

The fix is lost after revert and rebase previous PR https://github.com/pytorch/pytorch/pull/150750 (only change of tests are merged).

## Test Result

```python
>>> import torch
>>>
>>> model_output = torch.randn(10, 5).cuda()
>>> labels = torch.randint(0, 5, (10,)).cuda()
>>> weights = torch.randn(5)
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
>>> loss = loss_fn(input=model_output, target=labels)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3476, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155085
Approved by: https://github.com/mikaylagawarecki
2025-06-06 07:00:04 +00:00
07da8a469b [CI] fix xpu-smi hang issue on some xpu runners (#155194)
To workaround  xpu-smi hang issue on some XPU runners, refer https://github.com/pytorch/pytorch/actions/runs/15431583674/job/43431289026?pr=154962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155194
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-06-06 06:51:26 +00:00
e694280d12 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-06 06:49:44 +00:00
c6b4f98625 Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-06 05:53:24 +00:00
d3d64c6db0 Revert "Add pinned numpy and fix build (#155129)"
This reverts commit a3098a74d494020dbb906c05ef047013e1921662.

Reverted https://github.com/pytorch/pytorch/pull/155129 on behalf of https://github.com/malfet due to Broke test_spectral_op, looks like missing xfail, see 0db3e0cf29/1 ([comment](https://github.com/pytorch/pytorch/pull/155129#issuecomment-2947951632))
2025-06-06 03:14:47 +00:00
0db3e0cf29 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit e1180c7228ba8c8b16cabf78706d4a67ca189a6b.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Breaks doc test, but should be easily fixable ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2947935940))
2025-06-06 03:08:48 +00:00
28796f71d0 Redo D75092426: [internal] Expose additional metadata to compilation callbacks (#155063)
Originally https://github.com/pytorch/pytorch/pull/153596
---------------

Summary:
via reverting D75708685

gate the ROCm failure

Test Plan:
Unit tests in OSS, sandcastle

Rollback Plan:

Bifferential Revision: D75894349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155063
Approved by: https://github.com/masnesral
2025-06-05 23:40:31 +00:00
72453a6676 [PT2][comms] put visualize_overlap in a try-except block (#155222)
Summary:
For simple FSDP, this `visualize_overlap` function is throwing errors.

Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places.

Test Plan:
:)

Rollback Plan:

Reviewed By: Microve

Bifferential Revision: D75985733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222
Approved by: https://github.com/yf225
2025-06-05 23:39:48 +00:00
9bae2fcf99 [profiler] Enable all configured activities in CUPTI Range profiler mode (#154749)
Summary: Updates the  pytorch range profiler mode (metrics mode) to support all trace activitity types.

Reviewed By: sraikund16

Bifferential Revision: D75568693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154749
Approved by: https://github.com/sraikund16
2025-06-05 23:38:54 +00:00
26f066bb61 Add AOTI model name config (#154129)
Summary: If a model name is specified in aoti config, the generated files will use that model name as file stem.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_using_model_name_for_files
```

Bifferential Revision: D75102034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154129
Approved by: https://github.com/desertfire
2025-06-05 23:38:11 +00:00
fa705f7912 [BE] minor refactor + some comments on behavior (#154695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154695
Approved by: https://github.com/masnesral, https://github.com/eellison
2025-06-05 23:00:46 +00:00
9e88d6c857 [ROCm] manywheel missing hipsparselt deps (#155254)
Bundle libhipsparselt.so and auxiliary files into wheel.

Dependency added by hipsparselt integration #150578.

Fixes #155207.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155254
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-05 22:45:36 +00:00
e1180c7228 Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-05 22:35:04 +00:00
0a092c7de6 Enable CPP Extension Open Registration tests on Arm (#144774)
Enables most tests under CPP Extension Open Registration as they pass on Arm now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144774
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet
2025-06-05 22:32:28 +00:00
0827464002 Replace runtime type parameterization (#155221)
See:

```
>>> import timeit; print(f"OrderedSet[str](): {timeit.timeit('OrderedSet[str]()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s, OrderedSet(): {timeit.timeit('OrderedSet()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s")
```
> `OrderedSet[str]()`: 0.354622s, OrderedSet(): 0.095376s

Type parameterization should be on type hint, not in runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155221
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-06-05 21:43:54 +00:00
7dcc77e422 Turn on new tiling by default (#154768)
Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set.

TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red)

I am doing another run and will update here.

Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See:
```
(Pdb) p expr
((s25*s85)//32)
(Pdb) p FloorDiv(expr, expr)
((s25*s85)//(32*(((s25*s85)//32))))
```

and also - unbacked shape is not multiple of itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768
Approved by: https://github.com/jansel
2025-06-05 21:34:09 +00:00
a85ad55525 [ROCm][Windows] Fix offload gpu arch list in tests (#155212)
Added fix to get ROCM_PROPERTY_ARCH_LIST value in set_target_properties in c10/cuda and caffe2 tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155212
Approved by: https://github.com/malfet
2025-06-05 20:30:28 +00:00
9a42f01586 [Cutlass] EVT dynamic shapes support (#154835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #154829
2025-06-05 20:17:01 +00:00
5911f870c0 [Cutlass] fp8 dynamic shapes test (#154829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-06-05 20:17:01 +00:00
606d73bde4 Adding from_node for nodes in gm.module() (#155053)
Summary:
Adding "from_node" information that indicates which nodes are unlifted in `.module()` call.
The lifted nodes will have "ExportedProgram.module().unlift()" passname in the last entry of from_node.

Test Plan:
```
buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export
```

Rollback Plan:

Reviewed By: angelayi

Differential Revision: D75837494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155053
Approved by: https://github.com/angelayi
2025-06-05 20:11:56 +00:00
c8c892b4a5 [scan] disable functionalization key in backward tracing (#154343)
Previously, we didn't disable functionalization key when materializing backward graph. This causes the torch.zeros_like call for the case where grad is None to return a functional tensor that's not tracked by the proxy tensor mode.

This PR fixes it by putting the tracing code under disable functionalization ctx manager.

Fixes https://github.com/pytorch/pytorch/issues/153437.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154343
Approved by: https://github.com/zou3519
2025-06-05 20:06:33 +00:00
5e93abe3c0 Address docs for clip_grad functions (#155125)
This PR takes the opinionated stance that `torch.nn.utils.<func>` should be the preferred API over `torch.nn.utils.clip_grad.<func>`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155125
Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki, https://github.com/janeyx99
2025-06-05 19:22:09 +00:00
dd41a3907c [MPS] Fix unary/binary ops for 2**32+ elem tensors (#155183)
By using `TensorIterator::with_32bit_indexing()` primitive

Add `bind_tensors` helper function that correctly sets up MPS tensors originating from TensorIterator

TODO: Add comments to bind_tensors as well asunit test, based on
```
python  -c "import torch;print((torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps')).sin())"
```

Fixes https://github.com/pytorch/pytorch/issues/154828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155183
Approved by: https://github.com/cyyever, https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #155150, #155178, #155184
2025-06-05 18:57:14 +00:00
05dd638ee9 Revert "Add dont constant fold flag (#154945)"
This reverts commit 196c95d463367f15999c0cddc9eb89031e9988ab.

Reverted https://github.com/pytorch/pytorch/pull/154945 on behalf of https://github.com/malfet due to This broke halide test sanity, see a3098a74d4/1 ([comment](https://github.com/pytorch/pytorch/pull/154945#issuecomment-2945598901))
2025-06-05 18:25:59 +00:00
a3098a74d4 Add pinned numpy and fix build (#155129)
Not sure why the online doc build passes but it fails locally with these broken strings...

Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129
Approved by: https://github.com/janeyx99, https://github.com/svekars
2025-06-05 17:44:18 +00:00
2481c4b2ea [cutlass backend] add teraflops and increase rep for benchmark script (#154944)
Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/)

I think I will continue to use do_bench for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944
Approved by: https://github.com/mlazos
2025-06-05 17:20:29 +00:00
be2ab96347 Inductor unit tests: cuda 12.6 -> 12.8 (#155056)
Fixes #154938

When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8

Regarding the revert & reland:
* This PR causes the python 3.13 version to be bumped from 3.13.2 to 3.13.3. test_deopt_from_append_list starts unexpectedly passing on 3.13.3, so I originally modified the test in https://github.com/pytorch/pytorch/pull/155167 to xfail only for <=3.13.2
* However there was a land race with https://github.com/pytorch/pytorch/pull/150796, which introduced another test that passes only for >=3.13.3.

Resolution:
* @guilhermeleobas reverted https://github.com/pytorch/pytorch/pull/150796 so I will reland this (and I've merged the test_deopt_from_append_list change into this PR. And based on Guilherme's feedback, I'm just skipping the test instead of selectively failing/passing the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2025-06-05 17:17:27 +00:00
cadcb5d368 [inductor] disable compiler on the compiled_module_main (#155169)
Fixes https://github.com/pytorch/pytorch/issues/154536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155169
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-06-05 16:37:45 +00:00
13ea0f2c0a [dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)
For program like this

```
class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c = 0

    def forward(self, x):
        self.c += 1
        return x * self.c
```

You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867
Approved by: https://github.com/zou3519
2025-06-05 16:37:22 +00:00
a14f427db6 [BE] Update cudnn to 9.10.1.4 (#155122)
Follow up to #152782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-05 16:07:25 +00:00
cd361fc247 [CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) (#154437)
Fixes https://github.com/pytorch/pytorch/issues/154157

Inductor Workflows where moved from focal to jammy here: https://github.com/pytorch/pytorch/pull/154153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154437
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/davidberard98, https://github.com/huydhn
2025-06-05 15:24:07 +00:00
e895e9689c Update docs build to specify <3.13 in CONTRIBUTING (#155140)
Python 3.13 removed the deprecated imghdr module, so our docs build does not compile with 3.13+. Mention it in our contributing guide so people know before committing to the wrong version oop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155140
Approved by: https://github.com/drisspg, https://github.com/cyyever
ghstack dependencies: #155126
2025-06-05 15:16:48 +00:00
2f3f8339ec [BE] Document device memory apis in correct module (#155126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155126
Approved by: https://github.com/msaroufim, https://github.com/Skylion007
2025-06-05 15:16:48 +00:00
7999735d23 [CUDA][MPS] Fix torch.arange bound validation for large float inputs (#154320)
Fixes #153133

Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic.

The fix introduces double precision validation for checking whether the step sign is consistent with the range direction.

This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320
Approved by: https://github.com/malfet
2025-06-05 14:51:25 +00:00
ed661a5f11 [MPS] Fix complex scalar binding to Metal tensors (#155184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155184
Approved by: https://github.com/dcci
ghstack dependencies: #155150, #155178
2025-06-05 14:34:57 +00:00
9bf6593e96 Fix docstring for torch.UntypedStorage.from_file (#155067)
Fixes #130629

Happy to revert the second commit if we think it's making the test too fragile for the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155067
Approved by: https://github.com/malfet
2025-06-05 14:30:49 +00:00
a1057cda31 Revert "Add CPython generator/contextlib tests (#150796)"
This reverts commit d5f642211f14593c8c78af98a1fb7cfb63039ce5.

Reverted https://github.com/pytorch/pytorch/pull/150796 on behalf of https://github.com/guilhermeleobas due to This is breaking tests on trunk. https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=3.13&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/150796#issuecomment-2944469866))
2025-06-05 13:51:54 +00:00
196c95d463 Add dont constant fold flag (#154945)
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops

Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-05 13:42:44 +00:00
e01fde8213 Revert "[reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)"
This reverts commit bee9c70c5d4b681ec1f2adf92eca1205b372634a.

Reverted https://github.com/pytorch/pytorch/pull/154974 on behalf of https://github.com/malfet due to Broke inductor tests, see 3c72b9fd8f/1 ([comment](https://github.com/pytorch/pytorch/pull/154974#issuecomment-2944370617))
2025-06-05 13:36:21 +00:00
3c72b9fd8f Revert "SDPA support gfx950 (#155103)"
This reverts commit b9312c56bf5f277e341c0185da748e3475d0807f.

Reverted https://github.com/pytorch/pytorch/pull/155103 on behalf of https://github.com/malfet due to looks like it broke mi300 tests, see 9a4c08ddfc/1 ([comment](https://github.com/pytorch/pytorch/pull/155103#issuecomment-2944331460))
2025-06-05 13:33:17 +00:00
523b637cbe Revert "[test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167)"
This reverts commit 1c828786c28b8cd2a6be2397cc2af65e3266c5fa.

Reverted https://github.com/pytorch/pytorch/pull/155167 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see fa3c38c7ae/1 ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))
2025-06-05 13:27:40 +00:00
f60b2712dd Revert "Inductor unit tests: cuda 12.6 -> 12.8 (#155056)"
This reverts commit bb43ced6e2c9e1cdc17923826aaf58466c2ffd4b.

Reverted https://github.com/pytorch/pytorch/pull/155056 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see fa3c38c7ae/1 ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))
2025-06-05 13:27:40 +00:00
9a4c08ddfc [MPS] Parametrize test_scaled_dot_product_attention_autocast (#155005)
Also moving comments inside the function scope for some of my previous regression tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155005
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-05 13:24:53 +00:00
fa3c38c7ae Add tensor overlap check for cross (#154999)
Fixes #132031

## Test Result

```python
In [1]: import torch
   ...: torch.manual_seed(0)
   ...: torch.cuda.manual_seed(0)
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(3, 4)
   ...: torch.cross(a, b, out=a)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
      4 a = torch.randn(3, 4)
      5 b = torch.randn(3, 4)
----> 6 torch.cross(a, b, out=a)

RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999
Approved by: https://github.com/lezcano
2025-06-05 10:00:01 +00:00
5b65628906 Workflow to tag trunk commits with trunk/{commit-sha} tags (#155170)
This PR adds workflow to automate tagging commits on the `main` branch. The workflow includes validation and retry with exponential backoff.

The rationale for this is to work around the github limitation on using workflow_dispatch (requires branch or tag). We want to use workflow_dispatch to rerun CI workflows with parameters (trunk, pull, etc).

---

### Testing

Tested using almost identical workflow in a personal repo (the difference is in repository_owner check and backoff settings).

* successful tag push:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454729765/job/43504630765

* validation: PR commit (fails)
   https://github.com/izaitsevfb/deleteme/actions/runs/15454743572/job/43504669720

* tagging of the old commit on main:
   https://github.com/izaitsevfb/deleteme/actions/runs/15453805748/job/43501885903

* tag already exists:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454756077/job/43504706980

* invalid sha on workflow dispatch:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454611077/job/43504286858

* retry with exponential backoff on failure (via tag rule blocklist):
   https://github.com/izaitsevfb/deleteme/actions/runs/15454768346/job/43504743486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155170
Approved by: https://github.com/huydhn
2025-06-05 09:50:58 +00:00
bee9c70c5d [reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)
reland of https://github.com/pytorch/pytorch/pull/154769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-06-05 07:25:04 +00:00
be16f21ca6 [Graph Partition] add symints to get_graph_inputs (#154679)
During `codegen_inputs`, we check whether there are undefined symbols:
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)

Previously, for graph partition inputs, we do not explicitly add symints.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)
We relied on sizes/strides of TensorBox for codegen symint inputs.  For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)

In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2*s0, 2]`. Since `2*s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`.

The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679
Approved by: https://github.com/eellison
2025-06-05 06:46:28 +00:00
d3c8f36ba0 Revert "[Intel GPU] Make SDPA output has the same stride as Query. (#154340)"
This reverts commit 0f10df71a66cb1b0c3659381b7db8e06d95f0d67.

Reverted https://github.com/pytorch/pytorch/pull/154340 on behalf of https://github.com/etaf due to This PR breaks hugging face E2E run on XPU. ([comment](https://github.com/pytorch/pytorch/pull/154340#issuecomment-2942954192))
2025-06-05 06:46:24 +00:00
bb43ced6e2 Inductor unit tests: cuda 12.6 -> 12.8 (#155056)
Fixes #154938

When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
ghstack dependencies: #155167
2025-06-05 05:59:06 +00:00
1c828786c2 [test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167)
Not sure why, apparently this test starts passing on python 3.13.3 (while it fails on python <=3.13.2) and it's causing unexpected passes on xfail-ed tests when newer versions of python are used, e.g. in #155056.

Verified locally in a python 3.13.1 vs. python 3.13.3 conda env.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155167
Approved by: https://github.com/williamwen42
2025-06-05 05:59:06 +00:00
93012d2290 Revert "[forward fix] add support for MemoryFormat after type tightening (#154658)"
This reverts commit 0fdd568b785812da86e69d65632de77d2ee945c7.

Reverted https://github.com/pytorch/pytorch/pull/154658 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154658#issuecomment-2942752048))
2025-06-05 05:01:40 +00:00
5130ac64f4 Revert "Add randint_like tensor overload for high (#154899)"
This reverts commit 72fe1d5f42aa9bffa876932a3b4fcae052b99168.

Reverted https://github.com/pytorch/pytorch/pull/154899 on behalf of https://github.com/seemethere due to Failing internal tests see https://fburl.com/diff/bai044ob ([comment](https://github.com/pytorch/pytorch/pull/154899#issuecomment-2942740661))
2025-06-05 04:54:05 +00:00
80703ca332 [FlexAttention] Allow dispatch to SAC for flex (#150080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150080
Approved by: https://github.com/zou3519
2025-06-05 04:34:27 +00:00
fa63de0866 Handle empty linemaps in PyCodeCache (#155064)
Some functions have empty linemaps, and if you call `PyCodeCache.stack_frames_for_code` on code in the wrong order, you'll end up triggering a too many values to unpack issue: https://github.com/pytorch/pytorch/issues/154536

Specifically, if you populate PyCodeCache's linemap via caching, and then request the stack frames of a inductor generated output file that has an empty linemap, this function will try to unpack too many arguments.

Test plan:
```
import os

os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "1"
os.environ["TORCHINDUCTOR_AUTOGRAD_CACHE"] = "1"

import torch

@torch.compile
def fn(x: torch.Tensor):
    (x_grad,) = torch.autograd.grad(x.sum(), x)
    return x_grad

x = torch.randn(10, 10, requires_grad=True)
result = fn(x)
```

Run this twice and see that everything works as expected.

It's hard to exactly pinpoint a good unit test for this: it requires a whole lot of moving parts to get the issue to trigger because:

- The callsite in question in dynamo, without caching, will always run before generating the code, so cls.linemaps[path] will be None most of the time
- The inductor generated output needs to call *back* into dynamo via `assert_size_stride`
- In our test case, the CompiledBackward needs to not have linemaps, and also be called in the middle of a graph break while compiling a different cached function. Caching switches the order the PyCodeCache.linemap is populated (i.e. either before or after the graph break is evaluated), which causes the issue.

All these things need to interact together to create the bug, so it's a bit difficult to write a simple unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155064
Approved by: https://github.com/bdhirsh
2025-06-05 03:54:35 +00:00
450180fbcd [c10d][fr] Add the log of thread name and thread id into fr (#155142)
There is an ask from internal head users to have thread id and thread name inside fr. This would be useful to users when it comes to cases when we launches collectives not just on main thread as well.

Differential Revision: [D75973919](https://our.internmc.facebook.com/intern/diff/D75973919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155142
Approved by: https://github.com/kwen2501
2025-06-05 03:33:01 +00:00
b9312c56bf SDPA support gfx950 (#155103)
Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27

Test Plan:
```
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           179.737 |             182.874 |          3106.6  |          359.662 |              205.506 |        382.334 |          375.776 |       22.1205 |       191.067 |           334.392 |                17.2841 |                 16.9877 |               8.63754 |                   15.1169 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |           617.271 |             623.38  |          7169.73 |          998.961 |              654.534 |        445.312 |          440.947 |       38.3387 |       275.164 |           419.96  |                11.6152 |                 11.5014 |               7.17719 |                   10.9539 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |           667.032 |             670.118 |         13031.8  |         1383.42  |              768.452 |        412.091 |          410.193 |       21.0928 |       198.694 |           357.703 |                19.5371 |                 19.4471 |               9.42    |                   16.9586 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          2074.64  |            2214.81  |         29186.9  |         3916.35  |             2404.29  |        529.978 |          496.437 |       37.6714 |       280.749 |           457.313 |                14.0684 |                 13.1781 |               7.45257 |                   12.1395 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          2456.6   |            2472.38  |         51095.8  |         5647.01  |             3008.09  |        447.574 |          444.718 |       21.5186 |       194.707 |           365.518 |                20.7994 |                 20.6666 |               9.0483  |                   16.9861 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |          8048.8   |            8070.96  |        113478    |        15580.8   |             9768.71  |        546.423 |          544.922 |       38.7569 |       282.274 |           450.218 |                14.0987 |                 14.06   |               7.2832  |                   11.6165 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           692.323 |             697.649 |          4241.81 |          1562.26 |              906.441 |        248.148 |          246.254 |       40.5012 |      109.968  |           189.531 |                6.12693 |                 6.08015 |               2.71518 |                   4.67963 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |          2263.22  |            2267.38  |          9482.64 |          7003.8  |             2765.5   |        303.636 |          303.079 |       72.4687 |       98.1174 |           248.489 |                4.1899  |                 4.18221 |               1.35393 |                   3.42891 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |          2553.94  |            2572.68  |         15909.8  |          5697.16 |             3284.77  |        269.073 |          267.112 |       43.193  |      120.621  |           209.206 |                6.22953 |                 6.18415 |               2.79259 |                   4.84352 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          8187.67  |            8201.71  |         35449.2  |         26424.3  |            10364.5   |        335.722 |          335.147 |       77.5413 |      104.025  |           265.21  |                4.32959 |                 4.32218 |               1.34154 |                   3.42025 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          9948.15  |            9815.47  |         62815.1  |         23741.9  |            12710     |        276.31  |          280.046 |       43.7598 |      115.778  |           216.269 |                6.31425 |                 6.39961 |               2.64575 |                   4.94217 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |         32187.6   |           32035.6   |        137832    |        102075    |            40623.4   |        341.595 |          343.216 |       79.7716 |      107.716  |           270.66  |                4.28216 |                 4.30248 |               1.35031 |                   3.39293 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+

```

Rollback Plan:

Differential Revision: D75934358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103
Approved by: https://github.com/yoyoyocmu
2025-06-05 03:26:38 +00:00
a01bb9da14 [CI][CUDA] Re-enable the test-nan-assert on CUDA12 (#154448)
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround #153479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448
Approved by: https://github.com/kwen2501
2025-06-05 02:09:31 +00:00
5e03433443 Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit e5afbe31245287a92fe328c404b3557e5c5eca73.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Broke rocm, see 642687af29/1 ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2942415600))
2025-06-05 01:38:13 +00:00
642687af29 [MPS][BE] Some refactor in preparation for 64-bit iterators (#155178)
set input/output tensors only once

Get rid of `is_storage_dense` predicate, as `iter.is_contiguous` serves the same purpose
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155178
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #155150
2025-06-05 01:24:31 +00:00
3398d1d459 support bmm and mm_plus_mm in generated templates cache (#154904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154904
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #154891, #154892
2025-06-05 00:36:01 +00:00
21f45f7afb Add CPython int/float tests (#150795)
Tests:
* test_int.py
* test_int_literal.py
* test_float.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_int" "test_int_literal" "test_float"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150795
Approved by: https://github.com/williamwen42
2025-06-05 00:28:53 +00:00
d5f642211f Add CPython generator/contextlib tests (#150796)
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py

Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.

The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:

```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
2025-06-05 00:18:29 +00:00
fb5a787a8f [HOP] Added clone for outputs of create_bw_fn that are aliasing the inputs (#153932)
This PR fixes an issue with the new way of creating the bw graph introduced for cond. In particular, there is an issue if the bw function simply aliases the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153932
Approved by: https://github.com/ydwu4
2025-06-04 23:52:52 +00:00
b0a2ca65ef support more prologue functions in generated templates cache (#154892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154892
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #154891
2025-06-04 23:45:36 +00:00
51b4c51973 add missing check for caching triton template caching (#154891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154891
Approved by: https://github.com/eellison
2025-06-04 23:45:36 +00:00
1083bc749d [Memory Snapshot] Add Flag to Toggle Global and Local Callbacks for Annotations (#154932)
Summary:
There are some cases where we want only local annotations for memory snapshot such as executing inside the cudastream callback, which cannot execute CUDA operators. Thus the cuda errors happen: Exception in RecordFunction callback: CUDA error: operation not permitted

However, we need to have an option to turn on the globally so that on-demand snapshot can get annotations. Additionally, there may be some cases in which auto-trace will also want annotations using record functions so we expose the flag to the auto-trace as well.

Test Plan:
Run MVAI executable and see that the errors go away

Rollback Plan:

Differential Revision: D75831687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154932
Approved by: https://github.com/mzzchy, https://github.com/sanrise
2025-06-04 23:15:19 +00:00
7cf5b36ec2 Release GIL in PG destructor (#154976)
Summary: Gloo PG doesn't release GIL, which results in python code hanging until the destructor completes. The destructor waits for all work on the PG to complete which can take a long time.

Test Plan: Ran

```
$ pytest --log-cli-level=INFO -vs torchft/local_sgd_integ_test.py
```

with a large timeout on the async work. Call to `gil_scoped_release` doesn't show up in the gdb stack trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154976
Approved by: https://github.com/d4l3k, https://github.com/dcci, https://github.com/fduwjj
2025-06-04 23:10:55 +00:00
c881f2ddf3 [reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099)
Reland of https://github.com/pytorch/pytorch/pull/154780

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099
Approved by: https://github.com/williamwen42
2025-06-04 23:05:36 +00:00
992be94dab [MPS][BE] Better error messages (#155150)
"Can't be indexed using 32-bit iterator" is not really helpful error
This PR distinguishes between error from old indexing helper function as well as to binaryTensorIterator
Adds the same warning to unary op, otherwise it just runs and returns incorrect value

Test plan (manual, don't have machine with enough RAM to run it reliable in CI):
```
%  python  -c "import torch;print(torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps'))"
RuntimeError: add can't be indexed using 32-bit iterator for shape [1048576, 5000]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155150
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-04 22:53:51 +00:00
f5e2e4c4f1 [Inductor] Include math and torch in launcher scope (#154673)
Summary:
For grid computation, if we have sympy, it is possible we have math and torch used.
We include the math and torch module in the launcher scope to make sure those grid get computed correctly.

Test Plan: Check phabricator for internal cmd.

Differential Revision: D75642931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154673
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-06-04 22:32:19 +00:00
671553bd23 Update documentation wording for transformer-related layers (#155123)
<img width="947" alt="Screenshot 2025-06-04 at 1 33 53 PM" src="https://github.com/user-attachments/assets/4dbb66b3-43f4-4d04-afb5-dc80cec0f2cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155123
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2025-06-04 22:20:32 +00:00
6f23ca53bb [dynamo] sample gb_registry json file for website testing purposes (#155160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155160
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-06-04 22:14:48 +00:00
c8566a0b98 [export] Use patching in test (#155132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155132
Approved by: https://github.com/pianpwk
2025-06-04 21:41:26 +00:00
65a5eb8d27 Fix for ambiguity in linalg.norm()'s ord argument of +2 & -2 (#155148)
Fixes #136453

### Description
---
Fixed the ambiguity by referencing a hyperlink to wikipedia's SVD/Singular Values section as per past discussion (by other contributors) on the above thread.

In the ord argument, for values `+2` and `-2`, the `singular value` now points to [this section of singular values on the wiki SVD page](https://en.wikipedia.org/wiki/Singular_value_decomposition#Singular_values,_singular_vectors,_and_their_relation_to_the_SVD).

### Why not mention SVD
---
For conciseness (expanding 'largest singular value' -> 'largest singular value of a SVD' is too much, i think, wrt rest of the table)

---

I hope this is satisfactory. Please let me know if I have missed anything essential; cheers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155148
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2025-06-04 21:15:20 +00:00
b084e1b81c [HOP] Rework Autograd DispatchKey for scan and map (#153336)
This PR introduces the `py_autograd_impl` instead of the `DispatchKey.Autograd` for some HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153336
Approved by: https://github.com/ydwu4
2025-06-04 20:54:02 +00:00
0404785f3b [dynamo] [3/3] added cmd_update_gb_type which supports updating an existing gb_type properties and optional arg to change gb_type name (#154985)
The user can now use the terminal to update the registry whenever they update an existing gb_type's properties. Additionally, if the user changes the gb_type description itself, they can update the registry as well.

Terminal command template for updating existing gb_type: python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite]

Terminal command template for updating existing gb_type name (can also be used if the user changed the other properties as well including the gb_type name): python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite] --new_gb_type "new_name_for_existing_gb_type"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154985
Approved by: https://github.com/williamwen42
2025-06-04 20:10:02 +00:00
e5afbe3124 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-04 20:03:46 +00:00
4d576442e9 Fix incorrect get_default_qat_qconfig in prepare_qat_fx docs. (#155100)
Fixes #144522

## Description

FX QAT docs for prepare_qat_fx incorrectly used get_default_qat_qconfig when it should use get_default_qat_qconfig_mapping for a qconfig_mapping.

Previous example code incorrectly used `get_default_qat_qconfig`, resulting in a qconfig being incorrectly
passed to `prepare_qat_fx`.    `prepare_qat_fx` requires  a `qconfig_mapping`, not a single `qconfig`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155100
Approved by: https://github.com/jerryzh168
2025-06-04 18:51:40 +00:00
6c8241c089 [dynamo] [2/3] added add_new_gb_type functionality (#154886)
The user can now use the terminal to update the registry whenever they create a new unimplemented_v2() callsite.
Terminal command template: python [path to gb_id_mapping.py] add "new_gb_type" [path to file where user added callsite]
Before the user added a new gb_type:
<img width="619" alt="Screenshot 2025-06-02 at 1 33 54 PM" src="https://github.com/user-attachments/assets/7258cab1-a184-4200-9d56-7b21d243d6d8" />
After the user added a new gb_type:
<img width="366" alt="Screenshot 2025-06-02 at 1 34 47 PM" src="https://github.com/user-attachments/assets/5c383e94-268c-4f6d-9111-7b18c856222e" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154886
Approved by: https://github.com/williamwen42
ghstack dependencies: #154738
2025-06-04 18:44:37 +00:00
681a8189d7 [dynamo] [1/3] updated gbid mapping for initial registry creation (#154738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154738
Approved by: https://github.com/williamwen42
2025-06-04 18:44:37 +00:00
197080337b [AOTI] Extend torchgen to generate C shim with version number (#147745)
Summary: While it is ok to add a new arg with defaul value to a fallback op in Python, it will be BC-breaking for the C shim. This PR adds an automatic approach to update C shim files when specifying a version number with a list of new args for the modified op. See https://github.com/pytorch/pytorch/pull/154848 as an example on how to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147745
Approved by: https://github.com/yushangdi
2025-06-04 18:40:34 +00:00
1d67849e43 [AOTInductor] Activate CPU test for package and update weights (#155078)
Summary:
looks like CPU is enabled for update_constant_buffer in D71177509

enable these tests as well.

Test Plan:
```
 buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_without_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_user_managed_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_update_weights" -v
```

Rollback Plan:

Differential Revision: D75908993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155078
Approved by: https://github.com/angelayi
2025-06-04 17:57:20 +00:00
956716880f [c10d][gloo] Enable using c10::Half for gloo (#153862)
Testing with https://github.com/pytorch/gloo/pull/446 and we see that the numerical issues reported in https://github.com/pytorch/pytorch/issues/152300 is indeed resolved and we added a unit test for it. Also update submodule gloo to reflect the change on the gloo side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153862
Approved by: https://github.com/d4l3k, https://github.com/clee2000, https://github.com/malfet
2025-06-04 17:53:08 +00:00
9eb7e67727 [PT2][memory] correct wait tensor output size (#153569)
This PR correctly handles the output buffer size of wait tensor nodes.
![image](https://github.com/user-attachments/assets/fdcc5eb7-58cf-42a2-84b2-ce949cb9db92)

See [this doc](https://docs.google.com/document/d/1lkKulwIb-fYL_p8jn1SD6Lh1PoAKBgpBsU5sAH80leI/edit?tab=t.0#bookmark=id.w3n4k1y4rdz8) with testing details [internal only]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153569
Approved by: https://github.com/eellison
2025-06-04 17:49:25 +00:00
34c6371d24 Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568)
NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds.
See:
https://pypi.nvidia.com/nvidia-nvshmem-cu12/
https://pypi.nvidia.com/nvidia-nvshmem-cu11/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154568
Approved by: https://github.com/atalman
ghstack dependencies: #154538
2025-06-04 17:43:24 +00:00
b3e666ae17 [easy] Bump STATIC_CUDA_LAUNCHER_VERSION=1 (#154861)
This turns on STATIC_CUDA_LAUNCHER internally for a some low risk entitlements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154861
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-06-04 17:38:06 +00:00
e9c31fb86d [torch.compile] handle a custom __delattr__ method correctly (#150899)
Fixes #150765
- handle a custom __delattr__ method correctly

Test:
```
import torch

class MyObject:
    def __init__(self, val):
        self.val = val
        # Flag to track deletion attempts instead of using print
        self.deletion_attempted = False

    def __delattr__(self, attr):
        if attr == "val":
            # Set flag instead of printing
            self.deletion_attempted = True
        else:
            super().__delattr__(attr)

@torch.compile(fullgraph=True, backend="eager")
def test(input_tensor):
    instance_a = MyObject(1)
    instance_b = MyObject(2)

    del instance_a.val
    del instance_b.val
    exists_a = hasattr(instance_a, 'val')
    exists_b = hasattr(instance_b, 'val')
    deletion_attempted_a = instance_a.deletion_attempted
    deletion_attempted_b = instance_b.deletion_attempted

    return input_tensor + 1, exists_a, exists_b, deletion_attempted_a, deletion_attempted_b

# Run the test
result = test(torch.ones(1))
print(f"Result tensor: {result[0]}")
print(f"val attribute still exists on instance_a: {result[1]}")
print(f"val attribute still exists on instance_b: {result[2]}")
print(f"Deletion was attempted on instance_a: {result[3]}")
print(f"Deletion was attempted on instance_b: {result[4]}")

```

output:
```
(base) sany@sandishs-Laptop pytorch % python3 test_delattr_fix.py
Result tensor: tensor([2.])
val attribute still exists on instance_a: True
val attribute still exists on instance_b: True
Deletion was attempted on instance_a: True
Deletion was attempted on instance_b: True
```

```
(pytorch-dev) sany@sandishs-Laptop pytorch % python3 -m pytest test/dynamo/test_repros.py::ReproTests::test_delattr_return -v
========================================================= test session starts =========================================================
platform darwin -- Python 3.12.5, pytest-8.3.5, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.12/bin/python3
cachedir: .pytest_cache
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/dynamo/test_repros.py::ReproTests::test_delattr_return PASSED [0.0659s]                                                    [100%]

========================================================== 1 passed in 1.71s ==========================================================
(pytorch-dev) sany@sandishs-Laptop pytorch %
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150899
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-06-04 17:27:20 +00:00
4405dc1487 Revert "Always set CPU affinity for benchmark jobs (#154569)"
This reverts commit 629fca295e1257c2c54d1b6316ed4fa00e6044d6.

Reverted https://github.com/pytorch/pytorch/pull/154569 on behalf of https://github.com/anijain2305 due to potentially causing compile time regressions, unsure ([comment](https://github.com/pytorch/pytorch/pull/154569#issuecomment-2940737778))
2025-06-04 16:52:15 +00:00
8f08f90b61 Bump pillow from 10.0.1 to 10.3.0 in /.github/requirements (#154416)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.0.1 to 10.3.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/10.0.1...10.3.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 10.3.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-04 09:37:13 -07:00
aed938f3a8 Enable check_gomp for Ubuntu OSes (#155119)
And ARM platform
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155119
Approved by: https://github.com/atalman
2025-06-04 15:57:08 +00:00
20912673a6 Revert "Add __main__ guards to jit tests (#154725)"
This reverts commit 1a55fb0ee87eaa8b376aaa82d95d213fe0fbe64b.

Reverted https://github.com/pytorch/pytorch/pull/154725 on behalf of https://github.com/malfet due to This added 2nd copy of raise_on_run to common_utils.py which caused lint failures, see https://github.com/pytorch/pytorch/actions/runs/15445374980/job/43473457466 ([comment](https://github.com/pytorch/pytorch/pull/154725#issuecomment-2940503905))
2025-06-04 15:42:52 +00:00
6f93ce3c86 Revert "[Cutlass] fp8 dynamic shapes test (#154829)"
This reverts commit 36596ad2a009a0906848fa264954d4b200efc50e.

Reverted https://github.com/pytorch/pytorch/pull/154829 on behalf of https://github.com/seemethere due to This is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154829#issuecomment-2940494361))
2025-06-04 15:36:27 +00:00
3fa3dbdb1f Revert "[Cutlass] EVT dynamic shapes support (#154835)"
This reverts commit 4224a7df01a9607830da771fd4884c8eba150630.

Reverted https://github.com/pytorch/pytorch/pull/154835 on behalf of https://github.com/seemethere due to This is part of a stack that is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154835#issuecomment-2940463211))
2025-06-04 15:33:09 +00:00
3ce5102927 [ROCm] fix CI failures from inductor periodic (#154896)
Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-04 15:28:43 +00:00
a99a01a677 Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)"
This reverts commit cc96febb979da16b0a0b758020b330a49c72b7e7.

Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
a0f2544502 Revert "[dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)"
This reverts commit 6c2f941e250ba34a920f476c8a9ee30e6153fc15.

Reverted https://github.com/pytorch/pytorch/pull/154867 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
1a55fb0ee8 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/Skylion007
2025-06-04 14:44:08 +00:00
3f34d26040 Add __main__ guards to distributed tests (#154628)
This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs.

In distributed tests:

- Ensure all files which should call run_tests do call run_tests.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Cc @wconstab @clee2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628
Approved by: https://github.com/Skylion007
2025-06-04 14:39:57 +00:00
c8d44a2296 Add __main__ guards to fx tests (#154715)
This PR is part of a series attempting to re-submit #134592 as smaller PRs.

In fx tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154715
Approved by: https://github.com/Skylion007
2025-06-04 14:38:50 +00:00
cf9cad31df Add __main__ guards to tests (#154716)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

Add missing `if __name__ == "__main__":` guards to some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154716
Approved by: https://github.com/Skylion007
2025-06-04 14:38:13 +00:00
ca0c2985d3 [ONNX] Allow exporter to export SDPA to Attention onnx operator (#154596)
Fixes [#149662](https://github.com/pytorch/pytorch/issues/149662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154596
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-04 14:29:44 +00:00
31d12b3955 Fix avg_pool2d param kernel_size descripthon (#154353)
Fixes part of #153149

## Test Result

![image](https://github.com/user-attachments/assets/216ffd2b-dd2b-4cf6-9fca-aeed075be5e7)

![image](https://github.com/user-attachments/assets/820cd184-1f8e-4a7a-b64e-15dfb9c7dad2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154353
Approved by: https://github.com/colesbury
2025-06-04 11:55:01 +00:00
2af78d368f Skip another test file that doesn't run gradcheck for slow gradcheck (#154852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154852
Approved by: https://github.com/albanD
2025-06-04 07:47:09 +00:00
0f10df71a6 [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/guangyey
2025-06-04 07:16:56 +00:00
1e20745532 [ez][AOTI] Fix index offset for Optional Tensor Return (#155073)
Summary: As title. See added test for more context.

Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output_2

Rollback Plan:

Differential Revision: D75900658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155073
Approved by: https://github.com/angelayi
2025-06-04 06:22:46 +00:00
d2bfd97d71 [export] Refactor pt2 save/load (#152495)
Refactor the pt2 archive saving to consolidate the format of torch.export.save and torch._inductor.package.package_aoti.

This PR adds the following functions, which torch.export.save and AOTI packaging calls into:
```python
package_pt2(
    f: FileLike,
    *,
    exported_programs: Optional[Union[ExportedProgram, dict[str, ExportedProgram]]] = None,
    aoti_files: Optional[Union[list[str], dict[str, list[str]]]] = None,
    extra_files: Optional[dict[str, Any]] = None,
) -> FileLike

@dataclass
class PT2ArchiveContents:
    exported_programs: dict[str, ExportedProgram]
    aoti_runners: dict[str, AOTICompiledModel]
    extra_files: dict[str, Any]

load_pt2(f: FileLike) -> PT2ArchiveContents
```

Power users directly call into these APIs if they want to bundle multiple exported programs, aoti files, or extra metadata.

This is how the pt2 archive looks like ([spec](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?tab=t.0)):
```
├── archive_format
├── version
├── .data
├── data
│   ├── aotinductor
│   │   └── model1
│   │       ├── model1.cpp
│   │       ├── model1.so  # currently AOTI automatically moves weights in here, TODO to move it out
│   │       ├── cg7domx3woam3nnliwud7yvtcencqctxkvvcafuriladwxw4nfiv.cubin
│   │       └── cubaaxppb6xmuqdm4bej55h2pftbce3bjyyvljxbtdfuolmv45ex.cubin
│   ├── weights
│   │  ├── model1.pt  # TODO to dedup weights between model1/model2
│   │  └── model2.pt
│   └── constants
│   │  ├── model1.pt  # TODO to dedup weights between model1/model2
│   │  └── model2.pt
│   └── sample_inputs
│      ├── model1.pt  # TODO to dedup weights between model1/model2
│      └── model2.pt
├── extra
│   └── user_metadata.txt
└── models
    ├── model1.json
    └── model2.json
```

Future todos:
- unbundle the weights -- instead of .pt, we can use bin files, which will also allow us to dedup weights if we store multiple models
- update aoti_compile_and_package to also save the exported program
- integrate TNR with this packaging flow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152495
Approved by: https://github.com/yushangdi
2025-06-04 06:04:29 +00:00
75b24c273b Export torch::utils::tensor_to_numpy (#154178)
Fixes #154105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154178
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/youkaichao
2025-06-04 05:48:27 +00:00
7b074346e0 [Intel GPU] Support f32 intermediate dtype, headdim size <=576 and f32 causal mask for SDPA (#152091)
In OneDNN v3.7, SDPA has below defects:

1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.

In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg
2025-06-04 05:18:36 +00:00
4d93985d13 [c10d] Separate monitoring thread into a class in PGNCCL (#153977)
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.

What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)

Differential Revision: [D75799278](https://our.internmc.facebook.com/intern/diff/D75799278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-06-04 04:07:07 +00:00
ec35a36820 [ROCm][Windows] Fix building tests for multiple architectures (#154979)
Fixing building C10_CUDA_ALL_TEST_FILES and Caffe2_HIP_TEST_SRCS for multiple architectures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154979
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-04 03:53:21 +00:00
72fe1d5f42 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
ghstack dependencies: #154863
2025-06-04 03:37:09 +00:00
6b0c6f2856 [BE] Delete pre-CUDA-10.1 code from SparseCUDABlas (#155079)
As latest PyTorch is no longer buildable against it CUDA-10, so this is essentially a dead code

Made small change to hipify script to rename `cusparseGetErrorString` to `hipsparseGetErrorString`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155079
Approved by: https://github.com/atalman, https://github.com/cyyever
2025-06-04 03:29:24 +00:00
9f39028629 [MPS][BE] Move sigmoid op to Metal (#155080)
Fixes https://github.com/pytorch/pytorch/issues/154895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155080
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #154936, #155002, #155081
2025-06-04 03:28:11 +00:00
437df54cc8 [Inductor] Fix a few FX conversion bugs. (#154958)
# Feature
This PR fixes two bugs with Inductor's FX backend.
1. When extracting offsets from `ReinterpretView`'s, we accidentally took the offset of the parent layout instead of the view's layout. This case is triggered when multiple kernels write into the same buffer due to `torch.cat`.
2. In certain rare cases, `V.graph.graph_inputs` can contain a constant input value. In case this happens, create a new `sympy.Symbol` for the input, for compatibility with the existing `SymbolBuffer` abstraction mapping to an FX placeholder.  This case is triggered when calling `torch._inductor.compile` on  certain modules coming from `torch.export`.

# Test plan
Added a couple of tests exposing these bugs.
1. Concat with multiple kernels writing to the same buffer.
3. `Export` -> `torch._inductor.compile` with a constant input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154958
Approved by: https://github.com/jansel
2025-06-04 03:09:44 +00:00
3e57de1251 [ONNX] Create support for rotary embeddings (#154745)
This PR registers the RotaryEmbedding op in the `torch.ops.onnx` name spaces and allows the exporter to recognize and export onnx operators.

## Design

ONNX operators of their respective opset version is implemented in torch/onnx/ops/_impl.py, and are registered in the torch.ops.onnx namespace following the following rule:

`OpType-version => torch.ops.onnx.OpType.opset{version}`

For example, `RotaryEmbedding-23` becomes `torch.ops.onnx.RotaryEmbedding.opset23`

This name is parsed by the exporter to create an onnx node in the graph without having to go through translation.

When users use the ops in the model, we provide more convenient, unversioned functions under `torch.onnx.ops` that will dispatch to the implementations based on user input (type and provided attributes). For example, users can directly call `torch.onnx.ops.rotary_embedding()` to use the op natively in their pytorch models. I chose snake case naming to make the functions more pythonic and aligned with other torch apis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154745
Approved by: https://github.com/titaiwangms
2025-06-04 03:07:43 +00:00
37e6bf8adf Switch to _apply_to_tensors for dataclass input (#154897)
Fixes https://github.com/pytorch/pytorch/issues/153077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897
Approved by: https://github.com/weifengpy
2025-06-04 02:19:52 +00:00
34e3930401 fix numpy compatibility for 2d small list indices (#154806)
Will fix #119548 and linked issues once we switch from warning to the new behavior,
but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive.
We will change the behavior after 2.8 branch is cut.
Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD
2025-06-04 01:58:52 +00:00
e2760544fa [PT] expose FlightRecord API for building (#154866)
Summary: as titled

Test Plan:
CI

Rollback Plan:

Differential Revision: D75803611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154866
Approved by: https://github.com/fduwjj, https://github.com/d4l3k
2025-06-04 01:25:52 +00:00
d8e4c1c363 [BE] Define REGISTER_UNARY_TI_DISPATCH (#155081)
That creates _kernel_mps function that takes iterator and calls stub for
it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155081
Approved by: https://github.com/dcci
ghstack dependencies: #154936, #155002
2025-06-04 01:15:37 +00:00
50de6ae253 Revert "[BE][Ez]: Fully type nn.utils.clip_grad (#154801)"
This reverts commit 9ce2732b685da527308dc2dc4b2eeb4e252f57d1.

Reverted https://github.com/pytorch/pytorch/pull/154801 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154801#issuecomment-2937886337))
2025-06-04 00:41:27 +00:00
40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
6c2f941e25 [dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)
For program like this

```
class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c = 0

    def forward(self, x):
        self.c += 1
        return x * self.c
```

You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867
Approved by: https://github.com/zou3519
ghstack dependencies: #154780
2025-06-04 00:05:53 +00:00
cbdacd32fe [AOTI][Intel GPU] Support multi_arch_kernel_binary option for XPU. (#154514)
Following the design of #154413, this PR add XPU support for generating kernel binary files that support multiple archs.

Fixes #154682, Fixes #154683, Fixes 154689, Fixes #154685 , Fixes #154690, Fixes #154681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154514
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-06-03 23:02:00 +00:00
8f0e3f446d [Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #154091
2025-06-03 23:01:05 +00:00
6c40e6606f [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-06-03 23:01:05 +00:00
4224a7df01 [Cutlass] EVT dynamic shapes support (#154835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #154775, #154761, #154829
2025-06-03 22:20:34 +00:00
36596ad2a0 [Cutlass] fp8 dynamic shapes test (#154829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #154775, #154761
2025-06-03 22:20:33 +00:00
1c2b9cecd2 [Cutlass] Support bias arg for fp8 GEMM (#154761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154761
Approved by: https://github.com/drisspg
ghstack dependencies: #154775
2025-06-03 22:20:27 +00:00
5735729597 [Cutlass] Cleanup gemm_template evt handling (#154775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154775
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-06-03 22:20:18 +00:00
71499fee6b [3/3] Add build rule and test for Graph in nativert (#154532)
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:

1. Add header file
2. Add source file
3. **Add test and build rules**

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75495273](https://our.internmc.facebook.com/intern/diff/D75495273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154532
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530, #154531
2025-06-03 21:52:05 +00:00
b4c399d445 [2/3] Add source file for Graph in nativert (#154531)
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:

1. Add header file
2. **Add source file**
3. Add test and build rules.

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75492405](https://our.internmc.facebook.com/intern/diff/D75492405/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154531
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530
2025-06-03 21:51:52 +00:00
55873dcb0d [1/3] Add header file for Graph in nativert (#154530)
We split the large PR for adding Graph.h and Graph.cpp to `nativert` into 3 smaller PRs:
1. **Add header file**
2. Add source file
3. Add test and build rules.

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75491860](https://our.internmc.facebook.com/intern/diff/D75491860/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154530
Approved by: https://github.com/SherlockNoMad
2025-06-03 21:51:47 +00:00
69a57d9486 add JSON output support for operator benchmark (#154410)
To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format
Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature.
At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression.
An example of the JSON output of the operator benchmark.
```
[
  {
    "benchmark": {
      "name": "PyTorch operator benchmark - add_M1_N1_K1_cpu",
      "mode": "inference",
      "dtype": "float32",
      "extra_info": {
        "input_config": "M: 1, N: 1, K: 1, device: cpu"
      }
    },
    "model": {
      "name": "add_M1_N1_K1_cpu",
      "type": "micro-benchmark",
      "origins": [
        "pytorch"
      ]
    },
    "metric": {
      "name": "latency",
      "unit": "us",
      "benchmark_values": [
        2.074
      ],
      "target_value": null
    }
  },
  {
    "benchmark": {
      "name": "PyTorch operator benchmark - add_M64_N64_K64_cpu",
      "mode": "inference",
      "dtype": "float32",
      "extra_info": {
        "input_config": "M: 64, N: 64, K: 64, device: cpu"
      }
    },
    "model": {
      "name": "add_M64_N64_K64_cpu",
      "type": "micro-benchmark",
      "origins": [
        "pytorch"
      ]
    },
    "metric": {
      "name": "latency",
      "unit": "us",
      "benchmark_values": [
        9.973
      ],
      "target_value": null
    }
  },
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410
Approved by: https://github.com/huydhn
2025-06-03 21:29:24 +00:00
8e1474d3c6 [inductor] small cleanups in torch/_inductor/codegen/mps.py (#154921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154921
Approved by: https://github.com/jansel, https://github.com/Skylion007
2025-06-03 20:57:25 +00:00
cyy
debd095149 Avoid index integer overflow in gemm_notrans_ (#154809)
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154809
Approved by: https://github.com/soulitzer
2025-06-03 19:28:34 +00:00
10c3e6ec43 [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-03 19:21:15 +00:00
cc96febb97 [dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)
I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-03 19:19:47 +00:00
ea7b233015 [flex attention][triton pin] triton_helpers shim for TMA apis (#154858)
Triton 3.4 will remove the experimental TMA apis: https://github.com/triton-lang/triton/pull/6488

To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API.

Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes.

Note: we'll need to apply this for other things in inductor, this just does it for flex attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154858
Approved by: https://github.com/NikhilAPatel, https://github.com/drisspg
2025-06-03 19:15:48 +00:00
85fb13d0d1 [BE] Cleanup cuda 12.4 artifacts from scripts and workflows (#154893)
Remove artifacts. CUDA 12.4 was deprecated. hence no need to keep this code around

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154893
Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/tinglvv
2025-06-03 18:43:40 +00:00
c014e9d7cd [inductor][test] test_padding.py: use inductor TestCase instead of dynamo TestCase (#154935)
test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter.

To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935
Approved by: https://github.com/Skylion007
2025-06-03 18:36:44 +00:00
e8183f8d3d add #pragma once to stable/library.h (#154920)
This shoulda been there and it was an oversight that it was not! We do not want the same translation unit to process this header multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154920
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-03 18:34:53 +00:00
6f7694f18f [dynamo] Reconstruct defaultdict properly (#154931)
`DefaultDictVariable` inherited `ConstDictVariable.reconstruct`, causing
dynamo to reconstruct a `DefaultDictVariable` into a dict rather than
defaultdict. This patch fixes that.

Fixes #138412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154931
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #154930
2025-06-03 18:18:40 +00:00
467235027c [AOTDispatch] Use the proper meta function for _amp_foreach_non_finite_check_and_unscale_ (#154930)
As title, this fixes part of #138412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154930
Approved by: https://github.com/zou3519
2025-06-03 18:18:40 +00:00
462579af11 Update merge_rules.yaml (#155008)
- add new docs reviewers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155008
Approved by: https://github.com/malfet
2025-06-03 18:09:23 +00:00
f714599c57 [MPS][BE] Extend torch.special. to integer dtypes (#155002)
By changing the functor to looks as follows
```metal
struct xlog1py_functor {
  template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(c10:🤘:xlog1py(a, b));
  }
  template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
  inline float operator()(const T a, const T b) {
    return c10:🤘:xlog1py(float(a), float(b));
  }
};
```

Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155002
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #154936
2025-06-03 17:52:41 +00:00
31405a69fb [typing] Add missing type annotations to torch.nn.init module (#154504)
## Summary

Adds missing type annotations to `torch.nn.init` and removes `# mypy: allow-untyped-defs` since all functions are now properly typed.

## Changes

- Added missing type annotations to initialization functions in the module.
- Added missing typing imports: `Any`, `Callable`, `Union`
- Removed `# mypy: allow-untyped-defs` comment
- Create Literal types for kaiming initialization mode and nonlinearity.
- Created `__all__`

## Why

Better IDE support, catches type errors earlier, and brings the module up to PyTorch's typing standards. No runtime changes - purely additive typing improvements.

Tested with existing test suite and lintrunner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154504
Approved by: https://github.com/Skylion007
2025-06-03 17:33:32 +00:00
40142978d7 Add type annotation to orthogonal_ (#154927)
Trivial charge, but I want pyright to stop yelling at me
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154927
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 17:00:02 +00:00
1f131fe56b Update bug-report.yml (#154857)
Update issue template for binary data and numerical notes.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154857
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-03 16:13:07 +00:00
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
283f876ab6 [PP] Fix disabled flaky tests (#154856)
Fix https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](https://github.com/pytorch/pytorch/pull/153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154856
Approved by: https://github.com/Skylion007
2025-06-03 15:55:29 +00:00
250e9af4da Removing per torch.compile audit. (#154572)
Removing https://pytorch.org/docs/stable/torch.compiler_best_practices_for_backends.html per torch.compile audit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154572
Approved by: https://github.com/williamwen42, https://github.com/svekars
2025-06-03 15:41:52 +00:00
3685b10170 Turn on compile with NVSHMEM (#154538)
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154538
Approved by: https://github.com/ngimel
2025-06-03 15:24:24 +00:00
a1a268aff5 [dtensor] fix simplefsdp mixed-precision training bugs (#154975)
This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training.

In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`.

This PR fixes this issue and corrects previously added test cases.

After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly.

![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975
Approved by: https://github.com/tianyu-l
2025-06-03 14:47:36 +00:00
2608927cfb Solve for tilings (#153748)
Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses.

For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153748
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730
2025-06-03 14:37:30 +00:00
812deecaab Add option to define OpenBLAS version for manylinux Dockerfile_2_28_aarch64 (#150106)
Adds optional variable OPENBLAS_VERSION to `.ci/docker/common/install_openblas.sh` used to define which version of OpenBLAS to install. Adds argument to `Dockerfile_2_28_aarch64` image.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150106
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet

Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
2025-06-03 14:35:54 +00:00
0adbde4d35 Analyze coalesced mem (#153730)
Analyze memory expressions to see if they contain a coalescing symbol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153730
Approved by: https://github.com/jansel
ghstack dependencies: #153723
2025-06-03 14:29:06 +00:00
e9266f807a [BE] Use vendored packaging for testing (#154946)
As the rest of the torch uses it, test should rely on it as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154946
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 14:22:53 +00:00
9cdce682a1 [MPS][BE] Reimplement log1p as Metal shader (#154936)
That should make it faster than MPSGraph implementation, but also
improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$

Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} -  \frac{x}{3}))$

Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one.

Parametrize and modify regression test to check for accuracy of small values

TODOs:
 - Do proper implementation for complex values as well, perhaps using 0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)
 - May be implement it using Remez-like algorithm documented here 207f3b2b25/lib/msun/src/s_log1pf.c (L37)
 - Or use llvm's implementation from f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)
 - Benchmark which algorithm is faster and delivers better accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-06-03 14:10:13 +00:00
00dfd3891e [Tiling rewrite pt1] Normalize reads and writes to common iter space (#153723)
In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153723
Approved by: https://github.com/jansel
2025-06-03 14:04:34 +00:00
635b73e697 [dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.

Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
2025-06-03 11:50:57 +00:00
71a0af8a14 [TEST][Quantization] Skip test_learnable due to hypothesis (#152819)
As per comment in https://github.com/pytorch/pytorch/issues/111471#issuecomment-1866933243 the tests are failing due to hypothesis. This PR adds a skip to those tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152819
Approved by: https://github.com/eqy
2025-06-03 11:23:15 +00:00
ea5b9eca74 Combine sticky pgo key with job id (#154863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154863
Approved by: https://github.com/Mingming-Ding
2025-06-03 07:58:38 +00:00
a4da1d4a47 [Graph Partition] support standalone_compile (#154698)
For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698
Approved by: https://github.com/oulgen
2025-06-03 07:40:42 +00:00
d91c85babb [c10d][fr] Split cuda and non-cuda fr logic into two cpp file (#154929)
During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about  ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154929
Approved by: https://github.com/kwen2501
2025-06-03 07:00:14 +00:00
13044b2b04 Move c10/macros/Export.h to torch/standalone (#154850)
Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner.

Test Plan: CI

Bifferential Revision: D75756619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850
Approved by: https://github.com/janeyx99
2025-06-03 06:18:59 +00:00
a7e496a896 Revert "[dynamo] Record the pre-graph bytecode using fast record function event (#154769)"
This reverts commit 409c396a48584de1ab14e1be6957663d548ad89e.

Reverted https://github.com/pytorch/pytorch/pull/154769 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))
2025-06-03 06:13:49 +00:00
b86aaaae0b Revert "[dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)"
This reverts commit 7dee89913072f1499c5265d8e92d23c30fc6a7f1.

Reverted https://github.com/pytorch/pytorch/pull/154764 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))
2025-06-03 06:13:49 +00:00
d375e64279 [cutlass backend][forward fix] hex the cutlass key instead of decode (#154885)
This is mainly following how it is done for torch_key.

Error was:
```
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154885
Approved by: https://github.com/jingsh, https://github.com/mlazos
2025-06-03 06:00:16 +00:00
8af447224e Improve error message for torch.fft.ihfft2 when input's dtype is complex (#149692)
Fixes #149625

For the case mentioned in the issue, will get:

```
RuntimeError: Only supports floating-point dtypes, but found: ComplexDouble
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149692
Approved by: https://github.com/malfet
2025-06-03 05:54:56 +00:00
295ea202f6 [inductor] Add kernel_hash_key to ChoiceCaller (#154470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470
Approved by: https://github.com/mlazos
2025-06-03 04:01:49 +00:00
cyy
388912dd94 Remove AttributeError constructor (#154808)
It is a private API and uses C vsnprintf, which is not type safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-03 03:49:09 +00:00
ef92653022 Revert "Remove AttributeError constructor (#154808)"
This reverts commit 3239da0c732c4ad736df7081ea44c1cd79c01145.

Reverted https://github.com/pytorch/pytorch/pull/154808 on behalf of https://github.com/cyyever due to Need format code ([comment](https://github.com/pytorch/pytorch/pull/154808#issuecomment-2933286113))
2025-06-03 03:40:41 +00:00
b3cb0e83de [FSDP2] respect reshard_after_forward=True for root model (#154704)
resolve https://github.com/pytorch/pytorch/issues/154655

`fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases
* we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward
for recommendation model, we may have mutiple root for dense part

Change default beahvior to always respect `reshard_after_forward=True`

Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704
Approved by: https://github.com/mori360
2025-06-03 03:12:45 +00:00
ff35c0cdfd [inductor] Change _constexpr_to_value -> _unwrap_if_constexpr (#154905)
To adapt to the changes from: f480e2f697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154905
Approved by: https://github.com/davidberard98
2025-06-03 03:10:56 +00:00
cyy
e3cf73ee49 Move remaining CI jobs to VS 2022 (#154811)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154811
Approved by: https://github.com/huydhn
2025-06-03 02:21:24 +00:00
3239da0c73 Remove AttributeError constructor (#154808)
It is a private API and uses C vsnprintf, which is not type safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-03 02:18:51 +00:00
28cb3c0fe5 [test][inductor] attempt to fix duplicate registration issue (#154865)
Fixes #154216

In #154216, there's a duplicate registration error thrown from registering `test::foo` twice. I expect that this is caused by having two tests that both register a `test::foo` op in the same test file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154865
Approved by: https://github.com/NikhilAPatel, https://github.com/jingsh
2025-06-03 01:11:47 +00:00
6cb6da6ea2 [triton pin][test] relax codecache test checks for number of triton artifacts (#154879)
Triton has added another artifact that gets generated (triton-lang/triton#6992), so `test_cache_load_function` started failing as there are now 8 (instead of 7) artifacts.

Instead of figuring out a way to check exactly which set of artifacts will get generated, I instead modified the test to just check that there are _at least_ 6 artifacts, to account for different platforms (intel/amd/nvidia) and different triton versions (which may or may not have a `.source` artifact)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154879
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-06-03 00:52:54 +00:00
7f44b589be [dynamo] fix pruning locals with ShapeEnvSource (#154752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154752
Approved by: https://github.com/zhxchen17
2025-06-03 00:35:11 +00:00
47a142c3c2 [triton pin][tests] update inductor/profiler launch_(enter|exit)_hooks tests (#154894)
Fixes #154223

Triton has updated launch_(enter|exit)_hooks so that they are now in `knobs`. @danzimm already fixed this in #152457 - this just updates the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154894
Approved by: https://github.com/jingsh, https://github.com/NikhilAPatel
2025-06-03 00:14:14 +00:00
731acbfb0b [CI] Reuse old whl on PRs (#154662)
Turn off main branch only gating for reusing old whls
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154662
Approved by: https://github.com/huydhn
2025-06-03 00:10:39 +00:00
af9f18e87e [nativert] Free stale execution frames (#154636)
Summary:
This was implemented in SR due to caching of runtime instances building up and causing some memory usage spikes after some large amount traffic went through the model, and then once traffic went down, SR was still caching all the previous usage.

We need something similar on the Sigmoid side to make sure the static dispatch modules aren't hogging memory. Currently, all ExecutionFrame objects are being cached, and never freed if stale.

Test Plan:
Added extra execution frames in tmp commit D75257998 and ran local replayer test to confirm extra execution frames get cleaned up down to min size, which is set at 8

 {F1978532047}

Also tested by modifying load_net_predictor (modifications also in D75257998) to run benchmarkNumIterations twice - once with benchmarkNumThreads, and once with only one thread. Also set clearing interval at one second. Verified that execution frames get cleared when we drop down to one thread.

 {F1978558984}

```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```

Bifferential Revision: D75257992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154636
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-06-02 23:44:12 +00:00
37eb909c94 Revert "[Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)"
This reverts commit 7b25ff7cf2e6096c103da0068e417216a41be7a9.

Reverted https://github.com/pytorch/pytorch/pull/154091 on behalf of https://github.com/seemethere due to I root caused this PR to some failures, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures with my fix ([comment](https://github.com/pytorch/pytorch/pull/154091#issuecomment-2932848880))
2025-06-02 23:22:43 +00:00
ac65e94f45 Revert "[Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)"
This reverts commit 2dfc0e33273fe50dcbb3d363da02c8cc485b4adc.

Reverted https://github.com/pytorch/pytorch/pull/154110 on behalf of https://github.com/seemethere due to This is part of a stack with failures internally, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures ([comment](https://github.com/pytorch/pytorch/pull/154110#issuecomment-2932845168))
2025-06-02 23:20:11 +00:00
e3af628b0d Revert "Add CPython exception tests (#150789)"
This reverts commit 67fb9b7cc3f7d2ebbb104296f2b11776f4adbb22.

Reverted https://github.com/pytorch/pytorch/pull/150789 on behalf of https://github.com/seemethere due to This is failing upstream in trunk, see 67fb9b7cc3 ([comment](https://github.com/pytorch/pytorch/pull/150789#issuecomment-2932823586))
2025-06-02 23:12:15 +00:00
7dee899130 [dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.

Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #154769
2025-06-02 23:01:58 +00:00
409c396a48 [dynamo] Record the pre-graph bytecode using fast record function event (#154769)
![image](https://github.com/user-attachments/assets/1d06618b-1c14-4ed5-ab7b-dcfecbb4d632)

Adds another event in the profiler traces. This can help us find models where pre-graph bytecode is very expensive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154769
Approved by: https://github.com/zou3519, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-06-02 22:33:27 +00:00
f6b83d4cc6 sort iteration over index vars (#154846)
Fix for https://github.com/pytorch/pytorch/issues/154741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154846
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-06-02 22:06:00 +00:00
d6420d4f85 [CI] Reuse old whl: replace the version (#154773)
Replace the git version, so whl name goes from `torch-something+git<old commit>` to `torch-something+git<new commit>`

Renamed a bunch of variables to hopefully be more clear

Tested on ef210ad54b
* Removed gating that prevents it from running on PRs (which is going to be merged soon)
* Removed gating that checks for which files can be changed (since this PR has stuff outside of the acceptable list)
* The above two allow the whl to be reused, and I added assert 1 == 2 in common_utils and checked that jobs failed (meaning they were using updated code despite not building)

Checked that the whl in the docker image has the right commit sha, didn't check torch.__version__ though
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154773
Approved by: https://github.com/malfet
2025-06-02 22:02:41 +00:00
e1644e40a7 [ez][TD] Fix TD indexer workflow (#154868)
Update docker image, and fix gpu flag env var

Example failure: https://github.com/pytorch/pytorch/actions/runs/15381170311/job/43272174443

Tested on 9cb28f03e5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154868
Approved by: https://github.com/Skylion007
2025-06-02 21:33:19 +00:00
104c31598f [cutlass backend][ez] Make load config from local more resilient (#154740)
Differential Revision: D75693211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154740
Approved by: https://github.com/ColinPeppler
2025-06-02 21:12:12 +00:00
731e635c95 Add CPython math/cmath tests (#150794)
Tests:
* test_math.py
* test_cmath.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_math" "test_cmath"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150794
Approved by: https://github.com/zou3519
2025-06-02 20:49:44 +00:00
67fb9b7cc3 Add CPython exception tests (#150789)
----

* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
2025-06-02 20:44:41 +00:00
48807d568e [CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169)
Contributing to the fix of #147383   and #154119

Additional steps required: 3218b1b684/.github/workflows/lint.yml cu118 needs to be updated.
Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv
2025-06-02 20:22:14 +00:00
9d3ad82ca7 [dynamo] Remove all skipIfTorchDynamo in test_tensor_creation_ops.py (#154693)
Looks like they are no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154693
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-06-02 20:14:35 +00:00
984b1a80e3 [ez] add docs for *eager_then_compile stances (#154818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154818
Approved by: https://github.com/williamwen42
ghstack dependencies: #154802, #154826, #154822, #154823, #154805
2025-06-02 19:04:35 +00:00
28f27886eb Vary batch size when running dynamic shapes benchmarks (#154805)
This better measures the actual runtime performance of dynamic shapes
where we aren't guaranteed to have similar shapes as the original hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826, #154822, #154823
2025-06-02 18:56:18 +00:00
33f2d0ff45 add reference to stances from dynamic shapes doc (#154823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154823
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
ghstack dependencies: #154802, #154826, #154822
2025-06-02 18:47:19 +00:00
d99e9568ec Add docs for how to mark as unbacked (#154822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154822
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826
2025-06-02 18:30:57 +00:00
1258aac1c2 [dynamo] Upcast torch.Size + tuple to be of size torch.Size (#154830)
Fixes https://github.com/pytorch/pytorch/issues/154432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154830
Approved by: https://github.com/StrongerXi, https://github.com/Skylion007, https://github.com/williamwen42
2025-06-02 17:57:23 +00:00
9fe1b40d17 [ez] add dynamic sources docs (#154826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154826
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802
2025-06-02 17:53:30 +00:00
69e22301da Revert "[inductor] Add kernel_hash_key to ChoiceCaller (#154470)"
This reverts commit 7a79de1c0f31200f95a48a9e69fbd2df2a3c735d.

Reverted https://github.com/pytorch/pytorch/pull/154470 on behalf of https://github.com/seemethere due to Failing internal inductor tests, author is aware and suggested revert. D75767762 ([comment](https://github.com/pytorch/pytorch/pull/154470#issuecomment-2931717432))
2025-06-02 17:43:23 +00:00
113224b530 Enable non blocking remote cache write (#154837)
Test Plan:
Ran
```
buck2 run mode/opt //scripts/oulgen:runner
```
twice
and got

https://fburl.com/scuba/pt2_remote_cache/u7u1uqh1

Differential Revision: D75770423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154837
Approved by: https://github.com/jamesjwu
2025-06-02 17:36:43 +00:00
67067512a1 Revert "[BE] Cleanup old ExecuTorch codegen and runtime code (#154165)"
This reverts commit 515c19a3856e953c0fe23a0ed4fa844f8eea34d8.

Reverted https://github.com/pytorch/pytorch/pull/154165 on behalf of https://github.com/seemethere due to This is failing when attempting to test against executorch main internally, author has acknowledged that this should be reverted ([comment](https://github.com/pytorch/pytorch/pull/154165#issuecomment-2931489616))
2025-06-02 16:28:46 +00:00
981bdb39ca Enable ConvTranspose3D for FP32 and Complex64 (#154696)
Fixes #154615

Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15.

For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs.

Tests for ConvTranspose3D were enabled for the supported data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696
Approved by: https://github.com/malfet
2025-06-02 16:24:03 +00:00
77d85a4629 Symintify baddbmm (#154656)
Previously we would specialize on the shape in this if-statement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154656
Approved by: https://github.com/pianpwk
2025-06-02 15:23:14 +00:00
e22be781b7 Symintify repeat_interleave (#154660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154660
Approved by: https://github.com/pianpwk
2025-06-02 15:19:39 +00:00
cyy
f6275bf0fe Bump pocketfft submodule to the latest (#154845)
Fixes #154843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154845
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-02 14:54:13 +00:00
dfd6849e77 Update lint_urls.sh (#154838)
Do not match empty urls pieces like "https://"
Add headers for better handling urls like "https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154838
Approved by: https://github.com/Skylion007
2025-06-02 14:50:34 +00:00
c65e9ad77a Update slow tests (#154347)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154347
Approved by: https://github.com/pytorchbot
2025-06-02 11:30:56 +00:00
ff4515fde5 Add optional check_pinning argument to _validate_sparse_compressed_tensor/coo_args (#154759)
As in the title.

A prerequisite to https://github.com/pytorch/pytorch/pull/154638 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154759
Approved by: https://github.com/amjames, https://github.com/ngimel
ghstack dependencies: #154610
2025-06-02 10:17:07 +00:00
3f3c1f419f User-controlled sparse tensor validation when loading data from external storage (#154610)
This PR lets users to control sparse tensor invariants validation (that can be expensive, especially, for sparse tensors with many indices) when loading data from external sources.

By default, the validation of sparse tensor invariants is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154610
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-02 10:17:07 +00:00
9258cfc227 [audio hash update] update the pinned audio hash (#154776)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154776
Approved by: https://github.com/pytorchbot
2025-06-02 05:36:13 +00:00
16d05e130c [CI][CUDA][UCC] Update test_c10d_ucc.py - remove xfailIfLinux because it now succeeds (#150979)
pytest -v test/distributed/test_c10d_ucc.py  -k test_save_load
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/opt/pytorch/pytorch/.hypothesis/examples'))
rootdir: /opt/pytorch/pytorch
configfile: pytest.ini
plugins: anyio-4.9.0, hypothesis-6.130.13, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1, xdoctest-1.0.2, typeguard-4.3.0
collected 63 items / 62 deselected / 1 selected
Running 1 items in this shard

test/distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint PASSED [65.2581s]                                                                                               [100%]

================================================================================== 1 passed, 62 deselected in 68.78s (0:01:08)

@ptrblck @eqy @tinglvv @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150979
Approved by: https://github.com/eqy
2025-06-02 03:24:35 +00:00
cd3d2b75b3 Update README.md - James has the wrong github link. (#151473)
Unless I'm wrong, the James on the pytorch paper is not the account linked to in the README.md.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151473
Approved by: https://github.com/albanD
2025-06-02 01:53:44 +00:00
515c19a385 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-02 01:47:02 +00:00
0d0058d90d Fix flaky test in test_custom_ops (#152484)
Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484
Approved by: https://github.com/zou3519
2025-06-02 01:45:28 +00:00
80af98c6c3 [BE]: Update nlohmann submodule to 3.12.0 (#154817)
This is mostly compiler fixes, C++20 fixes, and clang-tidy fixes. Should be entirely backwards compatible with our current version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154817
Approved by: https://github.com/jansel, https://github.com/malfet
2025-06-02 01:29:58 +00:00
2b2245d5db [BE]: Replace printf with fmtlib call (#154814)
Safer, faster, more concise, and better type checking. Also add a few misc changes in the file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154814
Approved by: https://github.com/jansel
2025-06-01 22:27:08 +00:00
206e9d5160 [BE]: Update cpp-httplib submodule to 0.20.1 (#154825)
Updates cpp-httplib to 0.20.1. This mostly updates OSS with a bunch of CMake, CXX compiler errors, and bugfixes from upstream. It's a header only library so should be pretty straightforward to upgrade
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154825
Approved by: https://github.com/malfet
2025-06-01 21:44:23 +00:00
064bb3cebc [BE]: Replace a couple of call sites with fmtlib printf (#154533)
This is faster, and memory safe implementation of printf functions coming from fmtlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154533
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 21:16:34 +00:00
0350c7e72c [BE] Introduce torch.AcceleratorError (#152023)
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`

`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's  [`PyErr_SetString`](cb8a72b301/Python/errors.c (L282)), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](cb8a72b301/Python/errors.c (L32))
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references

Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'

import torch

x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
    x[y] = 2
    print(x)
except torch.AcceleratorError as e:
    print("Exception was raised", e.args[0])
    print("Captured error code is ", e.error_code)
```

which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
#12 at::native::abs(at::Tensor const&) from ??:0
#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#17 at::_ops::abs::call(at::Tensor const&) from ??:0
#18 at::native::isfinite(at::Tensor const&) from ??:0
#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
#22 PyObject_CallFunctionObjArgs from ??:0
#23 _PyObject_MakeTpCall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyObject_FastCallDictTstate from ??:0
#26 _PyStack_AsDict from ??:0
#27 _PyObject_MakeTpCall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 _PyFunction_Vectorcall from ??:0
#30 _PyEval_EvalFrameDefault from ??:0
#31 _PyFunction_Vectorcall from ??:0
#32 _PyEval_EvalFrameDefault from ??:0
#33 _PyFunction_Vectorcall from ??:0
#34 _PyEval_EvalFrameDefault from ??:0
#35 PyFrame_GetCode from ??:0
#36 PyNumber_Xor from ??:0
#37 PyObject_Str from ??:0
#38 PyFile_WriteObject from ??:0
#39 _PyWideStringList_AsList from ??:0
#40 _PyDict_NewPresized from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyEval_EvalCode from ??:0
#43 PyEval_EvalCode from ??:0
#44 PyUnicode_Tailmatch from ??:0
#45 PyInit__collections from ??:0
#46 PyUnicode_Tailmatch from ??:0
#47 _PyRun_SimpleFileObject from ??:0
#48 _PyRun_AnyFileObject from ??:0
#49 Py_RunMain from ??:0
#50 Py_BytesMain from ??:0
#51 __libc_init_first from ??:0
#52 __libc_start_main from ??:0
#53 _start from ??:0

Captured error code is  710
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: #154436
2025-06-01 21:02:43 +00:00
f7c09f864a [Docs] Reformat sparse example (#154785)
Not sure why, but rst fails to colorize multiline inputs, but works fine for single line commands
Test plan:
| [Before](https://docs.pytorch.org/docs/main/sparse.html#construction)  | [After](https://docs-preview.pytorch.org/pytorch/pytorch/154785/sparse.html#construction) |
| ------------- | ------------- |
| <img width="466" alt="image" src="https://github.com/user-attachments/assets/96a5c52a-1804-4d05-a5cf-c10221aaddf6" />  | <img width="477" alt="image" src="https://github.com/user-attachments/assets/99565288-5c0b-4e8e-bd60-f016ebc207b5" />  |

Fixes https://github.com/pytorch/pytorch/issues/154779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154785
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2025-06-01 20:56:14 +00:00
c2e9115757 Fix typo in dcp module (#154815)
Fixed the  docstring in `validate_checkpoint_id`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154815
Approved by: https://github.com/Skylion007
2025-06-01 18:18:45 +00:00
b90fc2ec27 [ez] delete code that died a long time ago (#154802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802
Approved by: https://github.com/Skylion007
2025-06-01 14:57:03 +00:00
0cd18ba1ca [BE][Ez] Update deprecated pybind11 functions (#154798)
* getType() is deprecated, replace it with new/proper static method. These are backwards compatible with old pybind11 versions we support. So break this off before we upgrade to pybind11 3.0 where these methods are dropped in #154115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154798
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-06-01 06:17:50 +00:00
bfae151269 [BE][Ez]: Remove unneeded mypy suppressions (#154800)
Improvements in typing have made this suppression unnecessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154800
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 06:10:41 +00:00
9cbbc2593b test for 146431 (#154786)
Adds test for #146431 that was fixed by #154746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154786
Approved by: https://github.com/Skylion007, https://github.com/galv

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-06-01 04:17:54 +00:00
cyy
5616fa4a68 [Submodule] Bump flatbuffers to v24.12.23 (#143964)
This sub-module has not been updated for a long time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143964
Approved by: https://github.com/Skylion007
2025-06-01 02:25:57 +00:00
c33fc9dae3 [BE][Ez]: Update VulkanMemoryAllocator to 3.3.0 (#154796)
Last update to this submodule was 3 years ago, and the API is pretty stable and this is a minor version release update. Part of a bunch of PRs to eradicate low CMake required versions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154796
Approved by: https://github.com/jansel
2025-06-01 00:30:56 +00:00
9ce2732b68 [BE][Ez]: Fully type nn.utils.clip_grad (#154801)
Full types clip_grad and exposed typing annotations that were hidden by a bad decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801
Approved by: https://github.com/jansel
2025-05-31 23:06:45 +00:00
dbad6d71c7 [BE][Ez]: Unskip conv1d MPS test (#154795)
Fixes issue I noticed where conv1d test is skipped for complex types unconditionally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154795
Approved by: https://github.com/jansel
2025-05-31 23:01:19 +00:00
b85c460749 [BE][Ez]: Update NVTX submodule to 3.2.1 (#154797)
Update NVTX3 submodule to 3.2.1.
* Mostly improved compiler support, Python support, and better CMake and C++ support.
* Also has a few new APIs to support fancy new features.
* This is header only library so should be an easy non-invasive change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154797
Approved by: https://github.com/jansel
2025-05-31 23:01:13 +00:00
6a781619bf Temporarily disable sparse tensor validation when loading from external storage. (#154758)
As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 .

The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-05-31 19:45:44 +00:00
c99e91b1d7 [BE]Enhance _get_clean_triton.py to auto-generate launch_params if missing (#154666)
Previously, @Chillee wrote a script https://github.com/pytorch/pytorch/pull/125811 to remove inductor dependency for inductor compiled triton kernels. We'd like to automate the process of obtaining the launch parameters.

Added functionality to the torch/utils/_get_clean_triton.py to automatically generate the launch_params file if it does not exist and the auto_generate_params flag is set to True. This includes running the input file in a subprocess with the appropriate environment variable. Updated the get_clean_triton function and the main script to support this new feature, allowing users to disable auto-generation via a command-line argument.

# Test Plan
test embedding op in TritonBench
```
# generate inductor compiled triton kernels
TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FX_GRAPH_CACHE=0 python run.py --op embedding  --mode fwd  --precision fp32 --metrics nsys_rep --only inductor_embedding  --num-inputs 1 --input-id 11
# run the script to get rid of inductor dependency. By default, triton_only_repro.py is the output file name.
python ~/pytorch/torch/utils/_get_clean_triton.py ~/tritonbench/torch_compile_debug/run_2025_05_29_14_47_50_497790-pid_849274/torchinductor/model__0_forward_1.0/output_code.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154666
Approved by: https://github.com/davidberard98
2025-05-31 19:27:56 +00:00
c014e4bcaa Fix typo in vec256 interleave2 (#154784)
Fix a typo where the elements in a vector are mislabeled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154784
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-05-31 14:17:10 +00:00
daff263062 [Functorch] Support Functorch for PrivateUse1 backend (#154700)
This PR enable that functorch to be used in 3rd party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154700
Approved by: https://github.com/zou3519
2025-05-31 07:28:45 +00:00
15e9119a69 [BE] install_triton_wheel.sh update for internal dev (#154637)
internal devgpu gets mad at `pip install ...` but `python3 -m pip install ...` is fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154637
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-31 06:57:56 +00:00
7368eeba5e [dynamo][guards] Prevent LENGTH guard on nn modules (#154763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154763
Approved by: https://github.com/williamwen42
2025-05-31 05:32:31 +00:00
7a79de1c0f [inductor] Add kernel_hash_key to ChoiceCaller (#154470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470
Approved by: https://github.com/mlazos
2025-05-31 03:09:37 +00:00
bd10ea4e6c Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit ad26ec6abe51d528124bc5fbbacaa87aef077ab8.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))
2025-05-31 02:14:24 +00:00
43390d8b13 ROCm Sparsity through HipSparseLT (#150578)
TLDR:

- This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity.
- Require **ROCm 6.4** && **gfx942/gfx950**.
- The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way.

### Dense vs. Sparse Performance Comparison

#### **NT (Row-major)**
**Average Uplift**: `1.20`

| M     | N      | K      | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-------|--------|--------|-------------------------|-------------------------------|--------|
| 14336 | 8      | 4096   | 20.05                   | 25.3                          | 1.26   |
| 4096  | 8      | 14336  | 21.07                   | 25.28                         | 1.20   |
| 3072  | 3072   | 10240  | 299.05                  | 351.82                        | 1.18   |
| 3072  | 1536   | 768    | 18.56                   | 20.05                         | 1.08   |
| 3072  | 17664  | 768    | 163.13                  | 173.91                        | 1.07   |
| 3072  | 196608 | 768    | 1717.30                 | 1949.63                       | 1.14   |
| 3072  | 24576  | 768    | 206.84                  | 242.98                        | 1.17   |
| 3072  | 6144   | 768    | 53.90                   | 56.88                         | 1.06   |
| 3072  | 98304  | 768    | 833.77                  | 962.28                        | 1.15   |
| 768   | 1536   | 768    | 8.53                    | 19.65                         | 2.30   |
| 768   | 17664  | 768    | 46.02                   | 46.84                         | 1.02   |
| 768   | 196608 | 768    | 463.15                  | 540.46                        | 1.17   |
| 768   | 24576  | 768    | 54.32                   | 59.55                         | 1.10   |
| 768   | 6144   | 768    | 19.47                   | 20.15                         | 1.03   |
| 768   | 98304  | 768    | 231.88                  | 258.73                        | 1.12   |

---

#### **NN (Row-major)**
**Average Uplift**: `1.13`

| M   | N      | K     | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-----|--------|-------|-------------------------|-------------------------------|--------|
| 768 | 1536   | 3072  | 27.50                   | 28.78                         | 1.05   |
| 768 | 17664  | 3072  | 125.06                  | 158.94                        | 1.27   |
| 768 | 196608 | 3072  | 1568.38                 | 1767.12                       | 1.13   |
| 768 | 24576  | 3072  | 171.05                  | 203.49                        | 1.19   |
| 768 | 6144   | 3072  | 58.72                   | 60.39                         | 1.03   |
| 768 | 98304  | 3072  | 787.15                  | 887.60                        | 1.13   |

-------------------------

This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt.

### ROCm and hipSPARSELt Support:

* [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support.
* [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version.
* [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration.
* [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro.
* `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153)

### Codebase Updates:

* [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285)

### Test Suite Updates:

* [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578
Approved by: https://github.com/jeffdaily
2025-05-31 02:03:40 +00:00
cyy
ad26ec6abe Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 01:54:35 +00:00
3e71016459 Revert "Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298)"
This reverts commit 489afa829a248ca64c4b2dffe2e6d601b8816cf9.

Reverted https://github.com/pytorch/pytorch/pull/154298 on behalf of https://github.com/izaitsevfb due to breaks linux-jammy-aarch64-py3.10 / build ([comment](https://github.com/pytorch/pytorch/pull/154298#issuecomment-2923966688))
2025-05-31 01:51:59 +00:00
489afa829a Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298)
Test Plan: The only functional change is zero-initialization instead of undefined-initialization. If tests pass, I think it should be fine.

Differential Revision: D75345074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154298
Approved by: https://github.com/swolchok
2025-05-31 01:32:45 +00:00
472773c7f9 [nativert] move OpKernelKind enum to torch (#154756)
Summary: att

Test Plan: ci

Differential Revision: D75703996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154756
Approved by: https://github.com/zhxchen17, https://github.com/cyyever
2025-05-31 01:31:29 +00:00
f01e628e3b Resubmit Remove MemPoolContext (#154042) (#154746)
Summary: Per title

Test Plan: Added tests + existing tests

Differential Revision: D75695030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746
Approved by: https://github.com/malfet
2025-05-31 01:21:54 +00:00
932733e0e6 Fix memory leaks in mps_linear_nograph (#154765)
Fixes some memory leaks which were identified as part of the investigation of https://github.com/pytorch/pytorch/issues/154329. This doesn't appear to be the whole solution but wanted to merge this anyway since it's a quick fix

In my tests I see roughly 3MB of unexpected memory growth before this change, and after this change I see 2.2MB of memory growth
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154765
Approved by: https://github.com/malfet
2025-05-31 00:46:12 +00:00
108422ac26 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 78624679a876a21acb14bf075ba6beccff21b9a0.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))
2025-05-31 00:28:03 +00:00
da4aacabac Add h100_distributed label (#154562)
Add h100_distributed label, testing distributed 3D composability tests on 8*H100 GPU node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154562
Approved by: https://github.com/seemethere
2025-05-31 00:17:43 +00:00
9b5308cd58 [upstream triton] support build with setup.py in ./python/ or in ./ (#154635)
Upstream triton has moved setup.py from python/ to ./.  This PR allows versions to be buildable by checking the location of setup.py and choosing the cwd of the build commands based on the location.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154635
Approved by: https://github.com/atalman
2025-05-31 00:15:43 +00:00
b019a33f8f [ez][CI] Reuse old whl: remove old zip/whl (#154770)
Forgot that unzip doesn't get rid of the zip so the old one is still there

Unrelated: figure out how to update the git version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154770
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-05-31 00:13:24 +00:00
0fab32290a Revert "[draft export] avoid storing intermediate real tensors in proxies (#154630)"
This reverts commit 5acb8d50801e6d110790993464611314dd1bd54b.

Reverted https://github.com/pytorch/pytorch/pull/154630 on behalf of https://github.com/malfet due to This still ooms, at least occasionally see 78624679a8/1 ([comment](https://github.com/pytorch/pytorch/pull/154630#issuecomment-2923759745))
2025-05-31 00:07:56 +00:00
faf973da5e [refactor] move materialize_as_graph to _higher_order_ops/utils.py (#154070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154070
Approved by: https://github.com/zou3519
2025-05-31 00:06:44 +00:00
cyy
78624679a8 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 00:01:52 +00:00
5f1c3c67b2 [pgo] log dynamic whitelist in PT2 Compile Events (#154747)
Summary: logs the whitelist to PT2 Compile Events

Test Plan: loggercli codegen GeneratedPt2CompileEventsLoggerConfig

Reviewed By: bobrenjc93

Differential Revision: D75617963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154747
Approved by: https://github.com/angelayi
2025-05-30 23:54:24 +00:00
bbda22e648 [BE][Ez]: Optimize unnecessary lambda with operator (#154722)
Automated edits performed by FURB118. Operator is implemented in C and way faster when passed to another C method like sorted, max etc as a `key=`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154722
Approved by: https://github.com/jansel
2025-05-30 23:47:10 +00:00
0f3db20132 [ez][CI] Do not reuse old whl if deleting files (#154731)
Thankfully very few commits actually delete files so I don't think has affected anything
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154731
Approved by: https://github.com/Skylion007
2025-05-30 22:35:13 +00:00
eb93c0adb1 [inductor][AMD] support special kwargs in AMD triton configs (#154605)
**Context**:

AMD triton kernels can be launched with special kwargs, like `waves_per_eu`. Triton configs with these kwargs look like this:

```
triton.Config({
    "BLOCK_SIZE": 64,
    "waves_per_eu": 2,
})
```

in comparison, nvidia's special kwargs are explicit parameters on the config, e.g. num_warps:

```
triton.Config(
    {"BLOCK_SIZE": 64},
    num_warps=4,
)
```

**Problem**: this causes custom triton kernels w/ PT2 to error out, because there's a kwarg in the triton.Config that doesn't appear in the kernel signature.

**Solution**: When splicing in the constexpr values into the arg list, ignore any values in the config kwargs list if they don't appear in the function signature.

Differential Revision: [D75599629](https://our.internmc.facebook.com/intern/diff/D75599629/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D75599629/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154605
Approved by: https://github.com/njriasan
2025-05-30 22:24:32 +00:00
1193bf0855 Revert "convert inductor codecache to use getArtifactLogger (#153766)"
This reverts commit 5b6fd277f954b789649501e21e9689a42d565e13.

Reverted https://github.com/pytorch/pytorch/pull/153766 on behalf of https://github.com/malfet due to I want to revert this change as I'm 90+% certain it somehow broke testing ([comment](https://github.com/pytorch/pytorch/pull/153766#issuecomment-2923620806))
2025-05-30 22:20:07 +00:00
26aa8dcf27 [ONNX] Simplify onnx test dependencies (#154732)
Simplify onnx test dependencies and bump onnxscript to 0.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154732
Approved by: https://github.com/Skylion007
2025-05-30 21:58:04 +00:00
5acb8d5080 [draft export] avoid storing intermediate real tensors in proxies (#154630)
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.

While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.

Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?

Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-05-30 21:06:55 +00:00
abc2264e8f remove another instance of mtia_workloadd from pytorch (#154739)
Summary: ^

Test Plan: CIs

Differential Revision: D75692171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154739
Approved by: https://github.com/sraikund16
2025-05-30 20:50:46 +00:00
22a4cabd19 [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 20:32:56 +00:00
ed1ff7d0fb [BE][Ez]: Update mimalloc submodule to 2.2.3 (#154720)
Updating minor version of mimalloc. The old version is more than 2 years old, and the newer release has performance fixes and compiler fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154720
Approved by: https://github.com/jansel
2025-05-30 20:17:13 +00:00
2f03673ebf [BE][Ez]: Enable ClangFormat aten/src/core/Formatting.cpp (#154719)
Follow up to #152830 . Noticed the file was excluded from fromatting, opt in to clang-format since it's really close anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154719
Approved by: https://github.com/jansel
2025-05-30 19:52:43 +00:00
f57754e815 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#154618)
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/148981 re-opening for visibility :

Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.

Motivation

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154618
Approved by: https://github.com/jansel
2025-05-30 19:30:25 +00:00
d6edefefbf [CUDA] Fixes for backwards in memefficient attn for large tensors (#154663)
followup to #154029.

@ngimel Backwards had the same problem as well so this PR fixes it and adds support for logsumexp computation in the forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154663
Approved by: https://github.com/ngimel
2025-05-30 19:30:07 +00:00
d89d213118 Fix test_tensorboard when started w/o tensorboard package (#154709)
If `TEST_TENSORBOARD == False` then `DataType` is not defined or imported. However it is used unconditionally when defining the test with `parametrize` which leads to an NameError crashing the test execution on start.

Provide a Dummy to make it syntactially correct. Tests will be skipped on start.

```
  File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 885, in <module>
    class TestTensorProtoSummary(BaseTestCase):
  File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 889, in TestTensorProtoSummary
    (torch.float16, DataType.DT_HALF),
                    ^^^^^^^^
NameError: name 'DataType' is not defined
Got exit code 1, retrying...
test_tensorboard 1/1 failed! [Errno 2] No such file or directory: '/dev/shm/build/pytorch-v2.2.1/.pytest_cache/v/cache/stepcurrent/test_tensorboard_0_0dba8bc00bbe233f'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154709
Approved by: https://github.com/Skylion007
2025-05-30 19:18:43 +00:00
22641f42b6 [Binary-builds]Use System NCCL by default in CI/CD. (#152835)
Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image.

Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835
Approved by: https://github.com/malfet
2025-05-30 18:51:48 +00:00
967937872f [dynamo] Remove dead code path for torch.Tensor.view(*shape) (#154646)
This was introduced in early days of Dynamo, and looks like it's been
fixed since -- the regression test `test_transpose_for_scores` passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154646
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #154645
2025-05-30 18:50:58 +00:00
f9dc20c7a3 [dynamo] Fix syntax error in aot graph from kwarg-less torch.Tensor.[random_|uniform_] calls (#154645)
As title, fixes #151432, see more context in the issue discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154645
Approved by: https://github.com/zou3519
2025-05-30 18:50:58 +00:00
fb67fa9968 Revert "[Inductor] Add NaN assert to returned values from generated code (#154455)"
This reverts commit aec3ef100844631cb7c4ce2725157984eb9cebfe.

Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/malfet due to Looks like it broke inductor/test_compile_subprocess.py::CpuTests::test_AllenaiLongformerBase, see 35fc5c49b4/1(default%2C%20&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2923154249))
2025-05-30 18:45:01 +00:00
35fc5c49b4 Revert "[internal] Expose additional metadata to compilation callbacks (#153596)"
This reverts commit f889dea97dad3cc506d43e379a469334417040c8.

Reverted https://github.com/pytorch/pytorch/pull/153596 on behalf of https://github.com/izaitsevfb due to introduces bunch of callback-related failures on rocm ([comment](https://github.com/pytorch/pytorch/pull/153596#issuecomment-2923139061))
2025-05-30 18:39:27 +00:00
b6b9311f4f [BE][Ez]: Fix typo in dynamo utils #154639 (#154748)
Fixes a typo in #154639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154748
Approved by: https://github.com/ngimel
2025-05-30 18:39:01 +00:00
bbdf469f0e Add CPython dict tests (#150791)
Tests:
* test_dict.py
* test_ordered_dict.py
* test_userdict.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_dict" "test_ordered_dict" "test_userdict"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150791
Approved by: https://github.com/zou3519
2025-05-30 18:17:09 +00:00
2120eeb8de [BE][Ez]: Improve dynamo utils typing with TypeIs and TypeGuard (#154639)
Adds some additional TypeIs and TypeGuard to some _dynamo utils for additional type narrowing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154639
Approved by: https://github.com/jansel
2025-05-30 18:09:50 +00:00
1b569e5490 Fix load_state_dict description (#154599)
Fixes #141364

Fix missing description in `assign` param

## Test Result

### Before
![image](https://github.com/user-attachments/assets/5928c691-4e31-463b-aa0a-86eb8bb452e5)

### After
![image](https://github.com/user-attachments/assets/036631a2-0f20-4a71-95c3-2c0fd732293e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154599
Approved by: https://github.com/colesbury, https://github.com/mikaylagawarecki
2025-05-30 18:08:59 +00:00
30ac7f4d4e [EZ/Memory Snapshot] Remove Handle even if compile_context not set (#154664)
Summary: When setting the memory snapshot callback we register and unregister callbacks for performance reasons. For ease of use, it makes sense to just remove all callbacks regardless of which flags are enabled. The enable stays behind a feature flag, this just changes the disable to ignore the flag itself.

Test Plan: Ran without any flags and saw all callbacks removed.

Differential Revision: D75636035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154664
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2025-05-30 18:08:37 +00:00
65d8dba735 [nativert] move layout planner settings to torch (#154668)
Summary: att

Test Plan: ci

Differential Revision: D75633031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154668
Approved by: https://github.com/zhxchen17
2025-05-30 17:33:27 +00:00
3bdceab124 [dynamo] fix: added star operator for graph_break_hints (#154713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154713
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-05-30 17:31:03 +00:00
802ffd06c8 [Export] Add math module for deserialization (#154643)
Summary: As title

Test Plan: ci

Differential Revision: D75580646

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154643
Approved by: https://github.com/yushangdi
2025-05-30 17:29:25 +00:00
fc0135ca11 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-30 17:23:36 +00:00
3027051590 [export] avoid float/bool specialization for scalar tensor construction (#154661)
Fixes #153411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154661
Approved by: https://github.com/angelayi
2025-05-30 17:18:21 +00:00
e7bf72c908 [multigraph] fix composabilty with aotautograd cache (#153526)
AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #153433, #153449
2025-05-30 16:56:17 +00:00
7183f52675 [dynamo] Support namedtuple subclass (#153982)
Fixes #133762. This involves
1. support tuple subclass constructed inside compile region.
2. handle the "fake" global scope associated with NamedTuple-generated
   `__new__`.
3. handle `namedtuple._tuplegetter` more faithfully.

Differential Revision: [D75488091](https://our.internmc.facebook.com/intern/diff/D75488091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153982
Approved by: https://github.com/jansel
ghstack dependencies: #154176
2025-05-30 16:14:37 +00:00
8002d22ce3 [dynamo] Trace into descriptor with __set__ (#154176)
As title, this patch basically implements
https://github.com/python/cpython/blob/3.11/Objects/object.c#L1371-L1452,
and make the `__get__` handling more robust.

I ran into this while fixing #133762.

Differential Revision: [D75488090](https://our.internmc.facebook.com/intern/diff/D75488090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154176
Approved by: https://github.com/jansel
2025-05-30 16:14:37 +00:00
31f95b5d2e Revert "inductor codecache: include private inductor configs in cache key (#153672)"
This reverts commit 2c1cb38d9516e10474b4f12a2e839046648a71a8.

Reverted https://github.com/pytorch/pytorch/pull/153672 on behalf of https://github.com/malfet due to Looks like it regressed pr_time_benchmarks, see ba3f91af97/1 ([comment](https://github.com/pytorch/pytorch/pull/153672#issuecomment-2922759739))
2025-05-30 15:54:14 +00:00
4b1f047a33 Add CPython list/tuple tests (#150790)
Tests:
* test_list.py
* test_tuple.py
* test_userlist.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_raise" "test_list" "test_tuple" "test_userlist"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150790
Approved by: https://github.com/williamwen42
2025-05-30 15:53:38 +00:00
ba3f91af97 Type hints for distributions/utils (#154712)
Fixes #144196
Part of #144219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154712
Approved by: https://github.com/Skylion007
2025-05-30 15:50:31 +00:00
0f81c7a28d [CI] Pin the torchao version used when testing torchbench (#154723)
Summary: To fix a recent CI breakage. As a follow-up, the torchao pin in .github/ci_commit_pins/torchao.txt is 6-month old. We should bump up that once we verify this fix works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154723
Approved by: https://github.com/eellison
2025-05-30 15:04:26 +00:00
7e8532077f Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 1ece53b157db4425ad12cae31fb570c591dc19e7.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))
2025-05-30 13:16:33 +00:00
cyy
1ece53b157 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-30 11:25:30 +00:00
9d6f0d5991 avoid sym_max on nested int in is_contiguous. (#154633)
calling is_contiguous will fail due to sym_max not being supported for nested int, this address in a way consistent with
make_contiguous_strides_for
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154633
Approved by: https://github.com/bobrenjc93
2025-05-30 09:59:33 +00:00
3c05167489 [Intel GPU] fix matmul accuracy when offset > 0 (#154495)
This pr will make matmul tensors contiguous if they are not 64 byte alignment. oneDNN requires a minimal alignment of 64 https://uxlfoundation.github.io/oneDNN/dev_guide_c_and_cpp_apis.html#intel-r-processor-graphics-and-xe-architecture-graphics

Fixes https://github.com/intel/torch-xpu-ops/issues/1656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154495
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang
2025-05-30 09:53:51 +00:00
aec3ef1008 [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 08:53:24 +00:00
dc82e911e7 remove allow-untyped-defs from torch/utils/data/datapipes/iter/filelister.py (#154624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154624
Approved by: https://github.com/Skylion007
2025-05-30 08:38:05 +00:00
639f459cb6 Revert "[Inductor] Add NaN assert to returned values from generated code (#154455)"
This reverts commit c3de2c7c6bc865b9fabd2db8f2af6383936aa653.

Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to see if it help fix the broken trunk below.  It it does not help, I will reland the PR ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2921562089))
2025-05-30 08:11:22 +00:00
f889dea97d [internal] Expose additional metadata to compilation callbacks (#153596)
These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward.

Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout.

```python
class CallbackTrigger(enum.Enum):
    # most common case, dynamo attempts to trace a new frame
    DYNAMO = 1
    # backward compilation can be deferred to runtime
    LAZY_BACKWARD = 2
    # some backends autotune at runtime
    TRITON_AUTOTUNING = 3
    # cudagraphs record at runtime
    CUDAGRAPH_RECORDING = 4
```

Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596
Approved by: https://github.com/masnesral
2025-05-30 08:07:04 +00:00
208965a9d6 Fix unbackend symint error (#154672)
## Summary

Me and @laithsakka  spoke offline about this one, TLDR is that we wanted this
![image](https://github.com/user-attachments/assets/2e537612-3261-4fbe-a6b9-f8ff92ba3c37)

to also be true for Inductor. In that vein we added two new apis to size-vars which is `guard_or_false`, or `guard_or_true`
with the semantics:

guard_or_false, guard_or_true:

Those APIs may add guards, but will never fail with data-dependent errors; They will try to evaluate the expression with the possibility of adding guards, if that fails due to data dependency, instead of hard failing. False or True are returned.

When to use this?

Performance optimizations that warrant a recompilation.

Take the general path and add a runtime check.
```
# Consider this branching.
if x==0:
    return 1
else
    return 10
# To make data dependent friendly, it can be written as the following:
if guard_or_false(x==0):
    return 1
else
  torch.check(x!=0) # runtime check
  return 10
```

However there is still 1 more api to add to make this example work which is the torch.check which works with expressions, I will leave that to the @laithsakka

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154672
Approved by: https://github.com/laithsakka
2025-05-30 07:45:01 +00:00
5a7442b91f remove allow-untyped-defs from torch/distributed/checkpoint/resharding.py (#154626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154626
Approved by: https://github.com/Skylion007
2025-05-30 07:43:04 +00:00
d66a55def0 remove allow-untyped-defs from torch/distributed/elastic/utils/logging.py (#154625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154625
Approved by: https://github.com/Skylion007
2025-05-30 07:37:56 +00:00
382b38ed1b remove allow-untyped-defs from torch/nn/utils/_expanded_weights/conv_expanded_weights.py (#154623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154623
Approved by: https://github.com/Skylion007
2025-05-30 07:32:57 +00:00
bcbd2a22b2 [Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU (#147693)
* add onednn primitive cache for int4 gemm for xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147693
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/guangyey, https://github.com/ZhiweiYan-96

Co-authored-by: Yan, Zhiwei <zhiwei.yan@intel.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-30 07:26:36 +00:00
0df96e3921 remove allow-untyped-defs from torch/ao/quantization/stubs.py (#154622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154622
Approved by: https://github.com/Skylion007
2025-05-30 07:26:09 +00:00
30f7079c93 [FSDP2] allow different dtypes for no grad model params (#154103)
Fixes #154082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103
Approved by: https://github.com/weifengpy
2025-05-30 07:00:54 +00:00
d173ba5a75 Revert "Remove MemPoolContext (#154042)"
This reverts commit 3b38989b5f8f918cf1ad38bdade059608544af4b.

Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))
2025-05-30 06:53:37 +00:00
0fdd568b78 [forward fix] add support for MemoryFormat after type tightening (#154658)
Summary:
fixes error:
```
    raise AssertionError(f"Unexpected type in c_type_for_prim_type: {type_=}")
AssertionError: Unexpected type in c_type_for_prim_type: type_=MemoryFormat
```

after https://github.com/pytorch/pytorch/pull/154371 | D75568111

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/test:test_custom_ops -- --exact 'deeplearning/aot_inductor/test:test_custom_ops - test_export_extern_fallback_nodes (deeplearning.aot_inductor.test.test_custom_ops.TestAOTInductorProxyExecutor)'
```

Differential Revision: D75617432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154658
Approved by: https://github.com/Camyll, https://github.com/atalman, https://github.com/malfet
2025-05-30 06:53:25 +00:00
a4b0023f3b [cutlass backend] Cache config generation locally and remotely (#154686)
Summary:
Trying to cache the json list of configs.

There are probably some more work:
* preset
* filelock (?)
* for cases where we generate from scratch, save it to local as well (?)

Test Plan: tested offline

Reviewed By: coconutruben

Differential Revision: D75334439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154686
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-05-30 05:40:46 +00:00
ba51f4876d Revert "Enable C++ dynamic shape guards by default (#140756)"
This reverts commit dc0f09a4785349fc3b4e4d3dc3c02b018e5a0534.

Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/izaitsevfb due to seem to break dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2921151663))
2025-05-30 03:52:02 +00:00
852b99eba0 Revert "[c10d] Separate monitoring thread into a class in PGNCCL (#153977)"
This reverts commit 0db9c64d68dcdf25210357c4f7a41618441091d4.

Reverted https://github.com/pytorch/pytorch/pull/153977 on behalf of https://github.com/izaitsevfb due to breaks lots of jobs internally, safer to revert, see D75628917 ([comment](https://github.com/pytorch/pytorch/pull/153977#issuecomment-2921146129))
2025-05-30 03:46:43 +00:00
20ee5f9044 remove allow-untyped-defs from elastic_distributed_sampler.py (#154620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154620
Approved by: https://github.com/Skylion007
2025-05-30 03:29:45 +00:00
9c06dff1ce [multigraph] use specializations in compile_and_call_fx_graph (#153449)
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.

There's really two parts of this work:

**The frontend changes:**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.

**The backend changes (this PR):**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.

I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153449
Approved by: https://github.com/zou3519
ghstack dependencies: #153433
2025-05-30 03:19:49 +00:00
c3de2c7c6b [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Differential Revision: D74691131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 03:09:37 +00:00
4a302b5731 NativeRT readme (#154581)
Summary: att

Test Plan: ci

Differential Revision: D75557667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154581
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/yiming0416
2025-05-30 02:50:53 +00:00
adfd5b293a Enhance UT on elapsed_time for XPUEvent (#154494)
# Motivation
UT enhancement to avoid the incorrect elapsed time return by xpu's Event.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154494
Approved by: https://github.com/EikanWang
2025-05-30 02:00:02 +00:00
0289313551 [AOTI] Support OptionalTensor return type in AOTI proxy executor (#154286)
Summary:

When a C++ custom op returns an uninitialized tensor, it will be marked as None in Python. For this scenario, the user should mark the possibly uninitialized return as Tensor? in the custom op schema.
This diff adds `as_optional_tensor` type to export schema and the support for optional tensor in AOTI proxy executor.

Test Plan:

```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output
```

Differential Revision: D75262529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154286
Approved by: https://github.com/desertfire
2025-05-30 01:53:00 +00:00
58ead04ee9 [dynamic shapes] unbacked safe unsqueeze (#154087)
Also ran into this working on https://github.com/SWivid/F5-TTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154087
Approved by: https://github.com/laithsakka
2025-05-30 01:41:57 +00:00
172015fc11 [multigraph] add specialize_on kwarg to mark_{dynamic,unbacked} (#153433)
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.

There's really two parts of this work:

**The frontend changes (this PR):**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.

**The backend changes:**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.

I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153433
Approved by: https://github.com/zou3519
2025-05-30 01:08:15 +00:00
9371491529 [Reland][pytorch] Patch the _is_conv_node function (#154473)
Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```

Differential Revision: D75494468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473
Approved by: https://github.com/Skylion007
2025-05-30 00:41:03 +00:00
d6cb0fe576 [MPS] Extend index_copy support to complex dtypes (#154671)
Should have noticed it during the review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154671
Approved by: https://github.com/dcci
ghstack dependencies: #154670
2025-05-30 00:28:13 +00:00
0134150ebb [MPS][BE] Do not copy sizes/strides unnecesserily (#154670)
Just pass them as args to `mtl_setArgs`, metaprogramming should deal with the rest
Also use `mtl_dispatch1DJob` instead of computing max threadgroup size by nand

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154670
Approved by: https://github.com/dcci
2025-05-30 00:28:13 +00:00
61bfb3df9f [a2av] Improve tuning for 4 GPUs (#154580)
### Problem
Running `nvshmem_all_to_all_vdev` on 4 x H100s (fully connected with NVSwitch).
Before:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  32.29  16.23
1  33.01  31.76
2  33.01  63.54
4  33.83  123.97
8  49.83  168.34
16  80.82  207.59
32  178.66  187.82
64  335.79  199.86
128  646.72  207.54
256  1268.77  211.57
512  2511.14  213.80
1024  4998.31  214.82
2048  9964.49  215.51
4096  19892.34  215.91
```

215 GB/s does not reach the SOL of NV18 (350-400 GB/s).

### Change
If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension.

After:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  25.01  20.96
1  25.70  40.80
2  25.76  81.42
4  28.87  145.26
8  40.79  205.64
16  61.46  272.97
32  111.82  300.06
64  202.40  331.57
128  382.56  350.84
256  739.11  363.19
512  1450.79  370.05
1024  2873.13  373.72
2048  5719.50  375.47
4096  11395.65  376.90
```

If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154580
Approved by: https://github.com/ngimel
2025-05-30 00:26:13 +00:00
2c1cb38d95 inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
ghstack dependencies: #153766
2025-05-30 00:24:29 +00:00
5b6fd277f9 convert inductor codecache to use getArtifactLogger (#153766)
I'm not entirely sure of the background for why inductor codecache code uses default python logging instead of the new TORCH_LOGS-based artifact logging, but switching it over to artifact logging makes it easier to use nice testing utils in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153766
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2025-05-30 00:24:29 +00:00
eqy
818f76a745 [cuDNN] Allow cudnn attention or flash attention in test_export.py regex (#154458)
Analogous to #153272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154458
Approved by: https://github.com/drisspg
2025-05-29 23:51:09 +00:00
dc0f09a478 Enable C++ dynamic shape guards by default (#140756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756
Approved by: https://github.com/anijain2305, https://github.com/laithsakka
ghstack dependencies: #151225
2025-05-29 23:44:43 +00:00
0c6c7780d9 [Inductor] Add envvar to disable decomposeK (#154421)
Summary: Add envvar to Inductor config to disable decomposeK autotuning choice

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_max_autotune_decompose_k_dynamic_False_sizes2 (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`

Reviewed By: eellison

Differential Revision: D75174823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154421
Approved by: https://github.com/eellison
2025-05-29 23:34:41 +00:00
9ba67e99bb [dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225
Approved by: https://github.com/anijain2305
2025-05-29 23:29:39 +00:00
d5e0704247 [ROCm] Update maxpool launch config (#154619)
* Better perf on MI300 with updated launch configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154619
Approved by: https://github.com/jeffdaily
2025-05-29 23:28:07 +00:00
43b18d098b Forward fix for test_frame_traced_hook in internal testing (#154641)
Summary: Fixes the newly-added dynamo test test_frame_traced_hook so it can run internally

Test Plan: This is a test change

Differential Revision: D75616787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154641
Approved by: https://github.com/Skylion007
2025-05-29 23:02:01 +00:00
b040d63ce4 Prevent SAC cache from being kept alive by reference cycle (#154651)
Fixes https://github.com/pytorch/pytorch/issues/154642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154651
Approved by: https://github.com/xmfan
2025-05-29 22:27:35 +00:00
7d17253af8 [BE]: Improve aten formatter with fmtlib (#152830)
Replaces stateful ostream output with stateless fmtlib, which is signficantly faster and more contained. It is especially faster for the type of complex double formatting found here since it uses the newer [DragonBox algorithm](https://github.com/jk-jeon/dragonbox) for faster floating point formatting (which is the main bottleneck here). This also enables some static time checking of the formatting strings

test plan: all tests pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152830
Approved by: https://github.com/cyyever, https://github.com/malfet, https://github.com/atalman
2025-05-29 22:11:30 +00:00
fdbf314278 [Inductor] Cache subgraph autotuning choices properly (#154067)
Differential Revision: D75170507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154067
Approved by: https://github.com/eellison
2025-05-29 22:01:44 +00:00
c7e8e8ee19 Add torch.profile benchmarking function to feedback_fns (#153579)
Summary: Updates some benchmarking code to have the option to use torch.profile, and passes in a thunk to benchmark_fns to get this information (this will be a different result from `timings`, which are already passed into those functions).

Test Plan: Existing unit tests.

Differential Revision: D74444990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153579
Approved by: https://github.com/coconutruben, https://github.com/masnesral, https://github.com/nmacchioni
2025-05-29 21:43:45 +00:00
1237f271aa [ROCm] MIOpen: Get current device from Torch rather than HIP in handle creation (#154549)
Get current device from Torch rather than HIP in MIOpen handle creation. The device may have already been set from torch side, otherwise device is set to 0 for handle.  Additional audits of cudnn vs miopen Handle.cpp file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154549
Approved by: https://github.com/jeffdaily, https://github.com/cyyever

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-29 21:12:12 +00:00
08fdc64c86 [ROCm] Exposing Some MIOpen Symbols (#2176) (#154545)
This PR exposes some MIOpen symbols, namely:

1. `miopenDataType_t getMiopenDataType(const at::Tensor& tensor)`
2. `miopenHandle_t getMiopenHandle()`
3. `class TensorDescriptor`
4. `class Descriptor`
5. `class FilterDescriptor`
6. `struct ConvolutionDescriptor`
7. `struct DropoutDescriptor`
8. `struct RNNDescriptor`

to enable adding extensions that make use of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154545
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-29 21:10:45 +00:00
83a0e4e6f9 [Visualizer] Start at index with most events (#154571)
Summary: Oftentimes a single snapshot will contain multiple GPU traces in it based on what the process can see. In this case lets just start with the gpu trace with the highest amount of activity

Test Plan:
Ran od with: https://www.35929.od.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-chujiechen-f701302011/1/rank-1_itrn-3.Mar_01_06_10_09.3747.snapshot.pickle
And it started at index 1 instead of 0

Differential Revision: D75555558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154571
Approved by: https://github.com/aaronenyeshi
2025-05-29 20:49:33 +00:00
2bc8fec744 deprecate MTIA_WORKLOADD from pytorch (#154627)
Differential Revision: D75612179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154627
Approved by: https://github.com/sraikund16
2025-05-29 20:30:40 +00:00
cb56df55dc [Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331)
Fixes #153298

This PR is the 3rd and final step of #147479
All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated.
All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary.

[henrylhtsang](https://github.com/henrylhtsang)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331
Approved by: https://github.com/henrylhtsang
2025-05-29 20:29:58 +00:00
629fca295e Always set CPU affinity for benchmark jobs (#154569)
Because metrics like compilation time requires CPU.  I want to see if this help fix https://github.com/pytorch/pytorch/issues/152566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154569
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-29 20:11:47 +00:00
3afbab66f7 [BE] Remove unused release scripts. Add clarifications for the branch cut process (#154649)
Scripts in ``scripts/release/promote/`` are not used for a while.
We use the ones in test-infra [here](https://github.com/pytorch/test-infra/blob/main/release/) .
Hence this small cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154649
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2025-05-29 19:49:37 +00:00
e8f5c24d17 [rocm]add device guard when initialize single stream (#154433)
Summary: AMD streams are lazily initialized and sometimes (e.g. when we just want to do event recording on the stream) we might not be setting the device guard while it's initializing which would lead to invalid configuration error.

Differential Revision: D75456460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154433
Approved by: https://github.com/jeffdaily
2025-05-29 19:42:12 +00:00
20ec61a02f [BE] fix lint errors caused by const SROpFunctor fn (#154552)
Summary: Remove const quaiflier from SR suggsted from CLANGTIDY.

Test Plan: arc lint -a -e extra --take CLANGTIDY caffe2/torch/fb/sparsenn/cpu_operators/to_dense_representation_cpu.cpp

Reviewed By: henryoier

Differential Revision: D75534056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154552
Approved by: https://github.com/Skylion007
2025-05-29 19:40:08 +00:00
5a21d6f982 [AOTI][reland] Support multi-arch when using package_cpp_only (#154608)
Summary: Reland https://github.com/pytorch/pytorch/pull/154414

Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154608
Approved by: https://github.com/yushangdi
2025-05-29 19:32:33 +00:00
0db9c64d68 [c10d] Separate monitoring thread into a class in PGNCCL (#153977)
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.

What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-05-29 17:45:04 +00:00
6f992e1b3f [BE][AT] cleanup my old todo (#154542)
Summary: this todo is very old, and probably not needed anymore. let's have CI figure out if removing this breaks anything

Test Plan: CI

Differential Revision: D75491068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154542
Approved by: https://github.com/Skylion007
2025-05-29 17:22:01 +00:00
634ce22601 [MPSInductor] Fix codegen for nested multistage reductions (#154578)
Yet to write a unittest for it, but this fixes codegen for
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5  --backend inductor --inference --devices mps --float16
```

By correctly closing triple nested loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154578
Approved by: https://github.com/jansel, https://github.com/dcci
2025-05-29 17:09:25 +00:00
8883e494b3 [cutlass backend][ez] remove indent for cutlass config serialization (#154573)
Differential Revision: [D75566642](https://our.internmc.facebook.com/intern/diff/D75566642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154573
Approved by: https://github.com/ColinPeppler
2025-05-29 17:00:52 +00:00
41092cb86c [MPS] index copy impl (#154326)
Second most requested op according to #154052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326
Approved by: https://github.com/malfet
2025-05-29 16:57:43 +00:00
733e684b11 Skip test file that doesn't run gradcheck for slow gradcheck (#154509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154509
Approved by: https://github.com/malfet
2025-05-29 16:32:26 +00:00
2c6f24c62d [ROCm] Updated default workspace for gfx95 (#153988)
Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153988
Approved by: https://github.com/jeffdaily
2025-05-29 16:22:17 +00:00
53b0f6f543 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 4613081b729273a9273185e9ef7470ce76e22da2.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/malfet due to It broke windows debug builds, see ef1d45b12d/1 ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2919897160))
2025-05-29 16:14:28 +00:00
ef1d45b12d Cleanup parent fallback logic (#154006)
The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006
Approved by: https://github.com/drisspg
2025-05-29 13:40:36 +00:00
d6e29bf875 Reflect back mutation if we clone misaligned tensors (#154442)
Fix for https://github.com/pytorch/pytorch/issues/152425

inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor.

If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation.

We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark:
```
import torch

t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16)

@torch.compile(dynamic=False)
def foo(x):
    return x.add_(1)

import triton

print(triton.testing.do_bench(lambda: foo(t[:-1])))
torch._dynamo.reset()
print(triton.testing.do_bench(lambda: foo(t[1:])))
```
gives
```
0.04063070610165596
0.07613472988113162
```
So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case.

In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-05-29 13:36:48 +00:00
3c74a72ea0 Keep XPU compatible with toolchain 2025.2 (#154359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154359
Approved by: https://github.com/EikanWang, https://github.com/cyyever
2025-05-29 11:12:07 +00:00
cd9ff41282 check fallback_value first. (#154493)
This is just a refactor, not a fix for any issue.
we do check fallback_value first  and early exit instead of checking it not set over and over.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154493
Approved by: https://github.com/bobrenjc93
2025-05-29 09:06:43 +00:00
447b481c79 [AOTI] Save data sizes to constants_info (#154534)
Differential Revision: D75223179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154534
Approved by: https://github.com/muchulee8
2025-05-29 06:39:13 +00:00
9c7ed3e46e [debug_printer][BE] Fix float8 type printing for min/max value printing (#154466)
Summary:
ATT

GH Issue: https://github.com/pytorch/pytorch/issues/149008

**Previous:**
Failed to use debug printing for float8 types due to the limitation of "min_all_cuda" implementation from aten native:

 4b39832412/aten/src/ATen/native/cuda/ReduceMinValuesKernel.cu (L51)

Error:

Min value: Error: "min_all_cuda" not implemented for 'Float8_e4m3fn'

**Now:**
Example output paste: P1824621233
Unblocked float8 type tensor debug printing. Suggest to print the whole value if numel <= threshold.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_C
OMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_float8_dtype_cuda
```

```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_fp8_cuda 2>&1 | tee fp8_example_printing.txt
```

Differential Revision: D74847967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154466
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang
2025-05-29 05:48:02 +00:00
07343efc15 [cutlass backend] small refactor to flatten the ops to avoid nested for loops (#154576)
Differential Revision: [D75565429](https://our.internmc.facebook.com/intern/diff/D75565429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154576
Approved by: https://github.com/ColinPeppler
2025-05-29 04:42:58 +00:00
b394c6e89c [Inductor][CPP] Add block sparse for FlexAttention CPU (#147196)
## Overview
This PR is to optimize FlexAttention CPP template with block sparse.
Block sparse is natively supported in FlexAttention block mask structures, thus following logic of the kv blocks from `kv_indice ` and `full_kv_indice ` is the strightforward way to add this optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147196
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel
2025-05-29 02:57:02 +00:00
c0864bb389 Add a (t * 0) pattern (#153161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153161
Approved by: https://github.com/danielvegamyhre
2025-05-29 02:19:36 +00:00
316e7a9293 [BE][Ez]: Denote common types as TypeAlias (#154527)
Denotes common_types as TypeAlias. This triggered a Ruff rule since we named our TypeAlias off standards so I added a file wide ruff suppression
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154527
Approved by: https://github.com/benjaminglass1, https://github.com/aorenste
2025-05-29 02:00:13 +00:00
2d932a2e01 [ROCm] Fix 3D tensor perf degradation with NHWC format (#154522)
Co-author: @doru1004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154522
Approved by: https://github.com/jeffdaily
2025-05-29 01:33:49 +00:00
cyy
4613081b72 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-29 00:52:44 +00:00
946a4c2bdc BE: Type previously untyped decorators (#154515)
Summary: Cloned #153726 from Skylion007 and fixed internal typing issues.

Test Plan: Unit tests pass

Differential Revision: D75477355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154515
Approved by: https://github.com/Skylion007
2025-05-29 00:36:34 +00:00
ba0a91b3ea [4/n][Optimus][Auto-AC] Expose the config to skip the dynamo gaurds to avoid recompile (#154152)
Summary:
context: https://fb.workplace.com/groups/1075192433118967/permalink/1673720956599442/

Thanks Microve for raising the existing dynamo skip API in D75196435

The dynamic shape triggers recompilation, introducing compilation time increase, we expose config that users can skip the dynamo guards to avoid the recompile. Note that it may quantize unnessarily nodes, which can impact NE, QPS and memory saving,  needs verification.

Differential Revision: D75248430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154152
Approved by: https://github.com/bobrenjc93
2025-05-29 00:35:37 +00:00
22a1b3b5d0 use 4 elements per thread in no-cast elementwise kernel (#154558)
Reduce elems per thread to 4 in vectorized function also (only for unaligned inputs where there's no vectorization anyway). This slightly reduces binary size (by 4MB)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154558
Approved by: https://github.com/malfet
2025-05-29 00:32:44 +00:00
40abb2b403 Fix deprecated amp APIs in docs (#154553)
Update usage of deprecated amp APIs.

Fixes https://github.com/pytorch/tutorials/issues/3331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154553
Approved by: https://github.com/Skylion007
2025-05-29 00:05:59 +00:00
2b3ac17aa2 [Cutlass] Remove spammy log for gemm extensions (#154548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154548
Approved by: https://github.com/henrylhtsang
2025-05-28 23:55:36 +00:00
81b7c96697 [dynamo, nested graph breaks] add skip_frame debugging function (#153773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510, #153772
2025-05-28 23:29:37 +00:00
6cda280483 [dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510
2025-05-28 23:29:37 +00:00
bbd45f1f1f [dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)
Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510
Approved by: https://github.com/jansel
ghstack dependencies: #151056
2025-05-28 23:29:37 +00:00
0f0d5749a0 [dynamo, nested graph breaks] small fixes to resume function generation (#151056)
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~

We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
2025-05-28 23:29:37 +00:00
65b1aedd09 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-28 23:25:17 +00:00
3e05a48927 Fix clamp type promotion in inductor decomposition (#154471)
Summary: as title, the clamp type promotion should take min/max arg into consideration as well.

Test Plan:
```
buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu
python test/inductor/test_torchinductor.py -k test_clamp -v
```

Differential Revision: D75490124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2025-05-28 23:24:25 +00:00
d865b784e4 Support unbacked whitelist (#154295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154295
Approved by: https://github.com/angelayi
2025-05-28 23:01:22 +00:00
ef4d57329b [CAG] Support for call_module at copy paste aot bwd graph (#153827)
Support for `call_module` in `copy_paste_aot_backward_graph` added recently with PT2.7

Problem is being observed with HPU backend in example repro due to creating fused modules.

```
import torch

device = 'cpu' #'hpu'
backend = 'inductor' #'hpu_backend'

def fn(t1):
    t1 = t1 * 1
    t1_grad = torch.ones_like(t1, device=device)
    t1.backward(t1_grad, retain_graph=True)
    return t1

t1 = torch.ones(1, requires_grad=True, device=device) #.squeeze()
compiled_fn = torch.compile(fn, backend=backend)
result = compiled_fn(t1)

with torch._dynamo.compiled_autograd._enable(torch.compile(backend=backend)):
    result_grad = torch.ones_like(result, device=device)
    result.backward(result_grad)

print(f'{result_grad=}')
print(f'{t1.grad=}')
```

With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze):
`torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(FakeTensor(..., device='hpu:0', size=()), 0), **{}): got IndexError('invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number')`

While on CPU there's following warning and None returned:
`repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.)
  print(f'{t1.grad=}')
t1.grad=None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153827
Approved by: https://github.com/xmfan
2025-05-28 22:52:40 +00:00
d62a33c002 [ez] add docblock for _expandsums (#154397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154397
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398, #154396, #154399
2025-05-28 22:43:26 +00:00
0c00e32632 [ez] add docblock for _eval_is_non_overlapping_and_dense (#154399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154399
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398, #154396
2025-05-28 22:40:03 +00:00
0f56318152 [precompile] Add Exception type PackageError for unsupported precompile features. (#154430)
Summary:
Today when guard serialization fails, dynamo will raise an internal error like:

```
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CLOSURE_MATCH guard cannot be serialized.
```

Adding a dedicated PackageError type to surface the error more clearly.

Test Plan: CI

Differential Revision: D75452124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154430
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-05-28 22:34:51 +00:00
11129d9317 Add new ops in fallback ops (#154251)
Fixes #ISSUE_NUMBER

## Background

Task: [T222738229](https://www.internalfb.com/intern/tasks/?t=222738229)

It's the first starter task on the project **_Enabling TorchNative Standalone on Whisper_**.  We are using cshim to create a layer of abstraction between _**libtorch**_ and **_AOTInductor generated artifacts_**.

So we needed to add an entry in the cshim for every API surface in libtorch. And we only care about operators that AOTInductor does not handle. And for this task, we only wanted to add it for the following ops.

## What I've done?

4 new fallback ops are added that show up in the Whisper model. (torchgen/aoti/fallback_ops.py)

- aten.permute (default)
- aten.squueze (dim)
- aten.abs (default)
- aten.hann_window (default)

Then I ran the below command to generate new header C shim header files. As it says [here](7e86a7c015/torchgen/gen.py (L2424-L2436%20for%20details))
`python torchgen/gen.py --update-aoti-c-shim`

Then, `python setup.py develop` to rebuild PyTorch

## Testing

Also 4 new tests have been added on test/inductor/test_aot_inductor.py

- test_proxy_executor_permute
- test_proxy_executor_abs
- test_proxy_executor_squeeze
- test_proxy_executor_hann

I ran these commands to test it (inside local pytorch root folder):

`python test/inductor/test_aot_inductor.py -k test_proxy_executor_permute`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_abs`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_squeeze`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_hann`

## NOTE:
I didn't see any order between the tests inside _test/inductor/test_aot_inductor.py_. That's why, I added new tests just after the test given in the example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154251
Approved by: https://github.com/angelayi
2025-05-28 22:11:07 +00:00
d2f506cae8 [ca] disable ca for functorch grad and run all HOO tests (#154147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154147
Approved by: https://github.com/zou3519
ghstack dependencies: #154133
2025-05-28 22:06:13 +00:00
857f21631d [ca] fix hop_db tests (#154133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154133
Approved by: https://github.com/zou3519
2025-05-28 22:06:13 +00:00
ed348e7026 Add docblock for TrackedFake (#154396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154396
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398
2025-05-28 21:19:49 +00:00
d311b79c12 add docblock for _fast_expand (#154398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154398
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400
2025-05-28 21:16:47 +00:00
e7318b863d [ez] add docblock to cast_symbool_to_symint_guardless (#154400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154400
Approved by: https://github.com/laithsakka
2025-05-28 21:11:53 +00:00
f6dcc45c44 [Kineto x Insight] Add device to activity type map in pytorch (#154253)
Summary: Update the device to ActivityType Map in pytorch. Need to be exported to github

Test Plan:
Run the ondemand e2e test and insight profiler is triggered during profiling
P1819539581: https://www.internalfb.com/intern/paste/P1819539581/
{F1978519960}

Insight profiler is not enabled when mtia_insight not specifying in config
{F1978527200}

Reviewed By: fenypatel99

Differential Revision: D75246621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154253
Approved by: https://github.com/Skylion007
2025-05-28 20:36:19 +00:00
e25074d462 [c10d][CI] Change expected return code in Sandcastle for Nan tests (#154441)
Fixing internal error caused by #153167.

`skip_but_pass_in_sandcastle_if` returns exit code 0. But `test_nan_assert` expects exit code -6.
So we'd need to set expected return code conditional on `IS_SANDCASTLE`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154441
Approved by: https://github.com/fduwjj, https://github.com/nWEIdia
ghstack dependencies: #153167
2025-05-28 20:35:52 +00:00
c381103fd7 Fix the logic of set_cpu_affinity (#154503)
While investigating https://github.com/pytorch/pytorch/issues/152566, I found two issues with how the cpu affinity is set in benchmark job:

* The current logic doesn't work with cgroups slice, the mechanism behind multi-tenant runner:
    * Using `lscpu` returns all CPUs and not the available ones from cgroups.  On the other hand, `nproc` works correctly.  For example, on H100, `lscpu` returns 192 CPUs while `nproc` returns 24 (192 / 8)
    * Setting `taskset -c 0-N` blindly is wrong because CPU 0 is only available to the the first tenant, aka alice.  For example, running `taskset -c 0 ls` on any other tenants will fail. To fix this, the ID of available CPUs can be fetched by calling `os.sched_getaffinity(0)`.
* The last bug is `taskset` works with logical CPUs https://www.man7.org/linux/man-pages/man1/taskset.1.html, so using the result from `test_inductor_get_core_number` is also wrong because that function returns the number of physical CPUs.

### Testing

CPU benchmark jobs look ok

* [aarch64 torch.compile benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2021%20May%202025%2016%3A40%3A28%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A40%3A28%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=fix-cpu-affinity-cgroups&lCommit=9a6288e083d650c470623f5fe136b1060824021c&rBranch=main&rCommit=dec5ab8d984b8a608140911351d877b9ddb141c2)
* [x86 micro benchmark](https://hud.pytorch.org/benchmark/llms?startTime=Wed%2C%2021%20May%202025%2016%3A41%3A26%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A41%3A26%20GMT&granularity=day&lBranch=main&lCommit=c1b7dbc52aaa49f4cd147bbe5935110a4a10e3e3&rBranch=refs/tags/ciflow/inductor-micro-benchmark-cpu-x86/154503&rCommit=9a6288e083d650c470623f5fe136b1060824021c&repoName=pytorch%2Fpytorch&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=cpu%20(x86_64)&archName=All%20Platforms)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154503
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-28 19:38:20 +00:00
66f53889d5 [nativert] port semaphore to c10 util (#153504)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore

gonna need a oss build export to take a look at this...

Test Plan: CI

Differential Revision: D73882656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504
Approved by: https://github.com/zhxchen17
2025-05-28 19:17:30 +00:00
24980d2641 [ROCm][CI] Update build-environment for mi300 workflows (#153134)
so their test times are tracked separately in https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json. Currently, both MI200 and MI300 test times get combined into the same key `linux-focal-rocm-py3.10`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153134
Approved by: https://github.com/huydhn
2025-05-28 19:04:53 +00:00
d4ab8e74f3 Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095)"
This reverts commit c6fc11af760d4ad1f01cc699a3c6488ab5f41770.

Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/izaitsevfb due to still fails to link internally at meta ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2917221575))
2025-05-28 18:22:39 +00:00
1c7a70b483 [AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)
Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/)

In general, we want to cache the cutlass kernels.

Also saw an error saying .o not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155
Approved by: https://github.com/chenyang78
2025-05-28 17:35:19 +00:00
66ac724b56 pyfmt lint torch/_export/passes/replace_view_ops_with_view_copy_ops_pass.py (#154488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154488
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484, #154485, #154487
2025-05-28 17:07:15 +00:00
dfe0f48123 pyfmt lint torch/_export/serde/schema.py (#154487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154487
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484, #154485
2025-05-28 17:07:15 +00:00
92cebed1bd pyfmt lint torch/_export/serde/serialize.py (#154485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154485
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484
2025-05-28 17:07:07 +00:00
b4fe5ca58a pymft lint torch/utils/weak.py (#154484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154484
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483
2025-05-28 17:06:58 +00:00
4de1b25df7 Remove empty files from execlude lint rule (#154483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154483
Approved by: https://github.com/Skylion007
2025-05-28 17:06:50 +00:00
70539308ac [dynamo] updating gb_type names for uniqueness (#154452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154452
Approved by: https://github.com/williamwen42
2025-05-28 16:54:10 +00:00
e313152a33 SDPA fix memory efficient attention for large batch dim (#154029)
Fixes #146704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154029
Approved by: https://github.com/ngimel
2025-05-28 16:53:53 +00:00
3b38989b5f Remove MemPoolContext (#154042)
Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042
Approved by: https://github.com/albanD, https://github.com/syed-ahmed
2025-05-28 16:35:48 +00:00
d23aa7e182 Add deprecation warning for torch.ao.quantization (#153892)
Summary:
att

Test Plan:
(ao) $ PYTHONWARNINGS='default' python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.ao.quantization.quantizer.xnnpack_quantizer import XNNPACKQuantizer
printing warning
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/__init__.py:36: DeprecationWarning: torch.ao.quantization is deprecated. Plan is to
1. Remove eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead
2. Remove fx graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx, torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e)
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e)
see https://dev-discuss.pytorch.org/t/torch-ao-quantization-migration-plan/2810 for more details
  warnings.warn(
>>> a = XNNPACKQuantizer()
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/quantizer/xnnpack_quantizer.py:281: DeprecationWarning: XNNPACKQuantizer is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead
  warnings.warn(f"{self.__class__.__name__} is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead", DeprecationWarning)
>>>

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153892
Approved by: https://github.com/Skylion007
2025-05-28 16:25:30 +00:00
5bf74753f6 [precompile] Prune local scope variables for guard serialization. (#154431)
Summary: Prune unused local objects from serialized local scope if they are not used in guard reconstruction. This is helpful when a user program takes things like local callable functions or the function call is recursive.

Test Plan:
test/dynamo/test_guard_serialization.py -k test_function_locals

Before pruning locals:
```
state = GuardsState(output_graph=OutputGraphGuardsState(local_scope={'x': tensor([ 0.0461,  0.4024, -1.0115]), 'g': <function ...aints=None, _guards=<torch._guards.GuardsSet object at 0x7fbccc7e9fc0>, _aotautograd_guards=[]), shape_code_parts=None)

    def pickle_guards_state(state: GuardsState) -> bytes:
        buf = io.BytesIO()
        pickler = GuardsStatePickler(buf)
        try:
            pickler.dump(state)
        except AttributeError as e:
>           raise torch._dynamo.exc.PackageError(str(e)) from e
E           torch._dynamo.exc.PackageError: Can't pickle local object 'TestGuardSerialization.test_function_locals.<locals>.foo'
```
After the diff
```
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D75452123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154431
Approved by: https://github.com/jansel
2025-05-28 16:03:02 +00:00
9db7bcb3fe [Dynamo] Introduce hook receiving list of traced code objects (#153622)
This PR:
* Expands `Hooks` with a new, optional `frame_traced_fn` field. It should be a callable receiving the list of traced code objects
* Maintains a list of `traced_code` objects in the `TracingContext` of an `OutputGraph`
    *  Whenever an `inline_call()` is encountered, the corresponding code object is added to this set
    * `OutputGraph`'s associated `f_code` is added to the list just before the hook is called

I believe use of this hook should enable the source code hashing that vLLM does in a better way than monkey-patching `inline_call()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153622
Approved by: https://github.com/jansel
2025-05-28 15:40:09 +00:00
476e0a643a [ez] add docblock for ShapeGuardPythonPrinter (#154403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154403
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385, #154402
2025-05-28 14:17:17 +00:00
473a93eb58 [ez] add docblock for _ShapeGuardPrinter (#154402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154402
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385
2025-05-28 14:13:22 +00:00
35a473e364 [ez] add docblock for guard_scalar (#154385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154385
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384
2025-05-28 14:10:07 +00:00
ee4f433963 [ez] add docblock for _guard_or (#154384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154384
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383
2025-05-28 14:06:29 +00:00
e9b97d19b1 [ez] Make SymNodeImpl comments less misleading (#154480)
As discussed in DS workchat, it's easy for users to get confused by
guarding for these supposedly non-guarding methods. The TL;DR is in the
case of non pythonic compilers like XLA, we actually do guard. I've
updated the comments accordingly to reduce confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154480
Approved by: https://github.com/pianpwk, https://github.com/Skylion007
2025-05-28 14:04:32 +00:00
a75e3a02be Revert "[dynamo, nested graph breaks] small fixes to resume function generation (#151056)"
This reverts commit 28e7aa21c522e92ea01a62dfdc5e3b74e398d8f0.

Reverted https://github.com/pytorch/pytorch/pull/151056 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
9603d6382d Revert "[dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)"
This reverts commit 1fe98429222a8ba5e16dd9381f50a8fb90edcf0e.

Reverted https://github.com/pytorch/pytorch/pull/153510 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
5fd7004dc9 Revert "[dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)"
This reverts commit 9a66c30bdc563c62375e5030c4103b67515b8dac.

Reverted https://github.com/pytorch/pytorch/pull/153772 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
e86439ed5b Revert "[dynamo, nested graph breaks] add skip_frame debugging function (#153773)"
This reverts commit aadf9eae63c4793e1107a3b21ede30e5289eeaca.

Reverted https://github.com/pytorch/pytorch/pull/153773 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
203b0efd63 [PP] Allow unused kwargs in ZB path (#153498)
This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error:

```
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in
[rank3]:[rank3]: return lambda: stage_backward_input(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input
[rank3]:[rank3]: dinputs = torch.autograd.grad(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad
[rank3]:[rank3]: result = _engine_run_backward(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad
```

related issues: https://github.com/pytorch/torchtitan/issues/1188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498
Approved by: https://github.com/kwen2501
2025-05-28 13:34:04 +00:00
cf7451f279 Fix signature of torch.sparse_coo_tensor() (#152681)
Fixes #145371

@pearu Searched all and find these codes, wondering whether is the root cause of the issue, could you have a review? Thanks a lot!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152681
Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/nikitaved
2025-05-28 13:16:41 +00:00
f58143b945 [Typing] Refactor torch.types.Device in torch/cuda/__init__.py (#153447)
Part of: #152952
Follow up: #153027

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

So `Optional[Union[Device, int]]` is equivalent to `torch.types.Device`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153447
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-28 10:09:31 +00:00
fdc339003b Revert "[AOTI] Support multi-arch when using package_cpp_only (#154414)"
This reverts commit a84d8c4a1cc515db274366537afd0b1492800c2d.

Reverted https://github.com/pytorch/pytorch/pull/154414 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm trunk job ([comment](https://github.com/pytorch/pytorch/pull/154414#issuecomment-2915597821))
2025-05-28 09:23:31 +00:00
853958f82c Fix: Replacements can cause runtime assertions to disappear and can cause invalid inductor code. (#153661)
Lets explore firs a couple of problem related to replacements and runtime assertions.

#### example problem 1
if we have a runtime assertions that u0==s0, u0 is an input coming from mark_unbacked. A replacement u0=s0 will be added, the function f(u0, s0) will become f(s0, s0), this leads to the assert  not being inserted during insert_deferred_runtime_asserts.
The reason is that insert_deferred_runtime_asserts logic insert each assertion once all its inputs are seen,  but u0 will never be seen. Same thing can happen when we defer assertion on backed i.e: s0==s2 ..etc.

#### example problem 2
Consider u0==s0, where u0 is coming from a call to .item() Imagine later on that a specialization happens to s0 to become 2. In that case s0 as input wont be seen during insert_deferred_runtime_asserts and the assertion won't be inserted in the graph. Worse, Inductor will generate some code that refers to s0 in the cpp wrapper while it does not exist, causing a failure.
internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1669766396994898/

## The solution :
Runtime assertions insertion loops depend on detecting that the symbols that are used in the runtime assertions are seen, note that those symbols are either graph inputs or generated in the graph from data dependent ops like .item().

The issues above happen when symbols are graph inputs, in order to force the symbols to exist in the graph and to be seen by the runtime assertions we do not do replacements on placeholders expressions during codegen and during runtime assertions insertion.

This should not have performance overhead, since we already optimized the graph with replacements, the only effect is not mistakenly dropping graph inputs that are used in runtime assertions.
I added extended testing. A solo unrelated follow up that I noticed, is that we might want to rename unbacked symbols in runtime assertions when we do unbacked renaming, but that's a different issue.

Other approaches that did not work :
#### ban replacements on unbacked.
1. does not work when we defer runtime assertions on backed ex: s0==s1. we could also ban such replacements
but problem 2 becomes more problematic.
2. Problem two, it affects the quality of reasoning ! in a bad way.

#### Apply specialization on runtime assertions before codegen .
1. Can fix some issues, but may lead also to runtime assertions becoming NOPs.
2. Does not fix the issue if not inserting runtime assertions during insert_deferred_runtime_asserts due to input not being detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153661
Approved by: https://github.com/jansel
2025-05-28 09:08:05 +00:00
aadf9eae63 [dynamo, nested graph breaks] add skip_frame debugging function (#153773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510, #153772
2025-05-28 08:54:09 +00:00
9a66c30bdc [dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510
2025-05-28 08:54:09 +00:00
1fe9842922 [dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)
Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510
Approved by: https://github.com/jansel
ghstack dependencies: #151056
2025-05-28 08:54:09 +00:00
28e7aa21c5 [dynamo, nested graph breaks] small fixes to resume function generation (#151056)
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~

We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
2025-05-28 08:54:09 +00:00
cyy
9d04c0f352 Remove outdated CUDA 11 conditions (#154313)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313
Approved by: https://github.com/eqy
2025-05-28 08:44:58 +00:00
1d9b7dd2d1 [PGO] suggest dynamic whitelist for recompilations (#154189)
suggests `TORCH_COMPILE_DYNAMIC_SOURCES` based off tensor size changes in PGO code state, including parameters.

Closing #153442 which took the dynamo guards approach.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154189
Approved by: https://github.com/bobrenjc93
2025-05-28 07:11:43 +00:00
fe760b6636 [ez] add docblock for _free_unbacked_symbols_with_path (#154383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154383
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381
2025-05-28 05:53:50 +00:00
8e25ba6963 [ez] add docblock for find_symbol_binding_fx_nodes (#154381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154381
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380
2025-05-28 05:44:26 +00:00
08c29deb5f [ez] add docblock to is_symbol_binding_fx_node (#154380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154380
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379
2025-05-28 05:41:19 +00:00
07405a6cff [ez] add docblock for free_unbacked_symbols (#154379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154379
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378
2025-05-28 05:37:25 +00:00
dcdaef5206 [ez] add docblock for free_symbols (#154378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154378
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377
2025-05-28 05:34:25 +00:00
abc3fdc7ac [ez] add docblock for _iterate_exprs (#154377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154377
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405
2025-05-28 05:28:58 +00:00
ab6cb85cb0 [ez] add docblock for _remove_effect_token_unbacked_bindings (#154405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154405
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404
2025-05-28 05:16:14 +00:00
fde8f6a8b8 [ez] add docblock for _suggest_torch_checks (#154404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154404
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376, #154386, #154401
2025-05-28 04:45:55 +00:00
b82fb57b67 [ez] add docblock for RuntimeAssert (#154401)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154401
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376, #154386
2025-05-28 04:43:22 +00:00
d64b4a91dd [ez] remove unused function _constrain_symbol_range (#154386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154386
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376
2025-05-28 04:41:00 +00:00
ef90cc18d7 use definitely_contiguous for _prim_elementwise_meta short circuit (#153441)
*
This verifies that the check short circuit is not material. https://github.com/pytorch/pytorch/pull/153431
```
import torch
from torch.export import Dim, export
class MyModel(torch.nn.Module):
    def forward(self, x, ranks):
        first_k = ranks.max().item()
        torch._check_is_size(first_k)
        narrow = x.narrow(dim = 1, start = 0, length = first_k)
        lt = narrow < narrow.size(1)
        return lt
inps = (
    torch.randn((8, 16), device="cuda"),
    torch.arange(8, device="cuda", dtype=torch.int8)
)
spec = {
    "x": (Dim.AUTO, Dim.AUTO),
    "ranks": (Dim.AUTO,),
}
traced = export(MyModel(), inps, dynamic_shapes=spec, strict=True).run_decompositions({})

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153441
Approved by: https://github.com/jansel
ghstack dependencies: #153432
2025-05-28 03:41:26 +00:00
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
54f1f29fed [dynamo] dynamic gb_type -> static gb_type (#154435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154435
Approved by: https://github.com/williamwen42
2025-05-28 03:14:26 +00:00
f12ce4e36b [Intel GPU] convolution fusion at XPU backend (#154202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154202
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/etaf
ghstack dependencies: #140365
2025-05-28 03:14:18 +00:00
c6fc11af76 Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move the inline function that defines the static variable to .cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-28 02:47:16 +00:00
855eff8e8e Don't CSE unbacked nodes (#154387)
* #154440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154387
Approved by: https://github.com/TroyGarden
ghstack dependencies: #154440
2025-05-28 02:21:56 +00:00
919a1a17e3 [ez] Replace misleading implementations with NYI (#154440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154440
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
2025-05-28 02:21:56 +00:00
a84d8c4a1c [AOTI] Support multi-arch when using package_cpp_only (#154414)
Summary: Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary.

Differential Revision: [D75452096](https://our.internmc.facebook.com/intern/diff/D75452096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154414
Approved by: https://github.com/angelayi
ghstack dependencies: #154412, #154413
2025-05-28 01:20:38 +00:00
cde82d25b7 [AOTI] Add a multi_arch_kernel_binary option (#154413)
Summary: CUDA can support multi-arch with the fatbin format. Add this multi_arch_kernel_binary option, so the compiled model binary can run across different GPU archs.

Differential Revision: [D75452094](https://our.internmc.facebook.com/intern/diff/D75452094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154413
Approved by: https://github.com/angelayi
ghstack dependencies: #154412
2025-05-28 01:20:38 +00:00
4d8f3d537a [AOTI][refactor] Rename embed_cubin to embed_kernel_binary (#154412)
Summary: Rename as it is not CUDA specific.

Differential Revision: [D75452095](https://our.internmc.facebook.com/intern/diff/D75452095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154412
Approved by: https://github.com/angelayi
2025-05-28 01:20:28 +00:00
e79790e14b [ez] add docblock for _sympy_from_args (#154376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154376
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375
2025-05-27 23:43:13 +00:00
fe082c5ffe Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)
Trying to fix: https://github.com/pytorch/pytorch/issues/154157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever
2025-05-27 23:16:21 +00:00
3f10c9d8af Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)
Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245
Approved by: https://github.com/jeffdaily
2025-05-27 23:10:25 +00:00
4b39832412 [CI] Update torchbench pin (#154453)
Related to https://github.com/pytorch/pytorch/issues/154446
Pins torchbench repo to a https://github.com/pytorch/benchmark/pull/2620 which pins opacus to ``1.5.3`` version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154453
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-05-27 23:08:42 +00:00
247ea229ba Create issue template: Release highlight for proposed Feature (#154125)
Authors: @anitakat @atalman

This is related to: https://github.com/pytorch/pytorch/issues/152134 . Adding RFC template for feature submissions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154125
Approved by: https://github.com/anitakat, https://github.com/ZainRizvi, https://github.com/albanD
2025-05-27 22:45:21 +00:00
53affa273b [MTIA Aten Backend][1.3/n] Migrate remaining view ops, which all need explicit register in native_functions.yaml (#154337)
See context in D75266206.

This diff/PR migrates all the remaining view ops, which all need changes in `native_functions.yaml` and thus need to be exported to PR.

Ops covered by this diff:
- _reshape_alias
- unfold

internal: Also delete the entire aten_mtia_view_ops.cpp file, and update corresponding build config.

Differential Revision: [D75385411](https://our.internmc.facebook.com/intern/diff/D75385411/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154337
Approved by: https://github.com/nautsimon
ghstack dependencies: #154336
2025-05-27 22:18:12 +00:00
eaf355cb11 [BE] Clean up unused parameter input in AOTIModel (#154276)
Summary: As title

Test Plan: CI

Differential Revision: D74691763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154276
Approved by: https://github.com/Skylion007
2025-05-27 22:17:32 +00:00
241f8dc84d Revert "Remove outdated CUDA 11 conditions (#154313)"
This reverts commit 3936e6141c09dab94f21e4fdab7bea4bddf62ac2.

Reverted https://github.com/pytorch/pytorch/pull/154313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/154313#issuecomment-2914230005))
2025-05-27 21:54:41 +00:00
6be829535f [ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)
* Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300
* Use thread_work_size of 8 or 16 for vectorized elementwise kernel

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634
Approved by: https://github.com/jeffdaily
2025-05-27 20:49:32 +00:00
555fc05868 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 6169ca0b65bcb382faa1a2287278b3717c18f127.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/benjaminglass1 due to Appears to have broken main ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2913975736))
2025-05-27 20:39:09 +00:00
7359705232 Add CPython tests for unittest (#150788)
Tests:
* test_assertions.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150788
Approved by: https://github.com/williamwen42
2025-05-27 20:26:17 +00:00
12fc06d267 Add CPython complex tests (#152015)
Tests:
* test_complex.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152015
Approved by: https://github.com/williamwen42
2025-05-27 20:24:28 +00:00
3b218e56dc Add CPython tests for iter/sort (#150797)
Tests:
* test_iter.py
* test_sort.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150797
Approved by: https://github.com/williamwen42
2025-05-27 20:22:34 +00:00
4fd8a54a41 [ez] add docblock for is_accessor_node (#154375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154375
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
ghstack dependencies: #154374
2025-05-27 19:47:32 +00:00
b367e5f6a6 [ROCm][Windows] Fix building torch 2.8 wheel with ROCm (added hipblasLt and rocblas directories) (#153144)
Since rocblas.dll and hipblaslt.dll are copied to torch/lib, rocblas and hipblaslt directories are needed to be stored there too (otherwise we have an error after wheel installation while searching for files in rocblas/library and hipblaslt/library which doesn't exist). This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153144
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-27 19:40:28 +00:00
fa6ca59079 Revert "Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)"
This reverts commit 2bd95f3a1f07132aa00f5c438c5228866d7dd1f8.

Reverted https://github.com/pytorch/pytorch/pull/154153 on behalf of https://github.com/malfet due to Broke inductor tests, see b8452e55bc/1 ([comment](https://github.com/pytorch/pytorch/pull/154153#issuecomment-2913738047))
2025-05-27 19:23:28 +00:00
6169ca0b65 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-27 19:17:41 +00:00
75bbd4989c [dynamo] Support using symint from dispatcher-style tensor subclass (#154130)
Fixes #146932.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154130
Approved by: https://github.com/laithsakka
2025-05-27 19:05:46 +00:00
8c0f07f944 Revert "[ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)"
This reverts commit 0d4de7872ac019abbd6e87b3391b2276d9d05bd4.

Reverted https://github.com/pytorch/pytorch/pull/153634 on behalf of https://github.com/malfet due to Broke inductor jobs, see b8452e55bc/1 ([comment](https://github.com/pytorch/pytorch/pull/153634#issuecomment-2913619071))
2025-05-27 19:02:59 +00:00
b8452e55bc [Kineto x Insight] Update Kineto submodule (#154426)
Summary: We add a new ActivityType::MTIA_INSIGHT in 20f652846f

Test Plan: CI

Differential Revision: D75454945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154426
Approved by: https://github.com/Skylion007
2025-05-27 18:29:29 +00:00
5075df6fee Make torch importable if compiled without TensorPipe (#154382)
By delaying the import/hiding it behind `torch.distributed.rpc.is_tensorpipe_avaiable()` check
Fixes https://github.com/pytorch/pytorch/issues/154300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154382
Approved by: https://github.com/Skylion007
ghstack dependencies: #154325
2025-05-27 18:13:38 +00:00
f472ea63bb [BE] Fix typos in SyntaxError description (#154436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154436
Approved by: https://github.com/seemethere, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-27 18:08:58 +00:00
cfbd99fdfd [Pytorch] Add option to CPU Blas GEMM to avoid output downcast (#154012)
Summary:
Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t):
1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t)
2. vector reduction to a scalar value (opmath_t -> opmath_t)
3. optional downcast if opmath_t != out_t

The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel

Test Plan: Attention CI passes

Differential Revision: D75023858

topic: not user facing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154012
Approved by: https://github.com/Valentine233, https://github.com/aditew01, https://github.com/CaoE, https://github.com/drisspg
2025-05-27 17:43:21 +00:00
1ca082d9a1 [ez] Rewrite comment to be more friendly to non haskellers (#151421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151421
Approved by: https://github.com/aorenste
2025-05-27 17:32:34 +00:00
70fbd5e08c [ez] Add docblock for resolve_unbacked_bindings (#154374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154374
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
2025-05-27 17:05:49 +00:00
2560c1f3f0 add sticky cache pgo (#154418)
It's a reland of https://github.com/pytorch/pytorch/pull/154394 that hit some mergebot bug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154418
Approved by: https://github.com/malfet
2025-05-27 16:40:18 +00:00
514409d032 update torchvision pin (#154255)
Fixes #153985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
0ddfd1ed43 [Intel GPU] Enable mkdnn._linear_pointwise at XPU backend (#140365)
# Motivation

This PR is intended to add post-op fusion support fo Linear. The liner-pointwise fusion is expected to be used in graph mode like torch.compile. The FusionUtils.cpp file defines a utilization APIs for generating primitive attribute. This APIs would also be used for conv-pointwise fusion, which is in #140372.

# Validation
```bash
   python test/xpu/test_fusion.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140365
Approved by: https://github.com/etaf, https://github.com/guangyey, https://github.com/EikanWang
2025-05-27 15:57:15 +00:00
0d4de7872a [ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)
* Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300
* Use thread_work_size of 8 or 16 for vectorized elementwise kernel

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634
Approved by: https://github.com/jeffdaily
2025-05-27 15:38:43 +00:00
7ae204c3b6 [BE][CI][Easy] Run lintrunner on generated .pyi stub files (#150732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste
2025-05-27 14:58:02 +00:00
0a7eef140b Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022)
Fixes #153790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022
Approved by: https://github.com/Skylion007
2025-05-27 14:10:00 +00:00
d88699308f [CI][MacOS] Move more dependencies to pypi (#154309)
Hopefully last step before all Mac build/tests could be switched away from conda
- Update cmake version from 3.22 to 3.25 as 3.22 from pipy seems  to be unusable with python-3.12
- Add `--plat-name macosx_11_0_arm64` to setup.py command
- Remove `codesign` for cmake workaround (that was probably never really necessary
-  Install `libpng` and `jpeg-turbo` when building torchbench and build torchaudio without OpenMP (to be fixed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154309
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-27 13:49:40 +00:00
11a51a11af Revert "introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)"
This reverts commit 5c6d7caaaa08f134c3b17ce032cb014527b53417.

Reverted https://github.com/pytorch/pytorch/pull/153432 on behalf of https://github.com/malfet due to Looks like it broke flex attention tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=g6.4xlarge&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/153432#issuecomment-2912562570))
2025-05-27 13:42:34 +00:00
c52a002a22 Add getDeviceProperties api to torch mtia device (#153577)
topic: not user facing

Test Plan: Internal benchmark.

Differential Revision: D74256550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153577
Approved by: https://github.com/nautsimon
2025-05-27 11:55:58 +00:00
2bd95f3a1f Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)
Trying to fix: https://github.com/pytorch/pytorch/issues/154157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever
2025-05-27 11:53:47 +00:00
6f86c1ce1d Add pyrefly.toml (#154144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154144
Approved by: https://github.com/Skylion007
2025-05-27 10:16:30 +00:00
5c6d7caaaa introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-27 08:54:31 +00:00
dec5ab8d98 [MTIA Aten Backend][1.2/n] Migrate as_strided to in-tree, and add unit tests (#154336)
See context in PR https://github.com/pytorch/pytorch/pull/153670

This diff migrate as_strided to in-tree. I found it's not covered by `test_kernel_eager_ci` so also adding unit tests.

Differential Revision: [D75385404](https://our.internmc.facebook.com/intern/diff/D75385404/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154336
Approved by: https://github.com/nautsimon
2025-05-27 06:32:38 +00:00
ef6306e1c6 Revert "[executorch hash update] update the pinned executorch hash (#153436)"
This reverts commit 8d6139b8d8a75aab5ead4262ff59d48615ebee31.

Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2911206795))
2025-05-27 06:02:14 +00:00
870133b2a0 Use get_device_context in aoti runtime for XPU directly (#154360)
# Motivation
Reuse [c10::xpu::get_device_context](1bebe0424e/c10/xpu/XPUFunctions.h (L27)) directly to reduce overhead, as it returns a cached `sycl::context` managed by PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154360
Approved by: https://github.com/EikanWang
2025-05-27 05:55:59 +00:00
8d89cdceb6 fix a compilation issue when TORCH_XPU_ARCH_LIST is an empty string (#153604)
When `XPU_ARCH_FLAGS` is an empty string, compilation will fail on `C10_STRINGIZE(XPU_ARCH_FLAGS)` in file `torch/csrc/xpu/Module.cpp` on Windows.
This PR fixes this issue by setting `TORCH_XPU_ARCH_LIST` to `""` to avoid an empty string conversion in `C10_STRINGIZE()` when compiling without an AOT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153604
Approved by: https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-27 05:26:46 +00:00
8d6139b8d8 [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-05-27 04:54:46 +00:00
912af9b2c2 update torchbench pin (#154256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154256
Approved by: https://github.com/huydhn
2025-05-27 04:40:54 +00:00
8d319607a7 [CPU][Brgemm] add s8s8 GEMM microkernel API (#154358)
As the title. `u8s8` and `u8u8` have already been supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154358
Approved by: https://github.com/leslie-fang-intel, https://github.com/Skylion007, https://github.com/Valentine233
2025-05-27 03:47:56 +00:00
f8010e7b93 [nativert] Move file_util to pytorch core (#153162)
Summary: fbcode//sigmoid/core/common -> fbcode//caffe2/torch/nativert/common

Test Plan: Github CI

Differential Revision: D74328089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153162
Approved by: https://github.com/zhxchen17
2025-05-27 03:42:47 +00:00
70d12ccc3f [Torch] Fix error message formatting in fp8 comparison logic (#153647)
Summary: Using `\` includes all the tabs from the next line in the error message.

Test Plan: Nothing, simply error message fixing

Reviewed By: exclamaforte

Differential Revision: D74539234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153647
Approved by: https://github.com/exclamaforte
2025-05-27 02:51:05 +00:00
100ec0b34a [Inductor] Allow passing in custom lowering dict to register_lowering() (#154344)
This PR adds support for passing in custom lowering dict to `register_lowering()`, which allows systems (e.g. Helion, https://github.com/pytorch-labs/helion/pull/80) that uses Inductor to maintain their own lowering dict instead of using the Inductor global `lowerings` dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154344
Approved by: https://github.com/jansel
2025-05-27 01:35:26 +00:00
cyy
3936e6141c Remove outdated CUDA 11 conditions (#154313)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313
Approved by: https://github.com/eqy
2025-05-27 00:30:14 +00:00
6006352ed3 [BE] Refactor manywheel build scripts (#154372)
1. Remove `CentOS Linux` cases, since its deprecated
2. Remove logic for old CUDA versions
3. Remove logic for `CUDA_VERSION=12.4` since we deprecated CUDA 12.4 support
4. Simplify setting `USE_CUFILE=1` - only supported on CUDA 12.6 and 12.8 builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154372
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-05-26 23:17:23 +00:00
b643076e4e Revert "[executorch hash update] update the pinned executorch hash (#153436)"
This reverts commit b6868f290e4882f9c895b1c9476327974288eaba.

Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2910692163))
2025-05-26 22:09:16 +00:00
aaf5cc13d9 [EASY] use guard_or_false instead of gso in Meta converter (#154234)
this was added in https://github.com/pytorch/pytorch/pull/141659, the current change keep the same intention
"i do not want to fail here if i cant tell if the size is zero or not"
i am not familiar enough in the code to know if we need here a runtime check, but looking at current
impl it seems that guard_or_false is appropriate to match current behaviour  and have the same effect of guard_size_oblivious here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154234
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164, #154167, #154172
2025-05-26 21:59:52 +00:00
e33feddb72 used guard_or_false instead of guard_size_oblivious inside maybe_reduce (#154172)
This was added in https://github.com/pytorch/pytorch/pull/119562
the idea in this loop seems to be the following.
```
    if (TORCH_GUARD_SIZE_OBLIVIOUS(size.sym_eq(1))) {
      // NB: we could short circuit this once needs_reduce is true but there's
      // no point since the reduction function will guard on this anyway
      if (!c10::guard_or_false(size.sym_eq(target), __FILE__, __LINE__)) {
        needs_reduce = true;
      }
    } else {
      if (!size.sym_eq(target).expect_true(__FILE__, __LINE__)) {
        fail();
      }
    }
  ```
  1. if we know size ==1
       1.1 : if we know for sure size == target --> no reduce needed.
       1.2 : we know for sure that size != target  --> we do reduction.
       1.3: we could not tell if size == target or not --> we do reduction.
  2. if we do now know if size ==1 or not
     we add a runtime assertions that size ==target and we fail at runtime if size is not equal to target.

We could have simplified 1.1 and always do reduction under 1.1, since doing 1.3 without runtime checks implies
that it is safe, but i feel the reason could be perf here? idk.

anyway using TORCH_GUARD_OR_FALSE instead of TORCH_GUARD_SIZE_OBLIVIOUS here is appropriate.
there is really no clear reason for size oblivious reasoning. or for this logic not to apply when size is not size like
size is always >=0 anyway. but bad reasoning can make us not able to infer that although we know its true here.

 python test/dynamo/test_misc.py -k test_validate_outputs_unbacked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154172
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164, #154167
2025-05-26 21:59:52 +00:00
ab5137b048 used guard_or_false instead of guard_size_oblivious in is_int_or_symint (#154167)
This is a short circuit, that we should not fail on. Before this PR we would not fail on u0, u0+u1,
only if they are size like.  but we will fail on u0-u1.. etc for no need.
guard_or_false seems appropriate for that reason.

This was added in https://github.com/pytorch/pytorch/pull/122145 there was no unit tests for me to verify
why it was added, i could not repo using the associated issue , the example does not work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154167
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164
2025-05-26 21:59:45 +00:00
1da2cc52bc [EASY] remove guard_size_oblivious from is_nonzero proxy call check (#154164)
This was added in https://github.com/pytorch/pytorch/pull/149637,
torch._check can handle unbacked there is no need for size oblivious reasoning here.

Note this does not make is_nonzero unbacked friendly. but that is a different story.
I ran the test added in  https://github.com/pytorch/pytorch/pull/149637 for veirfication.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154164
Approved by: https://github.com/aorenste, https://github.com/bobrenjc93
ghstack dependencies: #154154
2025-05-26 21:59:29 +00:00
f8a2998832 [EASY] used guard_or_false instead of guard_sizes_oblivious in pointless_view (#154154)
The change is direct and clear, the optimizations removes pointless_view iff it all sizes are the same if not we want to return false, there is no need for size oblivious  reasoning.

this was added in https://github.com/pytorch/pytorch/pull/139136, run existing tests that are added in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154154
Approved by: https://github.com/bobrenjc93
2025-05-26 21:59:21 +00:00
e89ee1e217 Pin almalinux version to 8.10-20250519 (#154367)
This PR pins Almalinux version to latest supported 8.10

This is related to: https://github.com/pytorch/pytorch/pull/154364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154367
Approved by: https://github.com/jeanschmidt, https://github.com/wdvr, https://github.com/malfet, https://github.com/huydhn
2025-05-26 20:08:20 +00:00
839c9c6156 Use property instead of ClassVar for Uniform.arg_constraints and Wishart.arg_constraints (#154361)
Fixes #154355

For these two distributions, the constraints depend on the actual values, and so `arg_constraints` cannot be a `ClassVar`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154361
Approved by: https://github.com/Skylion007
2025-05-26 17:48:28 +00:00
3f64502c98 Revert "Re-enable FakeTensor caching for SymInts (#152662)"
This reverts commit 7d11c61c26c596076613aa0111892f7cbccae32e.

Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1 ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))
2025-05-26 17:13:22 +00:00
187d38185e [cutlass backend] Do not raise hard error when re worker has cuda compilation error (#154173)
fbcode specific

Differential Revision: D75262641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154173
Approved by: https://github.com/bertmaher
2025-05-26 17:10:36 +00:00
f55f2f42a7 Add missing docstring for sym_ite (#154201)
`sym_ite` is listed in [the reference page](https://docs.pytorch.org/docs/stable/torch.html) and has no document.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154201
Approved by: https://github.com/Skylion007
2025-05-26 15:59:21 +00:00
02445ec8f0 Almalinux image, install glibc-langpack-en (#154364)
After update to: https://hub.docker.com/layers/amd64/almalinux/8/images/sha256-4f63eb966695df3c993deeacec7c73d87728e2ea66d3b48fed4b40cb547fa7c2

Started seeing warning: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
and random Segfaults when using python like:
https://github.com/pytorch/test-infra/actions/runs/15216565225/job/42901732536
```
+++ python -c 'import torch'
./check_binary.sh: line 258:  2276 Segmentation fault      (core dumped) python -c 'import torch'
```

Installing langpack does  resolve these issues: https://github.com/pytorch/test-infra/actions/runs/15256338815/job/42904808826#step:15:2311

Almalinux Docker build without setlocale warning:
https://github.com/pytorch/pytorch/actions/runs/15030284546/job/42240978131

Almalinux Docker build with setlocale warning:
https://github.com/pytorch/pytorch/actions/runs/15246391200/job/42873875745#step:3:7180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154364
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt
2025-05-26 15:56:42 +00:00
4b0ee3f4f2 [BE] Do not templetize unnnecessarily (#154305)
`${{ os.runner }}` would always evaluate to macOS for those files
And architecutre is always ARM64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154305
Approved by: https://github.com/atalman
2025-05-26 15:00:48 +00:00
7ab4fae62a Fix s390x vectorization compilation in inductor (#153946)
Fix s390x vectorization compilation in inductor.

One of failing tests is
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_add_complex_cpu
but it is still disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153946
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-05-26 12:54:25 +00:00
1bebe0424e Fix platform detection in MKLDNN CMake file (#142067)
When building PyTorch with `USE_XPU=True` and Clang,
the user sees misleading errors related to incorrect platform
detection that assumes that all users that are not using the GNU
compilers are on Windows. We can fix this by simply using CMake's
builtin platform detection variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142067
Approved by: https://github.com/EikanWang, https://github.com/min-jean-cho, https://github.com/guangyey
2025-05-26 06:09:37 +00:00
21e42c5d62 More descriptive error message for torch.nanmean() with complex dtypes (#153252)
Fixes #153132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153252
Approved by: https://github.com/colesbury
2025-05-26 05:42:57 +00:00
b6868f290e [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-05-26 04:43:10 +00:00
7d11c61c26 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
062387fb53 [SymmMem] Speed up tests (#153677)
Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #153653
2025-05-26 03:39:11 +00:00
8c16d0e404 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-26 00:56:05 +00:00
b04852e404 Fix deterministic indexing with broadcast (#154296)
Fixes #79987, now for real.
Also removed thrust sort path that was needed for cuda <=11.2 because we no longer support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154296
Approved by: https://github.com/soumith
2025-05-25 21:14:50 +00:00
c3100067ae [ONNX] Update onnx to 1.18 (#153746)
Update onnx python package to 1.18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153746
Approved by: https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/malfet
2025-05-25 20:58:47 +00:00
43b2716e89 PYFMT lint grandfathered files 1 (#154261)
lint:
-  test/test_fake_tensor.py
-  test/test_flop_counter.py
- torch/_export/verifier.py

with same rules as other files, it was a night mare for me to update tests in one of the skipped files
with not being able to lint them locally like other files with lintrunner -a.
note that those file do have active dev and not old not touched files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-05-25 17:36:14 +00:00
5677ab9aab [BE] Correctly pass exceptions raised from rpc_init to CPython (#154325)
By decorating function body with `HANDLE_TH_ERRORS`

Partially addresses https://github.com/pytorch/pytorch/issues/154300

I.e. after that change, importing torch no longer crashes but returns a readable (and actionable exception)
```
>>> import torch
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    import torch
  File "/Users/malfet/git/pytorch/pytorch/torch/__init__.py", line 2134, in <module>
    from torch import _VF as _VF, functional as functional  # usort: skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/functional.py", line 8, in <module>
    import torch.nn.functional as F
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/__init__.py", line 8, in <module>
    from torch.nn.modules import *  # usort: skip # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Bilinear, Identity, LazyLinear, Linear  # usort: skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/linear.py", line 7, in <module>
    from torch.nn import functional as F, init
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/functional.py", line 11, in <module>
    from torch._jit_internal import (
    ...<5 lines>...
    )
  File "/Users/malfet/git/pytorch/pytorch/torch/_jit_internal.py", line 42, in <module>
    import torch.distributed.rpc
  File "/Users/malfet/git/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 37, in <module>
    from torch._C._distributed_rpc import (  # noqa: F401
    ...<33 lines>...
    )
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154325
Approved by: https://github.com/Skylion007
2025-05-25 17:01:45 +00:00
31ae07b5e7 [CI] Do not install libuv on MacOS (#154307)
It's tensorpipe submodule and is build from source
Same for `dataclasses` as it's needed only for python-3.6
And get rid of `nidia-ml-py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154307
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #154304
2025-05-25 15:30:38 +00:00
6968386385 [BE] Sort requirements files alphabetically (#154304)
Using `sort` tool
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154304
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-25 15:30:38 +00:00
ed27ee8355 Bump setuptools from 70.0.0 to 78.1.1 in /tools/build/bazel (#154075)
Bumps [setuptools](https://github.com/pypa/setuptools) from 70.0.0 to 78.1.1.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v78.1.1</h1>
<h2>Bugfixes</h2>
<ul>
<li>More fully sanitized the filename in PackageIndex._download. (<a href="https://redirect.github.com/pypa/setuptools/issues/4946">#4946</a>)</li>
</ul>
<h1>v78.1.0</h1>
<h2>Features</h2>
<ul>
<li>Restore access to _get_vc_env with a warning. (<a href="https://redirect.github.com/pypa/setuptools/issues/4874">#4874</a>)</li>
</ul>
<h1>v78.0.2</h1>
<h2>Bugfixes</h2>
<ul>
<li>Postponed removals of deprecated dash-separated and uppercase fields in <code>setup.cfg</code>.
All packages with deprecated configurations are advised to move before 2026. (<a href="https://redirect.github.com/pypa/setuptools/issues/4911">#4911</a>)</li>
</ul>
<h1>v78.0.1</h1>
<h2>Misc</h2>
<ul>
<li><a href="https://redirect.github.com/pypa/setuptools/issues/4909">#4909</a></li>
</ul>
<h1>v78.0.0</h1>
<h2>Bugfixes</h2>
<ul>
<li>Reverted distutils changes that broke the monkey patching of command classes. (<a href="https://redirect.github.com/pypa/setuptools/issues/4902">#4902</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>Setuptools no longer accepts options containing uppercase or dash characters in <code>setup.cfg</code>.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="8e4868a036"><code>8e4868a</code></a> Bump version: 78.1.0 → 78.1.1</li>
<li><a href="100e9a61ad"><code>100e9a6</code></a> Merge pull request <a href="https://redirect.github.com/pypa/setuptools/issues/4951">#4951</a></li>
<li><a href="8faf1d7e0c"><code>8faf1d7</code></a> Add news fragment.</li>
<li><a href="2ca4a9fe47"><code>2ca4a9f</code></a> Rely on re.sub to perform the decision in one expression.</li>
<li><a href="e409e80029"><code>e409e80</code></a> Extract _sanitize method for sanitizing the filename.</li>
<li><a href="250a6d1797"><code>250a6d1</code></a> Add a check to ensure the name resolves relative to the tmpdir.</li>
<li><a href="d8390feaa9"><code>d8390fe</code></a> Extract _resolve_download_filename with test.</li>
<li><a href="4e1e89392d"><code>4e1e893</code></a> Merge <a href="https://github.com/jaraco/skeleton">https://github.com/jaraco/skeleton</a></li>
<li><a href="3a3144f0d2"><code>3a3144f</code></a> Fix typo: <code>pyproject.license</code> -&gt; <code>project.license</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4931">#4931</a>)</li>
<li><a href="d751068fd2"><code>d751068</code></a> Fix typo: pyproject.license -&gt; project.license</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v70.0.0...v78.1.1">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=70.0.0&new-version=78.1.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154075
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-25 15:13:03 +00:00
c113cf5a8f [BE] Remove unused conda-env-Linux-X64 (#154303)
According to https://github.com/search?type=code&q=conda-env-++repo%3Apytorch%2Fpytorch it's not referenced anywhere and has been replaced with `conda-env-ci` a while ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154303
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-25 14:24:28 +00:00
d8aed0703e [BE][Ez]: Enable ruff rule PLW1507. os.environ is not copied (#154120)
Enables a RUFF rule check against copying os.environ since its' actually a proxy object, not a dict so a shallow copy will be a noop which is rarely desired behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154120
Approved by: https://github.com/malfet
2025-05-25 14:22:57 +00:00
54932d865e Revert "[c10d] Add support for testing SIGABRT return (#153167)"
This reverts commit 03e102dbe8cbffc2e42a3122b262d02f03571de7.

Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))
2025-05-25 13:17:27 +00:00
c4ef4090c5 Fix segfault on exit in CachingHostAllocator by signaling background thread to exit (#154117)
Fixes #152008

This PR fixes a segmentation fault that occurred when exiting the program due to improper background thread management in CachingHostAllocator.

Previously, the background thread continued running and called process_events() even after the allocator object was destroyed, leading to a crash on exit.

f12d8d60b1/aten/src/ATen/core/CachingHostAllocator.h (L218)

```cpp
// Launch the background thread and process events in a loop.
static bool background_thread_flag [[maybe_unused]] = [this] {
  getBackgroundThreadPool()->run([&]() {
    while (true) {
      process_events();  // <-- This line may cause segfault on exit
      std::this_thread::sleep_for(std::chrono::microseconds(100));
    }
  });
  return true;
}();
```

The fix adds a mechanism to signal the background thread to exit before the object is destructed, ensuring the thread stops safely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154117
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-05-25 07:46:12 +00:00
9d922b55ef [Distributed][CI] Rework continuous TestCase (#153653)
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).

2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.

3. Added a test template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
2025-05-25 03:49:29 +00:00
03e102dbe8 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-25 03:48:34 +00:00
10c51b11ff Bump protobuf version and refactor tensorboard tests (#154244)
In preparation for https://github.com/pytorch/pytorch/pull/153746, I am bumping protobuf to 5.29.4 and fixing the tensorboard tests first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154244
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-05-25 00:50:07 +00:00
53ecb8159a Introduce statically_known_false (#154291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154291
Approved by: https://github.com/mengluy0125
2025-05-24 14:23:55 +00:00
2dfc0e3327 [Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang
2025-05-24 09:51:33 +00:00
cyy
8fe7ec6721 Add /Zc:preprocessor for torch libraries in MSVC builds (#147825)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147825
Approved by: https://github.com/janeyx99
2025-05-24 06:57:46 +00:00
6503b4a96e Update to using mypy 1.15 (#154054)
The BC break isn't real - mypy decided to start complaining about the way we were typing that function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154054
Approved by: https://github.com/Skylion007
2025-05-24 04:30:57 +00:00
76ed9db468 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-24 03:43:35 +00:00
1ab2993345 Add a link to transformer_building_blocks tutorial (#154281)
Cross-link to https://docs.pytorch.org/tutorials/intermediate/transformer_building_blocks.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154281
Approved by: https://github.com/mikaylagawarecki
2025-05-24 02:50:24 +00:00
e904d01c16 Make inductor UT to be generic (#154196)
# Motivation
https://github.com/pytorch/pytorch/pull/151773 introduces UT `test_triton_template_generated_code_caching` failed on XPU;
https://github.com/pytorch/pytorch/pull/153895 introduces UT `test_mutation_rename` failed on XPU;

fix https://github.com/pytorch/pytorch/issues/154218

# Additional Context
With this PR, both failed UTs passed on local machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154196
Approved by: https://github.com/jansel
2025-05-24 02:47:46 +00:00
a19f2cdf29 [draft export] skip when no LOC found (#154190)
Couldn't repro error, but verified fix with @ColinPeppler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154190
Approved by: https://github.com/ColinPeppler
2025-05-24 02:29:34 +00:00
975bbc63db [MPS][BE] Move fmod/remainder to Metal ops (#154280)
This accomplishes following:
 - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
 - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
 - Eliminates need for several correctness workarounds

Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
2025-05-24 01:45:33 +00:00
8f08bdb7f2 [MPS][BE] Code dedup (#154290)
Eliminate some copy-pasta by introducing `REGISTER_FLOAT_BINARY_OP` and `REGISTER_INTEGER_BINARY_OP` macros
Use `_METAL_310_PLUS` to guard bfloat dtype use
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154290
Approved by: https://github.com/yangw-dev, https://github.com/wdvr
ghstack dependencies: #154275
2025-05-24 01:41:31 +00:00
e5f63f4f66 [CI] Move Mac testing to 3.12 (#154177)
Prep step to completely move away from Conda during the builds..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154177
Approved by: https://github.com/huydhn, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271, #154269, #154270
2025-05-24 01:41:20 +00:00
11a490f32f [CI] Reuse old whl on more workflows (#154285)
Still only on main branch, not PRs, so that we can monitor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154285
Approved by: https://github.com/malfet
2025-05-24 01:25:35 +00:00
308beeeb56 [dynamo] Use UUID for compiled function variable names. (#154148)
Summary:
We previously assign each compiled function variable a name based on in-process global counter. This works fine within the same process but when we're trying to serialize the states with precompile, we need a way to load back these compiled functions without causing collision to the existing global scope.

Changing the counter to a true global uuid seems to resolve this issue.

For example, the new variable name will look like:
```
__compiled_fn_0_7ce7d872_4fe8_4174_b8fd_2496b09b8b43
```

Test Plan: CI

Differential Revision: D75244901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154148
Approved by: https://github.com/jansel
2025-05-24 01:08:42 +00:00
7ba6fb69e6 [Inductor][CPP] Enable vectorized fp8 E5M2 quant dequant (#153365)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E5M2` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e5m2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153365
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #152417, #152418, #153364
2025-05-23 23:20:02 +00:00
84b657d0b5 Add Vectorized FP8 E5M2 (#153364)
**Summary**
This PR mainly adding the `Vectorized<Float8_e5m2>` class to support the vectorization of `FP8 E5M2` with methods:

- Convert to/from `Vectorized<float>`
- Common vectorized methods like: `mul`, `abs`, `eq` and etc.

**Test Plan**
```
./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E5M2Test.*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153364
Approved by: https://github.com/jgong5, https://github.com/CaoE, https://github.com/vkuzo
ghstack dependencies: #152417, #152418
2025-05-23 23:11:25 +00:00
b77a6504fa [Inductor][CPP] Enable vectorized fp8 quant dequant (#152418)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E4M3` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e4m3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152418
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/CaoE
ghstack dependencies: #152417
2025-05-23 23:05:17 +00:00
080b74ce67 Add Vectorized FP8 E4M3 (#152417)
**Summary**
This PR mainly adding the `Vectorized<Float8_e4m3fn>` class to support the vectorization of `FP8 E4M3` with methods:

- Convert to/from `Vectorized<float>`
- Common vectorized methods like: `mul`, `abs`, `eq` and etc.

**Test Plan**
```
./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E4M3Test.*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152417
Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/vkuzo
2025-05-23 22:56:56 +00:00
bab59d3c28 Upgrade to CUDA 12.8.1 for nightly binaries (#152923)
Upgrade current CUDA 12.8 builds to 12.8.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152923
Approved by: https://github.com/atalman
2025-05-23 22:37:05 +00:00
f0b2706914 remove sleef_arm target (#154166)
Summary:
X-link: https://github.com/pytorch/executorch/pull/11082

We shouldn't need an ARM-specific variant; we have select() where we should need it.

Test Plan: CI

Reviewed By: nlutsenko

Differential Revision: D74356413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154166
Approved by: https://github.com/kimishpatel, https://github.com/malfet, https://github.com/Skylion007
2025-05-23 22:16:01 +00:00
86a160353e [BE] Don't run windows builds in pull.yml (#154264)
We already run windows builds and tests [during trunk.yml](c13eeaa718/.github/workflows/trunk.yml (L115-L130)).

Spot checking for failures of this job in pull.yml shows that the most of the times this job fails, the failure correlates with other build jobs failing as well, so it's not offering much unique signal.

Given that we'll run this job before merging the PR as part of trunk.yml anyways, the trade off of extra signal from getting a windows build signal a little earlier doesn't seem worth the infra investment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154264
Approved by: https://github.com/malfet
2025-05-23 22:03:19 +00:00
65f0cf3df5 [mergebot] Do not block on autoformat workflow (#154236)
Helps with https://github.com/pytorch/pytorch/issues/154084

Merge sometimes fails due to autoformat failing.  I believe it's because author doesn't have write perms/workflow running perms -> needs approval for workflows.  On merge, the bot adds the merge label -> triggers autoformat workflow -> needs approval (even though it will end up getting get skipped because the label doesn't match) -> merge sees and fails

So I put an ugly exception for the workflow in mergebot

Some restrictions to keep in mind:
* Need to checkout the PRs code changes to run lint/format on them -> possible security issue if someone modifies a linter/formatter
* The (third party) reusable action used in the autoformat workflow requires the trigger to be pull_request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154236
Approved by: https://github.com/malfet
2025-05-23 22:00:34 +00:00
bb17f9c98b [AOTAutogradCache] Fix CHROMIUM_EVENT_LOG being none (#154258)
It turns out if you import something that's None at import time in python, and later update the value, the one you imported stays none:

```
import torch
from torch._dynamo.utils import CHROMIUM_EVENT_LOG
class Foo:
  pass
torch._dynamo.utils.CHROMIUM_EVENT_LOG =  Foo()

print(CHROMIUM_EVENT_LOG) # None
```

This fixes teh bug so we get AOTAUtogradCache instant events again

Differential Revision: [D75305770](https://our.internmc.facebook.com/intern/diff/D75305770/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154258
Approved by: https://github.com/oulgen
2025-05-23 21:53:31 +00:00
0e4f1b8a06 [CI] Update MacOS conda requirmenets (#154270)
Pick package versions which are compatible with both 3.9 and 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154270
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271, #154269
2025-05-23 21:44:50 +00:00
5db1503846 [CI] Update MacOS numba and scipy versions (#154269)
Pick versions that supported by both 3.9 and 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154269
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271
2025-05-23 21:44:49 +00:00
aa3eab2ce6 Fix tcp init when using port 0 (#154156)
I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2025-05-23 21:41:58 +00:00
3c0b93afc5 Re-enable link linter (#153280)
And make URL linter always succeed for now.
I'll monitor the logs manually and experiment with it futher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153280
Approved by: https://github.com/albanD
2025-05-23 20:56:25 +00:00
6f34d141ab [MPS][BE] Delete complex_div (#154275)
An absolute no-op: delete `complex_div` from `UnaryKernel.metal` and use identical one from `c10/metal/utils.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154275
Approved by: https://github.com/dcci
2025-05-23 20:53:50 +00:00
dec6a47996 [BE] Delete unused pip-requirements-iOS.txt (#154271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154271
Approved by: https://github.com/clee2000
ghstack dependencies: #154237, #154268
2025-05-23 20:08:19 +00:00
acd0873d3b [CI] Fix TestDynamoTimed.test_ir_count for 3.12 (#154268)
Python-3.12 emits the same bytecode as 3.13 for code in question
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154268
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237
2025-05-23 20:08:19 +00:00
28af44285b Revert "[c10d] Add support for testing SIGABRT return (#153167)"
This reverts commit 499a76b844bbcbc5465cb76c617b3076c1b0fd65.

Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see fe784c5a2c/1 ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))
2025-05-23 19:44:08 +00:00
fe784c5a2c Fix torchbind path in AOTI package loader (#154265)
Summary: as title, fix the path in package loader and fix the test to take the additional dir into consideration.

Test Plan:
```
buck run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:torchbind
```

Reviewed By: angelayi

Differential Revision: D75308904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154265
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-05-23 19:32:53 +00:00
90855835ff Revert "[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)"
This reverts commit 269fa8028f68b29176e21886108634f48b1eced7.

Reverted https://github.com/pytorch/pytorch/pull/154155 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/154155#issuecomment-2905514934))
2025-05-23 19:08:40 +00:00
3b21d79225 [export] Move PT2ArchiveWriter/Reader to torch/export (#153795)
Summary:
Before:
`from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package`
After:
`from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package`

By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder.

Before:
```
├── archive_format
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── aotinductor

```
After:
```
├── tmp
│   ├── archive_format
│   ├── byteorder
│   ├── .data
│   │   ├── serialization_id
│   │   └── version
│   ├── data
│   │   ├── aotinductor
```

Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/5348024839248187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795
Approved by: https://github.com/zhxchen17
2025-05-23 19:04:36 +00:00
499a76b844 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-23 19:04:28 +00:00
561a11aa68 Revert "Patch the _is_conv_node function (#153749)"
This reverts commit c985cec5b2545d46af682d486b18866eee5dffd5.

Reverted https://github.com/pytorch/pytorch/pull/153749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153749#issuecomment-2905504697))
2025-05-23 19:04:20 +00:00
4ff19ecf66 Revert "[export] Move PT2ArchiveWriter/Reader to torch/export (#153795)"
This reverts commit 7e80f23516a86e18ae5bc5579d3005c1e7610102.

Reverted https://github.com/pytorch/pytorch/pull/153795 on behalf of https://github.com/malfet due to Looks like it broke lots of tests, see ec368a1903/1 ([comment](https://github.com/pytorch/pytorch/pull/153795#issuecomment-2905415496))
2025-05-23 18:29:08 +00:00
ec368a1903 Add sitemap (#154158)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154158
Approved by: https://github.com/albanD
2025-05-23 18:01:00 +00:00
0d62fd5c3c [MTIA Aten Backend][2/n] Migrate clamp ops(clamp.out/clamp_min.out/clamp_max.out) from out-of-tree to in-tree (#154015)
Summary:
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This PR
1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops):
- clamp.out
- clamp_min.out
- clamp_max.out
2. Also enabled structured kernel codegen for MTIA, which is needed by clamp
3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`)

Differential Revision: D74674418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-05-23 17:59:47 +00:00
bcb2125f0a [BE][CI] Update expecttest version to 0.3.0 (#154237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154237
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/atalman
2025-05-23 17:27:41 +00:00
cae25ef4e5 [c10d] Enhance Error Logging in new_subgroups() for Non-Divisible World Sizes (#154124)
Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness.

Test Plan: contbuild & OSS CI

Differential Revision: D75226925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124
Approved by: https://github.com/wz337
2025-05-23 17:12:43 +00:00
e927ba6dbd [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-23 17:12:25 +00:00
04a6fe7914 Update provenance tracking doc (#154062)
Summary: Update the doc to reflect the changes in https://github.com/pytorch/pytorch/pull/153584/files#diff-e0cdb58c0f84f56f20c5433339b6d83c470dcde47847e2328effea6bedd4cd27 and https://github.com/pytorch/tlparse/pull/110

Test Plan: CI

Differential Revision: D75155981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154062
Approved by: https://github.com/svekars, https://github.com/desertfire
2025-05-23 17:09:52 +00:00
7d8ea5db69 Disable cache and utilization stats uploading steps on s390x (#150297)
There are no AWS credentials available on s390x runners. These steps are failing anyway due to that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150297
Approved by: https://github.com/seemethere
2025-05-23 16:49:38 +00:00
7e80f23516 [export] Move PT2ArchiveWriter/Reader to torch/export (#153795)
Summary:
Before:
`from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package`
After:
`from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package`

By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder.

Before:
```
├── archive_format
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── aotinductor

```
After:
```
├── tmp
│   ├── archive_format
│   ├── byteorder
│   ├── .data
│   │   ├── serialization_id
│   │   └── version
│   ├── data
│   │   ├── aotinductor
```

Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/5348024839248187

Differential Revision: D74616598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795
Approved by: https://github.com/zhxchen17
2025-05-23 15:40:25 +00:00
214e4cef9f Fix RMSNorm doc rendering (#154205)
By removing `::func::` decorator which adds unneeded parenthesis

Test plan: Check https://docs-preview.pytorch.org/pytorch/pytorch/154205/generated/torch.nn.RMSNorm.html#rmsnorm
that now renders as
<img width="704" alt="image" src="https://github.com/user-attachments/assets/443f605d-75a6-41ef-8971-21e7dc8ef9f6" />

Fixes https://github.com/pytorch/pytorch/issues/154184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154205
Approved by: https://github.com/mikaylagawarecki
2025-05-23 15:39:29 +00:00
9e089bb5b6 change guard_or impl for better perf and simplicity (#153674)
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
4b7abce6a4 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

The latest version of this also:
1. Addresses the problem that caused #153891.
    The issue was that with caching ops are required to support `__eq__`.  Unfortunately _RecordFunction is minimalistic and doesn't support that - so in the off-chance that two keys hash to the same value the `__eq__` check would raise an exception.

    Apparently this was much more common on MacOS where memory patterns end up with more reuse (so the object IDs are the same and give you the same hash value for objects that use pointer hash).

    Tested locally on MacOS where running
```
python test/inductor/test_torchinductor.py GPUTests
```
was pretty much guaranteed to fail (at least for me) somewhere around test 100-200 and passed all 800 tests after this change.

Another way to test this is to run the inductor tests with `torch._subclasses.fake_tensor._DispatchCacheKey.__hash__` monkey-patched to return a constant (causing all values to hash-collide) but this can't really be checked-in since it causes the cache lookup to turn into an O(n) lookup which takes a crazy long time to run through all the tests...

2. Folds in #153780 to ensure that exceptions raised from the op don't include the context from the cache key bypass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-23 15:03:31 +00:00
866142ff16 Revert "Update the heuristic for AArch64 bmm/baddbmm (#149122)"
This reverts commit d759a517af3e6b2337bf8f8e0d1734e64e470f1b.

Reverted https://github.com/pytorch/pytorch/pull/149122 on behalf of https://github.com/jeanschmidt due to breaking internal models, @malfet may you help merge this? ([comment](https://github.com/pytorch/pytorch/pull/149122#issuecomment-2904703075))
2025-05-23 14:54:54 +00:00
5859582ee4 [BE][MPS] Delete unused complex_mul_out (#154175)
It's no longer called, after `mul` has been migrated to binary op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154175
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-05-23 13:44:24 +00:00
2225231a14 Enable AArch64 CI scripts to be used for local dev (#143190)
- Allow user to specify custom ComputeLibrary directory, which is then built rather than checking out a clean copy
- Remove `setup.py clean` in build. The CI environment should be clean already, removing this enables incremental rebuilds
- Use all cores for building ComputeLibrary

Mostly a port of https://github.com/pytorch/builder/pull/2028 with the conda part removed, because aarch64_ci_setup.sh has changed and can now handle being called twice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143190
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet

Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>
2025-05-23 12:09:59 +00:00
25149cd173 [c10d] Add more tests to prevent extra context (#154174)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Loop a bunch of sync ops and see if any of them creates extra context.
Requires nvml to check number of processes resident on a device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174
Approved by: https://github.com/atalman
2025-05-23 09:54:01 +00:00
ba5d45d22e Add assertion to align with cuda (#153233)
Fixes #153137

Aligned batch_norm_cpu_out assertion to [batch_norm_cuda_out](a7ea115494/aten/src/ATen/native/cuda/Normalization.cu (L436)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153233
Approved by: https://github.com/malfet
2025-05-23 07:32:43 +00:00
5623d30228 [Minimizer] Gracefully exit when there is no discrepancy in block mode (#154076)
Summary:
Previously, when there is no discrepancy in results for block mode, net_min_base will throw an OOB error.

This occurs due to the block _block_traverse_impl returning an OOB after exhausting subgraphs all the way down to a single node

There is also an issue where we may get an unsound subgraph (i.e. mark an earlier node as the "end" even if the correct end is later). This is due to an incorrect check (start_idx == mid) where there can possibly be two values left before the program pre-maturely returns

Test Plan:
Buck UI: https://www.internalfb.com/buck2/52524c26-ace5-4593-8a4b-843a54eb206a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3096224973363310
Network: Up: 0B  Down: 15MiB  (reSessionID-cd404e97-395f-49fc-8381-373e90a1378f)
Executing actions. Remaining     0/1
Command: test.
Time elapsed: 53.7s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D75143242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154076
Approved by: https://github.com/jfix71
2025-05-23 06:42:07 +00:00
8342b9371e [ROCm] Prefer hipblaslt for gfx1200, gfx1201 (#153610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153610
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2025-05-23 06:01:53 +00:00
26471fc203 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-23 05:45:35 +00:00
b33b7d5c8c [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-23 05:45:35 +00:00
269fa8028f [AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)
Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/)

In general, we want to cache the cutlass kernels.

Also saw an error saying .o not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155
Approved by: https://github.com/chenyang78
2025-05-23 04:51:36 +00:00
5bb156a7fd [dynamo] raise observed exception for module attribute errors (#153659)
Fixes https://github.com/pytorch/pytorch/issues/153605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153659
Approved by: https://github.com/StrongerXi
2025-05-23 03:56:26 +00:00
db1f33147b [audio hash update] update the pinned audio hash (#154001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154001
Approved by: https://github.com/pytorchbot
2025-05-23 03:51:21 +00:00
c1055f41a6 Data dependent free reshape. (#153198)
#### change 1: if compute_strides stride fail for reshape just clone.

Lets consider the most general case, if torch compile is asked to reshape [u0, u1][u3, u4] -> [u5, u6] what shall it do?
The shape is general enough to represent both contiguous and non contiguous tensors, tensors where a clone free reshape can happen and other where a clone free cant happen.  The current algorithm will fail due to data dependent errors.

The general idea is if its impossible to tell if the reshape can happen in place, (because for some concrete inputs
it will and other not) then its ok to take the general path and clone, instead of failing or asking the user to give hints.
**Because the user want a single graph (single compilations)** and this is the only way it can be done.
Had this been a view? then the user is explicitly asking for a copy-free reshape, we would fail asking for more
information (hints in torch.checks form).

with this change reshape works as the following:
1. if we know the input is contiguous we will convert the reshape to view.
2. if compute_strides succeed we will use view. (compute_strides  was changed to not fail when when unbacked presented instead it will just return nullptr if it cant compute the strides meaning we shall use a clone).
3. if neither 1, 2 works clone and use a view.

Side note: having a view does not mean that inductor will not clone, for inductor there is a pass that converts all views back to reshapes and inductor has its logic dealing with those.

#### change 2 : skip  _reshape_view_helper and fall back to simpler logic if it fail.
We trace _reshape_view_helper when doing fake tensor tracing , but not during proxy tracing. hence such tracing wont effect the graph (only compute output shapes of several operations). We should not fail there, because it should always be possible for us to pass it in case of reshape.

i.e. when reshape_symint was called we would have either cloned, or compute_strides succeeded so the view should pass. What I did is the following: we run _reshape_view_helper, if we fail due to unbacked we call _view_simple which will succeed always for reshapes, (might fail for views when its impossible to do the view, in such case we throw the dde that was thrown by the original algorithm).

Ideally I would want to register _view_simple as the meta for view and avoid calling  _reshape_view_helper completely but I am running some issues with the dispatcher with subclasses and I do not have time to debug it. Namely one test
would end up calling some c++ view function that does not support symints during meta dispatch when i register a
python meta decompositions
```python test/dynamo/test_subclasses.py SubclassTests.test_subclass_views_dynamic_True ```
 https://github.com/pytorch/pytorch/issues/153303.I will follow up with that change in a separate PR.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @bdhirsh

 Two other alternatives for registering   _view_simple as meta and the try catch approach in this PR is:
 1. call _view_simple if any input is dynamic see  #153521
 2. if we make is_compiling works for framework code tracing (does not work rn) we can call _view_simple
 is if is_compiling.

#### Note:
Reshape can still fail when is_contiguous is called, Next PR will handle that by calling is_known_contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153198
Approved by: https://github.com/etaf, https://github.com/bobrenjc93
2025-05-23 01:45:16 +00:00
f74842d665 [DTensor] enable SimpleFSDP's composability with Tensor Parallel (#152286)
This PR adds support for SimpleFSDP's composability with Tensor Parallel + torch.compile.

`_StridedShard` is used in SimpleFSDP/FSDP2 to support correct distributed checkpointing when FSDP+TP is applied. Previously, `_StridedShard` is not guarded by torch.compile. This PR adds `_StridedShard` as an additional placement type to be guarded by torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152286
Approved by: https://github.com/bdhirsh
2025-05-23 01:40:38 +00:00
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
3c0cbf4b44 Update GH action to use the correct label (#154126)
Update GH action to use the correct label for the docathon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154126
Approved by: https://github.com/AlannaBurke, https://github.com/clee2000
2025-05-23 00:29:43 +00:00
31f3ee0966 [BE][Ez]: Enable PT014 check for duplicate parameterize test cases (#154118)
Ruff rule which checks for an error [PT014](https://docs.astral.sh/ruff/rules/pytest-duplicate-parametrize-test-cases/) where a user might specify two duplicate test cases in pytest.parameterize, which is likely an error since it tests the same thing twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154118
Approved by: https://github.com/malfet
2025-05-23 00:00:53 +00:00
7b25ff7cf2 [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-05-22 23:37:03 +00:00
59c5fff2aa Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404)"
This reverts commit a79e621c1c11bcef5f816b9770b751237b84f620.

Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))
2025-05-22 22:26:59 +00:00
f2cce45657 [libc++ readiness][caffe2] No reason to check for "ext/stdio_filebuf.h" (#154080)
Summary: There should be no reason to check for existence of this GNU C++ header here in this file. It doesn't include it. Removing this condition to make it build under libc++.

Differential Revision: D75179136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154080
Approved by: https://github.com/soumith
2025-05-22 22:23:39 +00:00
c985cec5b2 Patch the _is_conv_node function (#153749)
Summary: torch.ops.aten.conv2d.padding is also conv2d node

Differential Revision: D74898941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153749
Approved by: https://github.com/andrewor14, https://github.com/Skylion007
2025-05-22 22:17:02 +00:00
413664b3c5 catch CSE recursion depth errors (#154039)
Fixes #153777

CSE is an optimization and shouldn't block a compile if it hits recursion depth limits. Unfortunately we can't write this iteratively due to a dependency on `ast.unparse` which necessarily needs to do recursion. This PR catches opts out of CSE when we hit recursion depth errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154039
Approved by: https://github.com/Microve
2025-05-22 20:17:19 +00:00
cad0727fe1 Rename the provenance tracing artifact name for kernel <-> post_grad nodes mapping (#154046)
Summary:
Context:

Recently we've added a couple more kernel types support other than inductor generated triton kernels,

such as cpu cpp kernels, extern kernels.

The name appeared in tlparse chrome link can be confusing to users.

Rename from

`inductor_triton_kernel_to_post_grad_nodes.json`

to `inductor_generated_kernel_to_post_grad_nodes.json`

Test Plan: CI

Differential Revision: D75159042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154046
Approved by: https://github.com/yushangdi
2025-05-22 19:20:56 +00:00
4277907d02 [binary builds] Linux aarch64 CUDA builds. Make sure tag is set correctly (#154045)
1. This should set the Manylinux 2.28 tag correctly for CUDA Aarch builds.
I believe we used to have something similar in the old script:
https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/build_aarch64_wheel.py#L811

``Tag: cp311-cp311-linux_aarch64 ``-> ``Tag: cp311-cp311-manylinux_2_28_aarch64``

2. Remove section for CUDA 12.6, since we no longer building CUDA 12.6 aarch64 builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154045
Approved by: https://github.com/Camyll, https://github.com/malfet
2025-05-22 18:36:13 +00:00
788d9cb2d7 [3/n][Optimus][Auto-AC][reland] Support any fp8 quantization type and set scaling as the default" (#154057)
Summary:
This is a reland of D74910193.
We change the dtype to torch.float8_e5m2 in unit test since it is not supported.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Differential Revision: D75169792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154057
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:26:34 +00:00
c2660d29a5 [ROCm] Added unit test to test the cuda_pluggable allocator (#154041)
Added unit test to include the cuda_pluggable allocator and replicate the apex setup.py to build nccl_allocator extension

This test to check if this commit https://github.com/pytorch/pytorch/pull/152179 helps to build the cuda pluggable allocator in Rocm/Apex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154041
Approved by: https://github.com/atalman, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-05-22 18:22:15 +00:00
5b8f422561 [PT2][Optimus] Fix a typo in decompose_mm (#154048)
Summary: As titled

Differential Revision: D75160513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154048
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:11:40 +00:00
633ed01145 [MPS] Add support for two more isin variants (#154010)
`isin_Tensor_Scalar_out` is just a redispatch to eq/neq
`isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars
Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps`

Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage

Before this change both of those failed
```python
>>> import torch
>>> t = torch.tensor([0, 1, 2], device='mps')
>>> torch.isin(t, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
>>> torch.isin(1, t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales
ghstack dependencies: #153970, #153971, #153997
2025-05-22 17:59:35 +00:00
7421c21b5e remove unused code. (#153979)
Remove the unused cmake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979
Approved by: https://github.com/albanD
2025-05-22 17:50:11 +00:00
fc859077a0 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = return 0 if u0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-22 17:25:38 +00:00
025c5cc048 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit d23762974eae105aad837188d5d2254ea9783b37.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2901916364))
2025-05-22 16:52:04 +00:00
7d3dab6b90 Revert "[BE]: Type previously untyped decorators (#153726)"
This reverts commit b7d08defe9cfe1595ff680f845b39f5e03a89555.

Reverted https://github.com/pytorch/pytorch/pull/153726 on behalf of https://github.com/yangw-dev due to sorry, it seems like your pr failed typecheck error internally, [D75155486](https://www.internalfb.com/diff/D75155486) ([comment](https://github.com/pytorch/pytorch/pull/153726#issuecomment-2901911114))
2025-05-22 16:49:08 +00:00
a15550b776 [Cutlass] Use env var for EVT flag (#154099)
Swaps out hard flag for environment variable in inductor config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154099
Approved by: https://github.com/eellison
2025-05-22 16:36:57 +00:00
a82c8891d5 Revert "[aoti] Add MPS runner and shim (#153964)"
This reverts commit 918ae5d36188f419a47f3b1315f9fb373035ed66.

Reverted https://github.com/pytorch/pytorch/pull/153964 on behalf of https://github.com/angelayi due to broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153964#issuecomment-2901876832))
2025-05-22 16:35:59 +00:00
47a01f3efb Revert "[aoti] Initial Metal support (#153959)"
This reverts commit 28bcd9eb30336b370298dbe9677b95019882f2a8.

Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))
2025-05-22 16:17:07 +00:00
f419373dd3 [inductor] lowering for fractional_max_pool3d (#148630)
also a lowering with a reduction for large window_sizes for
fractional_max_pool2d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148630
Approved by: https://github.com/eellison
2025-05-22 16:06:29 +00:00
9a8c42ff94 Get rid of unused code in linters (#154043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154043
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-05-22 15:24:54 +00:00
35ddad284d update mutation renames (#153895)
Thanks to @PaulZhang12 for original find. When we finalize a multi template buffer, we need to reflect mutation renaming in dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153895
Approved by: https://github.com/PaulZhang12
2025-05-22 14:54:39 +00:00
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
4439255148 [aotd] Support saved tensors hooks in aot_autograd (#150032)
https://github.com/pytorch/pytorch/issues/148222

Goal:

At the moment autograd saved tensors hooks are run in eager after compiled forward.
They are executed at the same time for all saved tensors.
Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu.
This is suboptimal for optimization of peak memory.
Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor.

To get user specified autograd saved tensors hooks in the graph.

Logic:

UX:
If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm).
Where pack_gm and unpack_gm are torch.fx.GraphModule.
Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue.

User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes.

This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule.

In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata.

If this metadata set - then aot_autograd cache can use saved cache artifact.
If metadata is not set - then cache is bypassed.

Dynamo:
Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default).

The complexity here is that at this moment we do not have example of inputs for the hooks.
We trace  pack_hook with some Tensor from the inputs.
The result subgraphs are added to the hashing of AotAutograd Cache.

In AotAutograd we retrace the graph with the true saved tensors coming from partitioner.

Backwards Compatibility:
As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks).
For other hooks or if compiled autograd is enabled - keep the same logic.

Recompilations:
Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function.

Aot_autograd:
After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user.

We do not try to put it close the last usage etc., relying on inductor to do this optimization.

```
INFO: TRACED GRAPH
 ===== Forward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== saved_tensors_pack_hook add 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== saved_tensors_unpack_hook add 3 =====
 <eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== Forward graph 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn)

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]);  add = None
        return (view, _to_copy, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph 3 =====
 <eval_with_key>.21 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32);  add_packed_2 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy);  tangents_1 = _to_copy = None
        return (None, None, add_7)

```

Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032
Approved by: https://github.com/bdhirsh
2025-05-22 14:09:38 +00:00
f12d8d60b1 Add hint message when parameters is empty in clip_grad_norm_ (#151529)
Fixes #148259

## Changes

- Add print warning message when `parameters` generator exhausted

## Test Result
### print warning
```python

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

model = SimpleModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

inputs = torch.randn(16, 10)
targets = torch.randn(16, 1)

outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()

params_to_clip = model.parameters()

for p in params_to_clip:
    print(p.shape)

max_norm = 1.0
norm_type = 2.0
total_norm = nn.utils.clip_grad_norm_(params_to_clip, max_norm, norm_type)
print(f"total_norm: {total_norm}")
```

```bash
/home/zong/code/pytorch/torch/nn/utils/clip_grad.py:222: UserWarning: `parameters` is an empty generator, no gradient clipping will occur.
  warnings.warn(
total_norm: 0.0
```

### UT

```bash
pytest test/test_nn.py -k test_clip_grad_norm
```

![image](https://github.com/user-attachments/assets/0aa0f06c-e0a5-43cf-9a97-d7c2747c9180)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151529
Approved by: https://github.com/jbschlosser
2025-05-22 11:23:39 +00:00
40e6ca24ef Update CPU Inductor merge rules by adding more CPP Template (#152086)
**Summary**
Add more CPP Template into the CPU Inductor merge rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152086
Approved by: https://github.com/atalman
2025-05-22 09:46:26 +00:00
2f57ee579d S390x update docker image (#153619)
Add ninja-build for pytorch tests.
Switch to gcc 14 due to fix for precompiled headers and s390x vectorization interaction.
Disable -Werror when building onnxruntime.
Pin onnx version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153619
Approved by: https://github.com/huydhn
2025-05-22 09:34:46 +00:00
d7a83ab67b Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 (#149312)
Fixes #102261

## Changes

- Use flag `_is_initial` to replace `self.last_epoch == 0` condition to judge whether `lr` should be initial value
- Add test for `ExponentialLR` checkpoint usecase

## Test Result

```python
pytest -s test/optim/test_lrscheduler.py  -vv
```

![image](https://github.com/user-attachments/assets/6fd32bcc-b4fb-4421-b891-620bd4900dc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149312
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-05-22 08:42:37 +00:00
423fc671e9 [Cutlass] Support float8_e4m3fn GEMM (#153890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-22 08:37:33 +00:00
c1b7dbc52a [dynamo] unimplemented -> unimplemented_v2 in variables/dict.py (#154040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154040
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-22 06:46:10 +00:00
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
482e5b6660 [inductor] Added precompilation_timeout_seconds into a config instead of hardcoded (#153788)
Fixes #153392

- Updated config.py to add the timeout as a config var to be tuned dynamically (default is 3600s).
- Passed the var as a kwarg during call on instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153788
Approved by: https://github.com/henrylhtsang
2025-05-22 06:44:02 +00:00
7128b50a65 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it
https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
2025-05-22 06:33:29 +00:00
4bcff4af99 Move prologue_supported_inputs computations to def_kernal (#150869)
This avoid replaying load_input on a cache hit on the generate_code_cache.
the idea is that if a template have prologue_loads_all_inputs = True, it means that
all all inputs are loaded and hence no need to replay

Effect on the current benchmark on a local run on dev server.
18549985383 -> 15072230073
25697270062 -> 20738613297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150869
Approved by: https://github.com/eellison
2025-05-22 06:24:44 +00:00
4421aee558 torch.compile: Supress stdout / stderr output from subprocesses when local (#153837)
Summary:
This output is extremely noisy - i.e. on a 96 core machine, with 8 ranks, you
can get ~700 duplicate set of logs from each worker.

Differential Revision: D74907920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153837
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-05-22 05:49:43 +00:00
6536 changed files with 361429 additions and 184341 deletions

View File

@ -2,7 +2,7 @@ build --cxxopt=--std=c++17
build --copt=-I.
# Bazel does not support including its cc_library targets as system
# headers. We work around this for generated code
# (e.g. c10/macros/cmake_macros.h) by making the generated directory a
# (e.g. torch/headeronly/macros/cmake_macros.h) by making the generated directory a
# system include path.
build --copt=-isystem --copt bazel-out/k8-fastbuild/bin
build --copt=-isystem --copt bazel-out/darwin-fastbuild/bin

15
.bc-linter.yml Normal file
View File

@ -0,0 +1,15 @@
version: 1
paths:
include:
- "**/*.py"
exclude:
- ".*"
- ".*/**"
- "**/.*/**"
- "**/.*"
- "**/_*/**"
- "**/_*.py"
- "**/test/**"
- "**/benchmarks/**"
- "**/test_*.py"
- "**/*_test.py"

View File

@ -3,10 +3,18 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
# Set CUDA architecture lists to match x86 build_cuda.sh
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
export TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
export TORCH_CUDA_ARCH_LIST="7.0;8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ "$DESIRED_CUDA" == *"13"* ]]; then
export TORCH_NVCC_FLAGS="-compress-mode=size"
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
@ -20,13 +28,27 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
# Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling CUDA libraries with wheel for aarch64."
else
echo "Using nvidia libs from pypi for aarch64."
# Fix platform constraints in PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64
# Replace 'platform_machine == "x86_64"' with 'platform_machine == "aarch64"'
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS//platform_machine == \'x86_64\'/platform_machine == \'aarch64\'}"
echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"
export USE_NVIDIA_PYPI_LIBS=1
fi
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -31,103 +31,228 @@ def build_ArmComputeLibrary() -> None:
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = "ComputeLibrary"
os.makedirs(acl_install_dir)
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")
if os.path.isdir(acl_install_dir):
shutil.rmtree(acl_install_dir)
if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", "-j8", f"build_dir=/{acl_install_dir}/build"]
+ acl_build_flags,
["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src"]:
for d in ["arm_compute", "include", "utils", "support", "src", "build"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def update_wheel(wheel_path, desired_cuda) -> None:
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith("Tag:"):
lines[i] = line.replace("-linux_", "-manylinux_2_28_")
print(f"Updated tag from {line} to {lines[i]}")
break
with open(filename, "w") as f:
f.writelines(lines)
def patch_library_rpath(
folder: str,
lib_name: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Apply patchelf to set RPATH for a library in torch/lib"""
lib_path = f"{folder}/tmp/torch/lib/{lib_name}"
if use_nvidia_pypi_libs:
# For PyPI NVIDIA libraries, construct CUDA RPATH
cuda_rpaths = [
"$ORIGIN/../../nvidia/cudnn/lib",
"$ORIGIN/../../nvidia/nvshmem/lib",
"$ORIGIN/../../nvidia/nccl/lib",
"$ORIGIN/../../nvidia/cusparselt/lib",
]
if "130" in desired_cuda:
cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")
else:
cuda_rpaths.extend(
[
"$ORIGIN/../../nvidia/cublas/lib",
"$ORIGIN/../../nvidia/cuda_cupti/lib",
"$ORIGIN/../../nvidia/cuda_nvrtc/lib",
"$ORIGIN/../../nvidia/cuda_runtime/lib",
"$ORIGIN/../../nvidia/cufft/lib",
"$ORIGIN/../../nvidia/curand/lib",
"$ORIGIN/../../nvidia/cusolver/lib",
"$ORIGIN/../../nvidia/cusparse/lib",
"$ORIGIN/../../nvidia/nvtx/lib",
"$ORIGIN/../../nvidia/cufile/lib",
]
)
# Add $ORIGIN for local torch libs
rpath = ":".join(cuda_rpaths) + ":$ORIGIN"
else:
# For bundled libraries, just use $ORIGIN
rpath = "$ORIGIN"
if os.path.exists(lib_path):
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"
)
def copy_and_patch_library(
src_path: str,
folder: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Copy a library to torch/lib and patch its RPATH"""
if os.path.exists(src_path):
lib_name = os.path.basename(src_path)
shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Update the cuda wheel libraries
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
libs_to_copy = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusparse.so.12",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
]
if enable_cuda:
libs_to_copy += [
# Check if we should use PyPI NVIDIA libraries or bundle system libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Using nvidia libs from pypi - skipping CUDA library bundling")
# For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages
# We only need to bundle non-NVIDIA libraries
minimal_libs_to_copy = [
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
"/opt/OpenBLAS/lib/libopenblas.so.0",
# Copy minimal libraries to unzipped_folder/torch/lib
for lib_path in minimal_libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Patch torch libraries used for searching libraries
torch_libs_to_patch = [
"libtorch.so",
"libtorch_cpu.so",
"libtorch_cuda.so",
"libtorch_cuda_linalg.so",
"libtorch_global_deps.so",
"libtorch_python.so",
"libtorch_nvshmem.so",
"libc10.so",
"libc10_cuda.so",
"libcaffe2_nvrtc.so",
"libshm.so",
]
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
f"{folder}/cuda_wheel/{wheelname}",
f"{folder}/{wheelname}",
copy_function=shutil.copy2,
)
os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")
for lib_name in torch_libs_to_patch:
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
else:
print("Bundling CUDA libraries with wheel")
# Original logic for bundling system CUDA libraries
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# CUDA version-specific libraries
if "130" in desired_cuda:
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.0",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
]
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy libraries to unzipped_folder/torch/lib
for lib_path in libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
if f.is_dir() and f.name.endswith(".dist-info"):
replace_tag(f"{f.path}/WHEEL")
break
os.system(f"wheel pack {folder}/tmp/ -d {folder}")
os.system(f"rm -rf {folder}/tmp/")
def complete_wheel(folder: str) -> str:
@ -194,8 +319,20 @@ if __name__ == "__main__":
).decode()
print("Building PyTorch wheel")
build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
os.system("cd /pytorch; python setup.py clean")
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars += "MAX_JOBS=5 "
# Handle PyPI NVIDIA libraries vs bundled libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Configuring build for PyPI NVIDIA libraries")
# Configure for dynamic linking (matching x86 logic)
build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "
else:
print("Configuring build for bundled NVIDIA libraries")
# Keep existing static linking approach - already configured above
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
@ -242,6 +379,6 @@ if __name__ == "__main__":
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
update_wheel(wheel_path, desired_cuda)
package_cuda_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

View File

@ -438,9 +438,7 @@ def build_torchvision(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
@ -495,9 +493,7 @@ def build_torchdata(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
@ -553,9 +549,7 @@ def build_torchtext(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
@ -613,9 +607,7 @@ def build_torchaudio(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

View File

@ -5,7 +5,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
pip install click mock tabulate networkx==2.0
pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
pip -q install "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
fi
# Skip tests in environments where they are not built/applicable
@ -147,8 +147,8 @@ export DNNL_MAX_CPU_ISA=AVX2
if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
# TODO(sdym@meta.com) remove this when the linked issue resolved.
# py is temporary until https://github.com/Teemu/pytest-sugar/issues/241 is fixed
pip install --user py==1.11.0
pip install --user pytest-sugar
pip install py==1.11.0
pip install pytest-sugar
# NB: Warnings are disabled because they make it harder to see what
# the actual erroring test is
"$PYTHON" \

View File

@ -36,3 +36,104 @@ See `build.sh` for valid build environments (it's the giant switch).
# Set flags (see build.sh) and build image
sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```
## [Guidance] Adding a New Base Docker Image
### Background
The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.
1. **Automatic Rebuilding**:
- The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory
- This ensures all images stay up-to-date with the latest dependencies and configurations
2. **Image Reuse in PyTorch Build Workflows** (example: linux-build):
- The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step
- The `_linux-build.yml` workflow:
- Pulls the Docker image determined by the `calculate-docker-image` step
- Runs a Docker container with that image
- Executes `.ci/pytorch/build.sh` inside the container to build PyTorch
3. **Usage in Test Workflows** (example: linux-test):
- The same Docker images are also used in `_linux-test.yml` for running tests
- The `_linux-test.yml` workflow follows a similar pattern:
- It uses the `calculate-docker-image` step to determine which Docker image to use
- It pulls the Docker image and runs a container with that image
- It installs the wheels from the artifacts generated by PyTorch build jobs
- It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container
### Understanding File Purposes
#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`
- **`.ci/docker/build.sh`**:
- Used for building base Docker images
- Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI
- Contains configurations for different Docker build environments
- **`.ci/pytorch/build.sh`**:
- Used for building PyTorch inside a Docker container
- Called by workflows like `_linux-build.yml` after the Docker container is started
- Builds PyTorch wheels and other artifacts
#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`
- **`.ci/docker/ci_commit_pins/`**:
- Used for pinning dependency versions during base Docker image building
- Ensures consistent environments for building PyTorch
- Changes here trigger base Docker image rebuilds
- **`.github/ci_commit_pins`**:
- Used for pinning dependency versions during PyTorch building and tests
- Ensures consistent dependencies for PyTorch across different builds
- Used by build scripts running inside Docker containers
### Step-by-Step Guide for Adding a New Base Docker Image
#### 1. Add Pinned Commits (If Applicable)
We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.
If your new Docker image needs a library installed from a specific pinned commit or built from source:
1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`
2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1
#### 2. Configure the Base Docker Image
1. **Add new Base Docker image configuration** (if applicable):
Add the configuration in `.ci/docker/build.sh`. For example:
```bash
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
NEW_ARG_1=yes
;;
```
2. **Add build arguments to Docker build command**:
If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:
```bash
docker build \
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
```
3. **Update Dockerfile logic**:
Update the Dockerfile to use the new argument. For example, in `ubuntu/Dockerfile`:
```dockerfile
ARG NEW_ARG_1
# Set up environment for NEW_ARG_1
RUN if [ -n "${NEW_ARG_1}" ]; then bash ./do_something.sh; fi
```
4. **Add the Docker configuration** in `.github/workflows/docker-builds.yml`:
The `docker-builds.yml` workflow pre-builds the Docker images whenever changes occur in the `.ci/docker/` directory. This includes the
pinned commit updates.

View File

@ -1,7 +1,7 @@
ARG CUDA_VERSION=12.4
ARG CUDA_VERSION=12.6
ARG BASE_TARGET=cuda${CUDA_VERSION}
ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete
FROM amd64/almalinux:8 as base
FROM amd64/almalinux:8.10-20250519 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
@ -11,6 +11,8 @@ ARG DEVTOOLSET_VERSION=11
RUN yum -y update
RUN yum -y install epel-release
# install glibc-langpack-en make sure en_US.UTF-8 locale is available
RUN yum -y install glibc-langpack-en
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
@ -50,10 +52,6 @@ ENV CUDA_VERSION=${CUDA_VERSION}
# Make things in our path by default
ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
@ -62,6 +60,14 @@ FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
ENV DESIRED_CUDA=12.8
FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
ENV DESIRED_CUDA=12.9
FROM cuda as cuda13.0
RUN bash ./install_cuda.sh 13.0
ENV DESIRED_CUDA=13.0
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ADD ./common/install_mkl.sh install_mkl.sh
@ -74,9 +80,10 @@ ADD ./common/install_mnist.sh install_mnist.sh
RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
COPY --from=cuda12.4 /usr/local/cuda-12.8 /usr/local/cuda-12.8
COPY --from=cuda12.8 /usr/local/cuda-12.8 /usr/local/cuda-12.8
COPY --from=cuda12.9 /usr/local/cuda-12.9 /usr/local/cuda-12.9
COPY --from=cuda13.0 /usr/local/cuda-13.0 /usr/local/cuda-13.0
# Final step
FROM ${BASE_TARGET} as final

View File

@ -50,30 +50,23 @@ if [[ "$image" == *xla* ]]; then
exit 0
fi
if [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
elif [[ "$image" == *-jammy* ]]; then
if [[ "$image" == *-jammy* ]]; then
UBUNTU_VERSION=22.04
elif [[ "$image" == *-noble* ]]; then
UBUNTU_VERSION=24.04
elif [[ "$image" == *ubuntu* ]]; then
extract_version_from_image_name ubuntu UBUNTU_VERSION
elif [[ "$image" == *centos* ]]; then
extract_version_from_image_name centos CENTOS_VERSION
fi
if [ -n "${UBUNTU_VERSION}" ]; then
OS="ubuntu"
elif [ -n "${CENTOS_VERSION}" ]; then
OS="centos"
else
echo "Unable to derive operating system base..."
exit 1
fi
DOCKERFILE="${OS}/Dockerfile"
# When using ubuntu - 22.04, start from Ubuntu docker image, instead of nvidia/cuda docker image.
if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
if [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *xpu* ]]; then
DOCKERFILE="${OS}-xpu/Dockerfile"
@ -83,10 +76,13 @@ elif [[ "$image" == *cuda*linter* ]]; then
elif [[ "$image" == *linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter/Dockerfile"
elif [[ "$image" == *riscv* ]]; then
# Use RISC-V specific Dockerfile
DOCKERFILE="ubuntu-cross-riscv/Dockerfile"
fi
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152
_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
@ -98,9 +94,8 @@ tag=$(echo $image | awk -F':' '{print $2}')
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$tag" in
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
@ -109,9 +104,28 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.0
CUDNN_VERSION=9
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11)
CUDA_VERSION=13.0.0
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -121,56 +135,18 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=9
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -179,44 +155,24 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
pytorch-linux-jammy-py3-clang12-onnx)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
VISION=yes
ONNX=yes
;;
pytorch-linux-focal-py3.9-clang10)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
VISION=yes
TRITON=yes
;;
pytorch-linux-focal-py3.11-clang10)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=10
VISION=yes
TRITON=yes
;;
pytorch-linux-focal-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-rocm-n-1-py3)
pytorch-linux-jammy-py3.10-clang12)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
CLANG_VERSION=12
VISION=yes
ROCM_VERSION=6.3
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)
if [[ $tag =~ "jammy" ]]; then
ANACONDA_PYTHON_VERSION=3.10
else
ANACONDA_PYTHON_VERSION=3.12
fi
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.4
@ -225,25 +181,40 @@ case "$tag" in
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-noble-rocm-alpha-py3)
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.0
ROCM_VERSION=7.0
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
;;
pytorch-linux-jammy-xpu-2025.1-py3)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-xpu-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.1
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
pytorch-linux-jammy-xpu-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.2
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)
# TODO (huydhn): Upgrade this to Python >= 3.10
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
@ -252,32 +223,20 @@ case "$tag" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDNN_VERSION=9
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-asan)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang15-asan)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-clang12)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=15
CUDA_VERSION=12.8.1
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
@ -303,21 +262,22 @@ case "$tag" in
GCC_VERSION=11
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
VISION=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -327,11 +287,15 @@ case "$tag" in
GCC_VERSION=11
ACL=yes
VISION=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-noble-riscv64-py3.12-gcc14)
GCC_VERSION=14
;;
*)
# Catch-all for builds that are not hardcoded.
VISION=yes
@ -341,7 +305,6 @@ case "$tag" in
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
@ -370,14 +333,6 @@ esac
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if [[ ${CUDNN_VERSION} == 9 ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
fi
fi
no_cache_flag=""
progress_flag=""
# Do not use cache and progress=plain when in CI
@ -394,7 +349,6 @@ docker build \
--build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "VISION=${VISION:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \
--build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
@ -402,9 +356,6 @@ docker build \
--build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
--build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
@ -422,6 +373,7 @@ docker build \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "OPENBLAS=${OPENBLAS:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
@ -464,7 +416,14 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
fi
if [ -n "$GCC_VERSION" ]; then
if !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then
if [[ "$image" == *riscv* ]]; then
# Check RISC-V cross-compilation toolchain version
if !(drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version 2>&1 | grep -q " $GCC_VERSION\\W"); then
echo "RISC-V GCC_VERSION=$GCC_VERSION, but:"
drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version
exit 1
fi
elif !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then
echo "GCC_VERSION=$GCC_VERSION, but:"
drun gcc --version
exit 1

View File

@ -39,6 +39,7 @@ RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt

View File

@ -1 +1 @@
b173722085b3f555d6ba4533d6bbaddfd7c71144
56392aa978594cc155fa8af48cd949f5b5f1823a

View File

@ -0,0 +1,2 @@
transformers==4.54.0
soxr==0.5.0

View File

@ -1 +0,0 @@
243e186efbf7fb93328dd6b34927a4e8c8f24395

View File

@ -1 +1 @@
v2.26.5-1
v2.27.5-1

View File

@ -0,0 +1 @@
v2.27.7-1

View File

@ -0,0 +1 @@
74a23feff57432129df84d8099e622773cf77925

View File

@ -1 +1 @@
b0e26b7359c147b8aa0af686c20510fb9b15990a
1b0418a9a454b2b93ab8d71f40e59d2297157fae

View File

@ -1 +1 @@
c8757738a7418249896224430ce84888e8ecdd79
fccfc522864cf8bc172abe0cd58ae5581e2d44b9

View File

@ -23,6 +23,10 @@ conda_install() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*
}
conda_install_through_forge() {
as_jenkins conda install -c conda-forge -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*
}
conda_run() {
as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION --no-capture-output $*
}

View File

@ -15,6 +15,9 @@ install_ubuntu() {
elif [[ "$UBUNTU_VERSION" == "22.04"* ]]; then
cmake3="cmake=3.22*"
maybe_libiomp_dev=""
elif [[ "$UBUNTU_VERSION" == "24.04"* ]]; then
cmake3="cmake=3.28*"
maybe_libiomp_dev=""
else
cmake3="cmake=3.5*"
maybe_libiomp_dev="libiomp-dev"
@ -30,18 +33,6 @@ install_ubuntu() {
maybe_libomp_dev=""
fi
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
# TODO: Eliminate this hack, we should not relay on apt-get installation
# See https://github.com/pytorch/pytorch/issues/144768
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
@ -70,7 +61,6 @@ install_ubuntu() {
libasound2-dev \
libsndfile-dev \
${maybe_libomp_dev} \
${maybe_libnccl_dev} \
software-properties-common \
wget \
sudo \

View File

@ -4,12 +4,8 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
@ -21,7 +17,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
exit 1
;;
esac
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
@ -64,11 +59,16 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge
# Miniforge installer doesn't install sqlite by default
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
conda_install sqlite
fi
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.29=*openmp*"
else
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
if [[ $(uname -m) != "aarch64" ]]; then
pip_install mkl==2024.2.0
pip_install mkl-static==2024.2.0
pip_install mkl-include==2024.2.0
fi
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
@ -82,6 +82,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_run ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION})
fi
if [[ "$UBUNTU_VERSION" == "24.04"* ]] ; then
conda_install_through_forge libstdcxx-ng=14
fi
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt

View File

@ -3,11 +3,10 @@
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads # @lint-ignore
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t 3.14.0 3.14.0t"}
function check_var {
if [ -z "$1" ]; then
@ -24,9 +23,8 @@ function do_cpython_build {
tar -xzf Python-$py_ver.tgz
local additional_flags=""
if [ "$py_ver" == "3.13.0t" ]; then
if [[ "$py_ver" == *"t" ]]; then
additional_flags=" --disable-gil"
mv cpython-3.13/ cpython-3.13t/
fi
pushd $py_folder
@ -68,32 +66,29 @@ function do_cpython_build {
ln -s pip3 ${prefix}/bin/pip
fi
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
# packaging is needed to create symlink since wheel no longer provides needed information
${prefix}/bin/pip install packaging==25.0 wheel==0.45.1 setuptools==80.9.0
local abi_tag=$(${prefix}/bin/python -c "from packaging.tags import interpreter_name, interpreter_version; import sysconfig ; from sysconfig import get_config_var; print('{0}{1}-{0}{1}{2}'.format(interpreter_name(), interpreter_version(), 't' if sysconfig.get_config_var('Py_GIL_DISABLED') else ''))")
ln -sf ${prefix} /opt/python/${abi_tag}
}
function build_cpython {
local py_ver=$1
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
local py_suffix=$py_ver
local py_folder=$py_ver
if [ "$py_ver" = "3.13.0t" ]; then
PY_VER_SHORT="3.13"
PYT_VER_SHORT="3.13t"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PYT_VER_SHORT
elif [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PY_VER_SHORT
else
wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_ver
# Special handling for nogil
if [[ "${py_ver}" == *"t" ]]; then
py_suffix=${py_ver::-1}
py_folder=$py_suffix
fi
# Update to rc2 due to https://github.com/python/cpython/commit/c72699086fe4
if [ "$py_suffix" == "3.14.0" ]; then
py_suffix="3.14.0rc2"
fi
wget -q $PYTHON_DOWNLOAD_URL/$py_folder/Python-$py_suffix.tgz -O Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_suffix
rm -f Python-$py_ver.tgz
}

View File

@ -10,6 +10,8 @@ else
arch_path='sbsa'
fi
NVSHMEM_VERSION=3.3.24
function install_cuda {
version=$1
runfile=$2
@ -40,18 +42,42 @@ function install_cudnn {
rm -rf tmp_cudnn
}
function install_118 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"
install_cuda 11.8.0 cuda_11.8.0_520.61.05_linux
function install_nvshmem {
cuda_major_version=$1 # e.g. "12"
nvshmem_version=$2 # e.g. "3.3.9"
install_cudnn 11 $CUDNN_VERSION
case "${arch_path}" in
sbsa)
dl_arch="aarch64"
;;
x86_64)
dl_arch="x64"
;;
*)
dl_arch="${arch}"
;;
esac
CUDA_VERSION=11.8 bash install_nccl.sh
tmpdir="tmp_nvshmem"
mkdir -p "${tmpdir}" && cd "${tmpdir}"
CUDA_VERSION=11.8 bash install_cusparselt.sh
# nvSHMEM license: https://docs.nvidia.com/nvshmem/api/sla.html
# This pattern is a lie as it is not consistent across versions, for 3.3.9 it was cuda_ver-arch-nvshhem-ver
filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"
suffix=".tar.xz"
url="https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-${arch_path}/${filename}${suffix}"
ldconfig
# download, unpack, install
wget -q "${url}"
tar xf "${filename}${suffix}"
cp -a "${filename}/include/"* /usr/local/cuda/include/
cp -a "${filename}/lib/"* /usr/local/cuda/lib64/
# cleanup
cd ..
rm -rf "${tmpdir}"
echo "nvSHMEM ${nvshmem_version} for CUDA ${cuda_major_version} (${arch_path}) installed."
}
function install_124 {
@ -69,12 +95,14 @@ function install_124 {
}
function install_126 {
CUDNN_VERSION=9.5.1.17
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.6 bash install_nccl.sh
CUDA_VERSION=12.6 bash install_cusparselt.sh
@ -82,114 +110,35 @@ function install_126 {
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
# CUDA 11.8 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"
function install_129 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.9.1 in the same container
install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux
export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
install_nvshmem 12 $NVSHMEM_VERSION
# all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
CUDA_VERSION=12.9 bash install_nccl.sh
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
CUDA_VERSION=12.9 bash install_cusparselt.sh
#####################################################################################
# CUDA 11.8 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-11.8/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
ldconfig
}
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
# install CUDA 12.8.0 in the same container
install_cuda 12.8.0 cuda_12.8.0_570.86.10_linux
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.8.1 in the same container
install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.8 bash install_nccl.sh
CUDA_VERSION=12.8 bash install_cusparselt.sh
@ -197,17 +146,37 @@ function install_128 {
ldconfig
}
function install_130 {
CUDNN_VERSION=9.13.0.50
echo "Installing CUDA 13.0 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 13.0 in the same container
install_cuda 13.0.0 cuda_13.0.0_580.65.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 13 $CUDNN_VERSION
install_nvshmem 13 $NVSHMEM_VERSION
CUDA_VERSION=13.0 bash install_nccl.sh
CUDA_VERSION=13.0 bash install_cusparselt.sh
ldconfig
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
12.4) install_124;
;;
12.4) install_124; prune_124
12.6|12.6.*) install_126;
;;
12.6) install_126; prune_126
12.8|12.8.*) install_128;
;;
12.8) install_128;
12.9|12.9.*) install_129;
;;
13.0|13.0.*) install_130;
;;
*) echo "bad argument $1"; exit 1
;;

View File

@ -1,26 +0,0 @@
#!/bin/bash
if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cudnn
ldconfig
fi

View File

@ -5,13 +5,21 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ "13" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.8.0.4_cuda13-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.7.1.0-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
@ -21,9 +29,6 @@ elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi

View File

@ -5,9 +5,7 @@ set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
function install_huggingface() {
local version
commit=$(get_pinned_commit huggingface)
pip_install "git+https://github.com/huggingface/transformers@${commit}"
pip_install -r huggingface-requirements.txt
}
function install_timm() {
@ -15,11 +13,34 @@ function install_timm() {
commit=$(get_pinned_commit timm)
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y torch torchvision triton
}
function install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
python install.py --continue_on_fail
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
chown -R jenkins torchbench
chown -R jenkins /opt/conda
}
# Pango is needed for weasyprint which is needed for doctr
conda_install pango
# Stable packages are ok here, just to satisfy TorchBench check
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
install_torchbench
install_huggingface
install_timm
# Clean up
conda_run pip uninstall -y torch torchvision torchaudio triton torchao

View File

@ -7,6 +7,8 @@ if [[ ${CUDA_VERSION:0:2} == "11" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)
elif [[ ${CUDA_VERSION:0:2} == "13" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu13.txt)
else
echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"
exit 1

View File

@ -8,16 +8,6 @@ retry () {
"$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")
}
# A bunch of custom pip dependencies for ONNX
pip_install \
beartype==0.15.0 \
filelock==3.9.0 \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.5 \
numpy==1.24.2
# ONNXRuntime should be installed before installing
# onnx-weekly. Otherwise, onnx-weekly could be
# overwritten by onnx.
@ -29,12 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.17.0
pip_install onnxscript==0.2.2 --no-deps
# required by onnxscript
pip_install ml_dtypes
pip_install onnxruntime==1.22.1
pip_install onnxscript==0.4.0
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -4,9 +4,9 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128
USE_OPENMP=1
@ -14,9 +14,8 @@ NO_SHARED=0
DYNAMIC_ARCH=1
TARGET=ARMV8
CFLAGS=-O3
BUILD_BFLOAT16=1
"
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

View File

@ -8,9 +8,11 @@ ver() {
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 20.04 ]]; then
# gpg-agent is not available by default on 20.04
apt-get install -y --no-install-recommends gpg-agent
# gpg-agent is not available by default
apt-get install -y --no-install-recommends gpg-agent
if [[ $(ver $UBUNTU_VERSION) -ge $(ver 22.04) ]]; then
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
| sudo tee /etc/apt/preferences.d/rocm-pin-600
fi
apt-get install -y kmod
apt-get install -y wget
@ -26,13 +28,27 @@ Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF
# we want the patch version of 6.4 instead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
ROCM_VERSION="${ROCM_VERSION}.2"
fi
# Default url values
rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
# Special case for ROCM_VERSION == 7.0
if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then
rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
# Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
@ -66,25 +82,33 @@ EOF
done
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# CI no longer builds for ROCm 6.3, but
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.2) ]]; then
HIP_TAG=rocm-6.4.2
CLR_HASH=74d78ba3ac4bac235d02bcb48511c30b5cfdd457 # branch release/rocm-rel-6.4.2-statco-hotfix
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_TAG=rocm-6.4.1
CLR_HASH=efe6c35790b9206923bfeed1209902feff37f386 # branch release/rocm-rel-6.4.1-statco-hotfix
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
HIP_TAG=rocm-6.4.0
CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c # branch release/rocm-rel-6.4-statco-hotfix
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_TAG
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix
git clone https://github.com/jeffdaily/clr
pushd clr
git checkout $CLR_HASH
popd
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
# Need to point CMake to the correct python installation to find CppHeaderParser
cmake .. -DPython3_EXECUTABLE=/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}/bin/python3 -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
make -j
cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*
cp hipamd/lib/libamdhip64.so.6.4.* /opt/rocm/lib/libamdhip64.so.6.4.*
popd
rm -rf HIP clr
fi

View File

@ -5,7 +5,12 @@ set -eou pipefail
function do_install() {
rocm_version=$1
rocm_version_nodot=${1//./}
if [[ ${rocm_version} =~ ^[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
# chop off any patch version
rocm_version="${rocm_version%.*}"
fi
rocm_version_nodot=${rocm_version//./}
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

View File

@ -51,8 +51,13 @@ as_jenkins git clone --recursive ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
pip_install pybind11==2.13.6
# Old versions of python have setup.py in ./python; newer versions have it in ./
if [ ! -f setup.py ]; then
cd python
fi
pip_install pybind11==3.0.1
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py
@ -93,3 +98,10 @@ fi
if [ -n "${NUMPY_VERSION}" ]; then
pip_install "numpy==${NUMPY_VERSION}"
fi
# IMPORTANT: helion needs to be installed without dependencies.
# It depends on torch and triton. We don't want to install
# triton and torch from production on Docker CI images
if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then
pip_install helion --no-deps
fi

View File

@ -44,8 +44,12 @@ function install_ucc() {
./autogen.sh
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
if [[ -n "$CUDA_VERSION" && $CUDA_VERSION == 13* ]]; then
NVCC_GENCODE="-gencode=arch=compute_86,code=compute_86"
else
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
fi
if [[ -n "$ROCM_VERSION" ]]; then
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

View File

@ -34,18 +34,27 @@ function install_ubuntu() {
# The xpu-smi packages
apt-get install -y flex bison xpu-smi
# Compute and Media Runtimes
apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
apt-get install -y intel-ocloc
if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then
# Compute and Media Runtimes
apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
else # rolling driver
apt-get install -y \
intel-opencl-icd libze-intel-gpu1 libze1 \
intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev
fi
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
apt-get install -y ${XPU_PACKAGES}
@ -134,18 +143,18 @@ function install_sles() {
}
# Default use GPU driver LTS releases
XPU_DRIVER_VERSION="/lts/2350"
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
# Use GPU driver rolling releases
XPU_DRIVER_VERSION=""
# Default use GPU driver rolling releases
XPU_DRIVER_VERSION=""
if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then
# Use GPU driver LTS releases
XPU_DRIVER_VERSION="/lts/2350"
fi
# Default use Intel® oneAPI Deep Learning Essentials 2025.0
if [[ "$XPU_VERSION" == "2025.1" ]]; then
XPU_PACKAGES="intel-deep-learning-essentials-2025.1"
# Default use Intel® oneAPI Deep Learning Essentials 2025.1
if [[ "$XPU_VERSION" == "2025.2" ]]; then
XPU_PACKAGES="intel-deep-learning-essentials-2025.2"
else
XPU_PACKAGES="intel-deep-learning-essentials-2025.0"
XPU_PACKAGES="intel-deep-learning-essentials-2025.1"
fi
# The installation depends on the base OS

View File

@ -54,16 +54,6 @@ COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
RUN bash ./install_magma.sh 12.6
@ -74,6 +64,24 @@ RUN bash ./install_cuda.sh 12.8
RUN bash ./install_magma.sh 12.8
RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda
FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
RUN bash ./install_magma.sh 12.9
RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda
FROM cuda as cuda13.0
RUN bash ./install_cuda.sh 13.0
RUN bash ./install_magma.sh 13.0
RUN ln -sf /usr/local/cuda-13.0 /usr/local/cuda
# Install libibverbs for libtorch and copy to CUDA directory
RUN apt-get update -y && \
apt-get install -y libibverbs-dev librdmacm-dev && \
cp /usr/lib/x86_64-linux-gnu/libmlx5.so* /usr/local/cuda/lib64/ && \
cp /usr/lib/x86_64-linux-gnu/librdmacm.so* /usr/local/cuda/lib64/ && \
cp /usr/lib/x86_64-linux-gnu/libibverbs.so* /usr/local/cuda/lib64/ && \
cp /usr/lib/x86_64-linux-gnu/libnl* /usr/local/cuda/lib64/
FROM cpu as rocm
ARG ROCM_VERSION
ARG PYTORCH_ROCM_ARCH

View File

@ -39,6 +39,10 @@ case ${DOCKER_TAG_PREFIX} in
DOCKER_GPU_BUILD_ARG=""
;;
rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

View File

@ -27,5 +27,7 @@ COPY ./common/install_linter.sh install_linter.sh
RUN bash ./install_linter.sh
RUN rm install_linter.sh
RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env
USER jenkins
CMD ["bash"]

View File

@ -26,7 +26,7 @@ ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
@ -103,6 +103,7 @@ ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /usr/local/lib/ /usr/local/lib/
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
@ -174,6 +175,6 @@ ENV XPU_DRIVER_TYPE ROLLING
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_xpu.sh install_xpu.sh
ENV XPU_VERSION 2025.1
ENV XPU_VERSION 2025.2
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -2,7 +2,7 @@ FROM quay.io/pypa/manylinux_2_28_aarch64 as base
ARG GCCTOOLSET_VERSION=13
# Language variabes
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
@ -58,12 +58,13 @@ RUN git config --global --add safe.directory "*"
FROM base as openblas
# Install openblas
ARG OPENBLAS_VERSION
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM base as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -60,7 +60,7 @@ RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -5,7 +5,9 @@ ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
ARG DEVTOOLSET_VERSION=13
# there is a bugfix in gcc >= 14 for precompiled headers and s390x vectorization interaction.
# with earlier gcc versions test/inductor/test_cpu_cpp_wrapper.py will fail.
ARG DEVTOOLSET_VERSION=14
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
@ -58,7 +60,8 @@ RUN yum install -y \
libxslt-devel \
libxml2-devel \
openssl-devel \
valgrind
valgrind \
ninja-build
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
@ -103,9 +106,6 @@ CMD ["/bin/bash"]
# install test dependencies:
# - grpcio requires system openssl, bundled crypto fails to build
RUN dnf install -y \
protobuf-devel \
protobuf-c-devel \
protobuf-lite-devel \
hdf5-devel \
python3-h5py \
git
@ -120,15 +120,22 @@ RUN python3 -mpip install cmake==3.28.0
# so just build it from upstream repository.
# h5py is dependency of onnxruntime_training.
# h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
# h5py 3.11.0 doesn't build with numpy >= 2.3.0.
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install flatbuffers && \
pip3 install h5py==3.11.0 && \
pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
pip3 install --no-build-isolation h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \
git submodule update --init --recursive && \
./build.sh --config Release --parallel 0 --enable_pybind --build_wheel --enable_training --enable_training_apis --enable_training_ops --skip_tests --allow_running_as_root && \
wget https://github.com/microsoft/onnxruntime/commit/f57db79743c4d1a3553aa05cf95bcd10966030e6.patch && \
patch -p1 < f57db79743c4d1a3553aa05cf95bcd10966030e6.patch && \
./build.sh --config Release --parallel 0 --enable_pybind \
--build_wheel --enable_training --enable_training_apis \
--enable_training_ops --skip_tests --allow_running_as_root \
--compile_no_warning_as_error && \
pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
cd .. && /bin/rm -rf ./onnxruntime

View File

@ -27,6 +27,7 @@ fi
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
OPENBLAS_VERSION=${OPENBLAS_VERSION:-}
case ${image} in
manylinux2_28-builder:cpu)
@ -40,6 +41,7 @@ case ${image} in
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.30"
;;
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
@ -65,6 +67,12 @@ case ${image} in
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
manylinux2_28-builder:cuda13*)
TARGET=cuda_final
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
manylinuxaarch64-builder:cuda*)
TARGET=cuda_final
GPU_IMAGE=amd64/almalinux:8
@ -73,6 +81,10 @@ case ${image} in
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
manylinux2_28-builder:rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
TARGET=rocm_final
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"
@ -109,6 +121,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \

View File

@ -16,6 +16,7 @@ click
#test that import:
coremltools==5.0b5 ; python_version < "3.12"
coremltools==8.3 ; python_version == "3.12"
#Description: Apple framework for ML integration
#Pinned versions: 5.0b5
#test that import:
@ -41,18 +42,15 @@ fbscribelogger==0.1.7
#Pinned versions: 0.1.6
#test that import:
flatbuffers==2.0 ; platform_machine != "s390x"
flatbuffers==24.12.23
#Description: cross platform serialization library
#Pinned versions: 2.0
#Pinned versions: 24.12.23
#test that import:
flatbuffers ; platform_machine == "s390x"
#Description: cross platform serialization library; Newer version is required on s390x for new python version
hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
#Pinned versions: 3.44.6, 4.53.2
#Pinned versions: 5.35.1
#test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
junitparser==2.1.1
@ -65,10 +63,12 @@ lark==0.12.0
#Pinned versions: 0.12.0
#test that import:
librosa>=0.6.2 ; python_version < "3.11"
librosa>=0.6.2 ; python_version < "3.11" and platform_machine != "s390x"
librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
#Description: A python package for music and audio analysis
#Pinned versions: >=0.6.2
#test that import: test_spectral_ops.py
#librosa depends on numba; disable it for s390x while numba is disabled too
#mkl #this breaks linux-bionic-rocm4.5-py3.7
#Description: Intel oneAPI Math Kernel Library
@ -93,10 +93,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.14.0
mypy==1.16.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -111,13 +111,15 @@ ninja==1.11.1.3
#Pinned versions: 1.11.1.3
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9"
numba==0.55.2 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.10"
numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#test that import: test_numba_integration.py
#For numba issue see https://github.com/pytorch/pytorch/issues/51511
#Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
#numpy
#Description: Provides N-dimensional arrays and linear algebra
@ -166,10 +168,10 @@ pillow==11.0.0
#Pinned versions: 10.3.0
#test that import:
protobuf==3.20.2
#Description: Googles data interchange format
#Pinned versions: 3.20.1
#test that import: test_tensorboard.py
protobuf==5.29.4
#Description: Google's data interchange format
#Pinned versions: 5.29.4
#test that import: test_tensorboard.py, test/onnx/*
psutil
#Description: information on running processes and system utilization
@ -221,9 +223,9 @@ pygments==2.15.0
#Pinned versions: 2.12.0
#test that import: the doctests
#PyYAML
#pyyaml
#Description: data serialization format
#Pinned versions:
#Pinned versions: 6.0.2
#test that import:
#requests
@ -233,7 +235,7 @@ pygments==2.15.0
#rich
#Description: rich text and beautiful formatting in the terminal
#Pinned versions: 10.9.0
#Pinned versions: 14.1.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
@ -261,11 +263,6 @@ scipy==1.14.1 ; python_version >= "3.12"
#Pinned versions:
#test that import:
tb-nightly==2.13.0a20230426
#Description: TensorBoard
#Pinned versions:
#test that import:
# needed by torchgen utils
typing-extensions>=4.10.0
#Description: type hints for python
@ -307,7 +304,7 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.12.6.0
z3-solver==4.15.1.0 ; platform_machine != "s390x"
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
@ -337,12 +334,12 @@ sympy==1.13.3
#Pinned versions:
#test that import:
onnx==1.17.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
onnx==1.18.0
#Description: Required by onnx tests, and mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.2.2
onnxscript==0.4.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -361,12 +358,11 @@ pwlf==2.2.1
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
astunparse
PyYAML
pyyaml
pyzstd
setuptools
setuptools>=70.1.0
six
scons==4.5.2 ; platform_machine == "aarch64"
@ -382,3 +378,16 @@ dataclasses_json==0.6.7
cmake==4.0.0
#Description: required for building
tlparse==0.4.0
#Description: required for log parsing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py
setuptools-git-versioning==2.1.0
scikit-build==0.18.1
pyre-extensions==0.0.32
tabulate==0.9.0
#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI

View File

@ -1,11 +1,11 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@1657ad2fc1acdc98aa719eebecbb0128a7c13ce4#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought is probably
# something related to Docker setup. We can investigate this later
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.
sphinxcontrib.katex==0.8.6
#Description: This is used to generate PyTorch docs
@ -15,9 +15,14 @@ sphinxext-opengraph==0.9.1
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.9.1
matplotlib==3.5.3
sphinx_sitemap==2.6.0
#Description: This is used to generate sitemap for PyTorch docs
#Pinned versions: 2.6.0
matplotlib==3.5.3 ; python_version < "3.13"
matplotlib==3.6.3 ; python_version >= "3.13"
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
#Pinned versions: 3.6.3 if python > 3.12. Otherwise 3.5.3.
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"
@ -45,8 +50,8 @@ IPython==8.12.0
#Pinned versions: 8.12.0
myst-nb==0.17.2
#Description: This is used to generate PyTorch functorch docs
#Pinned versions: 0.13.2
#Description: This is used to generate PyTorch functorch and torch.compile docs.
#Pinned versions: 0.17.2
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5

View File

@ -1 +1 @@
3.3.1
3.5.0

View File

@ -0,0 +1 @@
3.5.0

View File

@ -0,0 +1,155 @@
# Cross-compilation Docker container for RISC-V architecture
ARG UBUNTU_VERSION
FROM --platform=linux/amd64 ubuntu:${UBUNTU_VERSION} as base
ARG UBUNTU_VERSION
ENV GCC_VERSION=14
ENV PYTHON_VERSION=3.12.3
ENV DEBIAN_FRONTEND=noninteractive
ENV CC=riscv64-linux-gnu-gcc-${GCC_VERSION}
ENV CXX=riscv64-linux-gnu-g++-${GCC_VERSION}
ENV QEMU_LD_PREFIX=/usr/riscv64-linux-gnu/
ENV SYSROOT=/opt/sysroot
# Install basic dependencies
RUN apt-get update && apt-get install -y \
ninja-build \
autoconf \
automake \
libtool \
patchelf \
ccache \
git \
wget \
python3-pip \
python3-venv \
python-is-python3 \
cmake \
sudo \
lsb-release \
gcc-${GCC_VERSION}-riscv64-linux-gnu \
g++-${GCC_VERSION}-riscv64-linux-gnu \
pkg-config \
&& rm -rf /var/lib/apt/lists/*
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
FROM base as python
ARG ZLIB_VERSION=1.3.1
ARG FFI_VERSION=3.4.6
ARG BZ2_VERSION=1.0.8
ARG XZ_VERSION=5.4.6
ARG OPENSSL_VERSION=3.2.1
# Set up sysroot directory for dependencies
ENV PKG_CONFIG_PATH=${SYSROOT}/lib/pkgconfig
ENV PKG_CONFIG_SYSROOT_DIR=${SYSROOT}
WORKDIR /opt
# Build zlib (for compression)
RUN echo "--- Building zlib ---" \
&& wget -c https://www.zlib.net/zlib-${ZLIB_VERSION}.tar.gz \
&& tar -xf zlib-${ZLIB_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd zlib-${ZLIB_VERSION}/ \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} \
&& make -j$(nproc) && make install \
&& cd ../..
# Build libffi (for ctypes module)
RUN echo "--- Building libffi ---" \
&& wget -c https://github.com/libffi/libffi/releases/download/v${FFI_VERSION}/libffi-${FFI_VERSION}.tar.gz \
&& tar -xf libffi-${FFI_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd libffi-${FFI_VERSION}/ \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \
&& make -j$(nproc) && make install \
&& cd ../..
# Build bzip2 (for bz2 module)
RUN echo "--- Building bzip2 ---" \
&& wget -c https://sourceware.org/pub/bzip2/bzip2-${BZ2_VERSION}.tar.gz \
&& tar -xf bzip2-${BZ2_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd bzip2-${BZ2_VERSION}/ \
&& make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} bzip2 bzip2recover libbz2.a \
&& make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} -f Makefile-libbz2_so \
&& make install PREFIX=${SYSROOT} \
&& cp libbz2.so.${BZ2_VERSION} ${SYSROOT}/lib/ \
&& cd ${SYSROOT}/lib/ \
&& ln -sf libbz2.so.${BZ2_VERSION} libbz2.so.1.0 \
&& ln -sf libbz2.so.1.0 libbz2.so \
&& cd /opt/
# Build xz (for lzma module)
RUN echo "--- Building xz ---" \
&& wget -c https://github.com/tukaani-project/xz/releases/download/v${XZ_VERSION}/xz-${XZ_VERSION}.tar.gz \
&& tar -xf xz-${XZ_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd xz-${XZ_VERSION} \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \
&& make -j$(nproc) && make install \
&& cd ../..
# Build OpenSSL (for ssl module)
RUN echo "--- Building OpenSSL ---" \
&& wget -c https://www.openssl.org/source/openssl-${OPENSSL_VERSION}.tar.gz \
&& tar -xf openssl-${OPENSSL_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd openssl-${OPENSSL_VERSION}/ \
&& mkdir build && cd build \
&& ../Configure linux64-riscv64 --prefix=${SYSROOT} \
&& make -j$(nproc) && make install_sw \
&& cd ../..
# Build SQLite3 (for sqlite3 module)
RUN echo "--- Building SQLite3 ---" \
&& wget -c https://www.sqlite.org/2024/sqlite-autoconf-3450200.tar.gz \
&& tar -xf sqlite-autoconf-3450200.tar.gz --no-same-permissions --no-same-owner \
&& cd sqlite-autoconf-3450200 \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \
&& make -j$(nproc) && make install \
&& cd ../..
# Build and install RISC-V Python with all modules
RUN wget -c https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \
&& tar -xf Python-${PYTHON_VERSION}.tgz --no-same-permissions --no-same-owner \
&& cd Python-${PYTHON_VERSION} \
&& mkdir build && cd build \
&& ../configure \
--host=riscv64-linux-gnu \
--build=x86_64-linux-gnu \
--prefix=${SYSROOT} \
--enable-shared \
--disable-ipv6 \
--with-build-python=/usr/bin/python3 \
--with-ensurepip=no \
ac_cv_file__dev_ptmx=yes \
ac_cv_file__dev_ptc=no \
&& make -j$(nproc) \
&& make install
FROM base as final
COPY --from=python /opt/sysroot /opt/sysroot
# Install crossenv and cmake
RUN pip install crossenv cmake==4.0.0 --break-system-packages \
&& /usr/bin/python3 -m crossenv ${SYSROOT}/bin/python3 /opt/riscv-cross-env
# Add pip-installed cmake binaries to PATH
ENV PATH="/usr/local/bin:${PATH}"
# Set up cross Python environment
SHELL ["/bin/bash", "-c"]
RUN source /opt/riscv-cross-env/bin/activate \
&& pip install setuptools pyyaml typing_extensions wheel
# Set default environment variables for PyTorch build
ENV Python_ROOT_DIR=${SYSROOT}
ENV OPENSSL_ROOT_DIR=${SYSROOT}
USER jenkins
CMD ["bash"]

View File

@ -1,170 +0,0 @@
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ARG IMAGE_NAME
FROM ${IMAGE_NAME} as base
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc
ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install clang
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
ENV UCC_HOME /usr
ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
FROM base as triton-builder
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN bash ./install_triton.sh
FROM base as final
COPY --from=triton-builder /opt/triton /opt/triton
RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
RUN rm -rf /opt/triton
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
# See https://github.com/pytorch/pytorch/issues/82174
# TODO(sdym@fb.com):
# check if this is needed after full off Xenial migration
ENV CARGO_NET_GIT_FETCH_WITH_CLI true
RUN bash ./install_cache.sh && rm install_cache.sh
ENV CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
# Add jni.h for java host build
COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
# Install Open MPI for CUDA
COPY ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV CUDA_PATH /usr/local/cuda
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
# Install CUDNN
ARG CUDNN_VERSION
ARG CUDA_VERSION
COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
ARG CUDA_VERSION
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Install NCCL
ARG CUDA_VERSION
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash install_nccl.sh
RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*
ENV USE_SYSTEM_NCCL=1
ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Install CUDSS
ARG CUDA_VERSION
COPY ./common/install_cudss.sh install_cudss.sh
RUN bash install_cudss.sh
RUN rm install_cudss.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi
RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi
USER jenkins
CMD ["bash"]

View File

@ -25,6 +25,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
@ -95,10 +96,11 @@ ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt
# (optional) Install non-default Ninja version
ARG NINJA_VERSION

View File

@ -56,10 +56,10 @@ RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt
# Install XPU Dependencies
ARG XPU_VERSION
@ -72,7 +72,7 @@ ARG TRITON
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-xpu.txt triton-xpu.txt
COPY triton_version.txt triton_version.txt
COPY triton_xpu_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

View File

@ -66,6 +66,7 @@ ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ARG CUDA_VERSION
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
@ -96,10 +97,11 @@ RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt
ARG TRITON
ARG TRITON_CPU
@ -147,6 +149,12 @@ RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi
RUN rm install_acl.sh
ENV INSTALLED_ACL ${ACL}
ARG OPENBLAS
COPY ./common/install_openblas.sh install_openblas.sh
RUN if [ -n "${OPENBLAS}" ]; then bash ./install_openblas.sh; fi
RUN rm install_openblas.sh
ENV INSTALLED_OPENBLAS ${OPENBLAS}
# Install ccache/sccache (do this last, so we get priority in PATH)
ARG SKIP_SCCACHE_INSTALL
COPY ./common/install_cache.sh install_cache.sh
@ -174,7 +182,6 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV CUDA_PATH /usr/local/cuda

View File

@ -7,4 +7,4 @@ set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh
USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

31
.ci/lumen_cli/README.md Normal file
View File

@ -0,0 +1,31 @@
# 🔧 Lumen_cli
A Python CLI tool for building and testing PyTorch-based components, using a YAML configuration file for structured, repeatable workflows.
## Features
- **Build**
- external projects (e.g. vLLM)
## 📦 Installation
at the root of the pytorch repo
```bash
pip install -e .ci/lumen_cli
```
## Run the cli tool
The cli tool must be used at root of pytorch repo, as example to run build external vllm:
```bash
python -m cli.run build external vllm
```
this will run the build steps with default behaviour for vllm project.
to see help messages, run
```bash
python3 -m cli.run --help
```
## Add customized external build logics
To add a new external build, for instance, add a new external build logics:
1. create the build function in cli/lib folder
2. register your target and the main build function at EXTERNAL_BUILD_TARGET_DISPATCH in `cli/build_cli/register_build.py`
3. [optional] create your ci config file in .github/ci_configs/${EXTERNAL_PACKAGE_NAME}.yaml

View File

@ -0,0 +1,37 @@
import argparse
import logging
from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec
from cli.lib.core.vllm.vllm_build import VllmBuildRunner
logger = logging.getLogger(__name__)
# Maps targets to their argparse configuration and runner
# it adds new target to path python -m cli.run build external {target} with buildrunner
_TARGETS: dict[str, TargetSpec] = {
"vllm": {
"runner": VllmBuildRunner,
"help": "Build vLLM using docker buildx.",
}
# add yours ...
}
def register_build_commands(subparsers: argparse._SubParsersAction) -> None:
build_parser = subparsers.add_parser(
"build",
help="Build related commands",
formatter_class=RichHelp,
)
build_subparsers = build_parser.add_subparsers(dest="build_command", required=True)
overview = "\n".join(
f" {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()
)
external_parser = build_subparsers.add_parser(
"external",
help="Build external targets",
description="Build third-party targets.\n\nAvailable targets:\n" + overview,
formatter_class=RichHelp,
)
register_targets(external_parser, _TARGETS)

View File

@ -0,0 +1,71 @@
"""
Cli Argparser Utility helpers for CLI tasks.
"""
import argparse
from abc import ABC, abstractmethod
try:
from typing import Any, Callable, Required, TypedDict # Python 3.11+
except ImportError:
from typing import Any, Callable, TypedDict
from typing_extensions import Required # Fallback for Python <3.11
class BaseRunner(ABC):
def __init__(self, args: Any) -> None:
self.args = args
@abstractmethod
def run(self) -> None:
"""runs main logics, required"""
# Pretty help: keep newlines + show defaults
class RichHelp(
argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter
):
pass
class TargetSpec(TypedDict, total=False):
"""CLI subcommand specification with bA."""
runner: Required[type[BaseRunner]]
help: str
description: str
add_arguments: Callable[[argparse.ArgumentParser], None]
def register_targets(
parser: argparse.ArgumentParser,
target_specs: dict[str, TargetSpec],
common_args: Callable[[argparse.ArgumentParser], None] = lambda _: None,
) -> None:
"""Register target subcommands."""
targets = parser.add_subparsers(
dest="target",
required=True,
metavar="{" + ",".join(target_specs.keys()) + "}",
)
for name, spec in target_specs.items():
desc = spec.get("description") or spec["runner"].__doc__ or ""
p = targets.add_parser(
name,
help=spec.get("help", ""),
description=desc.strip(),
formatter_class=RichHelp,
)
p.set_defaults(
func=lambda args, cls=spec["runner"]: cls(args).run(),
_runner_class=spec["runner"],
)
if "add_arguments" in spec and callable(spec["add_arguments"]):
spec["add_arguments"](p)
if common_args:
common_args(p)

View File

@ -0,0 +1,42 @@
"""
Docker Utility helpers for CLI tasks.
"""
import logging
from typing import Optional
import docker
from docker.errors import APIError, NotFound
logger = logging.getLogger(__name__)
# lazy singleton so we don't reconnect every call
_docker_client: Optional[docker.DockerClient] = None
def _get_client() -> docker.DockerClient:
global _docker_client
if _docker_client is None:
_docker_client = docker.from_env()
return _docker_client
def local_image_exists(
image_name: str, client: Optional[docker.DockerClient] = None
) -> bool:
"""Return True if a local Docker image exists."""
if not image_name:
return False
client = client or _get_client()
try:
client.images.get(image_name)
return True
except (NotFound, APIError) as e:
logger.error(
"Error when checking Docker image '%s': %s",
image_name,
e.explanation if hasattr(e, "explanation") else str(e),
)
return False

View File

@ -0,0 +1,110 @@
"""
Environment Variables and Dataclasses Utility helpers for CLI tasks.
"""
import os
from dataclasses import field, fields, is_dataclass, MISSING
from pathlib import Path
from textwrap import indent
from typing import Optional, Union
from cli.lib.common.utils import str2bool
def get_env(name: str, default: str = "") -> str:
"""Get environment variable with default fallback."""
return os.environ.get(name) or default
def env_path_optional(
name: str,
default: Optional[Union[str, Path]] = None,
resolve: bool = True,
) -> Optional[Path]:
"""Get environment variable as optional Path."""
val = get_env(name) or default
if not val:
return None
path = Path(val)
return path.resolve() if resolve else path
def env_path(
name: str,
default: Optional[Union[str, Path]] = None,
resolve: bool = True,
) -> Path:
"""Get environment variable as Path, raise if missing."""
path = env_path_optional(name, default, resolve)
if not path:
raise ValueError(f"Missing path value for {name}")
return path
def env_bool(
name: str,
default: bool = False,
) -> bool:
val = get_env(name)
if not val:
return default
return str2bool(val)
def env_bool_field(
name: str,
default: bool = False,
):
return field(default_factory=lambda: env_bool(name, default))
def env_path_field(
name: str,
default: Union[str, Path] = "",
*,
resolve: bool = True,
) -> Path:
return field(default_factory=lambda: env_path(name, default, resolve=resolve))
def env_str_field(
name: str,
default: str = "",
) -> str:
return field(default_factory=lambda: get_env(name, default))
def generate_dataclass_help(cls) -> str:
"""Auto-generate help text for dataclass fields."""
if not is_dataclass(cls):
raise TypeError(f"{cls} is not a dataclass")
def get_value(f):
if f.default is not MISSING:
return f.default
if f.default_factory is not MISSING:
try:
return f.default_factory()
except Exception as e:
return f"<error: {e}>"
return "<required>"
lines = [f"{f.name:<22} = {repr(get_value(f))}" for f in fields(cls)]
return indent("\n".join(lines), " ")
def with_params_help(params_cls: type, title: str = "Parameter defaults"):
"""
Class decorator that appends a help table generated from another dataclass
(e.g., VllmParameters) to the decorated class's docstring.
"""
if not is_dataclass(params_cls):
raise TypeError(f"{params_cls} must be a dataclass")
def _decorator(cls: type) -> type:
block = generate_dataclass_help(params_cls)
cls.__doc__ = (cls.__doc__ or "") + f"\n\n{title}:\n{block}"
return cls
return _decorator

View File

@ -0,0 +1,143 @@
from __future__ import annotations
import logging
import os
import textwrap
from pathlib import Path
from typing import TYPE_CHECKING
from cli.lib.common.utils import get_wheels
from jinja2 import Template
if TYPE_CHECKING:
from collections.abc import Iterable, Mapping
logger = logging.getLogger(__name__)
_TPL_CONTENT = Template(
textwrap.dedent("""\
## {{ title }}
```{{ lang }}
{{ content }}
```
""")
)
_TPL_LIST_ITEMS = Template(
textwrap.dedent("""\
## {{ title }}
{% for it in items %}
- {{ it.pkg }}: {{ it.relpath }}
{% else %}
_(no item found)_
{% endfor %}
""")
)
_TPL_TABLE = Template(
textwrap.dedent("""\
{%- if rows %}
| {{ cols | join(' | ') }} |
|{%- for _ in cols %} --- |{%- endfor %}
{%- for r in rows %}
| {%- for c in cols %} {{ r.get(c, "") }} |{%- endfor %}
{%- endfor %}
{%- else %}
_(no data)_
{%- endif %}
""")
)
def gh_summary_path() -> Path | None:
"""Return the Path to the GitHub step summary file, or None if not set."""
p = os.environ.get("GITHUB_STEP_SUMMARY")
return Path(p) if p else None
def write_gh_step_summary(md: str, *, append_content: bool = True) -> bool:
"""
Write Markdown content to the GitHub Step Summary file if GITHUB_STEP_SUMMARY is set.
append_content: default true, if True, append to the end of the file, else overwrite the whole file
Returns:
True if written successfully (in GitHub Actions environment),
False if skipped (e.g., running locally where the variable is not set).
"""
sp = gh_summary_path()
if not sp:
logger.info("[gh-summary] GITHUB_STEP_SUMMARY not set, skipping write.")
return False
md_clean = textwrap.dedent(md).strip() + "\n"
mode = "a" if append_content else "w"
with sp.open(mode, encoding="utf-8") as f:
f.write(md_clean)
return True
def md_heading(text: str, level: int = 2) -> str:
"""Generate a Markdown heading string with the given level (1-6)."""
return f"{'#' * max(1, min(level, 6))} {text}\n"
def md_details(summary: str, content: str) -> str:
"""Generate a collapsible <details> block with a summary and inner content."""
return f"<details>\n<summary>{summary}</summary>\n\n{content}\n\n</details>\n"
def summarize_content_from_file(
output_dir: Path,
freeze_file: str,
title: str = "Content from file",
code_lang: str = "", # e.g. "text" or "ini"
) -> bool:
f = Path(output_dir) / freeze_file
if not f.exists():
return False
content = f.read_text(encoding="utf-8").strip()
md = render_content(content, title=title, lang=code_lang)
return write_gh_step_summary(md)
def summarize_wheels(path: Path, title: str = "Wheels", max_depth: int = 3):
items = get_wheels(path, max_depth=max_depth)
if not items:
return False
md = render_list(items, title=title)
return write_gh_step_summary(md)
def md_kv_table(rows: Iterable[Mapping[str, str | int | float]]) -> str:
"""
Render a list of dicts as a Markdown table using Jinja template.
"""
rows = list(rows)
cols = list({k for r in rows for k in r.keys()})
md = _TPL_TABLE.render(cols=cols, rows=rows).strip() + "\n"
return md
def render_list(
items: Iterable[str],
*,
title: str = "List",
) -> str:
tpl = _TPL_LIST_ITEMS
md = tpl.render(title=title, items=items)
return md
def render_content(
content: str,
*,
title: str = "Content",
lang: str = "text",
) -> str:
tpl = _TPL_CONTENT
md = tpl.render(title=title, content=content, lang=lang)
return md

View File

@ -0,0 +1,69 @@
"""
Git Utility helpers for CLI tasks.
"""
import logging
from pathlib import Path
from cli.lib.common.path_helper import remove_dir
from git import GitCommandError, RemoteProgress, Repo
logger = logging.getLogger(__name__)
class PrintProgress(RemoteProgress):
"""Simple progress logger for git operations."""
def __init__(self, interval: int = 5):
super().__init__()
self._last_percent = -1
self._interval = interval
def update(self, op_code, cur, max=None, message=""):
msg = self._cur_line or message
if max and cur:
percent = int(cur / max * 100)
if percent != self._last_percent and percent % self._interval == 0:
self._last_percent = percent
logger.info("Progress: %d%% - %s", percent, msg)
elif msg:
logger.info(msg)
def clone_external_repo(target: str, repo: str, dst: str = "", update_submodules=False):
"""Clone repository with pinned commit and optional submodules."""
dst = dst or target
try:
logger.info("Cloning %s to %s", target, dst)
# Clone and fetch
remove_dir(dst)
r = Repo.clone_from(repo, dst, progress=PrintProgress())
r.git.fetch("--all", "--tags")
# Checkout pinned commit
commit = get_post_build_pinned_commit(target)
logger.info("Checking out pinned %s commit %s", target, commit)
r.git.checkout(commit)
# Update submodules if requested
if update_submodules and r.submodules:
logger.info("Updating %d submodule(s)", len(r.submodules))
for sm in r.submodules:
sm.update(init=True, recursive=True, progress=PrintProgress())
logger.info("Successfully cloned %s", target)
return r, commit
except GitCommandError as e:
logger.error("Git operation failed: %s", e)
raise
def get_post_build_pinned_commit(name: str, prefix=".github/ci_commit_pins") -> str:
path = Path(prefix) / f"{name}.txt"
if not path.exists():
raise FileNotFoundError(f"Pin file not found: {path}")
return path.read_text(encoding="utf-8").strip()

View File

@ -0,0 +1,14 @@
"""
Logger Utility helpers for CLI tasks.
"""
import logging
import sys
def setup_logging(level: int = logging.INFO):
logging.basicConfig(
level=level,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
stream=sys.stdout,
)

View File

@ -0,0 +1,62 @@
"""Path utility helpers for CLI tasks."""
import logging
import shutil
from pathlib import Path
from typing import Union
logger = logging.getLogger(__name__)
def get_path(path: Union[str, Path], resolve: bool = False) -> Path:
"""Convert to Path object, optionally resolving to absolute path."""
if not path:
raise ValueError("Path cannot be None or empty")
result = Path(path)
return result.resolve() if resolve else result
def ensure_dir_exists(path: Union[str, Path]) -> Path:
"""Create directory if it doesn't exist."""
path_obj = get_path(path)
path_obj.mkdir(parents=True, exist_ok=True)
return path_obj
def remove_dir(path: Union[str, Path, None]) -> None:
"""Remove directory if it exists."""
if not path:
return
path_obj = get_path(path)
if path_obj.exists():
shutil.rmtree(path_obj)
def force_create_dir(path: Union[str, Path]) -> Path:
"""Remove directory if exists, then create fresh empty directory."""
remove_dir(path)
return ensure_dir_exists(path)
def copy(src: Union[str, Path], dst: Union[str, Path]) -> None:
"""Copy file or directory from src to dst."""
src_path = get_path(src, resolve=True)
dst_path = get_path(dst, resolve=True)
if not src_path.exists():
raise FileNotFoundError(f"Source does not exist: {src_path}")
dst_path.parent.mkdir(parents=True, exist_ok=True)
if src_path.is_file():
shutil.copy2(src_path, dst_path)
elif src_path.is_dir():
shutil.copytree(src_path, dst_path, dirs_exist_ok=True)
else:
raise ValueError(f"Unsupported path type: {src_path}")
def is_path_exist(path: Union[str, Path, None]) -> bool:
"""Check if path exists."""
return bool(path and get_path(path).exists())

View File

@ -0,0 +1,71 @@
import glob
import logging
import shlex
import shutil
import sys
from collections.abc import Iterable
from importlib.metadata import PackageNotFoundError, version # noqa: UP035
from typing import Optional, Union
from cli.lib.common.utils import run_command
logger = logging.getLogger(__name__)
def pip_install_packages(
packages: Iterable[str] = (),
env=None,
*,
requirements: Optional[str] = None,
constraints: Optional[str] = None,
prefer_uv: bool = False,
) -> None:
use_uv = prefer_uv and shutil.which("uv") is not None
base = (
[sys.executable, "-m", "uv", "pip", "install"]
if use_uv
else [sys.executable, "-m", "pip", "install"]
)
cmd = base[:]
if requirements:
cmd += ["-r", requirements]
if constraints:
cmd += ["-c", constraints]
cmd += list(packages)
logger.info("pip installing packages: %s", " ".join(map(shlex.quote, cmd)))
run_command(" ".join(map(shlex.quote, cmd)), env=env)
def pip_install_first_match(pattern: str, extras: Optional[str] = None, pref_uv=False):
wheel = first_matching_pkg(pattern)
target = f"{wheel}[{extras}]" if extras else wheel
logger.info("Installing %s...", target)
pip_install_packages([target], prefer_uv=pref_uv)
def run_python(args: Union[str, list[str]], env=None):
"""
Run the python in the current environment.
"""
if isinstance(args, str):
args = shlex.split(args)
cmd = [sys.executable] + args
run_command(" ".join(map(shlex.quote, cmd)), env=env)
def pkg_exists(name: str) -> bool:
try:
pkg_version = version(name)
logger.info("%s already exist with version: %s", name, pkg_version)
return True
except PackageNotFoundError:
logger.info("%s is not installed", name)
return False
def first_matching_pkg(pattern: str) -> str:
matches = sorted(glob.glob(pattern))
if not matches:
raise FileNotFoundError(f"No wheel matching: {pattern}")
return matches[0]

View File

@ -0,0 +1,139 @@
"""
General Utility helpers for CLI tasks.
"""
import logging
import os
import shlex
import subprocess
import sys
from contextlib import contextmanager
from pathlib import Path
from typing import Optional
logger = logging.getLogger(__name__)
def run_command(
cmd: str,
use_shell: bool = False,
log_cmd: bool = True,
cwd: Optional[str] = None,
env: Optional[dict] = None,
check: bool = True,
) -> int:
"""Run a command with optional shell execution."""
if use_shell:
args = cmd
log_prefix = "[shell]"
executable = "/bin/bash"
else:
args = shlex.split(cmd)
log_prefix = "[cmd]"
executable = None
if log_cmd:
display_cmd = cmd if use_shell else " ".join(args)
logger.info("%s %s", log_prefix, display_cmd)
run_env = {**os.environ, **(env or {})}
proc = subprocess.run(
args,
shell=use_shell,
executable=executable,
stdout=sys.stdout,
stderr=sys.stderr,
cwd=cwd,
env=run_env,
check=False,
)
if check and proc.returncode != 0:
logger.error(
"%s Command failed (exit %s): %s", log_prefix, proc.returncode, cmd
)
raise subprocess.CalledProcessError(
proc.returncode, args if not use_shell else cmd
)
return proc.returncode
def str2bool(value: Optional[str]) -> bool:
"""Convert environment variables to boolean values."""
if not value:
return False
if not isinstance(value, str):
raise ValueError(
f"Expected a string value for boolean conversion, got {type(value)}"
)
value = value.strip().lower()
true_value_set = {"1", "true", "t", "yes", "y", "on", "enable", "enabled", "found"}
false_value_set = {"0", "false", "f", "no", "n", "off", "disable"}
if value in true_value_set:
return True
if value in false_value_set:
return False
raise ValueError(f"Invalid string value for boolean conversion: {value}")
@contextmanager
def temp_environ(updates: dict[str, str]):
"""
Temporarily set environment variables and restore them after the block.
Args:
updates: Dict of environment variables to set.
"""
missing = object()
old: dict[str, str | object] = {k: os.environ.get(k, missing) for k in updates}
try:
os.environ.update(updates)
yield
finally:
for k, v in old.items():
if v is missing:
os.environ.pop(k, None)
else:
os.environ[k] = v # type: ignore[arg-type]
@contextmanager
def working_directory(path: str):
"""
Temporarily change the working directory inside a context.
"""
if not path:
# No-op context
yield
return
prev_cwd = os.getcwd()
try:
os.chdir(path)
yield
finally:
os.chdir(prev_cwd)
def get_wheels(
output_dir: Path,
max_depth: Optional[int] = None,
) -> list[str]:
"""Return a list of wheels found in the given output directory."""
root = Path(output_dir)
if not root.exists():
return []
items = []
for dirpath, _, filenames in os.walk(root):
depth = Path(dirpath).relative_to(root).parts
if max_depth is not None and len(depth) > max_depth:
continue
for fname in sorted(filenames):
if fname.endswith(".whl"):
pkg = fname.split("-")[0]
relpath = str((Path(dirpath) / fname).relative_to(root))
items.append({"pkg": pkg, "relpath": relpath})
return items

View File

@ -0,0 +1,296 @@
import logging
import os
import textwrap
from typing import Any
from cli.lib.common.gh_summary import write_gh_step_summary
from cli.lib.common.git_helper import clone_external_repo
from cli.lib.common.pip_helper import pip_install_packages
from cli.lib.common.utils import run_command, temp_environ, working_directory
from jinja2 import Template
logger = logging.getLogger(__name__)
_TPL_VLLM_INFO = Template(
textwrap.dedent("""\
## Vllm against Pytorch CI Test Summary
**Vllm Commit**: [{{ vllm_commit }}](https://github.com/vllm-project/vllm/commit/{{ vllm_commit }})
{%- if torch_sha %}
**Pytorch Commit**: [{{ torch_sha }}](https://github.com/pytorch/pytorch/commit/{{ torch_sha }})
{%- endif %}
""")
)
def sample_vllm_test_library():
"""
Simple sample to unblock the vllm ci development, which is mimic to
https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml
see run_test_plan for more details
"""
# TODO(elainewy): Read from yaml file to handle the env and tests for vllm
return {
"vllm_basic_correctness_test": {
"title": "Basic Correctness Test",
"id": "vllm_basic_correctness_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"steps": [
"pytest -v -s basic_correctness/test_cumem.py",
"pytest -v -s basic_correctness/test_basic_correctness.py",
"pytest -v -s basic_correctness/test_cpu_offload.py",
"VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py",
],
},
"vllm_basic_models_test": {
"title": "Basic models test",
"id": "vllm_basic_models_test",
"steps": [
"pytest -v -s models/test_transformers.py",
"pytest -v -s models/test_registry.py",
"pytest -v -s models/test_utils.py",
"pytest -v -s models/test_vision.py",
"pytest -v -s models/test_initialization.py",
],
},
"vllm_entrypoints_test": {
"title": "Entrypoints Test ",
"id": "vllm_entrypoints_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"steps": [
" ".join(
[
"pytest",
"-v",
"-s",
"entrypoints/llm",
"--ignore=entrypoints/llm/test_lazy_outlines.py",
"--ignore=entrypoints/llm/test_generate.py",
"--ignore=entrypoints/llm/test_generate_multiple_loras.py",
"--ignore=entrypoints/llm/test_collective_rpc.py",
]
),
"pytest -v -s entrypoints/llm/test_lazy_outlines.py",
"pytest -v -s entrypoints/llm/test_generate.py ",
"VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode",
],
},
"vllm_regression_test": {
"title": "Regression Test",
"id": "vllm_regression_test",
"package_install": ["modelscope"],
"steps": [
"pytest -v -s test_regression.py",
],
},
"vllm_lora_tp_test_distributed": {
"title": "LoRA TP Test (Distributed)",
"id": "vllm_lora_tp_test_distributed",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s -x lora/test_chatglm3_tp.py",
"pytest -v -s -x lora/test_llama_tp.py",
"pytest -v -s -x lora/test_llm_with_multi_loras.py",
],
},
"vllm_distributed_test_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
"vllm_lora_28_failure_test": {
"title": "LoRA pytorch 2.8 failure test",
"id": "vllm_lora_28_failure_test",
"steps": ["pytest -v lora/test_quant_model.py"],
},
"vllm_multi_model_processor_test": {
"title": "Multi-Modal Processor Test",
"id": "vllm_multi_model_processor_test",
"package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],
"steps": [
"pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py",
],
},
"vllm_multi_model_test_28_failure_test": {
"title": "Multi-Model Test (Failed 2.8 release)",
"id": "vllm_multi_model_test_28_failure_test",
"package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],
"steps": [
"pytest -v -s models/multimodal/generation/test_voxtral.py",
"pytest -v -s models/multimodal/pooling",
],
},
"vllm_pytorch_compilation_unit_tests": {
"title": "PyTorch Compilation Unit Tests",
"id": "vllm_pytorch_compilation_unit_tests",
"steps": [
"pytest -v -s compile/test_pass_manager.py",
"pytest -v -s compile/test_fusion.py",
"pytest -v -s compile/test_fusion_attn.py",
"pytest -v -s compile/test_silu_mul_quant_fusion.py",
"pytest -v -s compile/test_sequence_parallelism.py",
"pytest -v -s compile/test_async_tp.py",
"pytest -v -s compile/test_fusion_all_reduce.py",
"pytest -v -s compile/test_decorator.py",
],
},
"vllm_languagde_model_test_extended_generation_28_failure_test": {
"title": "Language Models Test (Extended Generation) 2.8 release failure",
"id": "vllm_languagde_model_test_extended_generation_28_failure_test",
"package_install": [
"--no-build-isolation",
"git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8",
],
"steps": [
"pytest -v -s models/language/generation/test_mistral.py",
],
},
"vllm_distributed_test_2_gpu_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_2_gpu_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
# TODO(elainewy):need to add g6 with 4 gpus to run this test
"vllm_lora_test": {
"title": "LoRA Test %N",
"id": "lora_test",
"parallelism": 4,
"steps": [
"echo '[checking] list sharded lora tests:'",
" ".join(
[
"pytest -q --collect-only lora",
"--shard-id=$$BUILDKITE_PARALLEL_JOB",
"--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",
"--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",
]
),
"echo '[checking] Done. list lora tests'",
" ".join(
[
"pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB",
"--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",
"--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",
]
),
],
},
}
def check_parallelism(tests: Any, title: str, shard_id: int = 0, num_shards: int = 0):
"""
a method to check if the test plan is parallelism or not.
"""
parallelism = int(tests.get("parallelism", "0"))
is_parallel = parallelism and parallelism > 1
if not is_parallel:
return False
if shard_id > num_shards:
raise RuntimeError(
f"Test {title} expects {num_shards} shards, but invalid {shard_id} is provided"
)
if num_shards != parallelism:
raise RuntimeError(
f"Test {title} expects {parallelism} shards, but invalid {num_shards} is provided"
)
return True
def run_test_plan(
test_plan: str,
test_target: str,
tests_map: dict[str, Any],
shard_id: int = 0,
num_shards: int = 0,
):
"""
a method to run list of tests based on the test plan.
"""
logger.info("run %s tests.....", test_target)
if test_plan not in tests_map:
raise RuntimeError(
f"test {test_plan} not found, please add it to test plan pool"
)
tests = tests_map[test_plan]
pkgs = tests.get("package_install", [])
title = tests.get("title", "unknown test")
is_parallel = check_parallelism(tests, title, shard_id, num_shards)
if is_parallel:
title = title.replace("%N", f"{shard_id}/{num_shards}")
logger.info("Running tests: %s", title)
if pkgs:
logger.info("Installing packages: %s", pkgs)
pip_install_packages(packages=pkgs, prefer_uv=True)
with (
working_directory(tests.get("working_directory", "tests")),
temp_environ(tests.get("env_vars", {})),
):
failures = []
for step in tests["steps"]:
logger.info("Running step: %s", step)
if is_parallel:
step = replace_buildkite_placeholders(step, shard_id, num_shards)
logger.info("Running parallel step: %s", step)
code = run_command(cmd=step, check=False, use_shell=True)
if code != 0:
failures.append(step)
logger.info("Finish running step: %s", step)
if failures:
logger.error("Failed tests: %s", failures)
raise RuntimeError(f"{len(failures)} pytest runs failed: {failures}")
logger.info("Done. All tests passed")
def clone_vllm(dst: str = "vllm"):
_, commit = clone_external_repo(
target="vllm",
repo="https://github.com/vllm-project/vllm.git",
dst=dst,
update_submodules=True,
)
return commit
def replace_buildkite_placeholders(step: str, shard_id: int, num_shards: int) -> str:
mapping = {
"$$BUILDKITE_PARALLEL_JOB_COUNT": str(num_shards),
"$$BUILDKITE_PARALLEL_JOB": str(shard_id),
}
for k in sorted(mapping, key=len, reverse=True):
step = step.replace(k, mapping[k])
return step
def summarize_build_info(vllm_commit: str) -> bool:
torch_sha = os.getenv("GITHUB_SHA")
md = (
_TPL_VLLM_INFO.render(vllm_commit=vllm_commit, torch_sha=torch_sha).strip()
+ "\n"
)
return write_gh_step_summary(md)

View File

@ -0,0 +1,285 @@
import logging
import os
import textwrap
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
from cli.lib.common.cli_helper import BaseRunner
from cli.lib.common.docker_helper import local_image_exists
from cli.lib.common.envs_helper import (
env_bool_field,
env_path_field,
env_str_field,
with_params_help,
)
from cli.lib.common.gh_summary import (
gh_summary_path,
summarize_content_from_file,
summarize_wheels,
)
from cli.lib.common.path_helper import (
copy,
ensure_dir_exists,
force_create_dir,
get_path,
is_path_exist,
)
from cli.lib.common.utils import run_command
from cli.lib.core.vllm.lib import clone_vllm, summarize_build_info
logger = logging.getLogger(__name__)
# Default path for docker build artifacts
_DEFAULT_RESULT_PATH = "./shared"
# Temp folder in vllm work place to cp torch whls in vllm work directory for docker build
_VLLM_TEMP_FOLDER = "tmp"
@dataclass
class VllmBuildParameters:
"""
Parameters defining the vllm external input configurations.
Combine with VllmDockerBuildArgs to define the vllm build environment
"""
# USE_TORCH_WHEEL: when true, use local Torch wheels; requires TORCH_WHEELS_PATH.
# Otherwise docker build pull torch nightly during build
# TORCH_WHEELS_PATH: directory containing local torch wheels when use_torch_whl is True
use_torch_whl: bool = env_bool_field("USE_TORCH_WHEEL", True)
torch_whls_path: Path = env_path_field("TORCH_WHEELS_PATH", "./dist")
# USE_LOCAL_BASE_IMAGE: when true, use an existing local Docker base image; requires BASE_IMAGE
# Otherwise, pull dockerfile's default image remotely
# BASE_IMAGE: name:tag (only needed when use_local_base_image is True)
use_local_base_image: bool = env_bool_field("USE_LOCAL_BASE_IMAGE", True)
base_image: str = env_str_field("BASE_IMAGE")
# USE_LOCAL_DOCKERFILE: when true("1"), use a local Dockerfile; requires DOCKERFILE_PATH.
# otherwise, use vllm's default dockerfile.torch_nightly for build
# DOCKERFILE_PATH: path to Dockerfile used when use_local_dockerfile is True"
use_local_dockerfile: bool = env_bool_field("USE_LOCAL_DOCKERFILE", True)
dockerfile_path: Path = env_path_field(
"DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"
)
# OUTPUT_DIR: where docker buildx (local exporter) will write artifacts
output_dir: Path = env_path_field("OUTPUT_DIR", "external/vllm")
# --- Build args ----------------------------------------------------------
target_stage: str = env_str_field("TARGET_STAGE", "export-wheels")
tag_name: str = env_str_field("TAG", "vllm-wheels")
cuda_version: str = env_str_field("CUDA_VERSION", "12.8.1")
python_version: str = env_str_field("PYTHON_VERSION", "3.12")
max_jobs: str = env_str_field("MAX_JOBS", "64")
sccache_bucket: str = env_str_field("SCCACHE_BUCKET")
sccache_region: str = env_str_field("SCCACHE_REGION")
torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")
def __post_init__(self):
checks = [
(
self.use_torch_whl, # flag
True, # trigger_value
"torch_whls_path", # resource
is_path_exist, # check_func
"TORCH_WHEELS_PATH is not provided, but USE_TORCH_WHEEL is set to 1",
),
(
self.use_local_base_image,
True,
"base_image",
local_image_exists,
f"BASE_IMAGE {self.base_image} does not found, but USE_LOCAL_BASE_IMAGE is set to 1",
),
(
self.use_local_dockerfile,
True,
"dockerfile_path",
is_path_exist,
" DOCKERFILE_PATH path does not found, but USE_LOCAL_DOCKERFILE is set to 1",
),
]
for flag, trigger_value, attr_name, check_func, error_msg in checks:
value = getattr(self, attr_name)
if flag == trigger_value:
if not value or not check_func(value):
raise ValueError(error_msg)
else:
logger.info("flag %s is not set", flag)
if not self.output_dir:
raise ValueError("missing required output_dir")
@with_params_help(VllmBuildParameters)
class VllmBuildRunner(BaseRunner):
"""
Build vLLM using docker buildx.
Environment variable options:
"USE_TORCH_WHEEL": "1: use local wheels; 0: pull nightly from pypi",
"TORCH_WHEELS_PATH": "Path to local wheels (when USE_TORCH_WHEEL=1)",
"USE_LOCAL_BASE_IMAGE": "1: use local base image; 0: default image",
"BASE_IMAGE": "name:tag to indicate base image the dockerfile depends on (when USE_LOCAL_BASE_IMAGE=1)",
"USE_LOCAL_DOCKERFILE": "1: use local Dockerfile; 0: vllm repo default dockerfile.torch_nightly",
"DOCKERFILE_PATH": "Path to Dockerfile (when USE_LOCAL_DOCKERFILE=1)",
"OUTPUT_DIR": "e.g. './shared'",
"TORCH_CUDA_ARCH_LIST": "e.g. '8.0' or '8.0;9.0'",
"CUDA_VERSION": "e.g. '12.8.1'",
"PYTHON_VERSION": "e.g. '3.12'",
"MAX_JOBS": "e.g. '64'",
"SCCACHE_BUCKET": "e.g. 'my-bucket'",
"SCCACHE_REGION": "e.g. 'us-west-2'",
"""
def __init__(self, args=None):
self.work_directory = "vllm"
def run(self):
"""
main function to run vllm build
1. prepare vllm build environment
2. prepare the docker build command args
3. run docker build
"""
inputs = VllmBuildParameters()
logger.info("Running vllm build with inputs: %s", inputs)
vllm_commit = clone_vllm()
self.cp_dockerfile_if_exist(inputs)
# cp torch wheels from root direct to vllm workspace if exist
self.cp_torch_whls_if_exist(inputs)
# make sure the output dir to store the build artifacts exist
ensure_dir_exists(Path(inputs.output_dir))
cmd = self._generate_docker_build_cmd(inputs)
logger.info("Running docker build: \n %s", cmd)
try:
run_command(cmd, cwd="vllm", env=os.environ.copy())
finally:
self.genearte_vllm_build_summary(vllm_commit, inputs)
def genearte_vllm_build_summary(
self, vllm_commit: str, inputs: VllmBuildParameters
):
if not gh_summary_path():
return logger.info("Skipping, not detect GH Summary env var....")
logger.info("Generate GH Summary ...")
# summarize vllm build info
summarize_build_info(vllm_commit)
# summarize vllm build artifacts
vllm_artifact_dir = inputs.output_dir / "wheels"
summarize_content_from_file(
vllm_artifact_dir,
"build_summary.txt",
title="Vllm build env pip package summary",
)
summarize_wheels(
inputs.torch_whls_path, max_depth=3, title="Torch Wheels Artifacts"
)
summarize_wheels(vllm_artifact_dir, max_depth=3, title="Vllm Wheels Artifacts")
def cp_torch_whls_if_exist(self, inputs: VllmBuildParameters) -> str:
if not inputs.use_torch_whl:
return ""
tmp_dir = f"./{self.work_directory}/{_VLLM_TEMP_FOLDER}"
tmp_path = Path(tmp_dir)
force_create_dir(tmp_path)
copy(inputs.torch_whls_path, tmp_dir)
return tmp_dir
def cp_dockerfile_if_exist(self, inputs: VllmBuildParameters):
if not inputs.use_local_dockerfile:
logger.info("using vllm default dockerfile.torch_nightly for build")
return
dockerfile_path = get_path(inputs.dockerfile_path, resolve=True)
vllm_torch_dockerfile = Path(
f"./{self.work_directory}/docker/Dockerfile.nightly_torch"
)
copy(dockerfile_path, vllm_torch_dockerfile)
def get_result_path(self, path):
"""
Get the absolute path of the result path
"""
if not path:
path = _DEFAULT_RESULT_PATH
abs_path = get_path(path, resolve=True)
return abs_path
def _get_torch_wheel_path_arg(self, torch_whl_dir: Optional[Path]) -> str:
if not torch_whl_dir:
return ""
return f"--build-arg TORCH_WHEELS_PATH={_VLLM_TEMP_FOLDER}"
def _get_base_image_args(self, inputs: VllmBuildParameters) -> tuple[str, str, str]:
"""
Returns:
- base_image_arg: docker buildx arg string for base image
- final_base_image_arg: docker buildx arg string for vllm-base stage
- pull_flag: --pull=true or --pull=false depending on whether the image exists locally
"""
if not inputs.use_local_base_image:
return "", "", ""
base_image = inputs.base_image
# set both base image and final base image to the same local image
base_image_arg = f"--build-arg BUILD_BASE_IMAGE={base_image}"
final_base_image_arg = f"--build-arg FINAL_BASE_IMAGE={base_image}"
if local_image_exists(base_image):
pull_flag = "--pull=false"
return base_image_arg, final_base_image_arg, pull_flag
logger.info(
"[INFO] Local image not found:%s will try to pull from remote", {base_image}
)
return base_image_arg, final_base_image_arg, ""
def _generate_docker_build_cmd(
self,
inputs: VllmBuildParameters,
) -> str:
base_image_arg, final_base_image_arg, pull_flag = self._get_base_image_args(
inputs
)
torch_arg = self._get_torch_wheel_path_arg(inputs.torch_whls_path)
return textwrap.dedent(
f"""
docker buildx build \
--output type=local,dest={inputs.output_dir} \
-f docker/Dockerfile.nightly_torch \
{pull_flag} \
{torch_arg} \
{base_image_arg} \
{final_base_image_arg} \
--build-arg max_jobs={inputs.max_jobs} \
--build-arg CUDA_VERSION={inputs.cuda_version} \
--build-arg PYTHON_VERSION={inputs.python_version} \
--build-arg USE_SCCACHE={int(bool(inputs.sccache_bucket and inputs.sccache_region))} \
--build-arg SCCACHE_BUCKET_NAME={inputs.sccache_bucket} \
--build-arg SCCACHE_REGION_NAME={inputs.sccache_region} \
--build-arg torch_cuda_arch_list='{inputs.torch_cuda_arch_list}' \
--target {inputs.target_stage} \
-t {inputs.tag_name} \
--progress=plain .
"""
).strip()

View File

@ -0,0 +1,269 @@
import logging
import os
import re
import subprocess
import sys
from collections.abc import Iterable
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
from typing import Any
from cli.lib.common.cli_helper import BaseRunner
from cli.lib.common.envs_helper import env_path_field, env_str_field, get_env
from cli.lib.common.path_helper import copy, remove_dir
from cli.lib.common.pip_helper import (
pip_install_first_match,
pip_install_packages,
pkg_exists,
run_python,
)
from cli.lib.common.utils import run_command, working_directory
from cli.lib.core.vllm.lib import clone_vllm, run_test_plan, sample_vllm_test_library
logger = logging.getLogger(__name__)
@dataclass
class VllmTestParameters:
"""
Parameters defining the vllm external test input
!!!DO NOT ADD SECRETS IN THIS CLASS!!!
you can put environment variable name in VllmTestParameters if it's not the same as the secret one
fetch secrests directly from env variables during runtime
"""
torch_whls_path: Path = env_path_field("WHEELS_PATH", "./dist")
vllm_whls_path: Path = env_path_field(
"VLLM_WHEELS_PATH", "./dist/external/vllm/wheels"
)
torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")
def __post_init__(self):
if not self.torch_whls_path.exists():
raise ValueError("missing torch_whls_path")
if not self.vllm_whls_path.exists():
raise ValueError("missing vllm_whls_path")
class TestInpuType(Enum):
TEST_PLAN = "test_plan"
UNKNOWN = "unknown"
class VllmTestRunner(BaseRunner):
def __init__(self, args: Any):
self.work_directory = "vllm"
self.test_plan = ""
self.test_type = TestInpuType.UNKNOWN
self.shard_id = args.shard_id
self.num_shards = args.num_shards
if args.test_plan:
self.test_plan = args.test_plan
self.test_type = TestInpuType.TEST_PLAN
# Matches the structeur in the artifacts.zip from torcb build
self.TORCH_WHL_PATH_REGEX = "torch*.whl"
self.TORCH_WHL_EXTRA = "opt-einsum"
self.TORCH_ADDITIONAL_WHLS_REGEX = [
"vision/torchvision*.whl",
"audio/torchaudio*.whl",
]
# Match the structure of the artifacts.zip from vllm external build
self.VLLM_TEST_WHLS_REGEX = [
"xformers/*.whl",
"vllm/vllm*.whl",
"flashinfer-python/flashinfer*.whl",
]
def prepare(self):
"""
prepare test environment for vllm. This includes clone vllm repo, install all wheels, test dependencies and set env
"""
params = VllmTestParameters()
logger.info("Display VllmTestParameters %s", params)
self._set_envs(params)
clone_vllm(dst=self.work_directory)
with working_directory(self.work_directory):
remove_dir(Path("vllm"))
self._install_wheels(params)
self._install_dependencies()
# verify the torches are not overridden by test dependencies
check_versions()
def run(self):
"""
main function to run vllm test
"""
self.prepare()
try:
with working_directory(self.work_directory):
if self.test_type == TestInpuType.TEST_PLAN:
if self.num_shards > 1:
run_test_plan(
self.test_plan,
"vllm",
sample_vllm_test_library(),
self.shard_id,
self.num_shards,
)
else:
run_test_plan(
self.test_plan, "vllm", sample_vllm_test_library()
)
else:
raise ValueError(f"Unknown test type {self.test_type}")
finally:
# double check the torches are not overridden by other packages
check_versions()
def _install_wheels(self, params: VllmTestParameters):
logger.info("Running vllm test with inputs: %s", params)
if not pkg_exists("torch"):
# install torch from local whls if it's not installed yet.
torch_p = f"{str(params.torch_whls_path)}/{self.TORCH_WHL_PATH_REGEX}"
pip_install_first_match(torch_p, self.TORCH_WHL_EXTRA)
torch_whls_path = [
f"{str(params.torch_whls_path)}/{whl_path}"
for whl_path in self.TORCH_ADDITIONAL_WHLS_REGEX
]
for torch_whl in torch_whls_path:
pip_install_first_match(torch_whl)
logger.info("Done. Installed torch and other torch-related wheels ")
logger.info("Installing vllm wheels")
vllm_whls_path = [
f"{str(params.vllm_whls_path)}/{whl_path}"
for whl_path in self.VLLM_TEST_WHLS_REGEX
]
for vllm_whl in vllm_whls_path:
pip_install_first_match(vllm_whl)
logger.info("Done. Installed vllm wheels")
def _install_test_dependencies(self):
"""
This method replaces torch dependencies with local torch wheel info in
requirements/test.in file from vllm repo. then generates the test.txt
in runtime
"""
logger.info("generate test.txt from requirements/test.in with local torch whls")
preprocess_test_in()
copy("requirements/test.txt", "snapshot_constraint.txt")
run_command(
f"{sys.executable} -m uv pip compile requirements/test.in "
"-o test.txt "
"--index-strategy unsafe-best-match "
"--constraint snapshot_constraint.txt "
"--torch-backend cu128"
)
pip_install_packages(requirements="test.txt", prefer_uv=True)
logger.info("Done. installed requirements for test dependencies")
def _install_dependencies(self):
pip_install_packages(packages=["-e", "tests/vllm_test_utils"], prefer_uv=True)
pip_install_packages(packages=["hf_transfer"], prefer_uv=True)
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
# using script from vllm repo to remove all torch packages from requirements txt
run_python("use_existing_torch.py")
# install common packages
for requirements in ["requirements/common.txt", "requirements/build.txt"]:
pip_install_packages(
requirements=requirements,
prefer_uv=True,
)
# install test packages
self._install_test_dependencies()
def _set_envs(self, inputs: VllmTestParameters):
os.environ["TORCH_CUDA_ARCH_LIST"] = inputs.torch_cuda_arch_list
if not validate_cuda(get_env("TORCH_CUDA_ARCH_LIST")):
logger.warning(
"Missing supported TORCH_CUDA_ARCH_LIST. "
"Currently support TORCH_CUDA_ARCH_LIST env var "
"with supported arch [8.0, 8.9, 9.0]"
)
os.environ["HF_TOKEN"] = os.getenv("VLLM_TEST_HUGGING_FACE_TOKEN", "")
if not get_env("HF_TOKEN"):
raise ValueError(
"missing required HF_TOKEN, please set VLLM_TEST_HUGGING_FACE_TOKEN env var"
)
if not get_env("TORCH_CUDA_ARCH_LIST"):
raise ValueError(
"missing required TORCH_CUDA_ARCH_LIST, please set TORCH_CUDA_ARCH_LIST env var"
)
def preprocess_test_in(
target_file: str = "requirements/test.in", additional_packages: Iterable[str] = ()
):
"""
This modifies the target_file file in place in vllm work directory.
It removes torch and unwanted packages in target_file and replace with local torch whls
package with format "$WHEEL_PACKAGE_NAME @ file://<LOCAL_PATH>"
"""
additional_package_to_move = list(additional_packages or ())
pkgs_to_remove = [
"torch",
"torchvision",
"torchaudio",
"xformers",
"mamba_ssm",
] + additional_package_to_move
# Read current requirements
target_path = Path(target_file)
lines = target_path.read_text().splitlines()
pkgs_to_add = []
# Remove lines starting with the package names (==, @, >=) — case-insensitive
pattern = re.compile(rf"^({'|'.join(pkgs_to_remove)})\s*(==|@|>=)", re.IGNORECASE)
kept_lines = [line for line in lines if not pattern.match(line)]
# Get local installed torch/vision/audio from pip freeze
# This is hacky, but it works
pip_freeze = subprocess.check_output(["pip", "freeze"], text=True)
header_lines = [
line
for line in pip_freeze.splitlines()
if re.match(
r"^(torch|torchvision|torchaudio)\s*@\s*file://", line, re.IGNORECASE
)
]
# Write back: header_lines + blank + kept_lines
out_lines = header_lines + [""] + kept_lines
if pkgs_to_add:
out_lines += [""] + pkgs_to_add
out = "\n".join(out_lines) + "\n"
target_path.write_text(out)
logger.info("[INFO] Updated %s", target_file)
def validate_cuda(value: str) -> bool:
VALID_VALUES = {"8.0", "8.9", "9.0"}
return all(v in VALID_VALUES for v in value.split())
def check_versions():
"""
check installed packages version
"""
logger.info("Double check installed packages")
patterns = ["torch", "xformers", "torchvision", "torchaudio", "vllm"]
for pkg in patterns:
pkg_exists(pkg)
logger.info("Done. checked installed packages")

40
.ci/lumen_cli/cli/run.py Normal file
View File

@ -0,0 +1,40 @@
# main.py
import argparse
import logging
from cli.build_cli.register_build import register_build_commands
from cli.lib.common.logger import setup_logging
from cli.test_cli.register_test import register_test_commands
logger = logging.getLogger(__name__)
def main():
# Define top-level parser
parser = argparse.ArgumentParser(description="Lumos CLI")
subparsers = parser.add_subparsers(dest="command", required=True)
parser.add_argument(
"--log-level", default="INFO", help="Log level (DEBUG, INFO, WARNING, ERROR)"
)
# registers second-level subcommands
register_build_commands(subparsers)
register_test_commands(subparsers)
# parse args after all options are registered
args = parser.parse_args()
# setup global logging
setup_logging(getattr(logging, args.log_level.upper(), logging.INFO))
logger.debug("Parsed args: %s", args)
if hasattr(args, "func"):
args.func(args)
else:
parser.print_help()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,62 @@
import argparse
import logging
from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec
from cli.lib.core.vllm.vllm_test import VllmTestRunner
logger = logging.getLogger(__name__)
# Maps targets to their argparse configuration and runner
# it adds new target to path python -m cli.run build external {target} with buildrunner
_TARGETS: dict[str, TargetSpec] = {
"vllm": {
"runner": VllmTestRunner,
"help": "test vLLM with pytorch main",
}
# add yours ...
}
def common_args(parser: argparse.ArgumentParser) -> None:
"""
Add common CLI arguments to the given parser.
"""
parser.add_argument(
"--shard-id",
type=int,
default=1,
help="a shard id to run, e.g. '0,1,2,3'",
)
parser.add_argument(
"--num-shards",
type=int,
default=1,
help="a number of shards to run, e.g. '4'",
)
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
"-tp",
"--test-plan",
type=str,
help="a pre-defined test plan to run, e.g. 'basic_correctness_test'",
)
def register_test_commands(subparsers: argparse._SubParsersAction) -> None:
build_parser = subparsers.add_parser(
"test",
help="test related commands",
formatter_class=RichHelp,
)
build_subparsers = build_parser.add_subparsers(dest="test_command", required=True)
overview = "\n".join(
f" {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()
)
external_parser = build_subparsers.add_parser(
"external",
help="Test external targets",
description="Test third-party targets.\n\nAvailable targets:\n" + overview,
formatter_class=RichHelp,
)
register_targets(external_parser, _TARGETS, common_args=common_args)

View File

@ -0,0 +1,23 @@
[project]
name = "lumen-ci"
version = "0.1.0"
dependencies = [
"pyyaml==6.0.2",
"GitPython==3.1.45",
"docker==7.1.0",
"pytest==7.3.2",
"uv==0.8.6"
]
[tool.setuptools]
packages = ["cli"]
[tool.setuptools.package-dir]
cli = "cli"
[tool.ruff.lint]
# Enable preview mode for linting
preview = true
# Now you can select your preview rules, like RUF048
extend-select = ["RUF048"]

View File

@ -0,0 +1,47 @@
# tests/test_cli.py
import io
import sys
import unittest
from contextlib import redirect_stderr, redirect_stdout
from unittest.mock import patch
from cli.run import main
class TestArgparseCLI(unittest.TestCase):
@patch("cli.build_cli.register_build.VllmBuildRunner.run", return_value=None)
@patch("cli.build_cli.register_build.VllmBuildRunner.__init__", return_value=None)
def test_cli_run_build_external(self, mock_init, mock_run):
from cli.run import main # import after patches if needed
test_args = ["cli.run", "build", "external", "vllm"]
with patch.object(sys, "argv", test_args):
# argparse may call sys.exit on error; capture to avoid test aborts
try:
main()
except SystemExit:
pass
mock_init.assert_called_once() # got constructed
mock_run.assert_called_once_with() # run() called
def test_build_help(self):
test_args = ["cli.run", "build", "--help"]
with patch.object(sys, "argv", test_args):
stdout = io.StringIO()
stderr = io.StringIO()
# --help always raises SystemExit(0)
with self.assertRaises(SystemExit) as cm:
with redirect_stdout(stdout), redirect_stderr(stderr):
main()
self.assertEqual(cm.exception.code, 0)
output = stdout.getvalue()
self.assertIn("usage", output)
self.assertIn("external", output)
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,115 @@
import argparse
import io
import unittest
from contextlib import redirect_stderr
from unittest.mock import patch
from cli.lib.common.cli_helper import BaseRunner, register_targets, RichHelp, TargetSpec
# ---- Dummy runners for unittests----
class FooRunner(BaseRunner):
"""Foo description from docstring."""
def run(self) -> None: # replaced by mock
pass
class BarRunner(BaseRunner):
def run(self) -> None: # replaced by mock
pass
def add_foo_args(p: argparse.ArgumentParser) -> None:
p.add_argument("--x", type=int, required=True, help="x value")
def common_args(p: argparse.ArgumentParser) -> None:
p.add_argument("--verbose", action="store_true", help="verbose flag")
def build_parser(specs: dict[str, TargetSpec]) -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(prog="app", formatter_class=RichHelp)
register_targets(
parser=parser,
target_specs=specs,
common_args=common_args,
)
return parser
def get_subparser(
parser: argparse.ArgumentParser, name: str
) -> argparse.ArgumentParser:
subparsers_action = next(
a
for a in parser._subparsers._group_actions # type: ignore[attr-defined]
if isinstance(a, argparse._SubParsersAction)
)
return subparsers_action.choices[name]
class TestRegisterTargets(unittest.TestCase):
def test_metavar_lists_targets(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
"bar": {"runner": BarRunner},
}
parser = build_parser(specs)
subparsers_action = next(
a
for a in parser._subparsers._group_actions # type: ignore[attr-defined]
if isinstance(a, argparse._SubParsersAction)
)
self.assertEqual(subparsers_action.metavar, "{foo,bar}")
def test_add_arguments_and_common_args_present(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
}
parser = build_parser(specs)
foo = get_subparser(parser, "foo")
help_text = foo.format_help()
self.assertIn("--x", help_text)
self.assertIn("--verbose", help_text)
def test_runner_constructed_with_ns_and_run_called(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
}
parser = build_parser(specs)
with (
patch.object(FooRunner, "__init__", return_value=None) as mock_init,
patch.object(FooRunner, "run", return_value=None) as mock_run,
):
ns = parser.parse_args(["foo", "--x", "3", "--verbose"])
ns.func(ns) # set by register_targets
# __init__ received the Namespace
self.assertEqual(mock_init.call_count, 1)
(called_ns,), _ = mock_init.call_args
self.assertIsInstance(called_ns, argparse.Namespace)
# run() called with no args
mock_run.assert_called_once_with()
def test_runner_docstring_used_as_description_when_missing(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
}
parser = build_parser(specs)
foo = get_subparser(parser, "foo")
help_text = foo.format_help()
self.assertIn("Foo description from docstring.", help_text)
def test_missing_target_raises_systemexit_with_usage(self):
specs: dict[str, TargetSpec] = {"foo": {"runner": FooRunner}}
parser = build_parser(specs)
buf = io.StringIO()
with self.assertRaises(SystemExit), redirect_stderr(buf):
parser.parse_args([])
err = buf.getvalue()
self.assertIn("usage:", err)
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,75 @@
import unittest
from unittest import mock
from unittest.mock import MagicMock
import docker.errors as derr
from cli.lib.common.docker_helper import _get_client, local_image_exists
class TestDockerImageHelpers(unittest.TestCase):
def setUp(self):
# Reset the singleton in the target module
patcher = mock.patch("cli.lib.common.docker_helper._docker_client", None)
self.addCleanup(patcher.stop)
patcher.start()
def test_local_image_exists_true(self):
# Mock a docker client whose images.get returns an object (no exception)
mock_client = MagicMock()
mock_client.images.get.return_value = object()
ok = local_image_exists("repo:tag", client=mock_client)
self.assertTrue(ok)
def test_local_image_exists_not_found_false(self):
mock_client = MagicMock()
# Raise docker.errors.NotFound
mock_client.images.get.side_effect = derr.NotFound("nope")
ok = local_image_exists("missing:latest", client=mock_client)
self.assertFalse(ok)
def test_local_image_exists_api_error_false(self):
mock_client = MagicMock()
mock_client.images.get.side_effect = derr.APIError("boom", None)
ok = local_image_exists("broken:tag", client=mock_client)
self.assertFalse(ok)
def test_local_image_exists_uses_lazy_singleton(self):
# Patch docker.from_env used by _get_client()
with mock.patch(
"cli.lib.common.docker_helper.docker.from_env"
) as mock_from_env:
mock_docker_client = MagicMock()
mock_from_env.return_value = mock_docker_client
# First call should create and cache the client
c1 = _get_client()
self.assertIs(c1, mock_docker_client)
mock_from_env.assert_called_once()
# Second call should reuse cached client (no extra from_env calls)
c2 = _get_client()
self.assertIs(c2, mock_docker_client)
mock_from_env.assert_called_once() # still once
def test_local_image_exists_without_client_param_calls_get_client_once(self):
# Ensure _get_client is called and cached; local_image_exists should reuse it
with mock.patch("cli.lib.common.docker_helper._get_client") as mock_get_client:
mock_client = MagicMock()
mock_get_client.return_value = mock_client
# 1st call
local_image_exists("repo:tag")
# 2nd call
local_image_exists("repo:tag2")
# local_image_exists should call _get_client each time,
# but your _get_client itself caches docker.from_env.
self.assertEqual(mock_get_client.call_count, 2)
self.assertEqual(mock_client.images.get.call_count, 2)
mock_client.images.get.assert_any_call("repo:tag")
mock_client.images.get.assert_any_call("repo:tag2")
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,149 @@
import os
import unittest
from dataclasses import dataclass
from pathlib import Path
from unittest.mock import patch
import cli.lib.common.envs_helper as m
class TestEnvHelpers(unittest.TestCase):
def setUp(self):
# Keep a copy of the original environment to restore later
self._env_backup = dict(os.environ)
def tearDown(self):
# Restore environment to original state
os.environ.clear()
os.environ.update(self._env_backup)
# -------- get_env --------
def test_get_env_unset_returns_default(self):
with patch.dict(os.environ, {}, clear=True):
self.assertEqual(m.get_env("FOO", "default"), "default")
def test_get_env_empty_returns_default(self):
with patch.dict(os.environ, {"FOO": ""}, clear=True):
self.assertEqual(m.get_env("FOO", "default"), "default")
def test_get_env_set_returns_value(self):
with patch.dict(os.environ, {"FOO": "bar"}, clear=True):
self.assertEqual(m.get_env("FOO", "default"), "bar")
def test_get_env_not_exist_returns_default(self):
with patch.dict(os.environ, {"FOO": "bar"}, clear=True):
self.assertEqual(m.get_env("TEST_NOT_EXIST", "default"), "default")
def test_get_env_not_exist_without_default(self):
with patch.dict(os.environ, {"FOO": "bar"}, clear=True):
self.assertEqual(m.get_env("TEST_NOT_EXIST"), "")
# -------- env_bool --------
def test_env_bool_uses_default_when_unset(self):
with patch.dict(os.environ, {}, clear=True):
self.assertTrue(m.env_bool("FLAG", default=True))
self.assertFalse(m.env_bool("FLAG", default=False))
def test_env_bool_uses_str2bool_when_set(self):
# Patch str2bool used by env_bool so we don't depend on its exact behavior
def fake_str2bool(s: str) -> bool:
return s.lower() in {"1", "true", "yes", "on", "y"}
with (
patch.dict(os.environ, {"FLAG": "yEs"}, clear=True),
patch.object(m, "str2bool", fake_str2bool),
):
self.assertTrue(m.env_bool("FLAG", default=False))
# -------- env_path_optional / env_path --------
def test_env_path_optional_unset_returns_none_by_default(self):
with patch.dict(os.environ, {}, clear=True):
self.assertIsNone(m.env_path_optional("P"))
def test_env_path_optional_unset_returns_none_when_env_var_is_empty(self):
with patch.dict(os.environ, {"P": ""}, clear=True):
self.assertIsNone(m.env_path_optional("P"))
def test_env_path_optional_unset_returns_default_str(self):
# default as string; resolve=True by default -> absolute path
default_str = "x/y"
with patch.dict(os.environ, {}, clear=True):
p = m.env_path_optional("P", default=default_str)
self.assertIsInstance(p, Path)
self.assertIsNotNone(p)
if p:
self.assertTrue(p.is_absolute())
self.assertEqual(p.parts[-2:], ("x", "y"))
def test_env_path_optional_unset_returns_default_path_no_resolve(self):
d = Path("z")
with patch.dict(os.environ, {}, clear=True):
p = m.env_path_optional("P", default=d, resolve=False)
self.assertEqual(p, d)
def test_env_path_optional_respects_resolve_true(self):
with patch.dict(os.environ, {"P": "a/b"}, clear=True):
p = m.env_path_optional("P", resolve=True)
self.assertIsInstance(p, Path)
if p:
self.assertTrue(p.is_absolute())
def test_env_path_optional_respects_resolve_false(self):
with patch.dict(os.environ, {"P": "rel/dir"}, clear=True):
p = m.env_path_optional("P", resolve=False)
self.assertEqual(p, Path("rel/dir"))
if p:
self.assertFalse(p.is_absolute())
def test_env_path_raises_when_missing_and_default_none(self):
with patch.dict(os.environ, {}, clear=True):
with self.assertRaises(ValueError):
m.env_path("P", None, resolve=True)
def test_env_path_returns_path_when_present(self):
tmp = Path("./b").resolve()
with patch.dict(os.environ, {"P": str(tmp)}, clear=True):
p = m.env_path("P", None, resolve=True)
self.assertEqual(p, tmp)
# -------- dataclass field helpers --------
def test_dataclass_fields_read_env_at_instantiation(self):
@dataclass
class Cfg:
flag: bool = m.env_bool_field("FLAG", default=False)
out: Path = m.env_path_field("OUT", default="ab", resolve=True)
name: str = m.env_str_field("NAME", default="anon")
# First instantiation
with patch.dict(
os.environ, {"FLAG": "true", "OUT": "outdir", "NAME": "alice"}, clear=True
):
cfg1 = Cfg()
self.assertTrue(cfg1.flag)
self.assertIsInstance(cfg1.out, Path)
self.assertTrue(cfg1.out.is_absolute())
self.assertEqual(cfg1.name, "alice")
cfg1.name = "bob" # change instance value
self.assertEqual(cfg1.name, "bob") # change is reflected
# Change env; new instance should reflect new values
with patch.dict(os.environ, {"FLAG": "false", "NAME": ""}, clear=True):
cfg2 = Cfg()
self.assertFalse(cfg2.flag) # str2bool("false") -> False
self.assertTrue("ab" in str(cfg2.out))
self.assertIsInstance(cfg2.out, Path)
self.assertTrue(cfg2.out.is_absolute())
self.assertEqual(cfg2.name, "anon") # empty -> fallback to default
def test_dataclass_path_field_with_default_value(self):
@dataclass
class C2:
out: Path = m.env_path_field("OUT", default="some/dir", resolve=False)
with patch.dict(os.environ, {}, clear=True):
c = C2()
self.assertEqual(c.out, Path("some/dir"))
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,122 @@
# test_path_utils.py
# Run: pytest -q
import os
import unittest
from pathlib import Path
from tempfile import TemporaryDirectory
from cli.lib.common.path_helper import (
copy,
ensure_dir_exists,
force_create_dir,
get_path,
is_path_exist,
remove_dir,
)
class TestPathHelper(unittest.TestCase):
def setUp(self):
self.tmpdir = TemporaryDirectory()
self.tmp_path = Path(self.tmpdir.name)
def tearDown(self):
self.tmpdir.cleanup()
# -------- get_path --------
def test_get_path_returns_path_for_str(self):
# Use relative path to avoid absolute-ness
rel_str = "sub/f.txt"
os.chdir(self.tmp_path)
p = get_path(rel_str, resolve=False)
self.assertIsInstance(p, Path)
self.assertFalse(p.is_absolute())
self.assertEqual(str(p), rel_str)
def test_get_path_resolves(self):
rel_str = "sub/f.txt"
p = get_path(str(self.tmp_path / rel_str), resolve=True)
self.assertTrue(p.is_absolute())
self.assertTrue(str(p).endswith(rel_str))
def test_get_path_with_path_input(self):
p_in = self.tmp_path / "sub/f.txt"
p_out = get_path(p_in, resolve=False)
self.assertTrue(str(p_out) == str(p_in))
def test_get_path_with_none_raises(self):
with self.assertRaises(ValueError):
get_path(None) # type: ignore[arg-type]
def test_get_path_invalid_type_raises(self):
with self.assertRaises(TypeError):
get_path(123) # type: ignore[arg-type]
# -------- ensure_dir_exists / force_create_dir / remove_dir --------
def test_ensure_dir_exists_creates_and_is_idempotent(self):
d = self.tmp_path / "made"
ensure_dir_exists(d)
self.assertTrue(d.exists() and d.is_dir())
ensure_dir_exists(d)
def test_force_create_dir_clears_existing(self):
d = self.tmp_path / "fresh"
(d / "inner").mkdir(parents=True)
(d / "inner" / "f.txt").write_text("x")
force_create_dir(d)
self.assertTrue(d.exists())
self.assertEqual(list(d.iterdir()), [])
def test_remove_dir_none_is_noop(self):
remove_dir(None) # type: ignore[arg-type]
def test_remove_dir_nonexistent_is_noop(self):
ghost = self.tmp_path / "ghost"
remove_dir(ghost)
def test_remove_dir_accepts_str(self):
d = self.tmp_path / "to_rm"
d.mkdir()
remove_dir(str(d))
self.assertFalse(d.exists())
# -------- copy --------
def test_copy_file_to_file(self):
src = self.tmp_path / "src.txt"
dst = self.tmp_path / "out" / "dst.txt"
src.write_text("hello")
copy(src, dst)
self.assertEqual(dst.read_text(), "hello")
def test_copy_dir_to_new_dir(self):
src = self.tmp_path / "srcdir"
(src / "a").mkdir(parents=True)
(src / "a" / "f.txt").write_text("content")
dst = self.tmp_path / "destdir"
copy(src, dst)
self.assertEqual((dst / "a" / "f.txt").read_text(), "content")
def test_copy_dir_into_existing_dir_overwrite_true_merges(self):
src = self.tmp_path / "srcdir"
dst = self.tmp_path / "destdir"
(src / "x").mkdir(parents=True)
(src / "x" / "new.txt").write_text("new")
dst.mkdir()
(dst / "existing.txt").write_text("old")
copy(src, dst)
self.assertEqual((dst / "existing.txt").read_text(), "old")
self.assertEqual((dst / "x" / "new.txt").read_text(), "new")
def test_is_str_path_exist(self):
p = self.tmp_path / "x.txt"
p.write_text("1")
self.assertTrue(is_path_exist(str(p)))
self.assertTrue(is_path_exist(p))
self.assertFalse(is_path_exist(str(self.tmp_path / "missing")))
self.assertFalse(is_path_exist(self.tmp_path / "missing"))
self.assertFalse(is_path_exist(""))
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,185 @@
# tests/test_run_test_plan.py
import importlib
from contextlib import nullcontext
from types import SimpleNamespace
from unittest.mock import MagicMock
import pytest
MOD = "cli.lib.core.vllm.lib"
# We import inside tests so the MOD override above applies everywhere
run_test_plan_import_path = f"{MOD}.run_test_plan"
def _get_cmd(c):
# Support both kwargs and positional args
return c.kwargs.get("cmd", c.args[0] if c.args else None)
def _get_check(c):
if "check" in c.kwargs:
return c.kwargs["check"]
# If positional, assume second arg is 'check' when present; default False
return c.args[1] if len(c.args) > 1 else False
@pytest.fixture
def patch_module(monkeypatch):
"""
Patch helpers ('pip_install_packages', 'temp_environ', 'working_directory',
'run_command', 'logger') inside the target module and expose them.
"""
module = importlib.import_module(MOD)
# Create fakes/mocks
pip_install_packages = MagicMock(name="pip_install_packages")
run_command = MagicMock(name="run_command", return_value=0)
# temp_environ / working_directory: record calls but act as context managers
temp_calls: list[dict] = []
workdir_calls: list[str] = []
def fake_working_directory(path: str):
workdir_calls.append(path)
return nullcontext()
def fake_temp_env(map: dict[str, str]):
temp_calls.append(map)
return nullcontext()
logger = SimpleNamespace(
info=MagicMock(name="logger.info"),
error=MagicMock(name="logger.error"),
)
# Apply patches (raise if attribute doesn't exist)
monkeypatch.setattr(
module, "pip_install_packages", pip_install_packages, raising=True
)
monkeypatch.setattr(module, "run_command", run_command, raising=True)
monkeypatch.setattr(
module, "working_directory", fake_working_directory, raising=True
)
monkeypatch.setattr(module, "temp_environ", fake_temp_env, raising=True)
monkeypatch.setattr(module, "logger", logger, raising=True)
return SimpleNamespace(
module=module,
run_test_plan=module.run_test_plan, # expose to avoid getattr("constant") (Ruff B009)
pip_install_packages=pip_install_packages,
run_command=run_command,
temp_calls=temp_calls,
workdir_calls=workdir_calls,
logger=logger,
)
def test_success_runs_all_steps_and_uses_env_and_workdir(monkeypatch, patch_module):
run_test_plan = patch_module.run_test_plan
tests_map = {
"basic": {
"title": "Basic suite",
"package_install": [],
"working_directory": "tests",
"env_vars": {"GLOBAL_FLAG": "1"},
"steps": [
"export A=x && pytest -q",
"export B=y && pytest -q tests/unit",
],
}
}
# One exit code per step (export + two pytest)
patch_module.run_command.side_effect = [0, 0, 0]
run_test_plan("basic", "cpu", tests_map)
calls = patch_module.run_command.call_args_list
cmds = [_get_cmd(c) for c in calls]
checks = [_get_check(c) for c in calls]
assert cmds == [
"export A=x && pytest -q",
"export B=y && pytest -q tests/unit",
]
assert all(chk is False for chk in checks)
assert patch_module.workdir_calls == ["tests"]
assert patch_module.temp_calls == [{"GLOBAL_FLAG": "1"}]
def test_installs_packages_when_present(monkeypatch, patch_module):
run_test_plan = patch_module.module.run_test_plan
tests_map = {
"with_pkgs": {
"title": "Needs deps",
"package_install": ["timm==1.0.0", "flash-attn"],
"steps": ["pytest -q"],
}
}
patch_module.run_command.return_value = 0
run_test_plan("with_pkgs", "gpu", tests_map)
patch_module.pip_install_packages.assert_called_once_with(
packages=["timm==1.0.0", "flash-attn"],
prefer_uv=True,
)
def test_raises_on_missing_plan(patch_module):
run_test_plan = patch_module.module.run_test_plan
with pytest.raises(RuntimeError) as ei:
run_test_plan("nope", "cpu", tests_map={})
assert "test nope not found" in str(ei.value)
def test_aggregates_failures_and_raises(monkeypatch, patch_module):
run_test_plan = patch_module.module.run_test_plan
tests_map = {
"mix": {
"title": "Some pass some fail",
"steps": [
"pytest test_a.py", # 0 → pass
"pytest test_b.py", # 1 → fail
"pytest test_c.py", # 2 → fail
],
}
}
# Simulate pass, fail, fail
patch_module.run_command.side_effect = [0, 1, 2]
with pytest.raises(RuntimeError) as ei:
run_test_plan("mix", "cpu", tests_map)
msg = str(ei.value)
assert "2 pytest runs failed" in msg
# Ensure logger captured failed tests list
patch_module.logger.error.assert_called_once()
# And we attempted all three commands
assert patch_module.run_command.call_count == 3
def test_custom_working_directory_used(patch_module):
run_test_plan = patch_module.module.run_test_plan
tests_map = {
"customwd": {
"title": "Custom wd",
"working_directory": "examples/ci",
"steps": ["pytest -q"],
}
}
patch_module.run_command.return_value = 0
run_test_plan("customwd", "cpu", tests_map)
assert patch_module.workdir_calls == ["examples/ci"]

View File

@ -0,0 +1,143 @@
import os
import tempfile
import unittest
from pathlib import Path
from cli.lib.common.utils import temp_environ, working_directory # <-- replace import
class EnvIsolatedTestCase(unittest.TestCase):
"""Base class that snapshots os.environ and CWD for isolation."""
def setUp(self):
import os
import tempfile
self._env_backup = dict(os.environ)
# Snapshot/repair CWD if it's gone
try:
self._cwd_backup = os.getcwd()
except FileNotFoundError:
# If CWD no longer exists, switch to a safe place and record that
self._cwd_backup = tempfile.gettempdir()
os.chdir(self._cwd_backup)
# Create a temporary directory for the test to run in
self._temp_dir = tempfile.mkdtemp()
os.chdir(self._temp_dir)
def tearDown(self):
import os
import shutil
import tempfile
# Restore cwd first (before cleaning up temp dir)
try:
os.chdir(self._cwd_backup)
except OSError:
os.chdir(tempfile.gettempdir())
# Clean up temporary directory
try:
shutil.rmtree(self._temp_dir, ignore_errors=True)
except Exception:
pass # Ignore cleanup errors
# Restore env
to_del = set(os.environ.keys()) - set(self._env_backup.keys())
for k in to_del:
os.environ.pop(k, None)
for k, v in self._env_backup.items():
os.environ[k] = v
class TestTempEnviron(EnvIsolatedTestCase):
def test_sets_and_restores_new_var(self):
var = "TEST_TMP_ENV_NEW"
self.assertNotIn(var, os.environ)
with temp_environ({var: "123"}):
self.assertEqual(os.environ[var], "123")
self.assertNotIn(var, os.environ) # removed after exit
def test_overwrites_and_restores_existing_var(self):
var = "TEST_TMP_ENV_OVERWRITE"
os.environ[var] = "orig"
with temp_environ({var: "override"}):
self.assertEqual(os.environ[var], "override")
self.assertEqual(os.environ[var], "orig") # restored
def test_multiple_vars_and_missing_cleanup(self):
v1, v2 = "TEST_ENV_V1", "TEST_ENV_V2"
os.environ.pop(v1, None)
os.environ[v2] = "keep"
with temp_environ({v1: "a", v2: "b"}):
self.assertEqual(os.environ[v1], "a")
self.assertEqual(os.environ[v2], "b")
self.assertNotIn(v1, os.environ) # newly-added -> removed
self.assertEqual(os.environ[v2], "keep") # pre-existing -> restored
def test_restores_even_on_exception(self):
var = "TEST_TMP_ENV_EXCEPTION"
self.assertNotIn(var, os.environ)
with self.assertRaises(RuntimeError):
with temp_environ({var: "x"}):
self.assertEqual(os.environ[var], "x")
raise RuntimeError("boom")
self.assertNotIn(var, os.environ) # removed after exception
class TestWorkingDirectory(EnvIsolatedTestCase):
def test_changes_and_restores(self):
start = Path.cwd()
with tempfile.TemporaryDirectory() as td:
target = Path(td) / "wd"
target.mkdir()
with working_directory(str(target)):
self.assertEqual(Path.cwd().resolve(), target.resolve())
self.assertEqual(Path.cwd(), start)
def test_noop_when_empty_path(self):
start = Path.cwd()
with working_directory(""):
self.assertEqual(Path.cwd(), start)
self.assertEqual(Path.cwd(), start)
def test_restores_on_exception(self):
start = Path.cwd()
with tempfile.TemporaryDirectory() as td:
target = Path(td) / "wd_exc"
target.mkdir()
with self.assertRaises(ValueError):
with working_directory(str(target)):
# Normalize both sides to handle /var -> /private/var
self.assertEqual(Path.cwd().resolve(), target.resolve())
raise ValueError("boom")
self.assertEqual(Path.cwd().resolve(), start.resolve())
def test_raises_for_missing_dir(self):
start = Path.cwd()
with tempfile.TemporaryDirectory() as td:
missing = Path(td) / "does_not_exist"
with self.assertRaises(FileNotFoundError):
# os.chdir should raise before yielding
with working_directory(str(missing)):
pass
self.assertEqual(Path.cwd(), start)
if __name__ == "__main__":
unittest.main(verbosity=2)

View File

@ -0,0 +1,176 @@
import os
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch
import cli.lib.core.vllm.vllm_build as vllm_build
_VLLM_BUILD_MODULE = "cli.lib.core.vllm.vllm_build"
class TestVllmBuildParameters(unittest.TestCase):
@patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=True)
@patch(
"cli.lib.common.envs_helper.env_path_optional",
side_effect=lambda name, default=None, resolve=True: {
"DOCKERFILE_PATH": Path("/abs/vllm/Dockerfile"),
"TORCH_WHEELS_PATH": Path("/abs/dist"),
"OUTPUT_DIR": Path("/abs/shared"),
}.get(name, Path(default) if default is not None else None),
)
@patch.dict(
os.environ,
{
"USE_TORCH_WHEEL": "1",
"USE_LOCAL_BASE_IMAGE": "1",
"USE_LOCAL_DOCKERFILE": "1",
"BASE_IMAGE": "my/image:tag",
"DOCKERFILE_PATH": "vllm/Dockerfile",
"TORCH_WHEELS_PATH": "dist",
"OUTPUT_DIR": "shared",
},
clear=True,
)
def test_params_success_normalizes_and_validates(
self, mock_env_path, mock_is_path, mock_local_img
):
params = vllm_build.VllmBuildParameters()
self.assertEqual(params.torch_whls_path, Path("/abs/dist"))
self.assertEqual(params.dockerfile_path, Path("/abs/vllm/Dockerfile"))
self.assertEqual(params.output_dir, Path("/abs/shared"))
self.assertEqual(params.base_image, "my/image:tag")
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)
@patch.dict(
os.environ, {"USE_TORCH_WHEEL": "1", "TORCH_WHEELS_PATH": "dist"}, clear=True
)
def test_params_missing_torch_whls_raises(self, _is_path):
with tempfile.TemporaryDirectory() as td:
os.chdir(td)
with self.assertRaises(ValueError) as cm:
vllm_build.VllmBuildParameters(
use_local_base_image=False,
use_local_dockerfile=False,
)
err = cm.exception
self.assertIn("TORCH_WHEELS_PATH", str(err))
@patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=False)
@patch.dict(
os.environ, {"USE_LOCAL_BASE_IMAGE": "1", "BASE_IMAGE": "img:tag"}, clear=True
)
def test_params_missing_local_base_image_raises(self, _local_img):
with tempfile.TemporaryDirectory() as td:
os.chdir(td)
with self.assertRaises(ValueError) as cm:
vllm_build.VllmBuildParameters(
use_torch_whl=False,
use_local_dockerfile=False,
)
err = cm.exception
self.assertIn("BASE_IMAGE", str(err))
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)
@patch.dict(
os.environ,
{"USE_LOCAL_DOCKERFILE": "1", "DOCKERFILE_PATH": "Dockerfile"},
clear=True,
)
def test_params_missing_dockerfile_raises(self, _is_path):
with tempfile.TemporaryDirectory() as td:
os.chdir(td)
with self.assertRaises(ValueError) as cm:
vllm_build.VllmBuildParameters(
use_torch_whl=False,
use_local_base_image=False,
)
err = cm.exception
self.assertIn("DOCKERFILE_PATH", str(err))
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)
@patch.dict(
os.environ,
{"OUTPUT_DIR": ""},
clear=True,
)
def test_params_missing_output_dir(self, _is_path):
with self.assertRaises(FileNotFoundError):
vllm_build.VllmBuildParameters()
class TestBuildCmdAndRun(unittest.TestCase):
@patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)
def test_generate_docker_build_cmd_includes_bits(self, _exists):
runner = vllm_build.VllmBuildRunner()
inputs = MagicMock()
inputs.output_dir = Path("/abs/out")
inputs.use_local_base_image = True
inputs.base_image = "img:tag"
inputs.torch_whls_path = Path("./vllm/tmp")
inputs.max_jobs = 64
inputs.cuda_version = "12.8.1"
inputs.python_version = "3.12"
inputs.sccache_bucket = "my-bucket"
inputs.sccache_region = "us-west-2"
inputs.torch_cuda_arch_list = "8.0;9.0"
inputs.target_stage = "export-wheels"
inputs.tag_name = "vllm-wheels"
cmd = runner._generate_docker_build_cmd(inputs)
squashed = " ".join(cmd.split())
self.assertIn("--output type=local,dest=/abs/out", squashed)
self.assertIn("-f docker/Dockerfile.nightly_torch", squashed)
self.assertIn("--pull=false", squashed)
self.assertIn("--build-arg TORCH_WHEELS_PATH=tmp", squashed)
self.assertIn("--build-arg BUILD_BASE_IMAGE=img:tag", squashed)
self.assertIn("--build-arg FINAL_BASE_IMAGE=img:tag", squashed)
self.assertIn("--build-arg max_jobs=64", squashed)
self.assertIn("--build-arg CUDA_VERSION=12.8.1", squashed)
self.assertIn("--build-arg PYTHON_VERSION=3.12", squashed)
self.assertIn("--build-arg USE_SCCACHE=1", squashed)
self.assertIn("--build-arg SCCACHE_BUCKET_NAME=my-bucket", squashed)
self.assertIn("--build-arg SCCACHE_REGION_NAME=us-west-2", squashed)
self.assertIn("--build-arg torch_cuda_arch_list='8.0;9.0'", squashed)
self.assertIn("--target export-wheels", squashed)
self.assertIn("-t vllm-wheels", squashed)
@patch(f"{_VLLM_BUILD_MODULE}.run_command")
@patch(f"{_VLLM_BUILD_MODULE}.ensure_dir_exists")
@patch(f"{_VLLM_BUILD_MODULE}.clone_vllm")
@patch.object(
vllm_build.VllmBuildRunner,
"_generate_docker_build_cmd",
return_value="docker buildx ...",
)
@patch.dict(
os.environ,
{
"USE_TORCH_WHEEL": "0",
"USE_LOCAL_BASE_IMAGE": "0",
"USE_LOCAL_DOCKERFILE": "0",
"OUTPUT_DIR": "shared",
},
clear=True,
)
def test_run_calls_clone_prepare_and_build(
self, mock_gen, mock_clone, mock_ensure, mock_run
):
params = MagicMock()
params.output_dir = Path("shared")
params.use_local_dockerfile = False
params.use_torch_whl = False
with patch(f"{_VLLM_BUILD_MODULE}.VllmBuildParameters", return_value=params):
runner = vllm_build.VllmBuildRunner()
runner.run()
mock_clone.assert_called_once()
mock_ensure.assert_called_once_with(Path("shared"))
mock_gen.assert_called_once_with(params)
mock_run.assert_called_once()
_, kwargs = mock_run.call_args
assert kwargs.get("cwd") == "vllm"

View File

@ -1,7 +1,7 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_CUDA ?= 11.8
DESIRED_CUDA ?= 12.8
DESIRED_CUDA_SHORT = $(subst .,,$(DESIRED_CUDA))
PACKAGE_NAME = magma-cuda
CUDA_ARCH_LIST ?= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
@ -16,15 +16,28 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma/build_magma.sh
.PHONY: all
all: magma-cuda130
all: magma-cuda129
all: magma-cuda128
all: magma-cuda126
all: magma-cuda118
.PHONY:
clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda130
magma-cuda130: DESIRED_CUDA := 13.0
magma-cuda130: CUDA_ARCH_LIST := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda130:
$(DOCKER_RUN)
.PHONY: magma-cuda129
magma-cuda129: DESIRED_CUDA := 12.9
magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda129:
$(DOCKER_RUN)
.PHONY: magma-cuda128
magma-cuda128: DESIRED_CUDA := 12.8
magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
@ -35,9 +48,3 @@ magma-cuda128:
magma-cuda126: DESIRED_CUDA := 12.6
magma-cuda126:
$(DOCKER_RUN)
.PHONY: magma-cuda118
magma-cuda118: DESIRED_CUDA := 11.8
magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37
magma-cuda118:
$(DOCKER_RUN)

View File

@ -28,6 +28,7 @@ pushd ${PACKAGE_DIR}/magma-${MAGMA_VERSION}
patch < ${PACKAGE_FILES}/CMake.patch
patch < ${PACKAGE_FILES}/cmakelists.patch
patch -p0 < ${PACKAGE_FILES}/thread_queue.patch
patch -p1 < ${PACKAGE_FILES}/cuda13.patch
patch -p1 < ${PACKAGE_FILES}/getrf_shfl.patch
patch -p1 < ${PACKAGE_FILES}/getrf_nbparam.patch
# The build.sh script expects to be executed from the sources root folder
@ -37,6 +38,7 @@ popd
# Package recipe, license and tarball
# Folder and package name are backward compatible for the build workflow
cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh
cp ${PACKAGE_FILES}/cuda13.patch ${PACKAGE_RECIPE}/cuda13.patch
cp ${PACKAGE_FILES}/thread_queue.patch ${PACKAGE_RECIPE}/thread_queue.patch
cp ${PACKAGE_FILES}/cmakelists.patch ${PACKAGE_RECIPE}/cmakelists.patch
cp ${PACKAGE_FILES}/getrf_shfl.patch ${PACKAGE_RECIPE}/getrf_shfl.patch

View File

@ -0,0 +1,26 @@
diff --git a/interface_cuda/interface.cpp b/interface_cuda/interface.cpp
index 73fed1b20..e77519bfe 100644
--- a/interface_cuda/interface.cpp
+++ b/interface_cuda/interface.cpp
@@ -438,14 +438,20 @@ magma_print_environment()
cudaDeviceProp prop;
err = cudaGetDeviceProperties( &prop, dev );
check_error( err );
+ #ifdef MAGMA_HAVE_CUDA
+#if CUDA_VERSION < 13000
printf( "%% device %d: %s, %.1f MHz clock, %.1f MiB memory, capability %d.%d\n",
dev,
prop.name,
prop.clockRate / 1000.,
+#else
+ printf( "%% device %d: %s, ??? MHz clock, %.1f MiB memory, capability %d.%d\n",
+ dev,
+ prop.name,
+#endif
prop.totalGlobalMem / (1024.*1024.),
prop.major,
prop.minor );
- #ifdef MAGMA_HAVE_CUDA
int arch = prop.major*100 + prop.minor*10;
if ( arch < MAGMA_CUDA_ARCH_MIN ) {
printf("\n"

View File

@ -5,10 +5,6 @@ set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
case "${GPU_ARCH_TYPE:-BLANK}" in
BLANK)
# Legacy behavior for CircleCI
bash "${SCRIPTPATH}/build_cuda.sh"
;;
cuda)
bash "${SCRIPTPATH}/build_cuda.sh"
;;

View File

@ -18,12 +18,10 @@ retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
PLATFORM="manylinux2014_x86_64"
PLATFORM=""
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
PLATFORM="manylinux_2_28_x86_64"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
@ -33,9 +31,11 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
# We use the package name to test the package by passing this to 'pip install'
@ -79,8 +79,6 @@ if [[ -e /opt/openssl ]]; then
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
mkdir -p /tmp/$WHEELHOUSE_DIR
export PATCHELF_BIN=/usr/local/bin/patchelf
@ -99,6 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -qUr requirements-build.txt
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in
@ -139,28 +138,11 @@ fi
echo "Calling setup.py bdist at $(date)"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
fi
echo "Finished setup.py bdist at $(date)"
# Build libtorch packages
@ -273,10 +255,6 @@ ls /tmp/$WHEELHOUSE_DIR
mkdir -p "/$WHEELHOUSE_DIR"
mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true
fi
if [[ -n "$BUILD_PYTHONLESS" ]]; then
mkdir -p /$LIBTORCH_HOUSE_DIR
mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR
@ -453,16 +431,8 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then
pushd $PYTORCH_ROOT/test
# Install the wheel for this Python version
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true
fi
pip uninstall -y "$TORCH_PACKAGE_NAME"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
fi
pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
# Print info on the libraries installed in this wheel

View File

@ -15,6 +15,9 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
export USE_CUFILE=${USE_CUFILE:-1}
export USE_SYSTEM_NCCL=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
@ -36,10 +39,8 @@ if [[ -n "$DESIRED_CUDA" ]]; then
if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then
CUDA_VERSION=${DESIRED_CUDA}
else
# cu90, cu92, cu100, cu101
if [[ ${#DESIRED_CUDA} -eq 4 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"
elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then
# cu126, cu128 etc...
if [[ ${#DESIRED_CUDA} -eq 5 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"
fi
fi
@ -50,24 +51,26 @@ else
fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
#removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
#however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"
;;
12.9)
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"
# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
13.0)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
;;
*)
echo "unknown cuda version $CUDA_VERSION"
@ -91,14 +94,15 @@ fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
DEPS_LIST=(
@ -108,33 +112,19 @@ DEPS_SONAME=(
"libgomp.so.1"
)
# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary
# since nvidia-cusparselt-cu11 is not available in PYPI
if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then
export USE_CUFILE=0
fi
# CUDA_VERSION 12.4, 12.6, 12.8
if [[ $CUDA_VERSION == 12* ]]; then
# CUDA_VERSION 12.*, 13.*
if [[ $CUDA_VERSION == 12* || $CUDA_VERSION == 13* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ $CUDA_VERSION == 13* ]]; then
export TORCH_NVCC_FLAGS="$TORCH_NVCC_FLAGS -compress-mode=size"
fi
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
@ -144,13 +134,12 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
"/usr/local/cuda/lib64/libnvshmem_host.so.3"
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
@ -161,124 +150,91 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.12"
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libnvshmem_host.so.3"
"libcufile.so.0"
"libcufile_rdma.so.1"
"libnvperf_host.so"
)
if [[ $USE_CUFILE == 1 ]]; then
# Add libnvToolsExt only if CUDA version is not 12.9
if [[ $CUDA_VERSION == 13* ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
"/usr/local/cuda/lib64/libcublas.so.13"
"/usr/local/cuda/lib64/libcublasLt.so.13"
"/usr/local/cuda/lib64/libcudart.so.13"
"/usr/local/cuda/lib64/libnvrtc.so.13"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13"
"/usr/local/cuda/lib64/libibverbs.so.1"
"/usr/local/cuda/lib64/librdmacm.so.1"
"/usr/local/cuda/lib64/libmlx5.so.1"
"/usr/local/cuda/lib64/libnl-3.so.200"
"/usr/local/cuda/lib64/libnl-route-3.so.200")
DEPS_SONAME+=(
"libcufile.so.0"
"libcufile_rdma.so.1"
)
"libcublas.so.13"
"libcublasLt.so.13"
"libcudart.so.13"
"libnvrtc.so.13"
"libcupti.so.13"
"libibverbs.so.1"
"librdmacm.so.1"
"libmlx5.so.1"
"libnl-3.so.200"
"libnl-route-3.so.200")
export USE_CUPTI_SO=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUFILE=0
else
DEPS_LIST+=(
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12")
DEPS_SONAME+=(
"libnvToolsExt.so.1"
"libcublas.so.12"
"libcublasLt.so.12"
"libcudart.so.12"
"libnvrtc.so.12"
"libcupti.so.12")
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nvshmem/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
'$ORIGIN/../../nvidia/cusparselt/lib'
)
if [[ $USE_CUFILE == 1 ]]; then
if [[ $CUDA_VERSION == 13* ]]; then
CUDA_RPATHS+=('$ORIGIN/../../nvidia/cu13/lib')
else
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
elif [[ $CUDA_VERSION == "11.8" ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750
export BUILD_BUNDLE_PTXAS=1
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.11"
"/usr/local/cuda/lib64/libcublasLt.so.11"
"/usr/local/cuda/lib64/libcudart.so.11.0"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.11.2" # this is not a mistake, it links to more specific cuda version
"/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.11"
"libcublasLt.so.11"
"libcudart.so.11.0"
"libnvToolsExt.so.1"
"libnvrtc.so.11.2"
"libnvrtc-builtins.so.11.8"
)
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
else
echo "Unknown cuda version $CUDA_VERSION"

View File

@ -22,9 +22,7 @@ retry () {
# TODO move this into the Docker images
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
@ -35,6 +33,9 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
@ -91,6 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -qUr requirements-build.txt
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1
@ -102,7 +104,7 @@ if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr
fi
echo "Calling setup.py install at $(date)"
echo "Calling 'python -m pip install .' at $(date)"
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"
@ -118,7 +120,7 @@ fi
# TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed
CFLAGS='-Wno-deprecated-declarations' \
BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \
python setup.py install
python -m pip install --no-build-isolation -v .
mkdir -p libtorch/{lib,bin,include,share}

View File

@ -95,6 +95,7 @@ ROCM_SO_FILES=(
"libroctracer64.so"
"libroctx64.so"
"libhipblaslt.so"
"libhipsparselt.so"
"libhiprtc.so"
)
@ -186,20 +187,28 @@ do
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; separated arch list to bar for grep
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep
ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library
HIPBLASLT_LIB_DST=lib/hipblaslt/library
ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
HIPBLASLT_ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
HIPBLASLT_OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($HIPBLASLT_ARCH_SPECIFIC_FILES $HIPBLASLT_OTHER_FILES)
# hipsparselt library files
HIPSPARSELT_LIB_SRC=$ROCM_HOME/lib/hipsparselt/library
HIPSPARSELT_LIB_DST=lib/hipsparselt/library
HIPSPARSELT_ARCH_SPECIFIC_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -E $ARCH)
#HIPSPARSELT_OTHER_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -v gfx)
HIPSPARSELT_LIB_FILES=($HIPSPARSELT_ARCH_SPECIFIC_FILES $HIPSPARSELT_OTHER_FILES)
# ROCm library files
ROCM_SO_PATHS=()
@ -234,12 +243,14 @@ DEPS_SONAME=(
DEPS_AUX_SRCLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"
"${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_SRC/}"
"/opt/amdgpu/share/libdrm/amdgpu.ids"
)
DEPS_AUX_DSTLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"
"${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_DST/}"
"share/libdrm/amdgpu.ids"
)

View File

@ -25,6 +25,7 @@ source /opt/intel/oneapi/mpi/latest/env/vars.sh
export USE_STATIC_MKL=1
export USE_ONEMKL=1
export USE_XCCL=1
export USE_MPI=0
WHEELHOUSE_DIR="wheelhousexpu"
LIBTORCH_HOUSE_DIR="libtorch_housexpu"

View File

@ -19,7 +19,7 @@ git config --global --add safe.directory /var/lib/jenkins/workspace
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# TODO: This can be removed later once vision is also part of the Docker image
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
pip install -q --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
# JIT C++ extensions require ninja, so put it into PATH.
export PATH="/var/lib/jenkins/.local/bin:$PATH"
# NB: ONNX test is fast (~15m) so it's ok to retry it few more times to avoid any flaky issue, we

View File

@ -1,34 +0,0 @@
#!/usr/bin/env bash
# DO NOT ADD 'set -x' not to reveal CircleCI secret context environment variables
set -eu -o pipefail
# This script uses linux host toolchain + mobile build options in order to
# build & test mobile libtorch without having to setup Android/iOS
# toolchain/simulator.
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
# Install torch & torchvision - used to download & trace test model.
# Ideally we should use the libtorch built on the PR so that backward
# incompatible changes won't break this script - but it will significantly slow
# down mobile CI jobs.
# Here we install nightly instead of stable so that we have an option to
# temporarily skip mobile CI jobs on BC-breaking PRs until they are in nightly.
retry pip install --pre torch torchvision \
-f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html \
--progress-bar off
# Run end-to-end process of building mobile library, linking into the predictor
# binary, and running forward pass with a real model.
if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-static* ]]; then
TEST_CUSTOM_BUILD_STATIC=1 test/mobile/custom_build/build.sh
elif [[ "$BUILD_ENVIRONMENT" == *-mobile-lightweight-dispatch* ]]; then
test/mobile/lightweight_dispatch/build.sh
else
TEST_DEFAULT_BUILD=1 test/mobile/custom_build/build.sh
fi
print_sccache_stats

View File

@ -11,10 +11,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
fi
echo "Python version:"
python --version
@ -27,6 +23,12 @@ cmake --version
echo "Environment variables:"
env
# The sccache wrapped version of nvcc gets put in /opt/cache/lib in docker since
# there are some issues if it is always wrapped, so we need to add it to PATH
# during CI builds.
# https://github.com/pytorch/pytorch/blob/0b6c0898e6c352c8ea93daec854e704b41485375/.ci/docker/common/install_cache.sh#L97
export PATH="/opt/cache/lib:$PATH"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
# Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
@ -48,15 +50,6 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export ATEN_THREADING=NATIVE
fi
# Enable LLVM dependency for TensorExpr testing
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then
# To build test_edge_op_registration
export BUILD_EXECUTORCH=ON
export USE_CUDA=0
fi
if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with
@ -99,6 +92,27 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export ACL_ROOT_DIR=/ComputeLibrary
fi
if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then
if [[ -f /opt/riscv-cross-env/bin/activate ]]; then
# shellcheck disable=SC1091
source /opt/riscv-cross-env/bin/activate
else
echo "Activation file not found"
exit 1
fi
export CMAKE_CROSSCOMPILING=TRUE
export CMAKE_SYSTEM_NAME=Linux
export CMAKE_SYSTEM_PROCESSOR=riscv64
export USE_CUDA=0
export USE_MKLDNN=0
export SLEEF_TARGET_EXEC_USE_QEMU=ON
sudo chown -R jenkins /var/lib/jenkins/workspace /opt
fi
if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
POSSIBLE_JAVA_HOMES=()
POSSIBLE_JAVA_HOMES+=(/usr/local)
@ -124,26 +138,8 @@ if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
fi
# Use special scripts for Android builds
if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
export ANDROID_NDK=/opt/ndk
build_args=()
if [[ "${BUILD_ENVIRONMENT}" == *-arm-v7a* ]]; then
build_args+=("-DANDROID_ABI=armeabi-v7a")
elif [[ "${BUILD_ENVIRONMENT}" == *-arm-v8a* ]]; then
build_args+=("-DANDROID_ABI=arm64-v8a")
elif [[ "${BUILD_ENVIRONMENT}" == *-x86_32* ]]; then
build_args+=("-DANDROID_ABI=x86")
elif [[ "${BUILD_ENVIRONMENT}" == *-x86_64* ]]; then
build_args+=("-DANDROID_ABI=x86_64")
fi
if [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then
build_args+=("-DUSE_VULKAN=ON")
fi
build_args+=("-DUSE_LITE_INTERPRETER_PROFILER=OFF")
exec ./scripts/build_android.sh "${build_args[@]}" "$@"
fi
if [[ "$BUILD_ENVIRONMENT" != *android* && "$BUILD_ENVIRONMENT" == *vulkan* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *vulkan* ]]; then
export USE_VULKAN=1
# shellcheck disable=SC1091
source /var/lib/jenkins/vulkansdk/setup-env.sh
@ -177,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Enable XCCL build
export USE_XCCL=1
export USE_MPI=0
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
@ -198,10 +195,16 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && echo "${TORCH_CUDA_ARCH_LIST}" | tr ' ' '\n' | sed 's/$/>= 8.0/' | bc | grep -q 1; then
J=2 # default to 2 jobs
case "$RUNNER" in
linux.12xlarge.memory|linux.24xlarge.memory)
J=24
;;
esac
echo "Building FlashAttention with job limit $J"
export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j ${J}"
fi
if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
@ -216,7 +219,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export USE_ASAN=1
export REL_WITH_DEB_INFO=1
export UBSAN_FLAGS="-fno-sanitize-recover=all"
unset USE_LLVM
fi
if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then
@ -227,7 +229,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
export USE_PRECOMPILED_HEADERS=1
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
@ -237,7 +239,7 @@ fi
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -257,6 +259,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e -o pipefail
get_bazel
python3 tools/optional_submodules.py checkout_eigen
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
@ -281,32 +284,38 @@ else
# XLA test build fails when WERROR=1
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *xla* && "$BUILD_ENVIRONMENT" != *riscv64* ]]; then
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install numpy==2.0.2
WERROR=1 python setup.py clean
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
python3 tools/packaging/split_wheel.py bdist_wheel
else
WERROR=1 python setup.py bdist_wheel
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py clean
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "USE_SPLIT_BUILD cannot be used with xla or rocm"
exit 1
else
python setup.py bdist_wheel
fi
python setup.py bdist_wheel
fi
pip_install_whl "$(echo dist/*.whl)"
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then
install_torchvision
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *audio* ]]; then
install_torchaudio
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchrec* || "${BUILD_ADDITIONAL_PACKAGES:-}" == *fbgemm* ]]; then
install_torchrec_and_fbgemm
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchao* ]]; then
install_torchao
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
echo "Checking that xpu is compiled"
pushd dist/
@ -394,10 +403,8 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.
@ -414,7 +421,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
# don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build
python tools/stats/export_test_times.py
fi
# don't do this for bazel or s390x as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
# don't do this for bazel or s390x or riscv64 as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
print_sccache_stats
fi

View File

@ -300,24 +300,3 @@ except RuntimeError as e:
exit 1
fi
fi
###############################################################################
# Check for C++ ABI compatibility to GCC-11 - GCC 13
###############################################################################
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
pushd /tmp
# Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
# gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19
# gcc 11 - CUDA 11.8, xpu, rocm
# gcc 13 - CUDA 12.6, 12.8 and cpu
# Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426
if [[ "$(uname -m)" == "s390x" ]]; then
cxx_abi="19"
elif [[ "$DESIRED_CUDA" != 'cu118' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then
cxx_abi="18"
else
cxx_abi="16"
fi
python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"
popd
fi

View File

@ -13,6 +13,13 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
fi
if which sccache > /dev/null; then
# Clear SCCACHE_BUCKET and SCCACHE_REGION if they are empty, otherwise
# sccache will complain about invalid bucket configuration
if [[ -z "${SCCACHE_BUCKET:-}" ]]; then
unset SCCACHE_BUCKET
unset SCCACHE_REGION
fi
# Save sccache logs to file
sccache --stop-server > /dev/null 2>&1 || true
rm -f ~/sccache_error.log || true

View File

@ -15,6 +15,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
export PYTORCH_TEST_WITH_ROCM=1
fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# TODO: Reenable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# shellcheck disable=SC2034
BUILD_TEST_LIBTORCH=0

View File

@ -78,6 +78,34 @@ function pip_install_whl() {
fi
}
function pip_build_and_install() {
local build_target=$1
local wheel_dir=$2
local found_whl=0
for file in "${wheel_dir}"/*.whl
do
if [[ -f "${file}" ]]; then
found_whl=1
break
fi
done
# Build the wheel if it doesn't exist
if [ "${found_whl}" == "0" ]; then
python3 -m pip wheel \
--no-build-isolation \
--no-deps \
--no-use-pep517 \
-w "${wheel_dir}" \
"${build_target}"
fi
for file in "${wheel_dir}"/*.whl
do
pip_install_whl "${file}"
done
}
function pip_install() {
# retry 3 times
@ -121,17 +149,23 @@ function get_pinned_commit() {
cat .github/ci_commit_pins/"${1}".txt
}
function detect_cuda_arch() {
if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then
if command -v nvidia-smi; then
TORCH_CUDA_ARCH_LIST=$(nvidia-smi --query-gpu=compute_cap --format=csv | tail -n 1)
elif [[ "${TEST_CONFIG}" == *nogpu* ]]; then
# There won't be nvidia-smi in nogpu tests, so just set TORCH_CUDA_ARCH_LIST to the default
# minimum supported value here
TORCH_CUDA_ARCH_LIST=8.0
fi
export TORCH_CUDA_ARCH_LIST
fi
}
function install_torchaudio() {
local commit
commit=$(get_pinned_commit audio)
if [[ "$1" == "cuda" ]]; then
# TODO: This is better to be passed as a parameter from _linux-test workflow
# so that it can be consistent with what is set in build
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"
else
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"
fi
pip_build_and_install "git+https://github.com/pytorch/audio.git@${commit}" dist/audio
}
function install_torchtext() {
@ -139,8 +173,8 @@ function install_torchtext() {
local text_commit
data_commit=$(get_pinned_commit data)
text_commit=$(get_pinned_commit text)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/data.git@${data_commit}"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${text_commit}"
pip_build_and_install "git+https://github.com/pytorch/data.git@${data_commit}" dist/data
pip_build_and_install "git+https://github.com/pytorch/text.git@${text_commit}" dist/text
}
function install_torchvision() {
@ -153,17 +187,19 @@ function install_torchvision() {
echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -
LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so
fi
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"
if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then
# Not sure if both are needed, but why not
export FORCE_CUDA=1
export WITH_CUDA=1
fi
pip_build_and_install "git+https://github.com/pytorch/vision.git@${commit}" dist/vision
if [ -n "${LD_PRELOAD}" ]; then
LD_PRELOAD=${orig_preload}
fi
}
function install_tlparse() {
pip_install --user "tlparse==0.3.30"
PATH="$(python -m site --user-base)/bin:$PATH"
}
function install_torchrec_and_fbgemm() {
local torchrec_commit
torchrec_commit=$(get_pinned_commit torchrec)
@ -178,25 +214,71 @@ function install_torchrec_and_fbgemm() {
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
# install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_uninstall fbgemm-gpu-nightly
# Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm
ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"
# Find rocm_version.h header file for ROCm version extract
rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"
if [ ! -f "$rocm_version_h" ]; then
rocm_version_h="${ROCM_HOME}/include/rocm_version.h"
fi
# Error out if rocm_version.h not found
if [ ! -f "$rocm_version_h" ]; then
echo "Error: rocm_version.h not found in expected locations." >&2
exit 1
fi
# Extract major, minor and patch ROCm version numbers
MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')
MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')
PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')
ROCM_INT=$((MAJOR_VERSION * 10000 + MINOR_VERSION * 100 + PATCH_VERSION))
echo "ROCm version: $ROCM_INT"
export BUILD_ROCM_VERSION="$MAJOR_VERSION.$MINOR_VERSION"
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py install \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
local wheel_dir=dist/fbgemm_gpu
local found_whl=0
for file in "${wheel_dir}"/*.whl
do
if [[ -f "${file}" ]]; then
found_whl=1
break
fi
done
# Build the wheel if it doesn't exist
if [ "${found_whl}" == "0" ]; then
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}" --recurse-submodules
python setup.py bdist_wheel \
--build-variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
# Save the wheel before cleaning up
mkdir -p dist/fbgemm_gpu
cp fbgemm/fbgemm_gpu/dist/*.whl dist/fbgemm_gpu
fi
for file in "${wheel_dir}"/*.whl
do
pip_install_whl "${file}"
done
rm -rf fbgemm
else
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu
fi
}
@ -212,34 +294,10 @@ function clone_pytorch_xla() {
fi
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
function install_torchao() {
local commit
commit=$(get_pinned_commit torchao)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/ao.git@${commit}"
pip_build_and_install "git+https://github.com/pytorch/ao.git@${commit}" dist/ao
}
function print_sccache_stats() {

View File

@ -1,123 +0,0 @@
from datetime import datetime, timedelta, timezone
from tempfile import mkdtemp
from cryptography import x509
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
temp_dir = mkdtemp()
print(temp_dir)
def genrsa(path):
key = rsa.generate_private_key(
public_exponent=65537,
key_size=2048,
)
with open(path, "wb") as f:
f.write(
key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.TraditionalOpenSSL,
encryption_algorithm=serialization.NoEncryption(),
)
)
return key
def create_cert(path, C, ST, L, O, key):
subject = issuer = x509.Name(
[
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
]
)
cert = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc) + timedelta(days=10)
)
.add_extension(
x509.BasicConstraints(ca=True, path_length=None),
critical=True,
)
.sign(key, hashes.SHA256())
)
# Write our certificate out to disk.
with open(path, "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
return cert
def create_req(path, C, ST, L, O, key):
csr = (
x509.CertificateSigningRequestBuilder()
.subject_name(
x509.Name(
[
# Provide various details about who we are.
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
]
)
)
.sign(key, hashes.SHA256())
)
with open(path, "wb") as f:
f.write(csr.public_bytes(serialization.Encoding.PEM))
return csr
def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
cert = (
x509.CertificateBuilder()
.subject_name(csr_cert.subject)
.issuer_name(ca_cert.subject)
.public_key(csr_cert.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc) + timedelta(days=10)
# Sign our certificate with our private key
)
.sign(private_ca_key, hashes.SHA256())
)
with open(path, "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
return cert
ca_key = genrsa(temp_dir + "/ca.key")
ca_cert = create_cert(
temp_dir + "/ca.pem",
"US",
"New York",
"New York",
"Gloo Certificate Authority",
ca_key,
)
pkey = genrsa(temp_dir + "/pkey.key")
csr = create_req(
temp_dir + "/csr.csr",
"US",
"California",
"San Francisco",
"Gloo Testing Company",
pkey,
)
cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)

View File

@ -35,12 +35,11 @@ fi
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
else
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
# NB: we always build with distributed; USE_DISTRIBUTED turns off all
# backends (specifically the gloo backend), so test that this case works too
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
fi
if which sccache > /dev/null; then
print_sccache_stats

View File

@ -20,14 +20,4 @@ print_cmake_info() {
CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")
# Print all libraries under cmake rpath for debugging
ls -la "$CONDA_INSTALLATION_DIR/../lib"
export CMAKE_EXEC
# Explicitly add conda env lib folder to cmake rpath to address the flaky issue
# where cmake dependencies couldn't be found. This seems to point to how conda
# links $CMAKE_EXEC to its package cache when cloning a new environment
install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true
# Adding the rpath will invalidate cmake signature, so signing it again here
# to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
# with an exit code 137 otherwise
codesign -f -s - "${CMAKE_EXEC}" || true
}

View File

@ -5,11 +5,6 @@ set -x
# shellcheck source=./macos-common.sh
source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"
if [[ -n "$CONDA_ENV" ]]; then
# Use binaries under conda environment
export PATH="$CONDA_ENV/bin":$PATH
fi
# Test that OpenMP is enabled
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
@ -18,9 +13,13 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
python -mpip install -r requirements.txt
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
python -mpip install --no-input -r requirements.txt
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to
@ -162,16 +161,45 @@ test_jit_hooks() {
assert_git_not_dirty
}
# Shellcheck doesn't like it when you pass no arguments to a function
# that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2120
checkout_install_torchbench() {
local commit
commit=$(cat .ci/docker/ci_commit_pins/torchbench.txt)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
popd
pip install -r .ci/docker/ci_commit_pins/huggingface-requirements.txt
# https://github.com/pytorch/pytorch/issues/160689 to remove torchao because
# its current version 0.12.0 doesn't work with transformers 4.54.0
pip uninstall -y torchao
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
}
torchbench_setup_macos() {
git clone --recursive https://github.com/pytorch/vision torchvision
git clone --recursive https://github.com/pytorch/audio torchaudio
brew install jpeg-turbo libpng
pushd torchvision
git fetch
git checkout "$(cat ../.github/ci_commit_pins/vision.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
python -m pip install -e . -v --no-build-isolation
popd
pushd torchaudio
@ -179,17 +207,15 @@ torchbench_setup_macos() {
git checkout "$(cat ../.github/ci_commit_pins/audio.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
#TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp
USE_OPENMP=0 python -m pip install -e . -v --no-build-isolation
popd
# Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2119,SC2120
checkout_install_torchbench
}
conda_benchmark_deps() {
conda install -y astunparse numpy scipy ninja pyyaml setuptools cmake typing-extensions requests protobuf numba cython scikit-learn
conda install -y -c conda-forge librosa
pip_benchmark_deps() {
python -mpip install --no-input requests cython scikit-learn six
}
@ -197,7 +223,7 @@ test_torchbench_perf() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -224,7 +250,7 @@ test_torchbench_smoketest() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
pip_benchmark_deps
# shellcheck disable=SC2119,SC2120
torchbench_setup_macos
@ -232,55 +258,95 @@ test_torchbench_smoketest() {
mkdir -p "$TEST_REPORTS_DIR"
local device=mps
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor)
local hf_models=(GoogleFnet YituTechConvBert Speech2Text2ForCausalLM)
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
for backend in eager inductor; do
for dtype in notset float16 bfloat16; do
echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
for model in "${hf_models[@]}"; do
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
fi
for dtype in notset amp; do
echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype}"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
if [ "$dtype" == notset ]; then
for dtype_ in notset amp; do
echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype_}"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv"
local dtype_arg="--${dtype_}"
if [ "$dtype_" == notset ]; then
dtype_arg="--float32"
fi
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv" || true
fi
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv" || true
done
done
done
fi
done
echo "Pytorch benchmark on mps device completed"
}
test_aoti_torchbench_smoketest() {
print_cmake_info
echo "Launching AOTInductor torchbench setup"
pip_benchmark_deps
# shellcheck disable=SC2119,SC2120
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--accuracy --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
done
echo "Launching HuggingFace inference performance run for AOT Inductor and dtype ${dtype}"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
echo "Pytorch benchmark on mps device completed"
}
@ -289,7 +355,7 @@ test_hf_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
echo "Launching HuggingFace training perf run"
@ -305,7 +371,7 @@ test_timm_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
echo "Launching timm training perf run"
@ -317,8 +383,6 @@ test_timm_perf() {
echo "timm benchmark on mps device completed"
}
install_tlparse
if [[ $TEST_CONFIG == *"perf_all"* ]]; then
test_torchbench_perf
test_hf_perf
@ -330,7 +394,9 @@ elif [[ $TEST_CONFIG == *"perf_hf"* ]]; then
elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest
test_torchbench_smoketest "${SHARD_NUMBER}"
elif [[ $TEST_CONFIG == *"aot_inductor_perf_smoketest"* ]]; then
test_aoti_torchbench_smoketest "${SHARD_NUMBER}"
elif [[ $TEST_CONFIG == *"mps"* ]]; then
test_python_mps
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

Some files were not shown because too many files have changed in this diff Show More