Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`.
Test Plan:
mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`
Rollback Plan:
Differential Revision: D82181924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
Summary:
We add the parsing for list of string. This is needed for AOTInductor
profiling for input information of Triton kernels.
Test Plan:
Included in commit.
test_profiler_op_event_kwargs_list_of_strings
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593
Approved by: https://github.com/sraikund16
As per comment in source code:
```
# If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel
# (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile
# quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower
# and upper are large, for the lower values we will be masking off large % of read/writes,
# when we could expand the coalescing xblock instead.
```
For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values).
Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163365
Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0
Landing this instead of https://github.com/pytorch/pytorch/pull/162994.
Here is how i think the whole dynamo + frame construction logic work:
1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that.
2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (f6ea41ead2/test/export/test_export.py (L1369)). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self.
3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (f6ea41ead2/torch/_dynamo/variables/functions.py (L463)) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export:
"While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]"
My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names.
Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163107
Approved by: https://github.com/avikchaudhuri
Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space).
Some caveats/changes:
- Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes
- Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension
Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint:
<img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" />
Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090
Approved by: https://github.com/bobrenjc93
Fixes#160547
### Summary:
bug
```
def test_namedtuple(self):
from collections import namedtuple
Point = namedtuple('Point', 'x y')
class M(torch.nn.Module):
def forward(self, x, y):
return x + y
inp = Point(torch.ones(3), torch.ones(3))
print(M()(*inp))
# errors
ep = torch.export.export(M(), inp, strict=False)
print(ep)
# succeeds
ep = torch.export.export(M(), inp, strict=True)
print(ep)
# workaround could be to convert namedtuple to a kwarg
inp_kwargs = {field: getattr(inp, field) for field in inp._fields}
ep = torch.export.export(M(), (), inp_kwargs)
print(ep)
```
FIx :
namedtuple is subclass of tuple
but namedtuple is not expected
So, this change handles named tuple case
I have added 🧪 test case for this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959
Approved by: https://github.com/angelayi
Co-authored-by: Angela Yi <angelayi@meta.com>
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.
Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing
Differential Revision: D83016250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
Fixes part of #163314
In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**
This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.
The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
Differential Revision: D82933509
over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out.
with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later
I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible)
putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green
I have the next PR ready as well (implements a fresh cache context manager), will export once this lands
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488
Approved by: https://github.com/aorenste, https://github.com/masnesral
## Why this PR?
I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4).
Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path.
## What I've changed?
- I added `-I ./` to correctly search for `./include/openreg.h`.
- I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235
Approved by: https://github.com/FFFrog, https://github.com/albanD
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
This PR removes import tricks of `SHARDING_PRIORITIES` and `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not.
Before change:
```
import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES
import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe
```
works
After change:
there is an import error exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438
Approved by: https://github.com/janeyx99
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.
Testing:
```
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
mod = torch.get_device_module('cuda')
hw = mod._device_limits.GPULimits(device)
print(hw.get_tflops_per_second(torch.float16))
print(hw.get_tflops_per_second(torch.float32))
print(hw.get_tflops_per_second(torch.float64))
print(hw.get_tflops_per_second(torch.bfloat16))
print(hw.get_tflops_per_second(torch.int8))
print(hw.get_memory_bandwidth_Bps() / 1e9)
print(hw.get_shared_memory_bandwidth_Bps() / 1e9)
# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.
### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)
## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.
When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.
Test Plan: CI
Differential Revision: D82904540
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm
We are seeing crashes of the form
```
Traceback (most recent call last):
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1487, in run
while self.step():
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1348, in step
self.dispatch_table[inst.opcode](self, inst)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2437, in LOAD_ATTR
self._load_attr(inst)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2425, in _load_attr
result = BuiltinVariable(getattr).call_function(
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 1347, in call_function
return handler(tx, args, kwargs)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <lambda>
tx, [v.realize() for v in args], kwargs
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <listcomp>
tx, [v.realize() for v in args], kwargs
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 72, in realize
self._cache.realize()
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 33, in realize
self.vt = builder.VariableBuilder(tx, self.source)(self.value)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 445, in __call__
vt = self._wrap(value)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 1043, in _wrap
torch._dynamo.utils.store_user_object_weakref(value)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/utils.py", line 4694, in store_user_object_weakref
user_obj_id_to_weakref[obj_id] = weakref.ref(obj)
torch._dynamo.exc.InternalTorchDynamoError: TypeError: cannot create weak reference to 'torch.Event' object
```
This pull request makes us gracefully graph break, vs explicitly crashing.
I've added a test which reproduces the issue. There is a side discussion re:
how did torch.Event support ever work here, since it appears you cannot take a
weakref to a torch.Event
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163168
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
Summary:
otherwise, may hit
```
Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal)
```
Test Plan: UTs
Reviewed By: yushangdi
Differential Revision: D82974062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529
Approved by: https://github.com/yushangdi, https://github.com/Skylion007
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Before
Moving module.to("cuda:0") under fake tensor mode would have parameter on `meta` device.
After
parameters would be on "cuda:0" .
Test Plan: buck2 run fbcode//caffe2/test:fake_tensor -- --r test_move_module
Reviewed By: mikaylagawarecki
Differential Revision: D80102876
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163433
Approved by: https://github.com/albanD
This change restricts the DLPack stride normalization to apply only to 1D tensors of shape (1,).
### Rationale
The previous implementation normalized the strides for any multi-dimensional tensor containing a dimension of size 1. While well-intentioned, this "over-normalization" discards critical memory layout information, causing issues for downstream consumers who rely on strides to infer alignment and contiguity.
For example:
* A row-major tensor with `shape=(1, 128)` and `stride=(128, 1)` would be incorrectly normalized to `stride=(1, 1)`.
* A column-major tensor with `shape=(1024, 1)` and `stride=(1, 1024)` would also be normalized to `stride=(1, 1)`.
This loss of stride information makes it impossible for consumers to detect the original memory layout (e.g., row-major vs. column-major) and breaks assumptions about memory alignment needed for optimized indexing or specialized hardware APIs like GPU TMA.
The original intent of the normalization was to handle the simple case of a 1D tensor with shape=(1,) and a non-standard stride. This fix reverts to that specific, non-problematic behavior, ensuring that multi-dimensional tensors retain their precise stride information during DLPack export.
### Related Issues
#163274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163282
Approved by: https://github.com/eqy
A big pain point ppl have with custom ops is that they do not accept arbitrary input/outputs. In this PR we create the concept of an "OpaqueObject" which allows users to pass arbitrary python objects into custom operators.
Some still slightly annoying parts with this implementation:
- The schema of the operator is `__torch__.torch.classes.aten.OpaqueObject` instead of whatever python type
- `@torch.library.custom_op` doesn't work.. yet?
UX:
```python
from torch._library.opaque_object import make_opaque, get_payload
# your custom python class
class OpaqueQueue:
def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
super().__init__()
self.queue = queue
self.init_tensor_ = init_tensor_
def push(self, tensor: torch.Tensor) -> None:
self.queue.append(tensor)
def pop(self) -> torch.Tensor:
if len(self.queue) > 0:
return self.queue.pop(0)
return self.init_tensor_
def size(self) -> int:
return len(self.queue)
queue = OpaqueQueue([], torch.zeros(3))
obj: torch._C.ScriptObject = make_opaque(queue)
# obj.payload stores a direct reference to this python queue object
self.assertEqual(get_payload(obj), queue)
# This is able to be passed through the dispatcher
torch.ops._TestOpaqueObject.queue_push(obj, torch.ones(3))
self.assertTrue(queue.size(), 1)
```
Authoring a custom op:
```python
lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")
torch.library.define(
f"_TestOpaqueObject::queue_push",
"(__torch__.torch.classes.aten.OpaqueObject a, Tensor b) -> ()",
tags=torch.Tag.pt2_compliant_tag,
lib=lib,
)
@torch.library.impl(f"{libname}::queue_push", "CompositeExplicitAutograd", lib=lib)
def push_impl(q: torch._C.ScriptObject, b: torch.Tensor) -> None:
# We can get the payload directly by get_payload(q)
queue = get_payload(q)
assert isinstance(queue, OpaqueQueue)
queue.push(b)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162660
Approved by: https://github.com/zou3519
Summary:
Currently, we assume that refcount_ and weakcount_ are always stored in an 8-byte aligned address right next to each other. Based on this assumption, we load 8 bytes in intrusive_ptr::reset_ to check the values of both counts. However, that assumption is not part of C++ language standard so it's essentially undefined behavior.
This change eliminates that assumption by combining refcount_ and weakcount_ in a single 64-bit count and we use the lower 32 bits for refcount_ and upper 32 bits for the weakcount_.
In addition to eliminating the undefined behavior, the change also eliminates the read of weakcount_ after decrementing refcount_ in intrusive_ptr::reset_. This claws back lost performance introduced in https://github.com/pytorch/pytorch/pull/162784 for non-final refcount_ decrementing.
Reviewed By: yfeldblum
Differential Revision: D82869192
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163394
Approved by: https://github.com/Skylion007
all details are in readme.md
Note: one thing i want to do soonest is to switch to graph representation instead of stack representation
for the fuzzed ops should make things easier as things get more complicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163417
Approved by: https://github.com/bobrenjc93
fixes#159855, was not triggered in other tests since it took
more than one round of fusion to get to the problematic code
which prunes WeakDeps. The WeakDeps are important to inhibit
fusion of kernels that read/write data into mutated buffers
with different indexing.
We modify the code to a) always prune before fusion, rather
than after, which improves its coverage and makes our basic
vertical fusion tests surface this issue as well and b)
check whether the weak dep is fusable before eliminating it
(which basically means checking that the producing code and
the consuming code are sufficiently compatible).
The tests that trigger this with change (a) is:
test_fusing_write_into_disjoint_read introduced in #118210.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162316
Approved by: https://github.com/eellison, https://github.com/mlazos, https://github.com/shunting314
### Issue
The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name.
If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA.
The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call).
### Usage:
```
@requires_nvshmem
@triton.jit
def foo(...):
...
foo[(1, 1)](...)
```
It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163423
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152, #163194
No need for unnecessary copy of std::vectors. This Tensor list is copied throughout the foreach paths and this code is on a hot path for torch optimizers. Auto move elision will not happen on the return statement since it's a subelement of a vector that needs to be copied out before the std::vector is dtor'd. This should reduce quite a few list copies along this path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163416
Approved by: https://github.com/ezyang
The big semantic change (and the reason for this port) is that we no longer monkeypatch Tensor with torchdim's special methods. The new algorithm for handling dispatch is that we first land in `__torch_function__` and we see if a special FCD implementation needs to be dispatch to first, and if there is nothing we fallback to the standard level strategy.
Because there is no longer C binding equivalent of classes, we've condensed _C.Dim and Dim together, and similar for Tensor. This resulted in some bugs as the Python API is sometimes different from the C API. I've attempted to disambiguate these but there may still be mistakes (many early bugs were due to this problem). Dim and DimEntry are especially painful as Dim must abide by Tensor equality semantics, but is pointer equality in C (DimEntry doesn't have this problem). Another difference between C/Python that is subtle is we no longer get implicit conversions from Dim to DimEntry, this also caused some bugs.
Much of the mechanical porting work was done by claude code. I have a separate PR that deletes functorch._C, but it was useful having dim.cpp to point claude at it so I haven't done it in this PR. From a reviewing perspective, I need to re-review that I didn't forget to port anything, some noticeably missing "small" things are patched_dim_method. I am still in progress of carefully doing a side-by-side review of ports; "simplifications" from claude code were also a major source of bugs.
There are two major feature gaps in the implementation:
- DelayedTensor and dot handling are not implemented yet. This should be reasonably easy, just need to do it. However, for the purposes of sharded propagation it is actually better not to reconstruct matmuls.
- Splitting dimensions with an index like `[x, y]` doesn't work. The problem is that `__getitem__` interprets this as advanced indexing and sends the list to torch.tensor to turn into a tensor, instead of being eligible for `__torch_function__`. I think I might need to hard code a special case for this or something?
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160236
Approved by: https://github.com/zdevito, https://github.com/albanD
Summary:
Today `fullgraph_capture` takes a frame, but clients usually take a callable (`nn.Module`, function, or method) and example inputs (args and kwargs) and then explicitly set up the frame to pass. This is boilerplate—and potentially tricky to get right—that can be hidden inside the API.
The original `fullgraph_capture` now becomes `_fullgraph_capture_frame`.
Test Plan:
existing tests
Rollback Plan:
Differential Revision: D82339400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162849
Approved by: https://github.com/zhxchen17
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") call.
This diff supports this.
Notice that .to("cuda") doesn't work yet, as it enquery current device idx by calling cuda API.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Test Plan:
buck2 run fbcode//caffe2/test:fake_tensor -- --r test_fake_gpu_no_init
Rollback Plan:
Differential Revision: D80101283
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160431
Approved by: https://github.com/henryoier, https://github.com/ezyang
Currently OutputGraphGuardsState is separated out as a serializable interface for OutputGraph, but some of the typing around it is incorrect in dynamo's guards.py and output_graph.py: more fields are used by code than claimed by OutputGraphGuardsState, and it works because either the full OutputGraph is passed in or the parts that use those fields are dead when OutputGraphGuardsState is passed in.
In this PR we try to further separate the necessary fields of OutputGraph that should be retained by a full graph capture mechanism, not just limited to dynamo (as it is currently) but also something like make_fx (in the future). Since these fields do not need to be serialized, the result is an intermediate "common" data structure that is between OutputGraphGuardsState and OutputGraph in the inheritance hierarchy.
Differential Revision: D81718791
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162211
Approved by: https://github.com/zhxchen17
Summary: We observe a case then the fwd graph has duplicated return nodes, which will lead to errors due to fx renaming the node, thus we add poi info into the node name.
Test Plan:
### unit test
```
CUDA_VISIBLE_DEVICES=3 buck2 test mode/opt -m ovr_config//triton:beta -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 //caffe2/test/functorch:test_aotdispatch -- test_quantize_activation_duplicate_nodes
```
Buck UI: https://www.internalfb.com/buck2/de5eccc6-4064-4214-843d-70b8e3829afe
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4503599937670844
Network: Up: 217KiB Down: 72KiB (reSessionID-73e5c269-4f4d-4a54-896a-79c077eea326)
Executing actions. Remaining 0/2 0.1s exec time total
Command: test. Finished 1 local
Time elapsed: 45.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
### E2E
before
f798417700
after
Differential Revision: D82844100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163364
Approved by: https://github.com/Yuzhen11
# why
- extra kwargs are input/op dependent and not config dependent. We don't
plan to serialize/deserialize them, and so they need to be fed in
later beore making the KTC, rather than when getting the config values
directly
# what
- move extra_kwargs into the KTC and get_ktc interface directly
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "_addmm"
```
Differential Revision: [D82871310](https://our.internmc.facebook.com/intern/diff/D82871310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163209
Approved by: https://github.com/nmacchioni
ghstack dependencies: #163305
# why
- this is not directly controlled by the config arg but rather by the
input and by the inductor wide setting
- it's always the same for every choice
- we want the config kwargs to be *programable* and this is not
programable in that sense but rather needs to use inductor config
# what
- move generating the ALLOW_TF32 kwarg in Triton templates into
get_extra_kwargs
# testing
with some annotations, this is now the kwargs and extra_kwargs on addmm
```
{'EVEN_K': True, 'USE_FAST_ACCUM': False, 'ACC_TYPE': 'tl.float32', 'num_stages': 1, 'num_warps': 2, 'BLOCK_M': 32, 'BLOCK_N': 32, 'BLOCK_K': 16, 'hint_override': None, 'GROUP_M': 8} # choice/config kwargs
{'ALLOW_TF32': True, 'epilogue_fn': <function addmm_epilogue.<locals>.epilogue at 0x7f64d54ff600>, 'epilogue_fn_hash': "['addmm_epilogue', torch.float32, 1, 1]", 'prefix_args': 1} # extra kwargs
```
they're both passed onto the template
Differential Revision: [D82871312](https://our.internmc.facebook.com/intern/diff/D82871312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163305
Approved by: https://github.com/nmacchioni
Fixes#158631
The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857
Approved by: https://github.com/divyanshk
# Feature
Support `torch.cond` in the FX converter. The generated FX IR is conceptually indentical to what would come from `torch.export`:
- Submodules as stored as attributes, and accessed via `getattr`.
- The conditional is represented as `torch.ops.higher_order.cond`, which takes in the subgraphs, a predicate and submodule inputs.
# Implementation overview
The FX backend generates code for subgraphs using the following steps:
1. When `codegen_conditional` is called in `WrapperFxCodegen`, we emit a `ConditionalLine`.
a. We also codegen the true/false subgraphs at this time, storing their subgms for later.
2. At the beginning of FX conversion, generate `get_attr` nodes accessing each subgraph. It's important to do this at the start, before registering the node metadata hook. This also matches the convention followed by torch.export.
3. When we see the `ConditionalLine` in the FX converter, we generate a corresponding `torch.ops.higher_order.cond`.
# Implementation details
This ended up being a substantial change, as wrapper codegen has some special logic for subgraphs.
Certain methods of `PythonWrapperCodegen` are overridden by `SubgraphPythonWrapperCodegen`. To apply these overrides, we use multiple inheritance with the registered subclass of `WrapperFxCodegen`.
Unlike most other wrapper codegen methods, which map 1:1 to Wrapper IR lines, subgraph codegen generates a number of wrapper lines including `EnterSubgraphLine` and `ExitSubgraphLine`, along with Python or C++ code calling the subgraph as a function. These lines are used for some backends' memory planning.
In contrast, FX IR typically represents a subgraph call as a single HOP node, or a `call_module` op. To account for this difference, this PR introduces a new wrapper IR line called `ConditionalLine`, which is only used by the FX backend. We override the `codegen_conditional` method to emit this line. This sidesteps having to port the existing subgraph codegen and associated memory planning to Wrapper IR. (In principle, it seems possible to adapt the existing backends to `ConditionalLine`, but it could be a larger refactor, since we'd also have to update the memory planning.)
Some of the lower-level subgraph codegen methods are still shared between the FX and Python backends, such as `generate_subgraph_common`. Those were easier to port to Wrapper IR.
This also required generalizing the way the FX converter handles graph inputs and outputs. Previously, it assumed the IO signature was the same as `V.graph.module`, but this is only true for the parent graph, and not subgraphs. Instead, we need to call `get_graph_inputs` and `get_graph_outputs` to populate the inputs and outputs for subgraphs.
# Test plan
This PR adds a couple of tests using torch.cond. Here's an example graph generated by one of them:
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%true_graph_0 : [num_users=1] = get_attr[target=true_graph_0]
%false_graph_0 : [num_users=1] = get_attr[target=false_graph_0]
%cond : [num_users=1] = call_function[target=torch.ops.higher_order.cond](args = (%arg0_1, %true_graph_0, %false_graph_0, (%arg1_1,)), kwargs = {})
%buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%cond, 0), kwargs = {})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 6, constant_args_idx: 6, grid: [(1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 6, XBLOCK: 8}})
return buf1
```
It also removes an existing negative test which checked that a certain error was raised when subgraphs were encountered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163234
Approved by: https://github.com/angelayi, https://github.com/jansel
Summary:
This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache.
This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries.
The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries().
Saving single entries and such with DynamoCache still works normally.
Test Plan:
All existing unit tests pass.
Rollback Plan:
Differential Revision: D82380307
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886
Approved by: https://github.com/zhxchen17
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
- `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
- With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
- Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps
I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
ghstack dependencies: #163339, #163341
Fixes#161014
This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue #161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor.
In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero. But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well.
This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you!
@malfet @manuelcandales @janeyx99 @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161639
Approved by: https://github.com/manuelcandales
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:
a1f7639922/test/inductor/test_loop_ordering.py (L612-L641)
Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162101, #162126
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.
In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```
UPDATE: add the same optimization for LoopBody.reorder_iter_loops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162101
Approved by: https://github.com/jansel, https://github.com/eellison
The issue cannot be reproduced using the original repro code provided in the issue description.
However, the underlying issue mentioned by the maintainer (missing functions in `builder.py` and `trace_rules.py`) was never addressed and can still be reproduced with this test case:
```python
import torch
from torch.nn.attention import _cur_sdpa_kernel_backends
@torch.compile(fullgraph=True)
def test_function_that_triggers_error():
return _cur_sdpa_kernel_backends()
print("Calling torch.compile function...")
try:
result = test_function_that_triggers_error()
print(f"Success: {result}")
except Exception as e:
print(f"ERROR: {e}")
print(f"Error type: {type(e)}")
```
The original repro likely no longer triggers the issue due to code path changes in the SDPA implementation, while the direct call to `_cur_sdpa_kernel_backends()` exposes the underlying problem where certain torch._C functions returning non-Tensor values aren't properly handled by dynamo tracing.
I have implemented the changes by adding the missing functions to both `builder.py` and `trace_rules.py` to properly handle these cases during compilation.
@guilhermeleobas
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161169
Approved by: https://github.com/guilhermeleobas, https://github.com/StrongerXi
Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.
Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.
Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298
As titled. Avoiding a potential hang when running dispatch and combine in subgroups.
The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin
Fixes#161010 by making `clone_meta` match the semantics of strides for eager mode.
This is:
* Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride
* Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163017
Approved by: https://github.com/williamwen42, https://github.com/xmfan
Co-authored-by: morrison-turnansky <mturnans@redhat.com>
Summary:
X-link: https://github.com/meta-pytorch/tritonbench/pull/432
Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template.
This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`.
Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation.
In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff.
Test Plan:
Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel.
Benchmarking command:
```
TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log
```
Rollback Plan:
Differential Revision: D82597111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163147
Approved by: https://github.com/njriasan
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```
If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.
However, when there are mutating args, we don't see `del buf1` immediately.
```python
@torch.library.custom_op(
"mylib::op1",
mutates_args=["x"],
schema="(Tensor(a!)? x) -> (Tensor, Tensor)",
device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
x = x + 1
return (x + 1, x + 2)
```
<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />
Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)
According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.
Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519
PR #151360 added mx fp8 and fp4 support on ROCm.
1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures.
This patch corrects is_blockwise_1x32_scaling function code.
2. Fixes the m, n, k dimensions for ROCm mx case.
3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2.
This resulted in higher SQNR value for mx fp4 test.
Testing result on gfx950 w/ ROCm7.0
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s
OK passed 111
This is same as before. (when PR 151360 was merged)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163127
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163205
Approved by: https://github.com/ezyang
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary: Platform args was a buck1 concept that we decided to port over to buck2 in order to make the migration easier. However, platforms args existing in the repo blocks some buck modernization like modefile free efforts, so we're trying to get rid of the usage.
Test Plan:
CI
Rollback Plan:
Differential Revision: D82470032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163086
Approved by: https://github.com/malfet, https://github.com/8Keep
Update the torch-xpu-ops commit to 24fab67b6e, includes:
- Clean up getDeviceIndexOfCurrentQueue
- Fix hardswish gradients corner case
- Fix xccl contiguous check
- Move checks from nonzero kernel to operator
- support high priority stream for xccl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163244
Approved by: https://github.com/EikanWang
Summary:
This diff does a few things:
- It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it
- It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized.
Here's an example of what that JSON looks like:
```
{
"artifacts": {
"precompile_aot_autograd": [
"__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
],
"precompile_dynamo": [
{
"backend_ids": [
"__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
],
"fn_name": "TorchBenchmarkRunner.forward_and_backward_pass",
"num_codes": "10",
"python_version": "3.12.11+meta",
"torch_version": "2.10.0a0+fb"
}
]
},
"num_entries": 1
}
```
Test Plan:
Existing tests pass.
NanoGPT tlparse showing the new debug:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR.
Rollback Plan:
Differential Revision: D82232574
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162740
Approved by: https://github.com/zhxchen17
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
- `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
- With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
- Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase
- Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps
I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
Summary:
Point people lowering to lite interpreter to the existence of ExecuTorch.
Added the typing deprecation, a warnings deprecation
Test Plan: Try using it, see deprecation warning
Reviewed By: lucylq
Differential Revision: D82759566
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163289
Approved by: https://github.com/larryliu0820
Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime.
Current behavior:
- Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3.
- Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions.
- Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed?
- Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed.
- We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed.
- DynamicInts as nn.Module attributes are handled.
- We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?)
```python
x = DynamicInt(4)
f(x)
f(1)
f(DynamicInt(3)) # same as f(3)
```
Follow-up work:
- Specifying shape constraints, either at the int-level, e.g.
```python
DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"]
```
or at the compilation level, e.g. something like
```python
s0 = DynamicInt(64, name="s0")
s1 = DynamicInt(128, name="s1")
with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]):
f(s0, s1)
```
This should subsume the need for specifying derived SymInts?
- SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor.
- Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`.
Differential Revision: D81698719
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162194
Approved by: https://github.com/bobrenjc93
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152
Summary:
What: Enables CUDA support for concat linear int8_mm woq optimization pattern by:
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA
Why: Extend WOQ to more device types
Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```
Rollback Plan:
Differential Revision: D80884518
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161848
Approved by: https://github.com/jerryzh168
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where.
Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163138
Approved by: https://github.com/henryoier, https://github.com/ezyang
In the .dist-info/METADATA file, the version was not being written with the new sha.
On python <3.11 (I think), the glob `**` will only match directories, so change this to `*`, which I checked that it will match both files and directories on py3.9 and py3.13
There's probably also a bunch of mismatches in RECORD but thats a problem for later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163214
Approved by: https://github.com/huydhn
resolves https://github.com/pytorch/torchtitan/issues/1136
torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already
```
# pad DTensor._local_tensor
fully_shard(model)
sd = fsdp_model.state_dict()
# reset_sharded_param should be a no-op in lazy_init
loss = fsdp_model(inp).sum()
```
this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early
unit test
```
pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163130
Approved by: https://github.com/tianyu-l
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.
Testing:
```
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
mod = torch.get_device_module('cuda')
hw = mod._device_limits.GPULimits(device)
print(hw.get_tflops_per_second(torch.float16))
print(hw.get_tflops_per_second(torch.float32))
print(hw.get_tflops_per_second(torch.float64))
print(hw.get_tflops_per_second(torch.bfloat16))
print(hw.get_tflops_per_second(torch.int8))
print(hw.get_memory_bandwidth_Bps() / 1e9)
print(hw.get_shared_memory_bandwidth_Bps() / 1e9)
# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162785
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658
Reland of #160532
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187
Approved by: https://github.com/angelayi
Summary: Updates the remaining tests to all ensure val_shapes is always passed a tuple and not a list. Enables adding an assert consistent with the other function arguments.
Test Plan:
Depends on CI.
Rollback Plan:
Differential Revision: D82383319
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162887
Approved by: https://github.com/NikhilAPatel
# why
- enable a clear interface for kernel templates to declare all their
instantiation parameters and any potential defaults
- simplify KernelTemplateChoice to just have a single params, and not kwargs and extra_kwargs
# what
- KernelTemplateParams interface
- placeholder implementation where we just pass through a dict
# testing
- existing ci tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162781
Approved by: https://github.com/jansel
Summary:
In edge cases, tracer_output can be left unset if there's double exception raised which causes the following issue:
```
UnboundLocalError: local variable 'tracer_output' referenced before assignment
```
Default initialize this variable so that it's always present.
Test Plan:
CI
Rollback Plan:
Differential Revision: D82652815
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163169
Approved by: https://github.com/tugsbayasgalan
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162658
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656
Summary:
I am really skeptical about inductor sizevars creating an empty shape env when not provided with one
i think we should fail there if the graph has dynamic shapes and no shape env is provided.
however i wonder if there are actually use cases that depends on the shape env not being there?
Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols.
Test Plan:
Fix the bug reported in creating simple e2e unit test is not trivial
https://www.internalfb.com/diff/D82337184
Rollback Plan:
Differential Revision: D82412384
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162927
Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel
1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong).
2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152
Approved by: https://github.com/ngimel
ghstack dependencies: #163025
**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162654
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650
**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162650
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636
Summary:
1. Generalized testing by auto-detecting Cache types and splitting testing by abstract base class
- Now checks that all Cache types are thread-safe
- Will fail tests if any new Cache is added and is untested (for example, any cache with non-str key or non-bytes value)
2. All Caches are thread-safe
- InMemoryCache was the only one not thread-safe, so added a lock for access
- Realized that to implement MultiCache we should just have this requirement.
* Also, OnDiskCache is now a functioning AsyncCache with a default base_dir using Python's tempfile.gettempdir, i.e. OnDiskCache is no longer an abstract cache class
Test Plan:
```
[nmacchioni@*** / ()]$ buck test fbcode//mode/opt caffe2/test/inductor:pcache
Tests finished: Pass 28. Fail 0. Fatal 0. Skip 0. Build failure 0
[nmacchioni@*** / ()|remote/fbcode/warm_gpu_od_stable...)]$
```
Rollback Plan:
Differential Revision: D82660240
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163173
Approved by: https://github.com/masnesral
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162636
Approved by: https://github.com/mori360
ghstack dependencies: #162631
We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore
BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward
## BC note
The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter.
Note this only affects torch.cuda._compile_kernel, which is a private API.
Example:
Before
```python
kernel = compile_kernel(
kernel_source="global void my_kernel() { ... }",
kernel_name="my_kernel",
header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }"
)
```
After
```python
kernel_source = """
#define SCALE 2.0f
device float scale(float x) { return x * SCALE; }
global void my_kernel() { ... }
"""
kernel = _compile_kernel(kernel_source, "my_kernel")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163165
Approved by: https://github.com/janeyx99, https://github.com/albanD
In Unified Runtime, we cannot have any fallback ops (for now). Not all conv1d ops can avoid fallbacks now, so we write a decomposition for it.
it's not registered to the default decomposition table as currently only executorch/unified runtime needs it. But it might benefit inductor as well because conv2d can generate triton kernels while there's no triton codegen for conv1d. I don't know if the conv2d triton kernel will have better perf compared to aten::conv1d, so it's not registered by default yet.
To register it, one just needs to do `import torch._decomp as decomp;decomp.register_decomposition(torch.ops.aten.conv1d.default, conv1d_to_conv2d)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163080
Approved by: https://github.com/angelayi
Fixes#163105
Note that the new `SWALR.load_state_dict` is **not backwards compatible**:
```python
@override
def load_state_dict(self, state_dict: dict[str, Any]) -> None:
"""Load the scheduler's state.
Args:
state_dict (dict): scheduler state. Should be an object returned
from a call to :meth:`state_dict`.
"""
self.__dict__.update(state_dict)
self._set_anneal_func(self._anneal_strategy)
```
If we'd like to maintain compatibility with old state_dicts (loaded with `weights_only=False`), we could use something along these lines:
```python
@override
def load_state_dict(self, state_dict: dict[str, Any]) -> None:
"""Load the scheduler's state.
Args:
state_dict (dict): scheduler state. Should be an object returned
from a call to :meth:`state_dict`.
"""
anneal_func = state_dict.pop("anneal_func", None)
strategy = state_dict.get("_anneal_strategy")
self.__dict__.update(state_dict)
if anneal_func is not None:
state_dict["anneal_func"] = anneal_func
if strategy is None:
if anneal_func == self._linear_anneal:
strategy = "linear"
elif anneal_func == self._cosine_anneal:
strategy = "cos"
if strategy is None:
strategy = getattr(self, "_anneal_strategy", "cos")
self._set_anneal_func(strategy)
```
But given the fact that loading an `SWALR` state_dict before this PR would have caused an error, this seems okay. A GitHub/Google search for `SWALR.load_state_dict` had no results. Happy to change if not, or add a warning just in case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163122
Approved by: https://github.com/janeyx99
# why
- KTC might regenerate a choicecaller e.g. through FlexibleLayout
optimization. This in turn would delete any annotations
# what
- provide an annotations dict inside KTC
- forward that dict towards the ChoiceCaller's annotations
- ChoiceCaller users e.g. in selectalgorithm now have access to the KTC
and can register handlers do record/make decisions based on the KTC
# testing
n/a
Differential Revision: [D82587631](https://our.internmc.facebook.com/intern/diff/D82587631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163117
Approved by: https://github.com/nmacchioni
Prevents edge cases in SequentialLR and ReduceLROnPlateau which could corrupt learning rates or trigger recompilation.
Supersedes #162360Fixes#162359Fixes#163093
While putting #162360 together, I noticed the class of issue I was fixing (i.e. unintended aliasing in lr_schedulers when using Tensor lrs) appeared in several other places. @janeyx99 suggested I put together a follow-up PR.
There are several bugs resembling the one fixed in #162360. I added a helper to fix these:
```python
def _update_param_group_val(param_group: dict[str, Any], key: str, val: float | Tensor):
"""Set param_group[key] to val without aliasing or assignment when they're both tensors.
Raises a KeyError if param_group[key] does not exist.
"""
if isinstance(param_group[key], Tensor):
param_group[key].fill_(_to_scalar(val))
else:
param_group[key] = val
```
And applied it to fix bugs in `SequentialLR.__init__` and `LRScheduler._update_lr`. I also added it to `CyclicLR.__init__` which was using an equivalent pattern, and `CosineAnnealingWarmRestarts.step` which *should* have had a similar issue:
```python
for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
param_group["lr"] = lr
```
But did not, because `get_lr()` actually returns tensors when using a tensor lr (despite its `list[float]` return type annotation). Relying on this propagation seems fragile, so I conservatively added the method here as well. I'll be fixing the type annotations and several related issues in followup PRs built off of this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163098
Approved by: https://github.com/janeyx99
Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:
- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA
Why: Extend WOQ to more device types
Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```
Rollback Plan:
Reviewed By: jerryzh168
Differential Revision: D80882442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161680
Approved by: https://github.com/jerryzh168
Because these specs are cached by reference. So by reusing them and mutating them, we're overwriting the cached specs of another op. I'm just fixing these 2, there are more instances, we'll need to do an audit separately.
This fixes a few opinfo tests, but side note that `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_multi_head_attention_forward_cpu_float32` fails for me locally even on the base commit, but it is not marked as xfail
NOTE: I am renaming `_wrap_output_spec_tensor_meta` so that external libraries will loudly fail. You should migrate to the functional `_create_output_spec_with_new_tensor_meta` or create your own mutation wrapper and take responsibility for the cache! This should be improved in https://github.com/pytorch/pytorch/issues/162731
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162702
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #161458
When generating the triton template for convolution, we check `V.graph.sizevars.statically_known_equals(in_chan * groups, x.get_size()[1]) `. Note that in this check, we should consider the groups.
This check verifies, at compile time, that the total number of input channels expected by the convolution weights (in_chan * groups) exactly matches the number of channels in the input tensor (x.get_size()[1]).
This fix is good in general as it allows for conv triton template to be generated when `groups> 1`. It's also required for unified runtime to use AOTI as a backend delegate, because unified runtime is libtorch-free, so we cannot use the ATEN fallback of conv2d.
```
python test/inductor/test_select_algorithm.py -k test_convolution2_group
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163094
Approved by: https://github.com/SherlockNoMad
Summary:
This diff fixes two things which come up when testing a tgif-published pt2 model remote net:
1) Updates isSameDevice to handle meta device to avoid this error:
```
what(): Unsupported device typemeta and meta
Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20
```
2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now.
Test Plan:
Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error:
```
Unsupported device typemeta and meta
```
Then after change #1 and before change #2 we get:
```
what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu
Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374
```
After change run is successful
Command:
```
MODEL_ENTITY_ID=921242082
SNAPSHOT_ID=1269
module_name=merge
SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt
```
Rollback Plan:
Differential Revision: D80713052
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162842
Approved by: https://github.com/henryoier
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Reland of #160532
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
# Problem
When dynamic shapes are passed to AOTInductor, they usually have a very basic form like `(s0, 5, 27)`. In these cases it's straightforward to generate code defining the symbol `s0` as a specific dimension of the input tensor. However, AOTI can handle slightly more generic expressions than this, such as `(2 * s0, 5, 27)`. In these cases, we don't immediately know the value of `s0`, but we need to solve for it, since it may be referenced in other parts of the program such as kernel call arguments, launch grids, etc.
# Feature
This PR adds support for more generic dynamic input expressions in the FX backend, following the implementation already present in AOTI's C++ backend:
1. Check if the expression contains *one* undefined symbol, as multiple variables would make the equation underdetermined. Let's call this `s0`. (We could potentially generalize this, but this PR focuses on cases AOTI can already handle.)
2. Generate a new symbol for the relevant size or stride of the input tensor. Let's call this `size`. This is computed with FX nodes just as a normal symbol would be.
3. Use sympy to solve for `s0` in terms of `size`. Let's call the resulting expression `solution`.
4. Since we know `s0` is an integer, `solution == floor(solution)`. Take the floor and then convert division to `FloorDiv`. This is required to trace through the expression, since the return value of regular division is not guaranteed to be an integer.
5. Generate FX for the modified `solution`, which defines the value `s0`.
6. Override the relevant method of `PythonWrapperCodegen` to a no-op, since the FX converter handles the above on its own.
# Test plan
In addition to the existing dynamic shapes tests, this PR adds new test cases where the input shape contains a non-trivial expression. This dynamic input dimension is then multiplied by other dimensions to form the argument to a `reshape`.
Here's an example graph from one of the CI tests. In this case, the input expression was `2*x + 1`, and the solution is `x = (sym_size_int - 1) / 2`:
```
graph():
%arg0_1 : [num_users=2] = placeholder[target=arg0_1]
%sym_size_int : [num_users=1] = call_function[target=torch.ops.aten.sym_size.int](args = (%arg0_1, 0), kwargs = {})
%sym_sum : [num_users=1] = call_function[target=torch.sym_sum](args = ([-1, %sym_size_int],), kwargs = {})
%floordiv : [num_users=1] = call_function[target=operator.floordiv](args = (%sym_sum, 2), kwargs = {})
%mul : [num_users=2] = call_function[target=operator.mul](args = (8, %floordiv), kwargs = {})
%sym_sum_1 : [num_users=2] = call_function[target=torch.sym_sum](args = ([4, %mul],), kwargs = {})
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([%sym_sum_1], [1]), kwargs = {dtype: torch.float32, device: cuda:0})
%sym_sum_2 : [num_users=1] = call_function[target=torch.sym_sum](args = ([35, %mul],), kwargs = {})
%floordiv_1 : [num_users=1] = call_function[target=operator.floordiv](args = (%sym_sum_2, 32), kwargs = {})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(%floordiv_1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, out_ptr0: %buf0, xnumel: %sym_sum_1, XBLOCK: 32}})
return buf0
```
The `sym_size_int` node returns the first dimension of the input tensor. Next, `floordiv` computes the input symbol in terms of the input size. Then, the launch grid is computed by `floordiv_1`, the kernel argument by `sym_sum_1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163044
Approved by: https://github.com/jansel
Note that this works only in a limited case, where you *don't* change the seed, but change only the offset of the philox generator. This captures the main use case we're interested in: Rewinding the RNG to a previous state. This is done by torch.utils.checkpoint.checkpoint in particular.
Calls to increase() change only the offset, not the seed. Thus, we allow for "no-op" calls to set_seed where the new seed is the same as the old seed. If a user does happen to try to change the seed during stream capture, they will receive an error.
Fixes#162504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162505
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/eellison, https://github.com/eee4017, https://github.com/cyyever
Summary:
Internally, we are building PyTorch on the compat layer.
Need to avoid compiling sve's box-cox, as sve is not marked as build target.
Rollback Plan:
Reviewed By: rraometa, YifanYuan3
Differential Revision:
D82544412
Privacy Context Container: L1208939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163078
Approved by: https://github.com/Skylion007, https://github.com/malfet
**Summary**
Tests parameter state management after forward and backward passes for single and multiple replicate groups
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_forward
2. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162631
Approved by: https://github.com/mori360
# why
- enable ChoiceCaller generation to provide extra information that
feedback_saver_fns (functions registered to run at the bench of
benchmarking) can use afterwards
- users that extend ChoiceCaller creation e.g. by creating their own
InductorChoices can use this to shuttle through information
# what
- add an annotations dictionary to ChoiceCaller class
# testing
n/a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162672
Approved by: https://github.com/nmacchioni
PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous.
However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well.
Changes included in this PR:
1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way.
2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage.
3. One bug fix inside composition, which will lead to infinite recursive call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162690
Approved by: https://github.com/ezyang
ghstack dependencies: #162413, #162534, #162414
Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context.
## Trace
Use vLLM as an example. The first trace shows dynamo graph partition.
<img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" />
The second trace shows inductor graph partition prior to this PR.
<img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" />
Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context.
The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead.
<img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162873
Approved by: https://github.com/shunting314
Summary:
The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.
After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).
Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI
Rollback Plan:
Differential Revision: D78074709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158017
Approved by: https://github.com/janeyx99
The test `test_binary_ufuncs.py::TestBinaryUfuncsCUDA::test_cuda_tensor_pow_scalar_tensor_cuda` fails with a mismatched `dtype`:
```Python
AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64.
```
This PR forces both arguments to use the same `dtype` to fix the test failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163070
Approved by: https://github.com/eqy
Summary:
Add some metadata to CompileArtifacts, so that it contains the source code information about the original code while they are being traced.
For now, we will not provide a verification method to end user and instead we just provide which files are inlined. It's up to user to verify the content from these files are not changed (because it's optional for many users to validate source code changes anyway in aot precompile)
Test Plan:
buck run @mode/opt test/dynamo:test_dynamo -- -k test_file_change
buck run @mode/opt test/dynamo:test_dynamo -- -k test_aot_compile_source_info
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162983
Approved by: https://github.com/yushangdi
This test sets "NCCL_ALGO=NVLS" in NcclUserBufferRegistrationTest which affects tests run in the same process such as `test_on_completion_hook_*` that fail with
> invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.26.2
> ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
> Last error:
> Error : no algorithm/protocol available for function Broadcast with datatype ncclInt8. NCCL_ALGO was set to NVLS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163063
Approved by: https://github.com/ezyang
**Summary:** This test verifies that the replicate function automatically moves forward pass inputs to the correct device.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162629
Approved by: https://github.com/mori360
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command.
**Test Cases**
1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1
2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162021
Approved by: https://github.com/mori360
After the UT suite moved to `MultiProcContinuousTest`, `skipIfRocm` decorator started failing rather than skipping UTs because now we spawn multiple threads before the skip decorator is taken into account and the skip decorator was raising an exception to exit the process. But, the parent process treated the child process exiting as a crash rather than a skip. Additionally, in `MultiProcContinuousTest`, if one UT fails all subsequent ones are also skipped which makes sense since there's one setup for the entire suite. However, this showed up as many failing/skipped UTs in the parity.
I added multiprocess version of skip decorators for ROCm, including, `skip_if_rocm_arch_multiprocess` and
`skip_if_rocm_ver_lessthan_multiprocess`. These are needed as symmetric memory feature is only supported on MI300 onwards and we need to skip them for other archs and some UTs only work after ROCm7.0.
Fixes#161249Fixes#161187Fixes#161078Fixes#160989Fixes#160881Fixes#160768Fixes#160716Fixes#160665Fixes#160621Fixes#160549Fixes#160506Fixes#160445Fixes#160347Fixes#160203Fixes#160177Fixes#160049Fixes#159921Fixes#159764Fixes#159643Fixes#159499Fixes#159397Fixes#159396Fixes#159347Fixes#159067Fixes#159066Fixes#158916Fixes#158760Fixes#158759Fixes#158422Fixes#158138Fixes#158136Fixes#158135
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162811
Approved by: https://github.com/jeffdaily
This PR refactors AOTAutograd slightly:
- It adds `simple_wraps` to various wrappers so that the reference to inner functions is stored in the output of AOTAutograd.
- It saves a `serialize()` method on the result of `aot_stage2`, in the event of an eager backward compile.
I discussed the lazy backward case with @bdhirsh, and we agreed that serialization in that case would probably use a different, more AOT API anyway, so we do not implement a serialize function for the lazy backward case. AOT precompile, at least initially, will always eagerly compile the backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162527
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162171
# Feature
This PR supports lowering `IndexPutFallback` through Inductor's FX converter. The approach is very similar to the one taken in https://github.com/pytorch/pytorch/pull/162686.
Compared to `ScatterFallback`, this required one additional change: the value of `self.op_overload` for `IndexPutFallback` was inaccurate. Previously, it used `aten.index_put`, which would result in unsound FX IR. The existing Python/C++ codegen use `aten.index_put_`, since the fallback mutates its input. This PR changes `self.op_overload` to match that.
# Test plan
Added a CI test lowering deterministic index put via the FX converter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162863
Approved by: https://github.com/angelayi
Reported by compute-sanitizer, otherwise it looks like `block_y_reduce` and `block_x_reduce` both use `shared_memory` for temporaries without synchronization between them
reproduces in e.g.,
`compute-sanitizer --tool=racecheck python test/test_matmul_cuda.py -k test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_128_rhs_block_1_cuda` (note that this test requires H100 to run unless only the non-emulated (cuBLAS impl.) is commented out)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162995
Approved by: https://github.com/msaroufim
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@d8c3ee](d8c3eefc29), includes:
- Optimize adaptive average pool for channel-last memory format
- Add unregister wait_tensor
- Replace deprecated `[[intel::reqd_sub_group_size(SgSize)]]` with `[[sycl::reqd_sub_group_size(SIMD)]]` and remove unnecessary attributes
- Revert "Roll back to original usage of sycl::get_kernel_bundle"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162804
Approved by: https://github.com/EikanWang
Summary:
This change adds a new environment variable (`TORCHINDUCTOR_TRITON_DISABLE_DEVICE_DETECTION`) and configuration in `torch._inductor.config` which can be set to `"1"` to allow a user to disable triton's device detection logic in [torch/utils/_triton.py:has_triton()](c9e57d7e9f/torch/utils/_triton.py (L128)). This function is used at import scope in several places but the function has a side effect of initializing the mtia device if it is available which is causing some of our autotuning workflows to crash.
Worth noting that when enabled this configuration disables all device detection not just mtia and this is because the logic in has_triton will initialize the mtia device as a side effect even when checking for a cuda or other device via the [get_interface_for_device()](c9e57d7e9f/torch/_dynamo/device_interface.py (L570)) function.
I've tagged it `topic: not user facing` since I don't anticipate any outside of meta users making use of this, however this is my first PR here, so please indicate if it should be handled differently.
Test Plan: This has been tested in the context of internal workflows.
Differential Revision: D82347853
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162974
Approved by: https://github.com/xmfan
Summary:
Mismatch tail is used as a fixed variable and there are cases that there are more than 10 mismatches FR gives up producing results (e.g. https://fburl.com/ai_infra/7gjl5ucb). This diff added the mismatch tail in the parsed args so make this configuarble.
Also tho the variable name is `mismatch_tail`(last 10) it is used as `mismatch_head` (the first 10). Updated it to be `num_mismatch_to_print`
Test Plan:
`buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id aps-ctx_fm_pipeline_change-1c8ea38a94 --mast_job_version 0 --mast_job_attempt 2 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks --num_mismatch_to_print 20 1>out 2>err`
Confirm no error and output 20 mismatches.
Rollback Plan:
Differential Revision: D82335995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162991
Approved by: https://github.com/fduwjj
Summary:
Implemented caching abstractions: `Cache` and `AsyncCache`.
`Cache` provides an abstraction for defining simple key -> value stores with get and put functionality. We propose using `Cache` for implementations with very low (microseconds) overhead, for example an in-memory cache.
`AsyncCache` provides an abstraction for defining simple key -> value stores with asynchronous get and put functionality. We propose using `AsyncCache` for implementations with medium to high (> millisecond) overhead, for example an on-disk cache.
We provide an initial extension of `Cache` in the form of `InMemoryCache`. `InMemoryCache` provides fast, in-memory caching that can be later used to memoize more expensive cache accesses. `InMemoryCache` also provides a custom constructor `InMemoryCache.from_env_var` that can be used to pre-populate the in-memory cache, which will be helpful for enabling determinism in the future.
We also provides extensions of `AsyncCache`. `OnDiskCache` subclasses `AsyncCache` and serves as a generic on-disk caching implementation with atomic, write-once guarantees. `OnDiskCache` is semi-generic, allowing subclassing to alter the output directory. `InductorOnDiskCache` subclasses `OnDiskCache` to create an Inductor-specific on-disk cache that outputs to Inductor's default caching directory.
Test Plan:
`Cache` Tests:
1. Get -> Set -> Get
- Checks that `get(key)` returns `None` when `key` is not cached, and that after calling `put(key, value)` subsequent `get(key)` calls return `value`
2. Set -> Set
- Checks that with duplicated `set(key, value)` calls only the initial call is successful
3. From env var
- Checks that constructing an `InMemoryCache` from an environment variable works.
`AsyncCache` Tests:
1. Get -> Set -> Get
- Same as `Cache` test, but checks both with synchronous and asynchronous execution
2. Set -> Set
- Same as `Cache` test, but checks both with synchronous and asynchronous execution
3. Set -> Set Concurrent
- Checks that of two concurrent `set(key, value)` operations, only one passes
```
cd ~/fbsource/fbcode && buck test mode/opt //caffe2/test/inductor:pcache
```
{F1981926248}
Rollback Plan:
Differential Revision: D82269762
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162777
Approved by: https://github.com/masnesral, https://github.com/aorenste
#162978 identified an issue that distributed test failures were wrongly muted.
Per discussion with @malfet, one solution is to disable rerun of distributed tests in `run_test.py`.
The PR makes use of the `is_distributed_test` flag to identify those tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163025
Approved by: https://github.com/malfet
Summary:
Cleaning up checkpoint background process can currently block trainer thread indefinitely if the process is hanging (notably due to Gloo pg init timeout).
This diff adds a 5s grace period for normal termination and sends SIGTERM if unable to shut down in that period.
Rollback Plan:
Differential Revision: D82268979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162828
Approved by: https://github.com/meetv18
Note this could change suggest_memory_format behaviour for unbacked
we used to return True for are_strides_like_channels_last sometimes even when results undecided
now when its not decided we return False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162354
Approved by: https://github.com/aorenste
This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.
We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.
I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).
# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.
# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).
# Benchmarks
* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431
# Change in generated PTX
This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```
## with printf
This is the version of the code that appears in this diff:
https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a
## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d
(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)
Test Plan:
Running this minimal test case:
```
import torch
def main():
x = torch.ones(10, dtype=torch.int64, device="cuda:0")
torch.repeat_interleave(x, x, output_size=0)
```
Now we see the new message (from printf) alongside the assert failure:
```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).
fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```
Rollback Plan:
Reviewed By: mradmila
Differential Revision: D79310684
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129
Approved by: https://github.com/ngimel
Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.
This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.
Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162916
Approved by: https://github.com/NikhilAPatel
pytest summarizes test failures by printing a truncated first line of the test of the OUTERMOST wrapped exception.
Prior to this PR, it looked like this:
```
FAILED [0.0454s] test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_H_cpu_float32 - Exception: Caused by sample input at index 0: SampleInput(input=Tensor[size=(12, 12), device="cpu", dtype=torch.float32], args=(), kwargs={}, ...
```
I argue this is not so useful. If I have a lot of test failures, I look to the test summary to understand what /kind/ of errors I have, so I can assess which ones I should look at first. In other words, this is better:
```
FAILED [0.1387s] test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db__softmax_backward_data_cpu_float32 - Exception: Tensor-likes are not close!
```
Now I know specifically this is a numerics problem!
This PR does it by prepending the old exception text to the wrapped exception. This is slightly redundant, as we are exception chaining, but it does the job. Open to bikeshedding.
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162961
Approved by: https://github.com/malfet
This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values.
Example of the original incorrect behavior:
```python
a = torch.ones(5,5)
torch.triu(a, 9223372036854775807)
# Before:
# tensor([[0., 0., 0., 0., 0.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.]])
```
The new implementation guards against overflow and produces correct results for all valid input values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240
Approved by: https://github.com/albanD
Summary:
Currently the C10_CUDA_CHECK only shows source location in CUDAException like below:
```
Exception raised from c10_cuda_check_implementation at fbcode/caffe2/c10/cuda/CUDAException.cpp:44
```
which is not terribly useful.
By checking the original diff D39619861 that introduced c10_cuda_check_implementation, it seems the original macro would show the source location correctly but c10_cuda_check_implementation broke it.
This diff will propagate caller source location to c10_cuda_check_implementation to fix the issue.
Test Plan:
CI
Observed desired error message after the change:
```
CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Device-side assertion tracking was not enabled by user.
Exception raised from operator() at fbcode/sigrid/predictor/aed/AedContainer.cpp:659 (most recent call first):
```
Note the last line reports actual caller location.
Rollback Plan:
Reviewed By: Raymo111
Differential Revision: D81880552
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162808
Approved by: https://github.com/janeyx99
This PR adds a new interface _aot_compile to `OptimizedModule`, so that the following is possible:
```
mod = SimpleLinearModule()
inputs = [
ModelInput(
args=(torch.randn(3, 3),),
kwargs={},
contexts=[torch.no_grad(), eval_mode(model)],
),
ModelInput(
args=(torch.randn(3, 3),), kwargs={}, contexts=[train_mode(model)]
),
]
assert isinstance(model, torch._dynamo.eval_frame.OptimizedModule)
model._aot_compile(
inputs,
)
```
After this PR, you can AOT precompile NanoGPT and use it to train directly. I'll share my fork of the repo to make this work.
## ModelInput
The `ModelInput` API is a work in progress; for now it represents a set of inputs and contexts to instruct the compiler to compile. Most commonly, this is "compile an eval mode with no grad, and a training mode with grad", but also contains things like autocasting contexts, etc.
## Dispatch
Dispatching is super simple here, we just iterate through all the precompiled fullgraphs and check guards for each one until there's one htat passes. I'm a bit worried that having this in python code is going to be too expensive. The guard checks are happening in C++ anyway, though, so the only python bottlenecked step here is just the for loop, so perhaps the overhead will not be high. I'll work on measuring this, though.
## TODOs
This PR does not support `mod.compile()`, only `torch.compile(mod)`. In order to support `mod.compile()`, we'll need to update torch.nn.Module with an updated implementation — I can add that frontend later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162171
Approved by: https://github.com/zhxchen17
Summary:
The check is introduced in D82262053
- `scalar_value` could be a numpy object
- Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`.
Test Plan:
```
vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10
```
Rollback Plan:
Differential Revision: D82383428
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888
Approved by: https://github.com/xush6528
Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps
NVRTC side
- (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>`
- `nvrtcCompileProgram`
- (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed
- `nvrtcDestroyProgram`
CUDA side
- use mangled name instead of normal name -> profit
- `extern "C"` is not even needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875
Approved by: https://github.com/msaroufim
Summary:
Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support.
This also includes a couple notable changes:
- Adds support in the TMA template support for checking the output layout.
- Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation.
- Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches.
Test Plan:
Tested with test_max_autotune.py unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480
Approved by: https://github.com/NikhilAPatel
Summary:
When rewriting sympy expressions in the compiler codebase we want to generate
FloorDiv(a, b) CleanDiv(a, b) directly and not a//b. since the later become floor(a*pow(b, -1))
For symnodes we automatically handle that conversions in the symnode op dispatch.
I will follow up with an issue to track all other usages of //.
Block internal Model.
Test Plan:
add test
run existing tests.
dakechen1993 testing on the model.
Rollback Plan:
Differential Revision: D82362241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162869
Approved by: https://github.com/ezyang
Summary:
Please see D21021645 for details about the optimization and why it's beneficial.
A similar change has been added to libstdc++ as well, see dbf8bd3c2f
Rollback Plan:
Reviewed By: yfeldblum
Differential Revision: D81960754
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162784
Approved by: https://github.com/swolchok
Summary:
One of our builds fails because the return value of fread is discarded. Explicit cast to void fixes the build.
```log
In file included from fbcode/caffe2/torch/csrc/jit/mobile/import.cpp:15:
fbcode/caffe2/torch/csrc/jit/mobile/file_format.h:156:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
156 | fread(data.get(), size, 1, f);
| ^~~~~ ~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
...
BUILD FAILED
Failed to build 'fbcode//caffe2:libtorch (cfg:opt-linux-x86_64-clang19-no-san-opt-by-default#fef256f7ee896871)'
```
Test Plan:
No runtime behavior change. CI.
Rollback Plan:
Differential Revision: D82265002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162767
Approved by: https://github.com/Skylion007
Summary:
The SIMD path is using SLEEF version of `pow` which is slightly different from `std::pow`. The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results.
Rollback Plan:
Differential Revision: D82265247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162772
Approved by: https://github.com/swolchok
Investigated together with @pyemma and @taotaohuang001
## Problem
when calling exported module with dict nested in the args tuple, it will make following complaits
```
Traceback (most recent call last):
File "/home/chzhu/infinitrain/test_torch_export.py", line 32, in <module>
print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)}))
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 848, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 424, in __call__
raise e
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 411, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1806, in inner
args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
return fn(*args, **kwargs)
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 81, in _check_input_constraints_pre_hook
flat_args_with_path = _check_inputs_match(args, kwargs, self._in_spec)
File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 64, in _check_inputs_match
raise ValueError( # noqa: B904
ValueError: Trying to flatten user inputs with exported input tree spec:
TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a1', 'a2'], [*,
*])]),
TreeSpec(dict, [], [])])
but actually got inputs with tree spec of:
TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a2', 'a1'], [*,
*])]),
TreeSpec(dict, [], [])]).
Please check that the inputs have the same number and type of args and kwargs as the ones you used when tracing.
```
## How to reproduce the issue
```python
import torch
# create a nn.Module with data_batch as input and output as output
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.linear = torch.nn.Linear(10, 1)
def forward(self, data_batch):
h1 = self.linear(data_batch["a1"])
h2 = self.linear(data_batch["a2"])
return h1 + h2
# torch export this module
model = MyModel()
example_args_forward = (
{
"a1": torch.randn(10),
"a2": torch.randn(10),
},
)
exported_model = torch.export.export(model, example_args_forward, strict=True)
# save the exported model
torch.export.save(exported_model, "exported_model.pt2")
# load the exported model
exported_model = torch.export.load("exported_model.pt2").module()
# run the exported model
print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)}))
```
## Root Cause
Input spec is encoded as [TreeSpec](582d278983/torch/utils/_pytree.py (L1059)) in torch export. With (args, kwargs) at the top level. When we call the exported model, it has a pre-execution [hook](582d278983/torch/export/_unlift.py (L66)) to check the input TreeSpec matches the received TreeSpec, where in Treespec, the dict key order is preserved. Something like
TreeSpec(dict, ['a2', 'a1'], [*,*])
To workaround this, the input check reorders [kwargs](582d278983/torch/export/_unlift.py (L67)), that is why kwargs can be out of order. But the dict nested in the args is not re-ordered, so any re-ordering of the keys will throw errors.
## Solution
Update eq_spec to handle the dict case, where we only guarantee that key set is the same without ordering constraints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162618
Approved by: https://github.com/angelayi
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Test Plan:
CI
Rollback Plan:
Differential Revision: D80181887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532
Approved by: https://github.com/henryoier, https://github.com/ezyang
Summary:
Sometimes checkpoint background process creation times out during gloo pg init.
Attempting to destroy the process during that time can block the trainer thread until the timeout completes.
This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time.
Test Plan:
CI
Rollback Plan:
Differential Revision: D81724668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760
Approved by: https://github.com/meetv18
Summary: Updates the NoValidChoicesError logic to include some additional context for if not choices exists or if no choices compiled.
Test Plan:
NFC. Depending on CI.
Rollback Plan:
Differential Revision: D82312035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162814
Approved by: https://github.com/mlazos
Currently, OpenReg supports Linux, Windows, and OS X, ensuring stability and ease of integration with third-party devices across all three platforms. It also doesn't rely on any other accelerators (such as CUDA or MPS).
Therefore, to minimize computational resource usage, `test_openreg` can be added to certain BLOCKLISTS to prevent its execution, limiting OpenReg's execution to only necessary scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161918
Approved by: https://github.com/albanD
ghstack dependencies: #161917
**Background:**
Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic.
**How to do:**
So how do we verify the quality of the third-party device integration mechanism?
We will maintain a `test_openreg` entrypoint in `test/run_test.py`.
This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available.
**Next:**
We will also improve `torch_openreg's` test coverage in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917
Approved by: https://github.com/albanD
Minor bug fix for the `nll_loss` test.
Before this PR, it runs `torch.randint(high=0)`, which will fail because it would try to generate a number that >= low and < high, i.e. x>=0 and x<0.
The test did not fail because that line is not run when testing on CPU because it failed earlier because of a unsupported dtype.
However, as we support TPUs at Google, this line is reached first before the dtype check, which triggers the bug.
To my understanding, these OpInfo should be general enough to support different hardware.
Fixing this obvious bug would make it more general cross different hardware.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162763
Approved by: https://github.com/soulitzer
# why
- now everything is in place to just gather templates and run
the V.choices.get_mm_configs once per op
- enables any overrides inside V.choices.get_mm_configs to
have a full view of the options for an op, not just for
one template
# what
- replace multiple calls to V.choices.get_mm_configs with
calls to gather the active templates, and then using those
in a single call
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161351
# why
- if we only use ExternKernelChoice we're not doing any codegen
- if we're not doing any codegen, we can use a FlexibleLayout
here, and provide deeper passes more chances to change it
# what
- if all the kernel template choices (KTC) are with a ExternKernelChoice
template, we switch to a FlexibleLayout before generating the choice
- add a test to make sure that works as intended (FlexibleLayout for
only extern, and FixedLayout if Triton is involved)
- caveats:
- because CPP, CUTLASS, and CK are not using
V.choices.get_mm_configs yet, we turn off the optimization
if either of those backends are in use. This will be relaxed
once they support this too
- because Triton templates are still using their own calls
(not a single call) to get_mm_configs, it's also turned
off there. The next diff unifies Triton + ATEN to a single
call to get_mm_configs and that in turn allows the optimization
there too
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351
Approved by: https://github.com/eellison, https://github.com/jansel
The tests were comparing raw exported strings for protobuf comparison, which is not backward/forward compatible with different versions of protobuf.
This PR parses the strings into protobuf and compares the protobufs directly, similar to what we did in assertImageProto.
Our test failed because we used a different version of protobuf, which output 44100.0 instead of 44100, which resulted in an error. However, they are equal, but only different in the exported strings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162644
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
Fixes the case described below which occurs when:
- A user `torch.compile`s a function that uses a triton kernel.
- `TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1` .
Problem:
If the user defined triton kernel is not autotuned:
```python
import os
os.environ["TORCHINDUCTOR_DUMP_LAUNCH_PARAMS"] = "1"
@triton.jit
def kernel(..., BLOCK_SIZE: tl.constexpr):
...
@torch.compile
def fn(..)
kernel[..](..., 128)
fn(..)
```
Then In `triton_heuristics. _interpret_args_grid`, `filter_signature` function:
```python
def filtered_signature() -> list[str]:
# constexprs are not passed in as args
return [
x
for x in self.triton_meta["signature"].keys()
if x not in cfg.kwargs.keys()
]
```
because `triton.autotune` is not used on the the `triton.jit` function, `cfg` above will be empty, and so `BLOCK_SIZE` will not be removed from the signature even though it is constexpr, even though it is removed from the arguments that are passed in to `interpret_args_grid`. This results in a mismatch between the number of parameters in the signature and the number of arguments, which leads to the error `NameError: name '_grid_2' is not defined`.
Fix:
Use the triton jit kernel `constexprs` for args to remove. Not sure if this is a good fix so suggestions are welcome.
Test plan:
Added a parameter to an existing triton kernel to test for this edge case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161924
Approved by: https://github.com/davidberard98
# Problem
Inductor has a `ScatterFallback` op with custom Python and C++ wrapper codegen macros. This is used in certain situations where the default Triton codegen doesn't apply, and especially for reductions which need to be deterministic. Since this op used direct Python/C++ codegen, it wasn't compatible with the FX backend.
# Feature
This PR refactors the associated wrapper codegen to support `ScatterFallback`. This follows the same basic steps that were used for other fallback ops including `MultiOutput` and `ExternKernel`:
1. Create a new wrapper IR op called `ScatterFallbackLine`. Move the logic in `ScatterFallback.cogeden` to `ScatterFallbackLine.codegen`, to prevent it from affecting the FX backend. This logic is unsafe for FX because it may generate Python or C++ strings with methods like `codegen_reference()`.
2. To eleminate the dependence on `V.graph`, move language-specific logic to the respective wrapper codegen subclasses. In this case, C++ codegen has some special logic, which is moved to `CppWrapperCpu`.
3. Create a new method in `FXWrapperCodegen` to handle `ScatterFallbackLine`.
# Test plan
Added a couple of CI tests for the FX backend with scatter fallbacks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162686
Approved by: https://github.com/jansel
Supports `torch.utils.cpp_extension.load_inline` on Windows with ROCm.
Tested on Windows with gfx1201.
Note that it currently only works when CC and CXX are set to `clang-cl`. This is also needed when building extensions via. `setuptools` due to linker errors when using `cl` directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162577
Approved by: https://github.com/ezyang
We create a wrapper class acting as a layout for device mesh so that we can add new methods more specific to DeviceMesh and keep the core logic of CuTe manipulation inside pycute module. This PR create the main body of the code and then next PR will come with actual implementation and unit test for device mesh layout. (Actual implementation can be found in https://github.com/pytorch/pytorch/pull/161016)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162414
Approved by: https://github.com/ezyang
ghstack dependencies: #162413, #162534
Users can specify the following to get a libtorch_free `.so`.
"aot_inductor.use_libtorch": False,
The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files.
"aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers)
The main motivating use case is for executorch to produce a libtorch free `.so`.
TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655
Approved by: https://github.com/angelayi
Excessive re-recording CUDAGraphs lead to bad performance. We previously warns once if this happens.
However, the limit (=50) is too high and users may just observe bad performance before actually seeing the warning message. Even worse, users may not see the warning message when there are many other logs. @anijain2305 reported that he never saw this warning message when using transformer library, but he DOES observe slowdown due to cudagraph re-recording & needs to turn off cudagraph.
#162663 attempts to hard error when re-recording too many times due to dynamic shapes. But it is a bc-breaking change. Actually, hf-t5-generate model in torchbench failed due to 256 re-recordings.
This PR a) reduces to smaller limit (=8); and b) makes the warning more spam, i.e., warn once for every distinct shapes once the limit is reached.
Fixes#162299
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162696
Approved by: https://github.com/mlazos
With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead
Test plan
```python
# torchrun --nproc_per_node=2 demo_fix.py
import os
import logging
logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)
import torch
if 'RANK' in os.environ:
torch.distributed.init_process_group('nccl')
from torch.utils.cpp_extension import _get_cuda_arch_flags
_get_cuda_arch_flags()
print(f"Rank {os.environ.get('RANK', '0')} done")
```
Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be
```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
[rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override.
Rank 0 done
Rank 1 done
```
But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)`
Then we get
```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
Rank 0 done
Rank 1 done
(source) [marksaroufim@devgpu005]~%
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764
Approved by: https://github.com/ezyang, https://github.com/zou3519
# Implement OpenReg device autoload mechanism
## Overview
The **Autoload** mechanism in PyTorch simplifies the integration of third-party device backends by enabling automatic discovery and initialization at runtime. Traditionally, integrating a new backend required explicit imports or manual initialization, which could be cumbersome and error-prone. With Autoload, PyTorch dynamically detects and initializes device backends, providing a seamless user experience.
This mechanism leverages Python entry points (e.g., `torch.backends`) and dynamic module loading. When PyTorch starts, it scans for registered entry points and invokes their initialization hooks, ensuring that all available backends are ready for use without requiring explicit imports.
## Motivation
This PR aims to apply [device autoload mechanism](https://github.com/pytorch/pytorch/issues/122468) to the OpenReg module with some simple changes.
## Change
### Before
```python
import torch
import torch_openreg
x = torch.tensor([1, 2, 3], device="openreg")
print(x)
```
### After
```python
import torch
# No need to import torch_openreg manually!
x = torch.tensor([1, 2, 3], device="openreg")
print(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158555
Approved by: https://github.com/FFFrog, https://github.com/albanD
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
if we detect compiled model is using cuda in meaningful way, we should store information about cuda + hardware
Example: `SystemInfo(python_version='3.12.9', torch_version='2.9.0a0+gite02b0e6', cuda_version='12.6', triton_version=(3, 4), gpu_name='NVIDIA PG509-210')`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162438
Approved by: https://github.com/zhxchen17
Biggest difference between both conda and homebrew CPython builds and one from python.org, is that later are universal binaries and they are always trying to build universal extension...
Workaround lots of universal binary build attempts by explicitly specifying both `_PYTHON_PLATFORM` and `--plat-name` as well as `ARCH_FLAGS`
Suppressed actionlint warning on use of `freethreaded` flag which is document in https://github.com/actions/setup-python/tree/v5
TODO: Remove lots of temporary workarounds when `3.14` is out in October 2025
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162136
Approved by: https://github.com/atalman, https://github.com/huydhn
ghstack dependencies: #162297, #162265
This PR fixes the errors like below:
```
[rank3]: RuntimeError: The following operation failed in the TorchScript interpreter.
[rank3]: Traceback of TorchScript (most recent call last):
[rank3]: RuntimeError: /tmp/comgr-28f951/input/CompileSourceACC062:67:7: error: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'?
[rank3]: 67 | uint32_t int32;
[rank3]: | ^~~~~~~~
[rank3]: | __hip_internal::uint32_t
```
Earlier uint32_t was defined in HIP headers in std namespace. Now it is moved to __hip_internal namespace in hip headers. This change is made in ROCm 7.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160587
Approved by: https://github.com/jeffdaily
I saw a failure where the reference error was 0.0, and the compiled error was 0.035. Although the failure still occurs with or without this change, it was confusing to see RMSE of 0.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162088
Approved by: https://github.com/drisspg
raise value error on init `ParametrizationList`, if `original.device != new.device`.
currently `_maybe_set` will throw below error in such situations, which I think it's not convenient to debug.
```
[rank1]: RuntimeError: Attempted to set the storage of a tensor on device "cuda:1" to a storage on different device "cpu". This is no longer allowed; the devices must match.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162717
Approved by: https://github.com/lezcano
Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`.
Test Plan:
mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`
Rollback Plan:
Differential Revision: D82181924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
Summary:
When exahaustively autotuning a new template you may hit situations that lead to compilation failures. This template will still attempt to autotune because nothing was marking this as failed and in my experiments lead to a crash/segfault if I didn't set `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`.
To help eliminate this issue this PR marks any template that fails to compile as "failed" and then removes all of the failed templates from the choice candidates. In the case where it would have just failed to compile twice, this should at least reduce compilation time.
Test Plan:
Tested locally when experminenting with the new blackwell templates and a Triton version that contains a bug related to `num_warps < 4`.
Rollback Plan:
Differential Revision: D82172207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162673
Approved by: https://github.com/PaulZhang12, https://github.com/mlazos
Summary:
Add new `scaled_mm` and `scaled_persistent_mm` configs to `template_heuristics.py` for Inductor FP8 Triton templates. These configs are a representative subset of the most performant configs generated from exhaustively autotuning FP8 Triton kernels with per-tensor and per-row scaling.
See this [spreadsheet](https://docs.google.com/spreadsheets/d/1Fal1vhFUJIUcLpM2kJect6IkgeUFvCY-nUr3RTupM_4/edit?gid=1732602731#gid=1732602731) for benchmarks and performance metrics.
Test Plan:
Verify that configs do not error, i.e.
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+i
nductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only pt2_fp8_gemm --metrics tflops,accuracy --input-loader={input_path} --output="{output_csv}" --atol=1e-2 --rtol=0.5 2>&1 | tee {log_file}
```
Rollback Plan:
Reviewed By: NikhilAPatel, PaulZhang12
Differential Revision: D81651226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162699
Approved by: https://github.com/PaulZhang12
This change addresses confusing error messages users encounter when using the ONNX exporter with default settings. Previously, `fallback=True` was the default, which would attempt to fall back to the TorchScript exporter when the dynamo path failed, leading to mixed error messages that obscured the actual issues.
## Problem
When `fallback=True` by default:
- Users get confusing error messages mixing dynamo and TorchScript export failures
- Error messages tell users to provide the `f` argument unnecessarily
- Dynamo error messages get flushed with TorchScript errors when both paths fail
- Users expecting the dynamo path get unexpected fallback behavior
## Solution
Changed the default from `fallback=True` to `fallback=False` in both:
- `torch.onnx.export()` function
- `torch.onnx._internal.exporter._compat.export_compat()` function
## Impact
**Before:**
```python
# Would fallback to TorchScript on dynamo failure, causing mixed error messages
torch.onnx.export(model, args)
```
**After:**
```python
# Clean dynamo-only errors by default
torch.onnx.export(model, args)
# Advanced users can still opt-in to fallback behavior
torch.onnx.export(model, args, fallback=True)
```
Fixes#162697
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162726
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
This makes gemma3 exportable on transformers=4.55.4
In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation.
Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically:
1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied
2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types.
3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator.
4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark)
Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758
Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240
Approved by: https://github.com/ydwu4
Summary:
Sometimes `ShapeEnv.create_symbol` can return a `sympy.Integer`. This messes up our phantom symbol infra for derived dims.
Fixes#161902
Test Plan:
added test based on repro
Rollback Plan:
Differential Revision: D81960709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162416
Approved by: https://github.com/tugsbayasgalan
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
fbgemm adds tbb as a dep only for rocm to avoid missing tbb symbols at import. But the way it was done was in setup.py to add the linker flag to CMAKE_CXX_FLAGS and it wasn't working for reasons unknown to me. But what did work was to add tbb as a dep in the cmake file. [We have a PR against upstream fbgemm](https://github.com/pytorch/FBGEMM/pull/4859) for that. Meanwhile, a much smaller patch is applied here in this PR until the fbgemm rocm ci commit hash is moved forward to include the tbb patch from upstream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162649
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
use torch.accelerator and `_get_device_module` instead of cuda to make DataParallel more device agnostic.
Fixes#162152
recently, I've done some works to support my own privateuse1 backend in DataParallel module, but I found some cuda related APIs exist in parallel_apply.py file, that makes me have to monkey patch DataParallel module to support DP on my own backend.
so I make some small changes to replace cuda.xxx to accelerator.xxx, and acquire device module by `_get_device_module`.
this is my first time to contribute to pytorch, please let me know if there is any problem about the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162573
Approved by: https://github.com/ezyang, https://github.com/guangyey
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Summary: D79674759 tried to fix the expensive prepare and convert steps, as `assert_and_get_unique_device` was called multiple times. This change fixes that issue by using `functools.cache` decorator.
Test Plan:
Verified on llm export to QNN.
LLM Quantization prepare time of ~20min reduced to ~3min.
Rollback Plan:
Differential Revision: D82073679
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162550
Approved by: https://github.com/andrewor14
Summary:
Add _package_executorch_files to archive apis. Allow us to package a PTE file into the archive.
I don't think there's a use-case to have more than one PTE file at the moment, but left it as `EXECUTORCH_FILES` just in case.
Test Plan:
Tested in D81992612
Rollback Plan:
Differential Revision: D81977483
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162520
Approved by: https://github.com/angelayi
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing.
lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read.
incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning.
decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is.
This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072
Approved by: https://github.com/swolchok, https://github.com/yfeldblum
## Summary
- pytorch is not built for *a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209)
## Changes
- **Setting USE_FBGEMM_GENAI for CUDA builds**: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](2033a0a08f/.github/scripts/nova_dir.bash (L29-L32))), so I follow the same rule here.
- **Extra nvcc flags**: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a
## Test plan
Test build:
```
echo $CUDA_HOME
/usr/local/cuda-12.9
export TORCH_CUDA_ARCH_LIST=10.0
python -m pip install --no-build-isolation -v -e .
```
Check build logs:
```
CMake Warning at CMakeLists.txt:901 (message):
Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a
```
Run unit tests:
- `pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544
Approved by: https://github.com/drisspg
Summary: Fix the edge case by allowing `call_function` nodes with no deps as graph entry (starter_nodes) in the splitter.
Test Plan:
The test shall pass in the current diff (after fix), and fail in the parent diff (before fix)
```
buck test mode/opt //glow/fb/fx/lowering:split_tests -- test_dataclass_as_graph_entry
```
Rollback Plan:
Differential Revision: D81232435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161716
Approved by: https://github.com/ezyang
Previously, DeviceInfo provided theoretical hardware information based on a hardcoded list manually created from various datasheets.
This update:
- Attempting to gather the information from a hardware library like `pynvml`, improving accuracy and expanding support to devices that don't have entries in the datasheet list.
- Adjusts flops and bw calculation based on these hardware values. For example, if the the memory or SMs are underclocked, it adjusts the theoretical max flops/bw accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162245
Approved by: https://github.com/v0i0, https://github.com/shunting314
Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes.
### Example exception
```
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel
kernel_code_list = self.generate_combo_kernel_code(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code
src_code = kernel.codegen_kernel()
^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel
code.splice(self.codegen_kernel_benchmark(num_gb=0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark
var_names.extend(self.kernel_benchmark_extra_args())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args
extra_args.append(str(V.graph.sizevars.size_hint(tree.numel)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint
return int(out)
^^^^^^^^
File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__
raise TypeError("Cannot convert symbols to int")
torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int
```
Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442
Approved by: https://github.com/jansel
This PR is quite large in that it covers most of rough edges in the new strict export flow:
1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.
@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
Reopened from #158747 which got reverted since without setuptools-scm in pytorch index URL the wheel cannot be built
We reconsider the original PR idea of introducing CK as a pytorch dependency on ROCm Linux and install the CK python package in CI only -- since (1) rocm-composable-kernel depends on setuptools-scm which depends on tomli and the existing index URLs need to be modified to host the new packages and (2) there also is a packaging [bug](https://github.com/pypa/setuptools/issues/3269#issuecomment-1254507377) in Ubuntu 22.04 which prevents correct dynamic version calculation with default system pip.
Extras:
-> this PR reconsiders how TORCHINDUCTOR_CK_DIR env variable is used; previously, this var was used to point to rocm-composable-kernel package installation path on the filesystem; now, the path is inferred by trying to import ck4inductor
-> the tests are updated to reflect this change
-> since in CI clang points to a bash script which invokes sccache, we cannot patch PATH to not contain sccache, this logic is removed from the testing code
-> scaled_mm test crashes during the benchmarking when the benchmarking happens in the main process, and times out benchmarking when it happens in a subprocess, on gfx942, so it is disabled
TBD: roll back rocm-mi300 workflow before merging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162288
Approved by: https://github.com/jeffdaily
# why
- now everything is in place to just gather templates and run
the V.choices.get_mm_configs once per op
- enables any overrides inside V.choices.get_mm_configs to
have a full view of the options for an op, not just for
one template
# what
- replace multiple calls to V.choices.get_mm_configs with
calls to gather the active templates, and then using those
in a single call
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161351
# why
- if we only use ExternKernelChoice we're not doing any codegen
- if we're not doing any codegen, we can use a FlexibleLayout
here, and provide deeper passes more chances to change it
# what
- if all the kernel template choices (KTC) are with a ExternKernelChoice
template, we switch to a FlexibleLayout before generating the choice
- add a test to make sure that works as intended (FlexibleLayout for
only extern, and FixedLayout if Triton is involved)
- caveats:
- because CPP, CUTLASS, and CK are not using
V.choices.get_mm_configs yet, we turn off the optimization
if either of those backends are in use. This will be relaxed
once they support this too
- because Triton templates are still using their own calls
(not a single call) to get_mm_configs, it's also turned
off there. The next diff unifies Triton + ATEN to a single
call to get_mm_configs and that in turn allows the optimization
there too
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351
Approved by: https://github.com/eellison, https://github.com/jansel
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
----
This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms.
In this PR:
- We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too.
- We can therefore remove build_ArmComputeLibrary functions from the ci build scripts.
- There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update )
- We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ).
- ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737
Approved by: https://github.com/seemethere
ghstack dependencies: #160078
Fixes aarch64 linux packaging, following error:
https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62
```
Traceback (most recent call last):
File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module>
import torch
File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libarm_compute.so: cannot open shared object file: No such file or directory
```
Due to missing dependencies.
Current Error:
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Hence the repackaging does not take any effect.
This PR does following
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl deleted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file.
```
2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling
2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1
2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5
2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so
2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so
2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0
2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0
2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0
2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0
2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64
2025-09-10T10:10:14.1503482Z to Tag: cp310-cp310-manylinux_2_28_aarch64
2025-09-10T10:10:14.1503682Z
2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
```
Test Plan, Executed on local file:
```
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD
Bundling CUDA libraries with wheel
Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64
to Tag: cp310-cp310-manylinux_2_28_aarch64
Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts
Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl..
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566
Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`
In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup
Using `unittest.skip` decorators avoids the starting of the test in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
Summary: `executorch_call_delegate` should have flattened inputs and outputs. So that it can be correctly serialized and the input/output specs are consistent with runtime.
Test Plan:
CI
Rollback Plan:
Differential Revision: D82064354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162538
Approved by: https://github.com/dolpm
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding:
1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached).
2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in https://github.com/pytorch/pytorch/pull/160709 but it does not fixed the check for the case when we call the `_flatten` twice.
What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why?
1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__).
2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that line will never be reached if we error out before that.
Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161311
Approved by: https://github.com/fegin
I suspected that I would need to repack vLLM wheels from https://github.com/pytorch/pytorch/pull/162000 because I renamed the wheel, and it turns out to be true. The error is as follows:
```
$ uv pip install --pre xformers --index-url https://download.pytorch.org/whl/nightly/cu129
Using Python 3.12.11+meta environment at: venv/py3.12
Resolved 28 packages in 759ms
error: Failed to install: xformers-0.0.33.dev20250901+cu129-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (xformers==0.0.33.dev20250901+cu129)
Caused by: Wheel version does not match filename: 0.0.33+5d4b92a5.d20250907 != 0.0.33.dev20250901+cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162371
Approved by: https://github.com/atalman
Summary: This adds a `_rank` field to DeviceMesh init that allows for instantiating a DeviceMesh without depending on `dist.get_rank()` which requires a global PG to be instantiated.
Test Plan:
```
buck2 test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:device_mesh -- init_backend
```
Rollback Plan:
Differential Revision: D81981777
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162439
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
Fixing issue introduced in https://github.com/pytorch/pytorch/pull/158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.
In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.
This was discovered from silent incorrect results e.g. on `torch.einsum` backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162460
Approved by: https://github.com/zpcore
Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path.
Test Plan:
```
buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test
```
https://www.internalfb.com/intern/testinfra/testrun/15199648841705052
Rollback Plan:
Differential Revision: D79230339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448
Approved by: https://github.com/meetv18
Fixes#154849
This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments. Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api
Tests:
For validations purpose:
test_signal_handling.py,
simple_test_api_signal_handling.py,
Unit Tests:
for launcher changes:launcher/test_api.py
for api changes: multiprocessing/test_api.py
E2E: test_run.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690
Approved by: https://github.com/fduwjj
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it.
The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple.
Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object.
Fixes#161610
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645
Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps:
(1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop.
(2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++)
I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated
This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us):
**aten.detach(plain_tensor)**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790>
_ = x.detach()
4.80 us
1 measurement, 100000 runs , 1 thread
```
**aten.detach(DTensor) [before this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750>
_ = x_dt.detach()
336.40 us
1 measurement, 1000 runs , 1 thread
```
**aten.detach(DTensor) [after this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520>
_ = x_dt.detach()
Median: 133.45 us
2 measurements, 1000 runs per measurement, 1 thread
```
benchmark script:
```
import torch
import torch.distributed as dist
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
import torch.utils.benchmark as benchmark
fake_store = FakeStore()
dist.init_process_group("fake", store=fake_store, rank=0, world_size=2)
mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,))
x = torch.randn(4, 4, requires_grad=True)
x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
t0 = benchmark.Timer(
stmt='_ = x_dt.detach()',
globals={'x_dt': x_dt},
)
print(t0.blocked_autorange())
dist.destroy_process_group()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580
Approved by: https://github.com/ezyang
# why
- unnecessary as we only ever need to know the dtype and maybe the
device
- we already take in the kernel inputs which have the device
- enable us to specify the layout after finding all the configs
but before generating the ChoiceCallers
# what
- replace all calls in template_heuristics that used to take Layout
with now just taking out_dtype
# testing
ci
Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162238
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348, #161349
# why
- enable us to override the default configs, or fall back to them
through subclassing InductorChoices
# what
- override (private) function
- default implementationt takes the kernel template choice (ktc)
generator for every template and just executes the generator
- future overrides can decide to replace those generators, or filter
out choices
- the 2nd expensive step (maybe_append_choices, choice_or_none) is
handled outside this function, in the main V.choices.get_mm_configs
this means that any overriding benefits from not generating expensive
templates that aren't going to be used
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161349
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348
\# why
- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
generators are executed, so we don't need this generator anymore
\# what
- rather than yielding the ChoiceCaller, just return the list of all
valid ChoiceCallers
\# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #161347
# why
- gather everything up to make choices, without running
potentially expensive generators
- enables overrides where we toss the entire list of configs
from inductor, without having to enumrate it (expensive)
# what
- add a holding class that just gets all the components necessary
to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
the scene
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
Our compiler is generating inefficient code for the offsetCalc in certain situations.
The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161700
Approved by: https://github.com/jeffdaily
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
@ -16,6 +16,8 @@ However, if you believe you have found a security vulnerability in PyTorch, we e
Please report security issues using https://github.com/pytorch/pytorch/security/advisories/new
All reports submitted thru the security advisories mechanism would **either be made public or dismissed by the team within 90 days of the submission**. If advisory has been closed on the grounds that it is not a security issue, please do not hesitate to create an [new issue](https://github.com/pytorch/pytorch/issues/new?template=bug-report.yml) as it is still likely a valid issue within the framework.
Please refer to the following page for our responsible disclosure policy, reward guidelines, and those things that should not be reported:
TORCH_CHECK(state_->seed_==seed,"CUDAGeneratorImpl::set_current_seed can be called during stream capture only if new seed is the same as the original seed.");
self_.sizes()," with dimension ",i," being empty.");
}
autooheight=output_size[0];
autoowidth=output_size[1];
automemory_format=self_.suggest_memory_format();
autoself=self_.contiguous(memory_format);
autoindices=indices_.contiguous(memory_format);
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.