This PR removes import tricks of `SHARDING_PRIORITIES` and `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not.
Before change:
```
import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES
import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe
```
works
After change:
there is an import error exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438
Approved by: https://github.com/janeyx99
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.
Testing:
```
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
mod = torch.get_device_module('cuda')
hw = mod._device_limits.GPULimits(device)
print(hw.get_tflops_per_second(torch.float16))
print(hw.get_tflops_per_second(torch.float32))
print(hw.get_tflops_per_second(torch.float64))
print(hw.get_tflops_per_second(torch.bfloat16))
print(hw.get_tflops_per_second(torch.int8))
print(hw.get_memory_bandwidth_Bps() / 1e9)
print(hw.get_shared_memory_bandwidth_Bps() / 1e9)
# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.
### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)
## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.
When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.
Test Plan: CI
Differential Revision: D82904540
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm
We are seeing crashes of the form
```
Traceback (most recent call last):
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1487, in run
while self.step():
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1348, in step
self.dispatch_table[inst.opcode](self, inst)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2437, in LOAD_ATTR
self._load_attr(inst)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2425, in _load_attr
result = BuiltinVariable(getattr).call_function(
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 1347, in call_function
return handler(tx, args, kwargs)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <lambda>
tx, [v.realize() for v in args], kwargs
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <listcomp>
tx, [v.realize() for v in args], kwargs
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 72, in realize
self._cache.realize()
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 33, in realize
self.vt = builder.VariableBuilder(tx, self.source)(self.value)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 445, in __call__
vt = self._wrap(value)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 1043, in _wrap
torch._dynamo.utils.store_user_object_weakref(value)
File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/utils.py", line 4694, in store_user_object_weakref
user_obj_id_to_weakref[obj_id] = weakref.ref(obj)
torch._dynamo.exc.InternalTorchDynamoError: TypeError: cannot create weak reference to 'torch.Event' object
```
This pull request makes us gracefully graph break, vs explicitly crashing.
I've added a test which reproduces the issue. There is a side discussion re:
how did torch.Event support ever work here, since it appears you cannot take a
weakref to a torch.Event
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163168
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
Summary:
otherwise, may hit
```
Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal)
```
Test Plan: UTs
Reviewed By: yushangdi
Differential Revision: D82974062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529
Approved by: https://github.com/yushangdi, https://github.com/Skylion007
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Before
Moving module.to("cuda:0") under fake tensor mode would have parameter on `meta` device.
After
parameters would be on "cuda:0" .
Test Plan: buck2 run fbcode//caffe2/test:fake_tensor -- --r test_move_module
Reviewed By: mikaylagawarecki
Differential Revision: D80102876
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163433
Approved by: https://github.com/albanD
This change restricts the DLPack stride normalization to apply only to 1D tensors of shape (1,).
### Rationale
The previous implementation normalized the strides for any multi-dimensional tensor containing a dimension of size 1. While well-intentioned, this "over-normalization" discards critical memory layout information, causing issues for downstream consumers who rely on strides to infer alignment and contiguity.
For example:
* A row-major tensor with `shape=(1, 128)` and `stride=(128, 1)` would be incorrectly normalized to `stride=(1, 1)`.
* A column-major tensor with `shape=(1024, 1)` and `stride=(1, 1024)` would also be normalized to `stride=(1, 1)`.
This loss of stride information makes it impossible for consumers to detect the original memory layout (e.g., row-major vs. column-major) and breaks assumptions about memory alignment needed for optimized indexing or specialized hardware APIs like GPU TMA.
The original intent of the normalization was to handle the simple case of a 1D tensor with shape=(1,) and a non-standard stride. This fix reverts to that specific, non-problematic behavior, ensuring that multi-dimensional tensors retain their precise stride information during DLPack export.
### Related Issues
#163274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163282
Approved by: https://github.com/eqy
A big pain point ppl have with custom ops is that they do not accept arbitrary input/outputs. In this PR we create the concept of an "OpaqueObject" which allows users to pass arbitrary python objects into custom operators.
Some still slightly annoying parts with this implementation:
- The schema of the operator is `__torch__.torch.classes.aten.OpaqueObject` instead of whatever python type
- `@torch.library.custom_op` doesn't work.. yet?
UX:
```python
from torch._library.opaque_object import make_opaque, get_payload
# your custom python class
class OpaqueQueue:
def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
super().__init__()
self.queue = queue
self.init_tensor_ = init_tensor_
def push(self, tensor: torch.Tensor) -> None:
self.queue.append(tensor)
def pop(self) -> torch.Tensor:
if len(self.queue) > 0:
return self.queue.pop(0)
return self.init_tensor_
def size(self) -> int:
return len(self.queue)
queue = OpaqueQueue([], torch.zeros(3))
obj: torch._C.ScriptObject = make_opaque(queue)
# obj.payload stores a direct reference to this python queue object
self.assertEqual(get_payload(obj), queue)
# This is able to be passed through the dispatcher
torch.ops._TestOpaqueObject.queue_push(obj, torch.ones(3))
self.assertTrue(queue.size(), 1)
```
Authoring a custom op:
```python
lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")
torch.library.define(
f"_TestOpaqueObject::queue_push",
"(__torch__.torch.classes.aten.OpaqueObject a, Tensor b) -> ()",
tags=torch.Tag.pt2_compliant_tag,
lib=lib,
)
@torch.library.impl(f"{libname}::queue_push", "CompositeExplicitAutograd", lib=lib)
def push_impl(q: torch._C.ScriptObject, b: torch.Tensor) -> None:
# We can get the payload directly by get_payload(q)
queue = get_payload(q)
assert isinstance(queue, OpaqueQueue)
queue.push(b)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162660
Approved by: https://github.com/zou3519
Summary:
Currently, we assume that refcount_ and weakcount_ are always stored in an 8-byte aligned address right next to each other. Based on this assumption, we load 8 bytes in intrusive_ptr::reset_ to check the values of both counts. However, that assumption is not part of C++ language standard so it's essentially undefined behavior.
This change eliminates that assumption by combining refcount_ and weakcount_ in a single 64-bit count and we use the lower 32 bits for refcount_ and upper 32 bits for the weakcount_.
In addition to eliminating the undefined behavior, the change also eliminates the read of weakcount_ after decrementing refcount_ in intrusive_ptr::reset_. This claws back lost performance introduced in https://github.com/pytorch/pytorch/pull/162784 for non-final refcount_ decrementing.
Reviewed By: yfeldblum
Differential Revision: D82869192
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163394
Approved by: https://github.com/Skylion007
all details are in readme.md
Note: one thing i want to do soonest is to switch to graph representation instead of stack representation
for the fuzzed ops should make things easier as things get more complicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163417
Approved by: https://github.com/bobrenjc93
fixes#159855, was not triggered in other tests since it took
more than one round of fusion to get to the problematic code
which prunes WeakDeps. The WeakDeps are important to inhibit
fusion of kernels that read/write data into mutated buffers
with different indexing.
We modify the code to a) always prune before fusion, rather
than after, which improves its coverage and makes our basic
vertical fusion tests surface this issue as well and b)
check whether the weak dep is fusable before eliminating it
(which basically means checking that the producing code and
the consuming code are sufficiently compatible).
The tests that trigger this with change (a) is:
test_fusing_write_into_disjoint_read introduced in #118210.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162316
Approved by: https://github.com/eellison, https://github.com/mlazos, https://github.com/shunting314
### Issue
The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name.
If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA.
The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call).
### Usage:
```
@requires_nvshmem
@triton.jit
def foo(...):
...
foo[(1, 1)](...)
```
It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163423
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152, #163194
No need for unnecessary copy of std::vectors. This Tensor list is copied throughout the foreach paths and this code is on a hot path for torch optimizers. Auto move elision will not happen on the return statement since it's a subelement of a vector that needs to be copied out before the std::vector is dtor'd. This should reduce quite a few list copies along this path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163416
Approved by: https://github.com/ezyang
The big semantic change (and the reason for this port) is that we no longer monkeypatch Tensor with torchdim's special methods. The new algorithm for handling dispatch is that we first land in `__torch_function__` and we see if a special FCD implementation needs to be dispatch to first, and if there is nothing we fallback to the standard level strategy.
Because there is no longer C binding equivalent of classes, we've condensed _C.Dim and Dim together, and similar for Tensor. This resulted in some bugs as the Python API is sometimes different from the C API. I've attempted to disambiguate these but there may still be mistakes (many early bugs were due to this problem). Dim and DimEntry are especially painful as Dim must abide by Tensor equality semantics, but is pointer equality in C (DimEntry doesn't have this problem). Another difference between C/Python that is subtle is we no longer get implicit conversions from Dim to DimEntry, this also caused some bugs.
Much of the mechanical porting work was done by claude code. I have a separate PR that deletes functorch._C, but it was useful having dim.cpp to point claude at it so I haven't done it in this PR. From a reviewing perspective, I need to re-review that I didn't forget to port anything, some noticeably missing "small" things are patched_dim_method. I am still in progress of carefully doing a side-by-side review of ports; "simplifications" from claude code were also a major source of bugs.
There are two major feature gaps in the implementation:
- DelayedTensor and dot handling are not implemented yet. This should be reasonably easy, just need to do it. However, for the purposes of sharded propagation it is actually better not to reconstruct matmuls.
- Splitting dimensions with an index like `[x, y]` doesn't work. The problem is that `__getitem__` interprets this as advanced indexing and sends the list to torch.tensor to turn into a tensor, instead of being eligible for `__torch_function__`. I think I might need to hard code a special case for this or something?
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160236
Approved by: https://github.com/zdevito, https://github.com/albanD
Summary:
Today `fullgraph_capture` takes a frame, but clients usually take a callable (`nn.Module`, function, or method) and example inputs (args and kwargs) and then explicitly set up the frame to pass. This is boilerplate—and potentially tricky to get right—that can be hidden inside the API.
The original `fullgraph_capture` now becomes `_fullgraph_capture_frame`.
Test Plan:
existing tests
Rollback Plan:
Differential Revision: D82339400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162849
Approved by: https://github.com/zhxchen17
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") call.
This diff supports this.
Notice that .to("cuda") doesn't work yet, as it enquery current device idx by calling cuda API.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Test Plan:
buck2 run fbcode//caffe2/test:fake_tensor -- --r test_fake_gpu_no_init
Rollback Plan:
Differential Revision: D80101283
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160431
Approved by: https://github.com/henryoier, https://github.com/ezyang
Currently OutputGraphGuardsState is separated out as a serializable interface for OutputGraph, but some of the typing around it is incorrect in dynamo's guards.py and output_graph.py: more fields are used by code than claimed by OutputGraphGuardsState, and it works because either the full OutputGraph is passed in or the parts that use those fields are dead when OutputGraphGuardsState is passed in.
In this PR we try to further separate the necessary fields of OutputGraph that should be retained by a full graph capture mechanism, not just limited to dynamo (as it is currently) but also something like make_fx (in the future). Since these fields do not need to be serialized, the result is an intermediate "common" data structure that is between OutputGraphGuardsState and OutputGraph in the inheritance hierarchy.
Differential Revision: D81718791
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162211
Approved by: https://github.com/zhxchen17
Summary: We observe a case then the fwd graph has duplicated return nodes, which will lead to errors due to fx renaming the node, thus we add poi info into the node name.
Test Plan:
### unit test
```
CUDA_VISIBLE_DEVICES=3 buck2 test mode/opt -m ovr_config//triton:beta -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 //caffe2/test/functorch:test_aotdispatch -- test_quantize_activation_duplicate_nodes
```
Buck UI: https://www.internalfb.com/buck2/de5eccc6-4064-4214-843d-70b8e3829afe
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4503599937670844
Network: Up: 217KiB Down: 72KiB (reSessionID-73e5c269-4f4d-4a54-896a-79c077eea326)
Executing actions. Remaining 0/2 0.1s exec time total
Command: test. Finished 1 local
Time elapsed: 45.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
### E2E
before
f798417700
after
Differential Revision: D82844100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163364
Approved by: https://github.com/Yuzhen11
# why
- extra kwargs are input/op dependent and not config dependent. We don't
plan to serialize/deserialize them, and so they need to be fed in
later beore making the KTC, rather than when getting the config values
directly
# what
- move extra_kwargs into the KTC and get_ktc interface directly
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "_addmm"
```
Differential Revision: [D82871310](https://our.internmc.facebook.com/intern/diff/D82871310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163209
Approved by: https://github.com/nmacchioni
ghstack dependencies: #163305
# why
- this is not directly controlled by the config arg but rather by the
input and by the inductor wide setting
- it's always the same for every choice
- we want the config kwargs to be *programable* and this is not
programable in that sense but rather needs to use inductor config
# what
- move generating the ALLOW_TF32 kwarg in Triton templates into
get_extra_kwargs
# testing
with some annotations, this is now the kwargs and extra_kwargs on addmm
```
{'EVEN_K': True, 'USE_FAST_ACCUM': False, 'ACC_TYPE': 'tl.float32', 'num_stages': 1, 'num_warps': 2, 'BLOCK_M': 32, 'BLOCK_N': 32, 'BLOCK_K': 16, 'hint_override': None, 'GROUP_M': 8} # choice/config kwargs
{'ALLOW_TF32': True, 'epilogue_fn': <function addmm_epilogue.<locals>.epilogue at 0x7f64d54ff600>, 'epilogue_fn_hash': "['addmm_epilogue', torch.float32, 1, 1]", 'prefix_args': 1} # extra kwargs
```
they're both passed onto the template
Differential Revision: [D82871312](https://our.internmc.facebook.com/intern/diff/D82871312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163305
Approved by: https://github.com/nmacchioni
Fixes#158631
The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857
Approved by: https://github.com/divyanshk
# Feature
Support `torch.cond` in the FX converter. The generated FX IR is conceptually indentical to what would come from `torch.export`:
- Submodules as stored as attributes, and accessed via `getattr`.
- The conditional is represented as `torch.ops.higher_order.cond`, which takes in the subgraphs, a predicate and submodule inputs.
# Implementation overview
The FX backend generates code for subgraphs using the following steps:
1. When `codegen_conditional` is called in `WrapperFxCodegen`, we emit a `ConditionalLine`.
a. We also codegen the true/false subgraphs at this time, storing their subgms for later.
2. At the beginning of FX conversion, generate `get_attr` nodes accessing each subgraph. It's important to do this at the start, before registering the node metadata hook. This also matches the convention followed by torch.export.
3. When we see the `ConditionalLine` in the FX converter, we generate a corresponding `torch.ops.higher_order.cond`.
# Implementation details
This ended up being a substantial change, as wrapper codegen has some special logic for subgraphs.
Certain methods of `PythonWrapperCodegen` are overridden by `SubgraphPythonWrapperCodegen`. To apply these overrides, we use multiple inheritance with the registered subclass of `WrapperFxCodegen`.
Unlike most other wrapper codegen methods, which map 1:1 to Wrapper IR lines, subgraph codegen generates a number of wrapper lines including `EnterSubgraphLine` and `ExitSubgraphLine`, along with Python or C++ code calling the subgraph as a function. These lines are used for some backends' memory planning.
In contrast, FX IR typically represents a subgraph call as a single HOP node, or a `call_module` op. To account for this difference, this PR introduces a new wrapper IR line called `ConditionalLine`, which is only used by the FX backend. We override the `codegen_conditional` method to emit this line. This sidesteps having to port the existing subgraph codegen and associated memory planning to Wrapper IR. (In principle, it seems possible to adapt the existing backends to `ConditionalLine`, but it could be a larger refactor, since we'd also have to update the memory planning.)
Some of the lower-level subgraph codegen methods are still shared between the FX and Python backends, such as `generate_subgraph_common`. Those were easier to port to Wrapper IR.
This also required generalizing the way the FX converter handles graph inputs and outputs. Previously, it assumed the IO signature was the same as `V.graph.module`, but this is only true for the parent graph, and not subgraphs. Instead, we need to call `get_graph_inputs` and `get_graph_outputs` to populate the inputs and outputs for subgraphs.
# Test plan
This PR adds a couple of tests using torch.cond. Here's an example graph generated by one of them:
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%true_graph_0 : [num_users=1] = get_attr[target=true_graph_0]
%false_graph_0 : [num_users=1] = get_attr[target=false_graph_0]
%cond : [num_users=1] = call_function[target=torch.ops.higher_order.cond](args = (%arg0_1, %true_graph_0, %false_graph_0, (%arg1_1,)), kwargs = {})
%buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%cond, 0), kwargs = {})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 6, constant_args_idx: 6, grid: [(1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 6, XBLOCK: 8}})
return buf1
```
It also removes an existing negative test which checked that a certain error was raised when subgraphs were encountered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163234
Approved by: https://github.com/angelayi, https://github.com/jansel
Summary:
This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache.
This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries.
The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries().
Saving single entries and such with DynamoCache still works normally.
Test Plan:
All existing unit tests pass.
Rollback Plan:
Differential Revision: D82380307
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886
Approved by: https://github.com/zhxchen17
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
- `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
- With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
- Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps
I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
ghstack dependencies: #163339, #163341
Fixes#161014
This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue #161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor.
In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero. But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well.
This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you!
@malfet @manuelcandales @janeyx99 @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161639
Approved by: https://github.com/manuelcandales
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:
a1f7639922/test/inductor/test_loop_ordering.py (L612-L641)
Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162101, #162126
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.
In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```
UPDATE: add the same optimization for LoopBody.reorder_iter_loops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162101
Approved by: https://github.com/jansel, https://github.com/eellison
The issue cannot be reproduced using the original repro code provided in the issue description.
However, the underlying issue mentioned by the maintainer (missing functions in `builder.py` and `trace_rules.py`) was never addressed and can still be reproduced with this test case:
```python
import torch
from torch.nn.attention import _cur_sdpa_kernel_backends
@torch.compile(fullgraph=True)
def test_function_that_triggers_error():
return _cur_sdpa_kernel_backends()
print("Calling torch.compile function...")
try:
result = test_function_that_triggers_error()
print(f"Success: {result}")
except Exception as e:
print(f"ERROR: {e}")
print(f"Error type: {type(e)}")
```
The original repro likely no longer triggers the issue due to code path changes in the SDPA implementation, while the direct call to `_cur_sdpa_kernel_backends()` exposes the underlying problem where certain torch._C functions returning non-Tensor values aren't properly handled by dynamo tracing.
I have implemented the changes by adding the missing functions to both `builder.py` and `trace_rules.py` to properly handle these cases during compilation.
@guilhermeleobas
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161169
Approved by: https://github.com/guilhermeleobas, https://github.com/StrongerXi
Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.
Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.
Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298
As titled. Avoiding a potential hang when running dispatch and combine in subgroups.
The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin
Fixes#161010 by making `clone_meta` match the semantics of strides for eager mode.
This is:
* Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride
* Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163017
Approved by: https://github.com/williamwen42, https://github.com/xmfan
Co-authored-by: morrison-turnansky <mturnans@redhat.com>
Summary:
X-link: https://github.com/meta-pytorch/tritonbench/pull/432
Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template.
This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`.
Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation.
In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff.
Test Plan:
Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel.
Benchmarking command:
```
TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log
```
Rollback Plan:
Differential Revision: D82597111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163147
Approved by: https://github.com/njriasan
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```
If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.
However, when there are mutating args, we don't see `del buf1` immediately.
```python
@torch.library.custom_op(
"mylib::op1",
mutates_args=["x"],
schema="(Tensor(a!)? x) -> (Tensor, Tensor)",
device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
x = x + 1
return (x + 1, x + 2)
```
<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />
Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)
According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.
Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519
PR #151360 added mx fp8 and fp4 support on ROCm.
1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures.
This patch corrects is_blockwise_1x32_scaling function code.
2. Fixes the m, n, k dimensions for ROCm mx case.
3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2.
This resulted in higher SQNR value for mx fp4 test.
Testing result on gfx950 w/ ROCm7.0
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s
OK passed 111
This is same as before. (when PR 151360 was merged)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163127
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163205
Approved by: https://github.com/ezyang
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary: Platform args was a buck1 concept that we decided to port over to buck2 in order to make the migration easier. However, platforms args existing in the repo blocks some buck modernization like modefile free efforts, so we're trying to get rid of the usage.
Test Plan:
CI
Rollback Plan:
Differential Revision: D82470032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163086
Approved by: https://github.com/malfet, https://github.com/8Keep
Update the torch-xpu-ops commit to 24fab67b6e, includes:
- Clean up getDeviceIndexOfCurrentQueue
- Fix hardswish gradients corner case
- Fix xccl contiguous check
- Move checks from nonzero kernel to operator
- support high priority stream for xccl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163244
Approved by: https://github.com/EikanWang
Summary:
This diff does a few things:
- It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it
- It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized.
Here's an example of what that JSON looks like:
```
{
"artifacts": {
"precompile_aot_autograd": [
"__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
],
"precompile_dynamo": [
{
"backend_ids": [
"__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
],
"fn_name": "TorchBenchmarkRunner.forward_and_backward_pass",
"num_codes": "10",
"python_version": "3.12.11+meta",
"torch_version": "2.10.0a0+fb"
}
]
},
"num_entries": 1
}
```
Test Plan:
Existing tests pass.
NanoGPT tlparse showing the new debug:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR.
Rollback Plan:
Differential Revision: D82232574
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162740
Approved by: https://github.com/zhxchen17
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
- `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
- With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
- Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase
- Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps
I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
Summary:
Point people lowering to lite interpreter to the existence of ExecuTorch.
Added the typing deprecation, a warnings deprecation
Test Plan: Try using it, see deprecation warning
Reviewed By: lucylq
Differential Revision: D82759566
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163289
Approved by: https://github.com/larryliu0820
Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime.
Current behavior:
- Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3.
- Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions.
- Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed?
- Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed.
- We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed.
- DynamicInts as nn.Module attributes are handled.
- We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?)
```python
x = DynamicInt(4)
f(x)
f(1)
f(DynamicInt(3)) # same as f(3)
```
Follow-up work:
- Specifying shape constraints, either at the int-level, e.g.
```python
DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"]
```
or at the compilation level, e.g. something like
```python
s0 = DynamicInt(64, name="s0")
s1 = DynamicInt(128, name="s1")
with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]):
f(s0, s1)
```
This should subsume the need for specifying derived SymInts?
- SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor.
- Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`.
Differential Revision: D81698719
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162194
Approved by: https://github.com/bobrenjc93
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152
Summary:
What: Enables CUDA support for concat linear int8_mm woq optimization pattern by:
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA
Why: Extend WOQ to more device types
Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```
Rollback Plan:
Differential Revision: D80884518
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161848
Approved by: https://github.com/jerryzh168
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where.
Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163138
Approved by: https://github.com/henryoier, https://github.com/ezyang
In the .dist-info/METADATA file, the version was not being written with the new sha.
On python <3.11 (I think), the glob `**` will only match directories, so change this to `*`, which I checked that it will match both files and directories on py3.9 and py3.13
There's probably also a bunch of mismatches in RECORD but thats a problem for later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163214
Approved by: https://github.com/huydhn
resolves https://github.com/pytorch/torchtitan/issues/1136
torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already
```
# pad DTensor._local_tensor
fully_shard(model)
sd = fsdp_model.state_dict()
# reset_sharded_param should be a no-op in lazy_init
loss = fsdp_model(inp).sum()
```
this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early
unit test
```
pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163130
Approved by: https://github.com/tianyu-l
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.
Testing:
```
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
mod = torch.get_device_module('cuda')
hw = mod._device_limits.GPULimits(device)
print(hw.get_tflops_per_second(torch.float16))
print(hw.get_tflops_per_second(torch.float32))
print(hw.get_tflops_per_second(torch.float64))
print(hw.get_tflops_per_second(torch.bfloat16))
print(hw.get_tflops_per_second(torch.int8))
print(hw.get_memory_bandwidth_Bps() / 1e9)
print(hw.get_shared_memory_bandwidth_Bps() / 1e9)
# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162785
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658
Reland of #160532
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
cuda_module = module.to("cuda:0")
cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
with torch.no_grad():
ep = torch.export.export(cuda_module, cuda_sample_inputs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187
Approved by: https://github.com/angelayi
Summary: Updates the remaining tests to all ensure val_shapes is always passed a tuple and not a list. Enables adding an assert consistent with the other function arguments.
Test Plan:
Depends on CI.
Rollback Plan:
Differential Revision: D82383319
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162887
Approved by: https://github.com/NikhilAPatel
# why
- enable a clear interface for kernel templates to declare all their
instantiation parameters and any potential defaults
- simplify KernelTemplateChoice to just have a single params, and not kwargs and extra_kwargs
# what
- KernelTemplateParams interface
- placeholder implementation where we just pass through a dict
# testing
- existing ci tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162781
Approved by: https://github.com/jansel
Summary:
In edge cases, tracer_output can be left unset if there's double exception raised which causes the following issue:
```
UnboundLocalError: local variable 'tracer_output' referenced before assignment
```
Default initialize this variable so that it's always present.
Test Plan:
CI
Rollback Plan:
Differential Revision: D82652815
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163169
Approved by: https://github.com/tugsbayasgalan
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162658
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656
Summary:
I am really skeptical about inductor sizevars creating an empty shape env when not provided with one
i think we should fail there if the graph has dynamic shapes and no shape env is provided.
however i wonder if there are actually use cases that depends on the shape env not being there?
Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols.
Test Plan:
Fix the bug reported in creating simple e2e unit test is not trivial
https://www.internalfb.com/diff/D82337184
Rollback Plan:
Differential Revision: D82412384
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162927
Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel
1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong).
2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152
Approved by: https://github.com/ngimel
ghstack dependencies: #163025
**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162654
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650
**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162650
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636
Summary:
1. Generalized testing by auto-detecting Cache types and splitting testing by abstract base class
- Now checks that all Cache types are thread-safe
- Will fail tests if any new Cache is added and is untested (for example, any cache with non-str key or non-bytes value)
2. All Caches are thread-safe
- InMemoryCache was the only one not thread-safe, so added a lock for access
- Realized that to implement MultiCache we should just have this requirement.
* Also, OnDiskCache is now a functioning AsyncCache with a default base_dir using Python's tempfile.gettempdir, i.e. OnDiskCache is no longer an abstract cache class
Test Plan:
```
[nmacchioni@*** / ()]$ buck test fbcode//mode/opt caffe2/test/inductor:pcache
Tests finished: Pass 28. Fail 0. Fatal 0. Skip 0. Build failure 0
[nmacchioni@*** / ()|remote/fbcode/warm_gpu_od_stable...)]$
```
Rollback Plan:
Differential Revision: D82660240
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163173
Approved by: https://github.com/masnesral
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162636
Approved by: https://github.com/mori360
ghstack dependencies: #162631
We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore
BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward
## BC note
The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter.
Note this only affects torch.cuda._compile_kernel, which is a private API.
Example:
Before
```python
kernel = compile_kernel(
kernel_source="global void my_kernel() { ... }",
kernel_name="my_kernel",
header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }"
)
```
After
```python
kernel_source = """
#define SCALE 2.0f
device float scale(float x) { return x * SCALE; }
global void my_kernel() { ... }
"""
kernel = _compile_kernel(kernel_source, "my_kernel")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163165
Approved by: https://github.com/janeyx99, https://github.com/albanD
In Unified Runtime, we cannot have any fallback ops (for now). Not all conv1d ops can avoid fallbacks now, so we write a decomposition for it.
it's not registered to the default decomposition table as currently only executorch/unified runtime needs it. But it might benefit inductor as well because conv2d can generate triton kernels while there's no triton codegen for conv1d. I don't know if the conv2d triton kernel will have better perf compared to aten::conv1d, so it's not registered by default yet.
To register it, one just needs to do `import torch._decomp as decomp;decomp.register_decomposition(torch.ops.aten.conv1d.default, conv1d_to_conv2d)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163080
Approved by: https://github.com/angelayi
Fixes#163105
Note that the new `SWALR.load_state_dict` is **not backwards compatible**:
```python
@override
def load_state_dict(self, state_dict: dict[str, Any]) -> None:
"""Load the scheduler's state.
Args:
state_dict (dict): scheduler state. Should be an object returned
from a call to :meth:`state_dict`.
"""
self.__dict__.update(state_dict)
self._set_anneal_func(self._anneal_strategy)
```
If we'd like to maintain compatibility with old state_dicts (loaded with `weights_only=False`), we could use something along these lines:
```python
@override
def load_state_dict(self, state_dict: dict[str, Any]) -> None:
"""Load the scheduler's state.
Args:
state_dict (dict): scheduler state. Should be an object returned
from a call to :meth:`state_dict`.
"""
anneal_func = state_dict.pop("anneal_func", None)
strategy = state_dict.get("_anneal_strategy")
self.__dict__.update(state_dict)
if anneal_func is not None:
state_dict["anneal_func"] = anneal_func
if strategy is None:
if anneal_func == self._linear_anneal:
strategy = "linear"
elif anneal_func == self._cosine_anneal:
strategy = "cos"
if strategy is None:
strategy = getattr(self, "_anneal_strategy", "cos")
self._set_anneal_func(strategy)
```
But given the fact that loading an `SWALR` state_dict before this PR would have caused an error, this seems okay. A GitHub/Google search for `SWALR.load_state_dict` had no results. Happy to change if not, or add a warning just in case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163122
Approved by: https://github.com/janeyx99
# why
- KTC might regenerate a choicecaller e.g. through FlexibleLayout
optimization. This in turn would delete any annotations
# what
- provide an annotations dict inside KTC
- forward that dict towards the ChoiceCaller's annotations
- ChoiceCaller users e.g. in selectalgorithm now have access to the KTC
and can register handlers do record/make decisions based on the KTC
# testing
n/a
Differential Revision: [D82587631](https://our.internmc.facebook.com/intern/diff/D82587631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163117
Approved by: https://github.com/nmacchioni
2025-09-17 18:06:50 +00:00
464 changed files with 17121 additions and 5960 deletions
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.