Usage:
```python
from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper
# Your ordinary function wrapper
def my_hop_fn_impl(fn, *args, k=1, **kwargs):
def wrapper(*args, **kwargs):
out = fn(*args, **kwargs)
if isinstance(out, tuple):
return (out[0] + k,)
return out + k
return wrapper
# Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph
def my_hop_fn(fn, *args, k=1, **kwargs):
return dynamo_bypassing_wrapper(
functools.partial(my_hop_fn_impl, k=k), fn, *args, **kwargs
)
```
Notes:
- The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn.
- The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487
Approved by: https://github.com/ydwu4
Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases:
1. `NoneLayout` is not allocated so it cannot become a partition input or output.
2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`.
3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen.
This PR supports these 3 cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899
Approved by: https://github.com/eellison
Previously, we didn't track the unbacked symbols leaked out of true_branch and false_branch if they have the same shape expr. This cause the the fake output of cond operator itself doesn't set up its unbacked_bindings meta properly (because they're ignored).
In this PR, we also check whether there're leaked out unbacked symbols and create new unbacked symbols for it and track it as output of cond.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148206
Approved by: https://github.com/zou3519
## PR
There are a few cases that my previous PR (#153220) didn't cover.
1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`.
2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this.
3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this.
## Device assertion msg
```
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
```
## Autotuning code setup
This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`).
It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1).
```
in_buf0 = generate_example_value(size=(u1 + s0, 256)) # concrete size is (17900, 256)
in_buf1 = generate_example_value(size=(u0, 10)) # concrete size is (8192, 10)
...
out_buf = generate_example_value(size=(u1 + s0, 266)) # concrete size is (17900, 256+10)
triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...)
```
If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads.
- `tmp6` makes sure we're only loading with the `xindex` from 256 to 264.
- `xmask` makes sure we're only loading with the `xindex` within `xnumel`.
- `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`.
The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load.
```
def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK):
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)
xmask = xindex < xnumel
x0 = (xindex % 264)
x1 = xindex // 264
...
tmp6 = x0 >= tl.full([1], value=256)
tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask)
# device assertion is thrown here
tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) | ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768
Approved by: https://github.com/jingsh
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).
Update: added support for TORCH_LOGS for the metrics logging.
Test Plan:
mm_loop.py and many existing tests cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices.
Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters.
This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure.
Points of discussion:
* For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime
* CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed
* Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels
** Testing **
Use the existing tests in `test/inductor/test_ck_backend.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152341
Approved by: https://github.com/chenyang78
Work around issues like #153960, #152623
NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055
Approved by: https://github.com/atalman
Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module.
But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs.
FIXES https://github.com/pytorch/pytorch/issues/153993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
I found the same issue as #147490 (@jibril-b-coulibaly).
There's an equivalent in the [doc-string](https://docs.pytorch.org/docs/stable/generated/torch.nn.RNN.html#rnn) of `torch.nn.RNN`:
```python
# Efficient implementation equivalent to the following with bidirectional=False
def forward(x, hx=None):
if batch_first:
x = x.transpose(0, 1)
seq_len, batch_size, _ = x.size()
if hx is None:
hx = torch.zeros(num_layers, batch_size, hidden_size)
h_t_minus_1 = hx
h_t = hx
output = []
for t in range(seq_len):
for layer in range(num_layers):
h_t[layer] = torch.tanh(
x[t] @ weight_ih[layer].T
+ bias_ih[layer]
+ h_t_minus_1[layer] @ weight_hh[layer].T
+ bias_hh[layer]
)
output.append(h_t[-1])
h_t_minus_1 = h_t
output = torch.stack(output)
if batch_first:
output = output.transpose(0, 1)
return output, h_t
```
However there's something wrong.
1. Like mentioned in #147490, line 499 is wrong
fb55bac3de/torch/nn/modules/rnn.py (L499)
The **input for RNNCell should be different** for different layers.
2. The code contains several hidden **reference-related issues** that may result in unintended modifications to tensors. For example in line 504, this causes all elements in the final output list to point to the same tensor.
fb55bac3de/torch/nn/modules/rnn.py (L504)
3. Some variable is not **defined**. Despite being a relatively minor issue in annotation, it can lead to significant confusion for those who are new to the concept. For example `weight_ih` in line 499
fb55bac3de/torch/nn/modules/rnn.py (L499)
So, i write a runnable version to make it more clear:
```python
# Efficient implementation equivalent to the following with bidirectional=False
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
def forward(x, hx=None, batch_first=False):
if batch_first:
x = x.transpose(0, 1)
seq_len, batch_size, _ = x.size()
if hx is None:
hx = torch.zeros(rnn.num_layers, batch_size, rnn.hidden_size)
h_t_minus_1 = hx.clone()
h_t = hx.clone()
output = []
for t in range(seq_len):
for layer in range(rnn.num_layers):
input_t = x[t] if layer == 0 else h_t[layer - 1]
h_t[layer] = torch.tanh(
input_t @ params[f"weight_ih_l{layer}"].T
+ h_t_minus_1[layer] @ params[f"weight_hh_l{layer}"].T
+ params[f"bias_hh_l{layer}"]
+ params[f"bias_ih_l{layer}"]
)
output.append(h_t[-1].clone())
h_t_minus_1 = h_t.clone()
output = torch.stack(output)
if batch_first:
output = output.transpose(0, 1)
return output, h_t
```
This code can reproduce the computation of torch.nn.RNN.
For example:
```python
import torch
import torch.nn as nn
torch.manual_seed(0)
input_size, hidden_size, num_layers = 3, 5, 2
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
x = torch.randn(10, 4, 3)
official_imp = rnn(x)
my_imp = forward(x)
assert torch.allclose(official_imp[0], my_imp[0])
assert torch.allclose(official_imp[1], my_imp[1])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153620
Approved by: https://github.com/mikaylagawarecki
Summary: Remove anonymous namespace in model_container.h to fix the following compiler warning,
```
warning: ‘torch::aot_inductor::AOTInductorModelContainer’ has a field ‘torch::aot_inductor::AOTInductorModelContainer::constant_folded_’ whose type uses the anonymous namespace [-Wsubobject-linkage]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154033
Approved by: https://github.com/chenyang78
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:
```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
AOTIMetalKernelFunctionHandle func,
unsigned idx,
AtenTensorHandle tensor);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.
with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.
This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.
How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.
Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
Fixes#153646
This PR refactors the logging behavior in the FX pass insert_deferred_runtime_asserts and runtime_assert.py to separate verbose/intermediate graph logs from the final output graph log. All verbose logs generated during the FX pass are now routed to a new artifact logger, graph_code_verbose, while only the final output graph remains logged to the original graph_code artifact.
Changes
- Added a new artifact logger: [graph_code_log = torch._logging.getArtifactLogger(__name__, "graph_code_verbose")]
- Updated all verbose/intermediate FX pass logs in [insert_deferred_runtime_asserts] to use the new graph_code_verbose artifact.
- Ensured that only the final output graph is logged to the original graph_code artifact.
- No changes to the FX pass logic or output—only logging behavior is affected.
Notes
This change is backward-compatible and does not affect the functional behavior of FX passes.
No changes to user-facing APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153775
Approved by: https://github.com/williamwen42
Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.
This has some advantages:
- No more dependency on FxGraphCache at all
- Clearing FxGraphCache does not result in AOTAutogradCache miss
- Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output
We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this *could* be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.
On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.
Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840
Approved by: https://github.com/zhxchen17
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.
Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.
I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.
Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```
with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
Fixes
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.cpp:1:
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:77:4: warning: extra ';' after member function definition [-Wextra-semi]
77 | };
| ^
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:81:4: warning: extra ';' after member function definition [-Wextra-semi]
81 | };
| ^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154034
Approved by: https://github.com/Skylion007
# Motivation
This PR adds scalar tensor (`t.elem()==1 and t.sizes().empty() == true and t.dim()=0` )handling in addmm, baddmm. The issue is found during the process of oneDNN upgradation. Found that the new version of oneDNN requires the post-binary (self in addmm) has same dimension as the one of output tensor. Now we need explicitly expand the shape of `self` tensor. Former version dnnl may help us do the broadcasting inside.
This PR could fix issues in https://github.com/intel/torch-xpu-ops/issues/1612 and CI error in https://github.com/pytorch/pytorch/pull/151767.
# Implementation
We treat the scalar tensor as normal tensor by `unsqueeze` it as 1 dimension tensor. Accompanied with the existing shape handle logic, it would be further `unsqueeze` to 2D or 3D shape.
UT testing
```
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmm_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmv_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float16
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float32
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153051
Approved by: https://github.com/EikanWang, https://github.com/guangyey
Summary:
# Context
The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits:
1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions.
2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc.
3. Eliminate MTIA's own codegen flow, which is unnecessary complexity.
4. Overall make MTIA's aten backend more pytorch native.
Differential Revision: D74672464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670
Approved by: https://github.com/albanD, https://github.com/nautsimon
This PR adds enforcement of testing header only APIs.
The benefit of torch/header_only_apis.txt is twofold:
1) this gives us a clear view of what we expect to be header only
2) this allows us to enforce testing
The enforcement added in this PR is very basic--we literally string match that a symbol in `torch/header_only_apis.txt` is in a cpp test. This is meant to be a first step in verifying our APIs are properly tested and can get fancier over time. For now, I've added myself as a codeowner to learn what to look out for in terms of proper tests. Over time, I anticipate we can automate more steps, but right now let's just get something out the door.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153635
Approved by: https://github.com/albanD
ghstack dependencies: #153965
This test was never the shining star in class but it helped check that we can properly delete a stable library. But now that we are running it in CI this is not a good test to annoy people with as dlclose + parallelism is likely not the move. I will miss it locally though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153975
Approved by: https://github.com/jbschlosser
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.
Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.
I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.
Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```
with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading.
The serialization schema is defined in torch/_export/serde/schema.py
See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature
Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp`
Differential Revision: D73895378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969
Approved by: https://github.com/swolchok
By decorated emitted kernels with `'''` rather than `"""`
To match regex in `torch._inductor.utils.run_and_get_kernels`
This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970
Approved by: https://github.com/dcci, https://github.com/jansel
Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751
Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec
Test Plan: PyTorch OSS CI
Differential Revision: D75073516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007, https://github.com/ngimel
Related: #148920
This PR:
* Introduces a new file `test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py` with testing that follows the usual python testing patterns
* This replaces the testing for python_agnostic in `test/test_cpp_extensions_aot.py`
After this PR, it is now possible to run:
```
python test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py
```
and the test will build the prerequisite wheel before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153274
Approved by: https://github.com/janeyx99, https://github.com/cyyever
ghstack dependencies: #153264
Related: #148920
This PR:
* Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()`
* Updates libtorch_agnostic tests to use this
* Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py`
* Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now
After this PR, it is now possible to run:
```
python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py
```
and the test will build the `libtorch_agnostic` extension before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264
Approved by: https://github.com/janeyx99
Fixes#153571
Summary:
1. Set annotation callback to global to include all threads
2. Only init callbacks when enable == true and callbacks are empty under mutex
3. When enable == false, check if callbacks are present and if so remove them and set handle to 0 under mutex
We don't expect memory snapshots to be called from several different threads (almost always called just from main) but we make sure to add thread safety in the off case that users do want to call it from different points of entry
Test Plan: Ran basic snapshot and saw that the callbacks were registered properly
Reviewed By: ngimel
Differential Revision: D74771491
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153839
Approved by: https://github.com/ngimel, https://github.com/Skylion007
Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone
Tests that should exercise both backends should explicitly parametrize this setting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655
Approved by: https://github.com/ngimel
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.
Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.
Test Plan:
New unit test that failed before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
Fixes #ISSUE_NUMBER
Currently the XPU and XCCL build settings are not recorded in the compiled binary and are not shown using the `torch.__config__.show()` which is a quick way to check if the binary has been built with such support.
Below is the output adding them (see end of last line):
```
Python 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:24:40) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__config__.show())
PyTorch built with:
- GCC 13.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2025.1-Product Build 20250203 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX512
XPU backend - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=43eb39d7c832b5560f7bfa8d29cc7919ac21c0ca, CXX_COMPILER=/home/pkourdis/compilers/gcc-13.3.0/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-error=redundant-move -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1,
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147161
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
An internal test case ran into a weird issue when exporting, where the model imported a file which creates tensor constants upon importing [(code ptr)](https://fburl.com/code/xwmhxm7n). This causes the tracer to create some tensor constants even though it's not used in the model code. This PR updates the lift_constant_tensors pass to remove constant nodes that are not being used instead of lifting them as tensor constants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153800
Approved by: https://github.com/dolpm, https://github.com/pianpwk
Differential Revision: D72437251
Enable to rebuild bucket order when find_unused_parameters=true.
It should be always better than not rebuilding bucket order when find_unused_parameters=True:
1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything
2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases:
a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain
b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter
it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404
Approved by: https://github.com/rohan-varma, https://github.com/fegin
Introduced by https://github.com/pytorch/pytorch/pull/153645
Semicolon is not needed after closing curly bracket defining a class method.
Not sure why CI did not catch it, but my local builds are now erroring out with
```
[19/97] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/passes/dead_code_elimination.cpp.o
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/passes/dead_code_elimination.cpp:4:
/Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.h:356:64: warning: extra ';' after member function definition [-Wextra-semi]
356 | ValueAndMemoryLocationSet(const AliasDb* db) : aliasDb_(db){};
| ^
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153887
Approved by: https://github.com/wdvr, https://github.com/davidberard98
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl. This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused. See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?
To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist
Pros:
* build time -> 6 min whenever this can be done
Cons:
* not sure if I have the right files
* version + whl name still remains the same
Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
Summary:
**TL;DR**: make DCE faster by replacing a Set<Value*> with a MemoryLocations sparse bitset (representing all the memory locations stored by the collection of all values in the set).
**Details**
The goal of this PR is to optimize this function from AliasDb:
```
bool AliasDb::writesToAlias(Node* n, const ValueSet& vs) const {
const auto writtenTo = getWrites(n);
if (writtenTo.empty()) {
return false;
}
MemoryLocations locs;
for (const auto v : vs) {
auto it = elementMap_.find(v);
if (it != elementMap_.end()) {
const auto& vlocs = memoryDAG_->getMemoryLocations(it->second);
if (writtenTo.intersects(vlocs)) {
return true;
}
}
}
return false;
}
```
In the DCE use case, we have a ValueSet of live values, into which we insert `Value*`s; and sometimes need to check whether a node mutates any of the live values using `writesToAlias`.
Looping through all the values in the ValueSet and indexing into the elementMap_ is slow; so if we can pre-compute the MemoryLocations set, this speeds up the function. In some large model examples, I see ~15-25x speedups from this change.
**Implementation**: To avoid exposing too many details of AliasDb, I introduce a friend class `ValueAndMemoryLocationSet`, which is an insert-only set of Values, which also maintains the corresponding MemoryLocations.
Then in AliasDb, I use `ValueAndMemoryLocationSet` if we're using AliasDb for analysis, and otherwise use a `Set<Value*>` if we don't have AliasDb.
Test Plan: Rely on unit tests.
Differential Revision: D74827086
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153645
Approved by: https://github.com/eellison
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl. This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused. See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?
To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist
Pros:
* build time -> 6 min whenever this can be done
Cons:
* not sure if I have the right files
* version + whl name still remains the same
Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.
with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.
This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.
How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.
Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
cuBLASLt matmuls have been silently allowing all reduction types, which meant that e.g., `allow_fp16_reduced_precision_reduction = False` had no effect.
In practice split-K with reduced precision reductions were unlikely to happen as the default `CUBLASLT_WORKSPACE_SIZE` of 1MiB tends to prevent this.
However this isn't guaranteed and we are on the path to increasing the default workspace size following #151163
This setting is effectively already tested in e.g., `test_cublas_addmm_size_100_cuda_float16` and `test_cublas_addmm_size_100_cuda_bfloat16` but the backend selection is not deterministic. Running the full `test_matmul_cuda.py` seems to exercise the Lt interface, but running a standalone test does not (apparently due to spurious alignment differences).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153095
Approved by: https://github.com/cyyever, https://github.com/Skylion007
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl. This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused. See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?
To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist
Pros:
* build time -> 6 min whenever this can be done
Cons:
* not sure if I have the right files
* version + whl name still remains the same
Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).
2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.
3. Added a test template.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call.
Test Plan: contbuild & OSS CI
Differential Revision: D74964213
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798
Approved by: https://github.com/Skylion007
Fixes#146411, following #148654
After test, seems this could be enabled for all ipynb file.
```bash
lintrunner --take RUFF --all-files
Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153820
Approved by: https://github.com/Skylion007
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.
Decisions made along the way:
1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.
The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.
Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558
Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.
Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.
Test Plan:
New unit test that failed before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
Summary:
## Context
See https://github.com/pytorch/pytorch/pull/149164 for more context.
Originally, this fix worked but more recently including `cmath` by itself no longer provides access to math constants on Windows platforms. I found that including `math.h` resolves this.
I'm not sure exactly what changed, but this PR updates the header to just use both includes fix the symbols not being found. It might be a bug with a recent Windows update perhaps?
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153742
Approved by: https://github.com/swolchok, https://github.com/Skylion007
In the FakeTensor cache when we get a bypass exception while computing the cache key (call this exc_1) we need to dispatch to the original operation.
It's possible for the dispatch to the original operation to get its own exception which we want to bubble up to the caller (call this exc_2).
If we directly dispatch from within the handler for exc_1 then exc_2 will have a `__context__` of exc_1 - which can cause deviations between cached and non-cached behavior - so we need to be a bit careful when we call the dispatch.
Testing:
test_aotdispatch.py::TestAOTExport::test_aot_export_predispatch_outdtype fails before this change and passes after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153780
Approved by: https://github.com/oulgen
for messages like
```/workspace/pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp:1396:38: warning: narrowing conversion of ‘(char)(& q)->at::Tensor::<anonymous>.at::TensorBase::get_device()’ from ‘char’ to ‘c10::DeviceIndex’ {aka ‘signed ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153643
Approved by: https://github.com/Skylion007
Summary:
This change introduces a fallback path from `bmm` to `mm` when the batch dimension is `1`.
The motivation is to unlock specialized `mm` kernel paths (e.g., `decomposeK`, `persistent+TMA`, etc.) which often don't have `bmm` equivalents.
### Rationale
- **No regression:** On shapes where the fallback triggers, we see no performance loss.
- **Performance wins:** On select shapes (especially with large `K`), we observe measurable speedups by triggering `mm`-specific optimizations.
For example, on `bmm` shapes of the form `(1, H, K, H)` where `H ∈ {16, 32, 48, 64}` and `K ∈ {4096 ... 32768}`, we see an **average speedup of 10%**.
- **Prevalence in prod:** Internal workloads frequently emit `bmm` ops with `batch=1`, making this fallback broadly useful in practice.
Test Plan:
contbuild & OSS CI
Tests in test/inductor/test_torchinductor.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153572
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone
Tests that should exercise both backends should explicitly parametrize this setting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655
Approved by: https://github.com/ngimel
Enables a variety of misc ruff rules and fixes some incorrect indentation in the file. Now that we updated ruff recently we can enable this rule lints. Most of these lints I've already applied, but now they are out of preview can apply them as stable lints.
Including:
* Do not bother why typing union with Never as this gets cancelled otu
* Simplify nested Literal into a single Literal
* Properly use packaging to parse version instead of `map(int(`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153624
Approved by: https://github.com/atalman, https://github.com/malfet
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
Prior to this PR, `_inductor/codegen/cpp_prefix.h` was copied into a new temporary directory on every inductor run utilizing the CPP backend (i.e. CPU-only), then included in the output source code. Instead, this PR puts it in an appropriate place in the torch includes, and includes it from there. This allows us to precompile it in cpp_wrapper and AOT inductor mode, saving significant compilation time.
Due to difficulties getting this to work in FBCode, the precompilation itself is only enabled in OSS PyTorch.
Differential Revision: [D69420620](https://our.internmc.facebook.com/intern/diff/D69420620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144293
Approved by: https://github.com/desertfire
Tests:
* test_set.py
This PR adds test_set.py from the CPython 3.13 branch and ~400 files to test/dynamo_expected_failures. Most of these are expected to be fixed in upcoming PRs. Only minimal changes were made to test_set.py to enable compilation with Dynamo using the PYTORCH_TEST_WITH_DYNAMO=1 environment variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150792
Approved by: https://github.com/anijain2305
Summary:
Split out second pass of LayerNorm so it's more likely to show up in
profiler output. In my testing with perf, the samples from the lambda in the
current implementation are attributed somewhat haphazardly.
Differential Revision: D74181627
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153578
Approved by: https://github.com/hl475
Basically adds native _IntWrapper support to dynamo. Here's my process of trying to make symint input support work on dynamo, and how I ended up with this approach [(doc)](https://docs.google.com/document/d/1GvNRQd8BnxlMay_hrEVgEta6VUeUW_hcFeRuB7q1nDY/edit?tab=t.0).
What I did was, before passing inputs to dynamo.export, I first wrap them with a class, `_IntWrapper`. When processing dynamic shapes, I will then add the corresponding dynamic shape specification to the `dynamism` field stored on the `_IntWrapper`. If there is no dynamism specified, then this will get unwrapped back to an integer. When dynamo tracing, when we encounter an `_IntWrapper`, we will convert this to a symint if the dynamism was specified as `Dim.DYNAMIC/AUTO`. Dynamo will then trace a graph that contains symint inputs, which will get passed to AOTAutograd and so on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152677
Approved by: https://github.com/pianpwk
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.
This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
**Summary**
This PR adds a new op, `onednn.qbatch_norm2d`, which accepts uint8 inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar performance as its counterpart for QuantizedCPU device `quantized.batch_norm2d`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_batch_norm_onednn
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152811
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #152411
Fixes#147336
## Context
NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.
Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.
To summarize:
In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.
This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).
i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.
## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs
## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.
Before fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```
After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
Summary:
Some new errors have been showing up on the PT2 dashboard with
```
Invalid type for lengths: Expected BlobReference or torch.Tensor, got: Tensor(shape: torch.Size([10]), stride: (1,), storage_offset: 0)
```
This is caused by [this piece of code](https://fburl.com/code/5nbi9on7) which maps over a set of nodes (in this case type `IDListFeatureListField`) and turns the results into strings to be displayed later. However during pytree.tree_map we call pytree.tree_unflatten which will call the class's init function, which calls `assert_blob` (https://fburl.com/code/h3ainrn9). Because we've mapped over the values and converted them to strings, the assert_blob fails.
I initially thought to disable the assert_blob while tracing (D74684309) but then I think we should actually flatten the list first. Because tlparse will expect just a string out outputs instead of the actual structure.
Test Plan: `buck2 run mode/opt sigmoid/inference/ts_migration:pt2i_readiness_main -- --test_suite ads_all --mode test_full_model --model_id 542947220` fails with something else 😅
Differential Revision: D74744326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153627
Approved by: https://github.com/yiming0416
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735
Approved by: https://github.com/fegin
ghstack dependencies: #153300, #152689
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.
This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.
This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.
FIXES https://github.com/pytorch/pytorch/issues/127115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.
This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441
Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.
The gloo submodule is updated to depend on the new Gloo changes
Test plan:
```py
import os
import time
transport = "TCP"
#transport = "IBVERBS"
os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"
import torch
import torch.distributed as dist
dist.init_process_group("gloo")
rank = dist.get_rank()
# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")
device = "cpu"
iters = 10
warmup_iters = 2
for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
t = torch.zeros(nelem, device=device)
torch.cuda.current_stream().synchronize()
for i in range(warmup_iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
start = time.perf_counter()
for i in range(iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
dur = (time.perf_counter() - start)
qps = iters/dur
bandwidth_gb = t.nbytes * iters / dur / 1e9
gb = t.nbytes / 1e9
if rank == 0:
print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.
The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.
I am open to making them part of base Backend based if folks prefer.
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.
This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.
However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean?
So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.
Added a test which checks for this case.
While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
## Summary
- The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`.
- This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction.
## Test plan
- I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing.
- I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188))
- I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))
## Next steps
1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case.
2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases.
3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595
Approved by: https://github.com/fegin
The immediate motivation is to make map support match
ExecuTorch so we can delete ExecuTorch-specific mapping functions, but
this should also straightforwardly improve performance.
Testing: there is existing coverage for this in
vec_test_all_types.cpp. Verified that it really does cover the newly
enabled "don't convert through float" paths by temporarily adding a
TORCH_INTERNAL_ASSERT(false).
Differential Revision: [D73802126](https://our.internmc.facebook.com/intern/diff/D73802126/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152366
Approved by: https://github.com/malfet
ghstack dependencies: #152365
Change trigger for auto format to be pull_request b/c the reusable action used gets the pr number from the pull_request event context, but only run it if ciflow/autoformat is attached to the PR. Tested this on a different PR, and it seems to be working
Changed tag name because ciflow prefixed labels have special handling
Also change to run on all files so it will mimic the normal CI lintrunner call, and because lintrunner, either by itself or using -m mergebase can miss some things. Idk if it would miss for format, but it does for checking lint. Format seems to take shorter than normal lint. I don't know if the comment about making suggestions on non edited file changes is a concern. I didn't really test this part
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153289
Approved by: https://github.com/atalman, https://github.com/malfet
Summary: To support running a subset of these tests with the remote autotuning utilities, I've split out some of the tests into separate classes so that I can derive from the "main" TestMaxAutotune class when creating new tests for remote. I'm not 100% sure what some of these tests do, so please suggest if another grouping / naming might make more sense. The remaining tests in TestMaxAutotune all smelled relevant to me.
Test Plan: existing unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153486
Approved by: https://github.com/eellison
This makes it more obvious what's going on when TCPStore shuts down while waiting on a remote key and also shows the remote address.
Test plan:
```
[W514 18:33:36.536327028 TCPStore.cpp:138] [c10d] recvValueWithTimeout failed on SocketImpl(fd=3, addr=[localhost]:34658, remote=[localhost]:1234): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
```
```py
import os
rank = int(os.environ["RANK"])
import time
from torch import distributed as dist
store = dist.TCPStore(
host_name="localhost",
port=1234,
is_master=(rank == 0),
wait_for_workers=False,
)
time.sleep(1)
print("starting")
if rank != 0:
store.get("foo")
else:
time.sleep(1)
print("done")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153586
Approved by: https://github.com/XilunWu
Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
### Summary
Fixes#148494
Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.
Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.
\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance
### Performance data with BS 1
Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine
| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms| 9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms| 7.01% |
While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen
Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.
------
- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.
Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:
- #117449
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727
Approved by: https://github.com/aorenste
ghstack dependencies: #150726
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.
This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.
This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.
FIXES https://github.com/pytorch/pytorch/issues/127115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
When using Bazel, it’s common to encounter issues like [this](https://github.com/bazelbuild/bazel/issues/14640) and [this](https://github.com/bazel-contrib/rules_python/issues/792) where the `PYTHONPATH` environment variable becomes too long and results in an error such as: `OSError: [Errno 7] Argument list too long` . To work around this, users often resort to custom logic to manipulate PYTHONPATH.
Currently, PyTorch Inductor constructs the PYTHONPATH for a subprocess using sys.path, which can lead to this issue in certain environments.
This PR introduces support for a new environment variable, `TORCH_CUSTOM_PYTHONPATH`, allowing users to override the default `PYTHONPATH` passed to the subprocess. This provides a clean way to avoid an exception when using PyTorch in Bazel.
Please let me know if I need to add some documentation to support this PR. I haven't found an open issue specific to this change but I'm confident that this change (or a similar one) would be appreciated by few.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152832
Approved by: https://github.com/masnesral
**Summary**
This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Summary:
CUDA Post: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/2020094788475989/
# Context
In this diff, we want to enable the on-demand mode of memory snapshot to allow user to trace any remote process via dyno command line.
# Design decision
**How do we send on-demand signal to remote process**
We leverage the dyno-Kineto approach.
Since dyno is running on all machine in Meta, it can send a request to the remote machine to start the Kineto.
Kineto will start another thread for memoryProfiler (https://fburl.com/code/dxsmmrok)
**why we use different approach as CUDA**
On CUDA side, we are using pybind to load torch Module and invoke the python api to start/stop the profiling. However, this requires us to compile the whole torch binary in the predictor which is not recommended by runtime(andruwang)
Thus, we decide to use the CPP api directly to avoid un-necessary dependency
**why the snapshot is saved as json string directly instead of pickle**
Pickle is primarily designed for use with Python and doesn't have well support in cpp. Also, it is hard for user to download the snapshot file and open locally.
Due to the dependency issue, it is hard to import the gzip/pickle library to decode the data. Thus, let's use JSON for now. I will work on the visualizer to fasten the render and support other format later.
**Plan**:
* Now, we will encoded file into gz for MTIA ondemand only and update the visualizer to support both type.
* Update auto-trace and CUDA side to encode in gzip as well
* Fully remove pickle dependency.
Test Plan:
# Remote cogwheel test
Servicelab: https://fburl.com/servicelab/pckux7a3
snapshot file manifold: https://fburl.com/manifold/fnotk18c
snapshot file in pastry: P1805522232
Visualization on D74399684
{F1977786422}
# Local Predictor Test
url: https://fburl.com/pytorch_memory_visualizer/y06kskkm
{F1977787329}
Differential Revision: D74179606
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153171
Approved by: https://github.com/sraikund16
Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor
Add regssion test for this behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582
Approved by: https://github.com/dcci
Fixes#147336
## Context
NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.
Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.
To summarize:
In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.
This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).
i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.
## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs
## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.
Before fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```
After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
Shameful admission: I have encountered this error 1-2 times, but don't have a repro.
torch/_inductor/select_algorithm.py", line 2022, in wait_on_futures
elapsed_times[future],
~~~~~~~~~~~~~^^^^^^^^
torch._inductor.exc.InductorError: KeyError: <Future at 0x7fc4e394fb90 state=finished returned tuple>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153417
Approved by: https://github.com/Skylion007, https://github.com/ColinPeppler
Address #151097. Including below changes,
- Add XPU support package 2025.1 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.0 build in CI to ensure no break issue until PyTorch 2.8 release
- Upgrade XPU support package from 2025.0 to 2025.1 in CD for both Linux and Windows
- Enable XCCL in Linux CD wheel and oneMKL integration in both both Linux and Windows
- Update XPU runtime pypi packages of CD wheels
- Remove deprecated support package version docker image build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151899
Approved by: https://github.com/EikanWang, https://github.com/atalman
Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382
Approved by: https://github.com/masnesral, https://github.com/jansel
MOTIVATION
This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66
CHANGES
add TEST_HPU flag to some of the conditions checking the environment
use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi.
@ankurneog , @EikanWang , @cyyever , @guangyey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD
Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code:
```
from torch import rand
from torch.distributions import MixtureSameFamily, Categorical, Binomial
max_count = 20
probs = rand(10, 5)
binom_probs = rand(10, 5)
d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs))
d.log_prob(d.sample())
```
which results in:
```
Traceback (most recent call last):
File "test.py", line 11, in <module>
d.log_prob(d.sample())
File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob
self._validate_sample(x)
File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample
valid = support.check(value)
^^^^^^^^^^^^^^^^^^^^
File "pytorch\torch\distributions\constraints.py", line 307, in check
(value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1
```
### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons)
- The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation.
- The fix itself does not alter the calculations at all. It only affects the sample validation process.
- The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise).
- I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions.
- Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5).
### Updated fix explanation (for all cases)
- The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation.
- The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed.
- This issue was already anticipated in the [code](331423e5c2/torch/distributions/mixture_same_family.py (L127)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317
Approved by: https://github.com/albanD, https://github.com/fritzo
By computing matmuls of only one random non-zero batch on CPU
This reduces test runtime from 11 minutes to 14 sec
```
% python3 test/test_mps.py -v -k test_large_bmm_
test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok
test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok
----------------------------------------------------------------------
Ran 2 tests in 27.495s
```
TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562
Approved by: https://github.com/Skylion007, https://github.com/clee2000
When we do `torch.compile(module)`, we eventually end up returning a new
`OptimizedModule` instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) for the compiled module.
`OptimizedModule` also inherits `nn.module.__call__`, and thus
has its own hook logic. This is useful for torchao, which injects module
forward hooks to run in eager for quantization purposes.
However, this might create unexpected behavior for global module hooks,
because `torch.compile(module)` causes the hook to fire one extra time
for `OptimizedModule`, when compared to eager.
To preserve BC, we simply emit a warning for this behavior, and let
users decide what to do. This is reasonable because the global module
hooks are documented to be used for debugging/profiling purposes only.
Fixes#149502
Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305, https://github.com/zou3519
Summary:
### Diff Context
- PR introduces Pipes for multiprocess comms with checkpointer process.
- Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults).
Test Plan: CI
Differential Revision: D74668559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488
Approved by: https://github.com/saumishr
Since we have pyproject.toml metadata for [project] and [build-requires], let's turn on the linter rules which validates this optional metadata to make sure it's properly formatted and follows the correct schema for standard Python build tools.
Right now, incorrect metadata could silently error with how our CI is invoked or only provide warnings for invalid metadata. This check will help surface those errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153543
Approved by: https://github.com/albanD
Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking.
Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513
Approved by: https://github.com/jansel
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.
The behavior of set_device before:
* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES
So this PR improves the device selection logic to:
* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth.
This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism. The kernel now achieves 350 GB/s SOL for Hopper. See figure.
It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8)
For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509
Approved by: https://github.com/ngimel
ghstack dependencies: #153483
This PR adds a tensor LR variant for the CPU Adagrad(fused=True).
I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078
Approved by: https://github.com/janeyx99
Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location
- The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn.
- The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic:
```
import torch
import time
# Set below to True to use cases selected by only one of the hueristics.
USE_ONLY_DIVERGENT_TEST_CASES = True
BATCH_SIZES = [ 1, 8, 32, 64, 128, 256 ]
M_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ]
N_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ]
K_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ]
ITERS = 50
def old_heuristic(m, n, k):
is_above_min_dims = m > 8 and n > 8 and k > 8
is_above_min_size = m*n*k > 8_192
return is_above_min_dims and is_above_min_size
def new_heuristic(b, m, n, k):
return b*b*m*n*k >= 4_194_304
def generate_test_cases():
test_cases = []
for b in BATCH_SIZES:
for m in M_DIMS:
for n in N_DIMS:
for k in K_DIMS:
if USE_ONLY_DIVERGENT_TEST_CASES:
if old_heuristic(m, n, k) != new_heuristic(b, m, n, k):
test_cases.append([b, m, n, k])
else:
test_cases.append([b, m, n, k])
return test_cases
def test(x, y):
for _ in range(5):
torch.bmm(x, y)
perf = 0.0
for _ in range(ITERS):
start = time.time()
torch.bmm(x, y)
end = time.time()
perf += (end - start) / ITERS
return perf
def main():
print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}")
cumulative_mean_time = 0.0
for b, m, n, k in generate_test_cases():
mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k))
cumulative_mean_time += mean_time
print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}")
print(f"Cumulative mean time = {cumulative_mean_time:.4f} s")
if __name__ == "__main__":
main()
```
From the script we see that cumulative mean time from all test cases (at 16 threads) is:
- 1.6195 s for the old heuristic
- 0.7012 s for the new heuristic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).
#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.
Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
Fixes#150916Fixes#147064
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock
Test Plan: Induced error manually and saw that GIL was released
Differential Revision: D74593564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415
Approved by: https://github.com/Skylion007, https://github.com/cyyever
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.
It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.
Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.
Differential Revision: D74599203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
This appears to be slow in production (potentially a quadratic explosion), and
logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how
bad of an issue this is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270
Approved by: https://github.com/masnesral
Summary: I forgot to remove this unused field in D73809989.
Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w
Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager").
Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222
Approved by: https://github.com/jansel
ghstack dependencies: #153193
Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193
Approved by: https://github.com/jansel
Summary:
For MTIA on-demand mode, since we are not using torch Module. The data upload happens in cpp and doesn't support pickle.
Thus, we store as JSON at the end and need the update visualizer to support it
Test Plan: Check Test plan in D74179606
Differential Revision: D74406209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153166
Approved by: https://github.com/sraikund16
`Error in dlopen: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not found` error was caused by cmake migration (as conda one probably have some extra link rules), while `C++ exception with description "CUDA error: no kernel image is available for execution on the device` were caused by the fact that test were build for Maxwell, but run on SM_86
Remaining test was failing before, but was probably disabled
TODOs:
- Move build to the build step
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153423
Approved by: https://github.com/huydhn, https://github.com/cyyever
Summary: This change is to support remote autotuning. I want to use all the same benchmarking utilities in select_algorithm.py. For remote autotuning, I'll reuse the TritonBenchmarkRequest class used for subprocess autotuning because it's already serializable. That class is also used in standard, in-process autotuning, but via TritonTemplateCaller.benchmark() which sets the output_tensor param when calling the underlying TritonBenchmarkRequest. For remote, I'll be using the TritonBenchmarkRequest request directly so I want the parameter to be named 'out' to avoid "got an unexpected keyword argument 'out'".
Test Plan: Existing unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153169
Approved by: https://github.com/aorenste, https://github.com/eellison
Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only
~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570
We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240
Approved by: https://github.com/EikanWang
By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache.
Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
device half* out_ptr1,
constant half* in_ptr0,
uint2 thread_pos [[thread_position_in_grid]],
uint2 group_pos [[thread_position_in_threadgroup]]
) {
auto xindex = thread_pos.x;
auto r0_index = thread_pos.y;
threadgroup float tmp_acc_0[32];
float tmp_acc_1 = 0;
for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
int r0_0 = 2 * r0_index + r0_0_cnt;
auto tmp0 = static_cast<float>(in_ptr0[r0_0]);
tmp_acc_1 += tmp0;
}
auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024);
for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
int r0_0 = 2 * r0_index + r0_0_cnt;
auto tmp2 = static_cast<float>(in_ptr0[r0_0]);
auto tmp3 = tmp2 / tmp1;
out_ptr1[r0_0] = static_cast<half>(tmp3);
}
}
```
Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps`
Fixes https://github.com/pytorch/pytorch/issues/152155
Though inductor tests are still failing, need to keep refining the variable invalidation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel
During the dump of FR, due to some unknown reasons, we see cuda errors when querying events and this leads to the failures of whole FR dumps (when trying to get entries). So we do a try-catch instead of let it fails the whole process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153414
Approved by: https://github.com/d4l3k
Improves typing so that all the optimizer subclasses (which all of them that subtype step) do not erase their type signature when this decorator is used. Now *kwarg values and returns will propogate
This complements @tsunghsienlee PR #153367 as the type signature of step() was being erased on all the optimizer subclasses by this untyped decorator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153374
Approved by: https://github.com/janeyx99, https://github.com/tsunghsienlee
Summary:
To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following:
1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack
2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected
3. Piping for compile context to pickle output
Test Plan:
In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658}
Differential Revision: D74028214
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707
Approved by: https://github.com/eqy
- Move community and language binding links to the horizontal bar
- Add an intro to the community page.
- Fix the link in the ogp_image
- Fix the link in the version switcher
- Clean up unneeded links
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153090
Approved by: https://github.com/albanD
Summary:
Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version.
Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it.
Test Plan:
### how to enable
```
torch._inductor.config.post_grad_fusion_options = {
"activation_quantization_aten_pass": {
"quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type
"use_scaling": False, -> default is False, if you want to use scaling verison, set it to True
"size_in_mb": 0.0, -> default is 100, you can tune the value.
"exclude_primals": False, -> whether want to exclude quantize parameters, default is False
"allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16
},
}
```
### toy model
```
buck2 run mode/opt //scripts/qyz/autoac:quantization
```
```
Epoch [80/200], Loss: 19227.2109
Epoch [100/200], Loss: 1353.5272
Epoch [120/200], Loss: 38630.6758
Epoch [140/200], Loss: 6239.9155
Epoch [160/200], Loss: 6039.1567
Epoch [180/200], Loss: 3994.3569
Epoch [200/200], Loss: 146.3966
```
Differential Revision: D73015996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770
Approved by: https://github.com/Mingming-Ding
https://github.com/pytorch/pytorch/pull/146942 introduced an 8.3% regression on the `benchmark_torchbench_run_bert_pytorch_training:defaults-speedup-x1000` perf metric. This was flagged by internal CI testing (task T223596372).
The root cause seems to be that `FloorDiv` is now used to calculate the launch grid in certain scenarios, which is slower than the previously-used `//`. Since launch grid calculations happen at runtime, they can have a significant performance impact on some models.
The reason for switching to `FloorDiv` in https://github.com/pytorch/pytorch/pull/146942 was to allow the FX backend to generate runnable Python code. `FloorDiv(x, y)` maps to `x // y` in Python, whereas `sympy.floor(sympy.Rational(x,y))` maps to `floor(x/y)`, which crashes as FX doesn't know what `floor` is.
To get the best of both worlds, this PR reverts to using `//` to calculate launch grids, but then post-processes the resulting sympy expressions in the FX converter, converting `floor(x / y)` to `FloorDiv(x, y)`. Since this sympy manipulation happens at compile time, the perf impact should minimal, and should only affect the FX backend. This is similar to the approach previously explored in https://github.com/pytorch/pytorch/pull/151144, but the implementation is more minimal and self-contained.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153230
Approved by: https://github.com/jansel
The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this.
Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286
Approved by: https://github.com/kwen2501
When codegen'ed, it looks like:
```py
post_grad_custom_pass = <object at 0x12345678>
```
Which is not runnable at all. Some logic is also trying to deepcopy the
object, and not all of these objects are deepcopy-able.
This PR skips codegenning of these passes.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243
Approved by: https://github.com/houseroad
## Context
Suppose we have this graph like this :
```
a: "[s1 + u2, 200]"
b: "[u0, 32]"
cat: "[s1 + u2, 232]" = torch.cat([a, b], dim=1)
```
NOTE: torch.cat assumes "all tensors must either have the same shape (except in the concatenating dimension) or be a 1-D empty tensor with size (0,)."
So, we would expect u0 = s1 + u2 which is guarded on today except it's a deferred runtime assertion since unbacked symints aren't replaced today as Pian.
Notice how a has a different symbolic shape than both b and cat. Today, this will create an unexpected shape mismatch when AOTI autotunes. Here's a rough illustration where 8192 is the unbacked symint fallback value.
```
# s1 is an arbitrary integer
a = generate_example_value(size=(s1 + 8192, 200))
b = generate_example_value(size=(8192, 32))
out = generate_example_value(size=(s1 + 8192, 232))
triton_cat.run(a, b, out ...)
```
## Error
```
wrapper.py:1484: <module>: block: [443,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
wrapper.py:1484: <module>: block: [443,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
RuntimeError: CUDA error: device-side assert triggered
```
Differential Revision: [D74485962](https://our.internmc.facebook.com/intern/diff/D74485962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153220
Approved by: https://github.com/desertfire
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.
This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.
* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels
These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.
Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
Unblocks #153020 which accidentally improves the CircularImportLinter to check all Python files. It doesn't set a logname so it errors, there is another FSDP script which already defaults LOGNAME to '' if not specified, this does the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153324
Approved by: https://github.com/awgu
When trying to plot a trace graph, Inductor checks if "dot" is installed. Currently, the code runs a "which dot" command.
By default, Windows doesn't have the "which" command. This patch replaces it with the more portable alternative.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153259
Approved by: https://github.com/Skylion007
Fixes#152918
Before:
```
File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose
raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.
```
After:
```
File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose
raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.
User stack:
File "/home/bobren/local/a/pytorch/error.py", line 5, in foo
return torch.randn(5) * x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152924
Approved by: https://github.com/pianpwk
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it.
### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
- The names are more informative. 'tf32' is more informative than a simple "high".
- Easier to extend new algorithm like `tf32x3`
#### Cons
- "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.
### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')

### We provide 3 fp32 compute precision can be set:
- **"ieee"**: Not allowed to use any other internal computation data types .
- **"tf32"**: Allowed to use tf32 as internal computation data types.
- **"bf16"**: Allowed to use bf16 as internal computation data types.
- **"none"**: Precision's are not set. Can be override by its father node.
### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
backend = cuda, op = all, precision setting = none
backend = cuda, op = conv, precision setting = tf32
backend = cuda, op = rnn, precision setting = tf32
backend = cuda, op = matmul, precision setting = none
backend = matmul, op = all, precision setting = none
backend = matmul, op = conv, precision setting = none
backend = matmul, op = rnn, precision setting = none
backend = matmul, op = matmul, precision setting = none
```
- If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
- If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".
### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
- If the user only uses previous APIs, it will work as previous expectations.
- If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.
### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
Fixes#132644
`_batch_p2p` incorrectly assumes that `dist.batch_isend_irecv` returns a single-element list of `dist.Work`, likely due to NCCL's coalescing behaviour.
For none NCCL backends like Gloo, multiple `dist.Work` objects are returned, causing the code to discard some operations via `.pop()`. This leads to deadlocks during pipeline parallelism.
## Changes:
* Modified `_batch_p2p` to return `list[dist.Work]` instead of popping a single element.
* Added `_wait_batch_p2p` to call `wait()` on multiple `dist.Work` objects, consuming the result of `_batch_p2p`.
* Updated references from `dist.Work` to `list[dist.Work]`.
## Testing:
* `pippy_bert.py` from #132644 now works with gloo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152938
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
Summary: _fold_conv_bn_qat has logic to remove the tracking stats. Currently, this includes a check that includes only torch.nn.modules.batchnorm.BatchNorm2d. As a result, the tracking stats are not properly removed when 1D is used. This diff updates to fix this.
Test Plan:
Run N7113483 without this fix.
{F1977726982}
```
bento kernel build sensorml
```
Re-run with local version of kernel, containing this diff:
{F1977727151}
Notice that now, num_batches is removed.
Differential Revision: D74269649
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152982
Approved by: https://github.com/andrewor14, https://github.com/yushangdi
Error out if K=0 in one of the grouped gemms to avoid hangs in #152668
Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already)
One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with
```
File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function
out = lowerings[target](*args, **kwargs) # type: ignore[index]
File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped
out = decomp_fn(*args, **kwargs)
File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm
if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias):
File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel
offs is not None
File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__
raise TypeError("cannot determine truth value of Relational")
torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational
```
which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for.
Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226
Approved by: https://github.com/kwen2501
Summary: - Expose `c10_retrieve_device_side_assertion_info()` for use by external code. The motivating use case is FBGEMM kernel launcher utilities, which add FBGEMM-specific context to the errors coming out of Torch DSA
Test Plan: OSS CI
Differential Revision: D74432771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153211
Approved by: https://github.com/Skylion007
This PR moves `mps_linear` to use MPSNDArrays and call into the MPS kernel directly instead of going through MPSGraph. It also adds a caching mechanism for reusing MPS kernels as there is also a small overhead attached to creating the kernel object.
The impact of the improvement is relatively more significant for small input kernels where the MPSGraph overhead represents a larger portion of the overall execution time of the operation but the speedup shows for both small and large input sizes as expected.
`mps_linear` before the changes:
```
input shapes: f32:[1,1,20], f32:[1,20]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x109d67110>
func(*args, **kwargs)
Median: 199.29 us
IQR: 9.56 us (196.71 to 206.27)
979 measurements, 1 runs per measurement, 1 thread
input shapes: f32:[1,1,5120], f32:[13284,5120]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1063b4510>
func(*args, **kwargs)
Median: 979.29 us
IQR: 25.29 us (964.83 to 990.13)
205 measurements, 1 runs per measurement, 1 thread
```
`mps_linear` after the changes:
```
input shapes: f32:[1,1,20], f32:[1,20]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10693a190>
func(*args, **kwargs)
Median: 176.08 us
IQR: 15.02 us (172.42 to 187.44)
1103 measurements, 1 runs per measurement, 1 thread
input shapes: f32:[1,1,5120], f32:[13284,5120]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10d524dd0>
func(*args, **kwargs)
Median: 952.56 us
IQR: 15.63 us (945.47 to 961.10)
210 measurements, 1 runs per measurement, 1 thread
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152210
Approved by: https://github.com/kulinseth, https://github.com/malfet
Co-authored-by: Nikita Shulga <nshulga@meta.com>
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925
A reland of https://github.com/pytorch/pytorch/pull/152125.
added a try-except around the justknob internally. Also added more documentation
Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.
Also factored out the run time assertion logic to a separate function.
We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops,
because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already.
One example is below:
```
class Model(torch.nn.Module):
def forward(self, a, b, c):
nz = torch.nonzero(a)
ones = a.new_ones([nz.size(0), b.size(0)])
torch._check(ones.size(0) >= 1)
equals = torch.add(ones, c)
return equals
torch._dynamo.mark_dynamic(c, 0)
```
When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv.
See `test_unbacked_equals_input_size_runtime_assertion` in test_aot_inductor.
In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below:
```
class Model(torch.nn.Module):
def forward(self, x):
y = x.reshape(100, -1).clone()
y = y + 1
return y
dynamic_shapes = {
"x": {0: torch.export.Dim.DYNAMIC},
}
x.shape[0] needs to be a multiple of 100.
```
See `test_aoti_runtime_asserts_backed_symint` in test_aot_inductor.
Example:
```
def forward(self):
arg0_1: "f32[s35]";
arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)
#
mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None
eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None
_assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None
clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None
return (add_6,)
```
Generated cpp code:
```
auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
auto arg0_1 = std::move(inputs[0]);
auto arg0_1_size = arg0_1.sizes();
int64_t s35 = arg0_1_size[0];
inputs.clear();
auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchinductor_dynamic_shapes -- -r test_unbacked_floordiv_simplify
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_i64_input_codegen_cuda
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_unbacked_equals_input_size
```
Differential Revision: D74361799
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153182
Approved by: https://github.com/henrylhtsang
Motivation is that `AlgorithmSelectorCache.__call__` is getting very long and hard to work with. There are nested layers of local functions in it. For example, we pass `precompile_fn`, a local variable, to `do_autotuning`, a local function, which already has a pointer to choices, a local variable, and then have `do_autotuning` calls `choices` in `self.lookup`.
When I was trying to make changes to do_autotuning, I would get `UnboundLocalError: cannot access local variable 'choices' where it is not associated with a value`. But no idea why it was even working in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153212
Approved by: https://github.com/eellison
* Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead
* `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on
* `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack
* Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack
* `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case
* Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569
Approved by: https://github.com/renjithravindrankannath, https://github.com/jeffdaily
Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com>
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.
This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.
However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean?
So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.
Added a test which checks for this case.
While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
This patch fixes a corner case of `torch.isin` decompisition when both
inputs are scalars. This pattern showed up from #141196.
Fixes#141196.
Error stack befor this patch:
```
File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12503, in test_scalar_isin_decomposition
res = opt_f()
^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 691, in _fn
raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1618, in _call_user_compiler
raise BackendCompilerFailed(
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1593, in _call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
compiled_gm = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/__init__.py", line 2365, in __call__
return compile_fx(model_, inputs_, config_patches=self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_inductor/compile_fx.py", line 2317, in compile_fx
return aot_autograd(
^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/backends/common.py", line 106, in __call__
cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1179, in aot_module_simplified
compiled_fn = AOTAutogradCache.load(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 923, in load
compiled_fn = dispatch_and_compile()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1164, in dispatch_and_compile
compiled_fn, _ = create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 576, in create_aot_dispatcher_function
return _create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 826, in _create_aot_dispatcher_function
compiled_fn, fw_metadata = compiler_fn(
^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 180, in aot_dispatch_base
fw_module, updated_flat_args, maybe_subclass_meta = aot_dispatch_base_graph( # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2199, in _trace_inner
t = dispatch_trace(
^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_compile.py", line 51, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1223, in dispatch_trace
graph = tracer.trace(root, concrete_args) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/_symbolic_trace.py", line 850, in trace
(self.create_arg(fn(*args)),),
^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1278, in wrapped
out = f(*tensors) # type:ignore[call-arg]
^^^^^^^^^^^
File "<string>", line 1, in <lambda>
File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 720, in inner_fn
outs = fn(*args)
^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 419, in _functionalized_f_helper
f_outs = fn(*f_args)
^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 81, in inner_fn
outs = fn(*args)
^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 902, in functional_call
out = PropagateUnbackedSymInts(mod).run(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 171, in run
self.env[node] = self.run_node(node)
^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7387, in run_node
result = super().run_node(n)
^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 240, in run_node
return getattr(self, n.op)(n.target, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 320, in call_function
return target(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1326, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_subclasses/functional_tensor.py", line 511, in __torch_dispatch__
outs_unwrapped = func._op_dk(
^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/utils/_stats.py", line 27, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1428, in __torch_dispatch__
return proxy_call(self, func, self.pre_dispatch, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 797, in proxy_call
r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2358, in maybe_handle_decomp
out = CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5108, in isin
return isin_default(elements, test_elements, invert=invert)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5137, in isin_default
x = elements.view(*elements.shape, *((1,) * test_elements.ndim))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: view() received an invalid combination of arguments - got (), but expected one of:
* (torch.dtype dtype)
* (tuple of ints size)
While executing %isin : [num_users=1] = call_function[target=torch.isin](args = (%x, %x), kwargs = {})
GraphModule: class GraphModule(torch.nn.Module):
def forward(self):
# File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12498 in f, code: x = torch.tensor(0)
x: "i64[][]" = torch.tensor(0)
# File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12499 in f, code: return torch.isin(x, x)
isin: "b8[][]" = torch.isin(x, x); x = None
return (isin,)
Original traceback:
File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12499, in f
return torch.isin(x, x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153216
Approved by: https://github.com/williamwen42, https://github.com/peterbell10
PR #151968 adds `reorder_for_minimizing_partition` for the minimal number of partitions. If reordering two nodes cannot reduce the number of partitions, `reorder_for_minimizing_partition` should maintain the relative order of these two nodes and rely on other reorder passes for some nice features, such as shorter liveness duration or less peak memory. In an extreme case, when all nodes are on gpu and can be cudagraphed, `reorder_for_minimizing_partition` should not reorder any nodes.
This PR improves `reorder_for_minimizing_partition` for the invariant: relative order of nodes within the same graph partition are maintained. To do so, we record the index of each node in the input `nodes: list[BaseSchedulerNode]` and use a heap to pop the node with the smallest index. So we always scheduler a node with smaller index in the same graph partition and respects the invariant. Previous implementation tried to use a queue to achieve that but failed. Because node_N at the end may rely on node_1 at the start, such that node_N is added to queue once node_1 is scheduled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153111
Approved by: https://github.com/eellison
Summary:
- Replace `C10_CUDA_KERNEL_LAUNCH_CHECK()` in the `KernelLauncher`, as the
latter does not print __FILE__ and __LINE__
The existing `C10_CUDA_KERNEL_LAUNCH_CHECK()` implementation does not print the source file and line number when a CUDA kernel launch throws an error, leaving users confused with a context-less message like `CUDA error: invalid arguments`. This new check is a slimmed re-implementation of the macro with extra context information added to the error (beyond just file and line number) so that we can at least locate the FBGEMM source file or template where the error first surfaces.
Test Plan:
```
buck2 run 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
buck2 run 'fbcode//mode/opt-amd-gpu' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
```
Reviewed By: sryap
Differential Revision: D74364031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153178
Approved by: https://github.com/atalman, https://github.com/huydhn
Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names.
The following test failure is fixed by removing weak dependency from partition_input_names:
`PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863
Approved by: https://github.com/eellison
Summary:
Bug fix for constant folding states. We are not setting the correct state for each updates.
One race condition would be:
(1) All threads obtain the model_exec_lock from main run.
(2) In second round of updated constant buffer, we should have set secondary as INITIALIZED but primary is mistakenly set instead.
(3) run_const_fold get called and an model_exec_lock is obtained, waiting for available at this time.
(4) main run enters INITIALIZED, waiting for unique_lock (which a shared_lock is being held by (3) at this moment)
Test Plan:
TBD
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153152
Approved by: https://github.com/jingsh, https://github.com/chenyang78
Summary: I'll need some of the benchmark-related functions surfaced so I can use them for remote autotuning. This PR just lifts the main in-process benchmarking helpers to classmethods. It wasn't strictly necessary to also move the sub-process benchmarking helper, but I think it improves readability. Also added some missing types.
Test Plan: Existing unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153084
Approved by: https://github.com/aorenste, https://github.com/eellison
Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys
Test Plan: unit test and e2e test reading from hf
Differential Revision: D74181513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856
Approved by: https://github.com/joecummings
# Motivation
fixes https://github.com/pytorch/pytorch/issues/153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.
# Additional Context
run the following script
```python
import torch
import torchvision.models as models
torch.manual_seed(0)
model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)
device = torch.device('xpu:1') # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)
with torch.no_grad():
ret = model(data)
print(ret)
print("Execution finished")
```
The output is
```bash
-9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01, 6.4551e-01,
-6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01, 3.2715e-02,
-4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
-8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01, 2.0312e+00]],
device='xpu:1', dtype=torch.float16)
Execution finished
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067
Approved by: https://github.com/albanD, https://github.com/EikanWang
Fixes: https://github.com/intel/torch-xpu-ops/issues/1503
`sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail).
We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`):
```
In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23:
/opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec'
60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
| ^
/workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here
60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
|
```
While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required.
Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level.
See: f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc
CC: @EikanWang, @guangyey, @gujinghui
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
The edited comment should have the info. The code change looks large, but its copied from the install_cache script that our docker images use 6a8006472e/.ci/docker/common/install_cache.sh (L42)
Sccache stopped working on xla at some point near dec 17 2023. I am not sure what commit caused it. I think it was having trouble writing to the cache.
Either way, there is an sccache already installed on the docker image, so we should use that instead of a binary from s3 which we're probably no longer sure where it came from/what commit it was built from
The one in the docker image is installed here 69d438ee65/.github/upstream/Dockerfile (L61) and is also very old, so I have https://github.com/pytorch/xla/pull/9102 to update it
sccache still not writing properly, i will investigate, but xla build currently broken after the above xla pr, and this should fix it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153002
Approved by: https://github.com/malfet
A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error):
```
Traceback (most recent call last):
File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call
CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call
return fn(*proxy_args, **proxy_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl
output = self._dispatch_impl(func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm
self = self.expand((dim1, dim2, dim3))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers
```
This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112
Approved by: https://github.com/jansel
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.
* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels
These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.
Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
Summary:
As discussed with @malfet , we're porting nativert code to torch/nativert/.
Following up some concerns over the new directory, I'm trying to setup the tooling on OSS so various things (like linters) can run on torch/nativert/ properly.
Test Plan: CI
Differential Revision: D74407808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153164
Approved by: https://github.com/dolpm, https://github.com/Skylion007
Summary: The autotuner is using zero-filled tensors to autotune
_scaled_grouped_mm and that's not appropriate for the offsets tensor, since it
essentially corresponds to "no input" and thus yields invalid perf results.
We can't really use the actual input tensors, since we might be compiling this
op in the context of an entire graph.
So instead, I decided to create a synthetic offsets tensor assuming that each
group is (roughly) the same size. I don't have data but I'd guess this
approach is OK for MoE since we're generally hoping to load-balance the
experts; I'm not sure how well it applies to other scenarios that might be more
heavy-tailed.
Test Plan:
```
pytest test_matmul_cuda.py -k test_scaled_grouped_gemm_
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152968
Approved by: https://github.com/ngimel
Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152878
Approved by: https://github.com/drisspg
Summary:
Added some logging and captured the indexing. See below image.
{F1977773416}
This is why the saved module path is called `/tmp/jimwan/minimizer_a_acc.pt`
Now the updated module paths are `/tmp/jimwan/minimizer_addmm_default_103_acc.pt`.
Test Plan:
```
MTIAC_USE_DIST_REF_KERNELS=all buck2 run @//mode/opt mtia/accuracy/minimizer:mtia_minimizer_runner -- --mode sequential --compare_fn allclose --pt_save_dir /tmp/debug3 --atol 1e-4 --rtol 1e-4 --all_outputs --start_idx native_layer_norm_default_80 --end_idx getitem_272 2>&1 | tee ~/test.log
```
{F1977773610}
Reviewed By: qcyuan
Differential Revision: D74369107
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153130
Approved by: https://github.com/Skylion007
As in title
idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31
Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip
Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
Summary:
We should generate the kernel for const graph and main graph separately.
The reason is that when we run autotuning, we would create separate
kernel calls and we should make sure that main graph also contains the
runner.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_autotune_with_constant_folding
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D74347765](https://our.internmc.facebook.com/intern/diff/D74347765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153040
Approved by: https://github.com/angelayi
Summary:
X-link: https://github.com/pytorch/gloo/pull/437
This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.
This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.
Known issues:
* support recv from any is not supported
* AllreduceBcubeBase2 is failing
Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:
GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```
We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.
Differential Revision: D73291471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153015
Approved by: https://github.com/fduwjj
ShapeEnv.evaluate_expr() behaves differently based on the (tls) global "suppress_guards" - so its cache key needs to include that value.
This came up because #152662 triggered it in the test `test/dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error` - fixing this caused that test to work again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152661
Approved by: https://github.com/laithsakka
## Summary
This PR updates the docstring for `CosineAnnealingLR` to accurately reflect its recursive learning rate schedule. The previous docstring displayed only the SGDR closed-form expression, which doesn't match the actual recursive implementation in code.
Changes:
- Added the recursive update formula used in `get_lr()`
- Retained the original closed-form SGDR expression for reference
- Clarified that warm restarts are not implemented in this scheduler
This addresses confusion raised in issue #152081.
## Related issue
[#152081](https://github.com/pytorch/pytorch/issues/152081)
## Testing
Doc-only change. Ran pre-commit to verify formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152936
Approved by: https://github.com/janeyx99
This diff hardens StaticCudaLauncher in the event a cubin file gets deleted under us. We store the raw cubin on the static cuda launcher, and reload it as needed. On cold start, this can happen if the cubin file is created by triton, and gets deleted before we can load the kernel on the parent process.
We don't want to store the entire cubin both in file format and in memory for caching purposes, so we delete it before caching the data. In the unfortunate/unlikely event where we can't load/find the necessary file on warm start, skip the stored triton launcher, falling back to regular triton.
This comes at a cost to worker memory, but it's not more memory than regular triton workers already take, so it should be okay.
Tests:
- Make test_static_cuda_launcher always delete the cubin path and reload it
Fixes#153030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153064
Approved by: https://github.com/oulgen, https://github.com/jansel
The reason why we did this before is because that's how our older
autograd.Function x Dynamo interaction work, but we've since adopted
newer designs that don't actually need the autograd.Function.ctx proxied
into the graph.
We still need a fx.Proxy for the autograd.Function.ctx object, so
whenever we do I create one via discard_graph_changes.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152621
Approved by: https://github.com/oulgen
As in title
idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31
Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip
Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
Fixes#122757
## Test Result
```python
import torch
model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)
loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)
Traceback (most recent call last):
File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
loss = loss_fn(input=model_output, target=labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
return F.cross_entropy(
^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
return torch._C._nn.cross_entropy_loss(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/malfet
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.
Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```
Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining 0/4 6.7s exec time total
Command: test. Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
# E2E
### how to enable (you can overrite the dtype, if nothing given, the default is fp8)
```
post_grad_fusion_options={
"activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
},
```
Differential Revision: D70522237
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380
Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925
Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.
Also factored out the run time assertion logic to a separate function.
We need to generate runtime asserts directly in Inductor instead
of just re-using the asserts from input graphs becase we reuse the
same ShapeEnv as before. In particular, on subsequent graph passes,
we would immediately turn all of these assertions into noops,
because when we evaluated their expressions, we would see that
because we had a deferred runtime assert in the ShapeEnv, we
know "oh, of course this expression is True" already.
One example is below:
```
class Model(torch.nn.Module):
def forward(self, a, b, c):
nz = torch.nonzero(a)
ones = a.new_ones([nz.size(0), b.size(0)])
torch._check(ones.size(0) >= 1)
equals = torch.add(ones, c)
return equals
torch._dynamo.mark_dynamic(c, 0)
```
When we re-use the ShapeEnv in Inductor lowering, the check that checks
a and nonzero have the same shape would be evaluted to True after we resolve
unbacked bindings using the ShapeEnv.
See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor.
In addition to the Inductor generated runtime asserts, we also
need the runtime asserts from the input graph, because some derived
runtime asserts are not generated in Inductor. One example is
below:
```
class Model(torch.nn.Module):
def forward(self, x):
y = x.reshape(100, -1).clone()
y = y + 1
return y
dynamic_shapes = {
"x": {0: torch.export.Dim.DYNAMIC},
}
x.shape[0] needs to be a multiple of 100.
```
See test_aoti_runtime_asserts_backed_symint in test_aot_inductor.
Example:
```
def forward(self):
arg0_1: "f32[s35]";
arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)
#
mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None
eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None
_assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None
clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None
return (add_6,)
```
Generated cpp code:
```
auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
auto arg0_1 = std::move(inputs[0]);
auto arg0_1_size = arg0_1.sizes();
int64_t s35 = arg0_1_size[0];
inputs.clear();
auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
```
Differential Revision: D73596786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)
### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.
For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153010
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/Skylion007
Fixes an error message in test/test_optim.py
Current behavior: If running the test with Adagrad, the error message reads: "SGD does not currently support capturable".
Fix: The error message now says correctly: "Adagrad does not currently support capturable".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153076
Approved by: https://github.com/janeyx99
This fixes an issue in the TORCH_CHECK error message in the FusedSgdKernel.
Current behavior: If the LR tensor is not on the same device as the parameters, the error message reads: "found_inf must be on the same GPU device as the params".
Fix: The error message now correctly points out "lr must be on the same GPU device as the params".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153074
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
Although Dim.AUTO covers the cases that a user sets more axes to be dynamic than the model actually needs, it silently falls back to STATIC when DYNAMIC fails. This increases the difficulty of debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153065
Approved by: https://github.com/justinchuby
Differential Revision: D74218923
Running on A100 seems to result in precision loss from decompose_k. This was root caused to the fp16/bf16 reduction setting, which establishes a less precise baseline than decompose_k, as decompose_k uses the bmm.dtype overload for fp32 output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152897
Approved by: https://github.com/eellison
Current FR code is built with `USE_C10D_NCCL` we should remove it to make it generic. And we keep existing API used by NCCL so that we can have some bc compatibility because lots of use cases are around FR with NCCL. The generic version with C10::Event can then be used for other backend like Gloo, etc.
The current Unit test should cover the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152563
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
ghstack dependencies: #152585
Tests serialization for RANGE_ITERATOR_MATCH; includes no non-test changes.
This PR handles iterator exhaustion issues by utilizing the janky solution from #152865; it passes a function to generate kwargs and `frame_state.f_locals` is updated with fresh iterators through a second kwarg generation pass after initial tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152872
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730, #152865
Tests serialization for TUPLE_ITERATOR_LEN; includes no non-test changes.
Passing a tuple iterator as input results in the iterator being exhausted during testing. I threw together a super janky workaround via accepting a func for kwarg generation and replacing `frame_state.f_locals` with newly-generated kwargs to get fresh iterators, but insights into a better approach are welcome!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152865
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730
Fix
```
CMake Warning (dev) in cmake/Codegen.cmake:
A logical block opening on the line
/var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if)
closes on the line
/var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif)
with mis-matching arguments.
```
by removing the condition in `endif`.
We could instead fix it, however, that is not best practice. For example, cmake_lint warns that, and CMake says
```
The optional <condition> argument is supported for backward compatibility only.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023
Approved by: https://github.com/aditew01, https://github.com/Skylion007
Prep change for getting rid of `_mac-test-mps.yml`
A complete no-op for now, but will be used by PR above the stack, but they should be landed few days apart to avoid forcing lots of people to rebase their PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153012
Approved by: https://github.com/wdvr
Issue was that
- symbol-ids appeared out-of-order w.r.t to the order of the forward inputs
```
def forward(arg0 # [(s3 - 1) + s4, 32], arg1 #[(s3 - 1)] ..)
```
- this causes codegen to fail because it expects all the base symbols `s4,s3` to have been codegen-ed already.
- well, we can skip codegen-ing sympy expr with many symbols e.g. `(s3 - 1) + s4` because `s3` and `s4` will be codegen-ed by other inputs.
```
# for example
s3 = arg1.size(0) + 1
s4 = argN.size(0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152579
Approved by: https://github.com/jingsh, https://github.com/desertfire
Noticed some of these ops were contributing to a big chunk of the runtime for OpenLLama as well as a few other benchmarks
At the op level, moving to a TensorIterator-based Metal kernel gives a 20x speedup. Will migrate the inverse trigonometric functions & log ops in a follow-up PR, as this one is already a bit large
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152876
Approved by: https://github.com/malfet
When we do `torch.compile(mod)`, we eventually end up returning a new
module instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) from the default `torch.nn.Module.__call__`.
As a result we can't reuse the inherited default `__call__` as is,
because we'd end up running the logic twice.
This patch makes the returned `OptimizedModule` override the default
`__call__`, and directly calls into its compiled `forward` method.
Fixes#149502
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305
## Summary
- Added --no-install-recommends flag to all apt-get install commands to reduce unnecessary dependencies
- Added apt-get clean after package installations to remove package cache and reduce image size
- Combined multiple apt commands into single instructions to reduce Docker image layers
## Test plan
Test by building the devcontainer and verifying functionality while ensuring reduced image size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152882
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/Skylion007
## Summary
- Replaced miniconda base image with base Ubuntu 22.04 image
- Installed Python and required dependencies using apt
- Replaced conda-based CUDA installation with apt-based version
- Updated paths in install-dev-tools.sh to reflect the new non-conda environment
- Removed conda-specific files and added requirements.txt for Python dependencies
## Test plan
Test by building and running the devcontainer in VS Code with both CPU and CUDA configurations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152881
Approved by: https://github.com/atalman
## Summary
- Changed the devcontainer context path from '../..' to './' for both CPU and CUDA configurations
- Added workspace mount configuration to properly mount the repository in the container
- Added containerEnv to disable implicit --user pip flag
## Test plan
Test by building and running the devcontainer in VS Code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152880
Approved by: https://github.com/atalman
Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster.
On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better).
Fixes#151823
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928
Approved by: https://github.com/laithsakka
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
This PR refactors CompiledFxGraph by adding a new post_compile step that only runs on cache hit. This refactors a bunch of code in _lookup_graph to its own function so that we can use it in BundledAOTAutogradCacheEntry. No difference in behavior here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152839
Approved by: https://github.com/oulgen
ghstack dependencies: #152836
The purpose of this stack is to create a new BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that is self contained, i.e. it contains all of the CompiledFxGraph directly in the entry, instead of relying on FxGraphCache._lookup_graph.
Because this woudl balloon the size of the actual cache entry to do this, our goal is not to use BundledAOTAutogradCacheEntry in cache scenarios: only for precompile use cases. Thus, it's important we make this whole setup generic, to be able to support these two workflows clearly.
This PR genericizes AOTAutogradCacheEntry considerably, so that it can take in different types of Forwards and Backwards.
Each GenericAOTAutogradCacheEntry is composed of two parts, a TForward and a TBackward. The forward and backward can be loaded in multiple ways, either via FxGraphCache._lookup_graph, or by saving the entire CompiledFxGraph.
For simplicify, this PR only implements the generic code refactors needed, but does not fully implement BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that takes a full CompiledForward. We'll handle and implement BundledAOTAutogradCacheEntry in the PR above this, for easier review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152836
Approved by: https://github.com/oulgen
This PR updates `GuardsStatePickler.reducer_override()` in `torch/_dynamo/guards.py` to handle reconstruction of traceable wrapper subclasses. It's intended to work recursively and handle any level of subclass instance nesting (e.g. subclass instances that contain subclass instances, etc.)
This PR tests the guard on several traceable wrapper tensor subclasses:
* `LocalSubclass`: used to ensure the correct error message is thrown when the subclass is not defined globally
* `torch.testing._internal.two_tensor.TwoTensor`: defines None for its extra metadata
* `SubclassWithMeta`: stores non-trivial extra metadata
* `SubclassWithCustomMetadataGuard`: stores non-trivial extra metadata and defines a custom `__metadata_guard__` classmethod
* `SubclassWithSubclassInnerTensors`: used to test recursiveness; this subclass contains subclass inner tensor components
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152626
Approved by: https://github.com/jansel
# Sub-PRs
These PRs contain refactors from the main one. They should be reviewed and merged first.
- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587
# Feature
The goals of this PR are twofold.
## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.
In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.
This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.
## Goal 2: Convert Wrapper IR into FX IR.
One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.
It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.
The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.
Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.
# Current status
Things that seem to work:
- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
- Handled the following cases, in addition to what was already in the Memory Planning IR:
- Comments
- Triton kernels
- Extern/fallback kernels
- Freeing tensors (`del buf0`)
- MultiOutput
- Graph outputs
- ReinterpretView / StorageBox, for both call args and outputs.
- FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
- Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
- Calling wrapped Triton kernels.
- Calling extern kernels and certain types of fallback kernels.
- Support both `extern_kernels.*` and `aten.*`.
- Support multi-output kernels like `torch.topk`.
- Graphs with multiple inputs/outputs.
- Training i.e. calling `Tensor.backward()` in a compiled function.
- Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.
Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
- Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
- Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
- CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.
# Out-of-tree compilers
With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.
```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen
class MyCustomBackend(WrapperFxCodegen):
def compile_graph(self, gm):
# Add 1 to the graph's outputs
def compiled_fn(*args):
return [x + 1 for x in gm.graph.forward(*args)]
return compiled_fn
```
# Example FX graphs
This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.
Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
return (buf0,)
```
Here's a more complicated graph that calls a `torch.addmm` extern kernel.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=2] = placeholder[target=arg1_1]
%buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
%buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
%addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
%delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
return (buf2,)
```
Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
%buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
%buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
%delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
%triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
return (buf1, buf2)
```
Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
%buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
return (buf0_view_buf0_0,)
```
Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
%s6 : [num_users=0] = placeholder[target=s6]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
return buf0
```
Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
%s10 : [num_users=0] = placeholder[target=s10]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
return buf0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
# Motivation
Fix https://github.com/pytorch/pytorch/issues/152301
When XPU is not available, calling `torch.xpu.is_bf16_supported()` still returns `True`, which is inconsistent with the expected behavior (should be False).
# Solution
Align to other backend, adding `including_emulation` to `torch.xpu.is_bf16_supported` and,
- return `False` if XPU is not available
- return `True` if `including_emulation` is True
- return `torch.xpu.get_device_properties().has_bfloat16_conversions` if `including_emulation` is False, it means if the device could generate SPIRV code for bf16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152317
Approved by: https://github.com/EikanWang
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
Fixes#122757
## Test Result
```python
import torch
model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)
loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)
Traceback (most recent call last):
File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
loss = loss_fn(input=model_output, target=labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
return F.cross_entropy(
^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
return torch._C._nn.cross_entropy_loss(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/mikaylagawarecki
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/`
Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later.
Test Plan:
Added test under `test/cpp/nativert/test_tensor_meta.cpp`
Differential Revision: D73820548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475
Approved by: https://github.com/albanD
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:
https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554
shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.
**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.
After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).
**Retry**
Original PR (https://github.com/pytorch/pytorch/pull/152644) was reverted due to internal bisect blaming this change, but the bisect was a false positive (and is marked as such)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152858
Approved by: https://github.com/aakhundov
Summary: The default action for ls for the local filesystem is with details=False, but this isn't the case for all filesystems (eg. huggingface), so setting details=False explicitly so that the return type of ls is a list of strings, and not a list of dictionaries, which is what it would be with details=True.
Test Plan: tested in notebook
Differential Revision: D74080572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152693
Approved by: https://github.com/joecummings
After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857
Approved by: https://github.com/davidberard98
Summary:
Allow TMA workspace creation allow specification for `num_programs`, which defaults to `num_sms` when not specified.
We need a total `num_programs * num_tma_descriptors` no. of descriptors for a kernel.
Test Plan: CI.
Differential Revision: D74189599
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152844
Approved by: https://github.com/drisspg
Summary:
1. Adding `input` field to `_adapt_flat_args` function
2. In `process_forward_inputs`, `reorder_kwargs` will now do nothing if no kwargs are provided (previously would error)
3. Pass `args` as input to `_adapt_flat_args`
These changes are made to update the InputAdapter
see more context in D73811508
Test Plan: see D73811508
Differential Revision: D73945419
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152575
Approved by: https://github.com/angelayi
Summary: Replace the key (path) from `<hash>.best_config` to `<parent_dir>/<hash>.best_config` to ensure that Autotune artifacts in MegaCache are loaded to the correct location locally.
Test Plan: NA
Differential Revision: D74052400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152678
Approved by: https://github.com/oulgen
# Sub-PRs
These PRs contain refactors from the main one. They should be reviewed and merged first.
- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587
# Feature
The goals of this PR are twofold.
## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.
In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.
This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.
## Goal 2: Convert Wrapper IR into FX IR.
One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.
It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.
The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.
Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.
# Current status
Things that seem to work:
- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
- Handled the following cases, in addition to what was already in the Memory Planning IR:
- Comments
- Triton kernels
- Extern/fallback kernels
- Freeing tensors (`del buf0`)
- MultiOutput
- Graph outputs
- ReinterpretView / StorageBox, for both call args and outputs.
- FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
- Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
- Calling wrapped Triton kernels.
- Calling extern kernels and certain types of fallback kernels.
- Support both `extern_kernels.*` and `aten.*`.
- Support multi-output kernels like `torch.topk`.
- Graphs with multiple inputs/outputs.
- Training i.e. calling `Tensor.backward()` in a compiled function.
- Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.
Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
- Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
- Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
- CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.
# Out-of-tree compilers
With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.
```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen
class MyCustomBackend(WrapperFxCodegen):
def compile_graph(self, gm):
# Add 1 to the graph's outputs
def compiled_fn(*args):
return [x + 1 for x in gm.graph.forward(*args)]
return compiled_fn
```
# Example FX graphs
This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.
Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
return (buf0,)
```
Here's a more complicated graph that calls a `torch.addmm` extern kernel.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=2] = placeholder[target=arg1_1]
%buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
%buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
%addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
%delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
return (buf2,)
```
Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
%buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
%buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
%delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
%triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
return (buf1, buf2)
```
Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
%buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
return (buf0_view_buf0_0,)
```
Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
%s6 : [num_users=0] = placeholder[target=s6]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
return buf0
```
Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
%s10 : [num_users=0] = placeholder[target=s10]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
%triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
return buf0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states.
## Test Cases:
0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed.
1. torch.compile() nested in vmap.
2. torch.compile() nested in grad.
3. torch.compile() nested in jvp + vmap
4. torch.compile() nested functionalize
5. torch.compile() nested in vmap + grad
Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616
Approved by: https://github.com/zou3519
ghstack dependencies: #152615
Change CI docker images to be `ci-image:<image name>-<folder sha>` instead of `<image name>:<folder sha>` so we never have to make a new ecr repo ever again
Pros:
never have to make a new ecr repo ever again
Cons:
if it aint broken, dont fix it?
Don't need to change linux-test images since they use the "full name" of the image with the docker registry and the tag
In order to prevent others needing to rebase past this PR, also push the image to the "old name". This can be removed after this PR has been in main for a while
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152209
Approved by: https://github.com/seemethere, https://github.com/atalman
There's an "are we compiling" check in SAC, which we rely on to know when to propagate recompute tags during tracing.
This check was a bit brittle, and missed cases where input ops accept list of tensors - I updated it to check if a `FunctionalTensorMode` is active, which should be a 100% reliable way to know if AOTDispatcher is in the middle of running.
There is a long-standing followup here around unifying `torch.compiler.is_compiling()` to work in all cases. We should probably just update it to always check if FakeMode/FunctionalMode are active and use it there. This has a bit of BC risk though so I opted for the more local fix to SAC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152195
Approved by: https://github.com/soulitzer
Summary: This diff implements an AsyncManifoldCache class that performs cache write and update ttl operations in an async manner. Essentially we are ok with the fire and forget approach where we dont guarantee that we can observe our writes, this gives us better runtime latency.
Test Plan: added new unit test
Reviewed By: jamesjwu
Differential Revision: D73867797
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152452
Approved by: https://github.com/jamesjwu
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24., 3.]])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
### Summary
Recreating #151990 to mitigate easyCLA failure
compute_global_tensor_shape util function takes in local tensor shape, device mesh
and placements. We all gather the shapes from the shards and according to the placement
type we construct the global shape.
Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared
### Test
`pytest test/distributed/tensor/test_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751
Approved by: https://github.com/XilunWu
Adds the `is_triton_capable` and `raise_if_triton_unavailable` class methods to the device interface, to allow device types to run their own checks for Triton _capability_ (which means a device can actually support Triton in the first place) and _availability_ (if the correct backend of Triton is installed and is functional for the device).
Using the device interface allows us to do these checks in a device-agnostic way, allow external backends to attest their Triton support by simply implementing those methods. The intention is for this to back things like the `has_triton` utility method.
This has been split from #139171.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152529
Approved by: https://github.com/jansel
Two error messages in the codebase instruct the user to use `Tendor.dense()`. This method doesn't exist, but `Tensor.to_dense()` does, and this is what the user should be using instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152631
Approved by: https://github.com/jansel
This is my suggestion for resolving #152087
This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093
Approved by: https://github.com/desertfire
`torch/csrc/utils.h` should be device-independent. Currently, it contains CUDA-related implementations, which indirectly causes the [failure of ROCm testing](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038) (The reason is that the ROCm test environment shouldn`t expose HIP-related header files, which causes the JIT compilation to fail during testing)
Therefore, move CUDA-related implementations to `torch/csrc/cuda/utils.h`.
**Question:**
This change may introduce BC-breack.
I searched for this function globally on github and I think the impact is very small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152521
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #152512, #152513
Summary:
change set
1. a ShardedTensor could have 0 size initially, the current check won't pass if the size is 0, added here
2. when we call ShardedTensor._init_from_local_shards, it will assume all the metadata is correct, all_gather to double check. In the new case, the metadata could be all 0 size, and the tensor has actual size, we need to provide such capability to recalculate the local/global metadata from the local tensor by all_gathering the information
Test Plan: i don't see a UT is associated, I have tested this with diff stack, D73274786.
Differential Revision: D73903933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152583
Approved by: https://github.com/q10, https://github.com/fduwjj
What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure.
If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example:
```
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.pi)
tensor([ -5.8555, 19.4844, -7.0703, -10.6562, 27.0000, 18.7812],
dtype=torch.float16)
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16))
tensor([ -5.8516, 19.4844, -7.0664, -10.6562, 26.9844, 18.7656],
dtype=torch.float16)
```
Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515
Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #152663
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.
Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI
Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:
<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />
TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />
We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.
Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
This PR:
- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides
Test Plan:
- tests + CI
disable padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
I saw this warning when compiling a 3rd party lib and did not agree with it. I'm not sure the original reason why we would want to force people to pass in TORCH_CUDA_ARCH_LIST to cmake vs set it as an env var. As a developer, it's much easier to set it as an env var or have it be autodetected. I also realized this warning was from before 2018!!! 7 years ago! And there are no plans to actually enforce this (nor should there be), so let's remove this misleading warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152715
Approved by: https://github.com/malfet, https://github.com/zou3519
Summary:
This PR fixes a bug in the Triton kernel invocation path where the `workspace_tensor` was inserted before the unpacked `extra_args` list in the final kernel argument list. This broke the expected ordering of arguments when dynamic shape size hints are emitted.
When dynamic shapes are used, `extra_args` contains both size hint arguments and grid arguments. The kernel expects the argument list to follow the order: **size hints → workspace tensor → grid args**. But previously, the `workspace_tensor` was inserted before unpacking `extra_args`, resulting in: **workspace tensor → size hints → grid args**, which is incorrect.
This fix constructs the workspace tensor earlier, allowing it to be slotted in after the size hints and before the grid arguments, restoring the expected argument layout.
Test Plan:
contbuild and OSS CI
Reviewers: paulzhan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152660
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.
This PR introduces a new wrapper IR line to represent symbolic call args. This deletes a little bit of duplicated code between the Python and C++ backends. In the main PR, having a Wrapper IR line for this also tells the FX backend what this part of the wrapper code is doing. Before this PR, symbolic call args generated raw Python lines, which confuse the FX converter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152587
Approved by: https://github.com/jansel
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:
https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554
shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.
**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.
After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).
Differential Revision: [D74036850](https://our.internmc.facebook.com/intern/diff/D74036850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152644
Approved by: https://github.com/benjaminglass1, https://github.com/eellison
Summary:
### Diff Context
- Sometime background process can be stuck processing async checkpoint request, and trainer shutdown can occur before the background process completes.
- Fix, timeout the thread while reading the IPC queue for a response from background process.
Differential Revision: D74017700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152629
Approved by: https://github.com/saumishr
This PR enabled fp8 distributed tests on MI300.
For testing the added feature, ran distributed.tensor.parallel.test_micro_pipeline_tp test and all the tests passed successfully, and no tests were skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151977
Approved by: https://github.com/jeffdaily
definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the
existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463
Approved by: https://github.com/bobrenjc93
In some internal frameworks, on second attempts the actual code is copied to a different path than previous attempts.
but its still the same. PGO will not work on those cased due to the following, sate entries before this PR used to be identified by (filepath, function name, line number).
after this PR they are identified by (hash(filepath) , function name, line number). This way PGO will work for those jobs on future attempts and re-compilations of static versions will be avoided.
Sometimes we do not have access to the source code, (file does not exists)
This seems to happen mostly when we re-trace a compiled function but generally it can happen .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152628
Approved by: https://github.com/oulgen
Add additional conditions to `build_pytorch_libs.py` to avoid fetching NCCL when `USE_CUDA` or `USE_NCCL` are disabled. While at it, adjust the existing condition for `USE_SYSTEM_NCCL` to use the utility function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152533
Approved by: https://github.com/albanD
So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now,
deferred runtime assertions that uses those is never generated.
This PR changes that, such that in the forward graph we consider those and generate the corresponding
runtime assertions of them. We still ignore them for backward which is not ideal
The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used
in them are seen.
We previously skipped placeholder, because for backward we have a wacky approach were we
ignore input defined unbacked symbols and assumes assertions that uses them are already emitted
in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts]
Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231
Approved by: https://github.com/aorenste
I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more.
However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool.
So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id.
Fixes#151199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472
Approved by: https://github.com/eellison, https://github.com/ngimel
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.
Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI
Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:
<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />
TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />
We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.
Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).
Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #149707
Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs.
Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707
Approved by: https://github.com/bdhirsh
Summary: Discovered when attempting to resolve arvr builds, should resolve issues around utilizing functorch through export.
Test Plan:
```
buck2 test arvr/mode/linux/opt //arvr/libraries/xrrp/ml/python/test:convert_to_etvk_test
```
Differential Revision: D74013898
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152620
Approved by: https://github.com/zou3519
When multiple checkpoint regions are back-to-back with no operations in-between, we enforce the operation at the boundary to be force-saved, see 7ea0da2d57/torch/_functorch/partitioners.py (L772-L807)
When using the `memory_budget` formulation on a graph which already has AC inside, we should respect the boundaries of the AC decision (which is set to `MUST_SAVE`), and thus ban those nodes from possible recomputation.
Adding tests would be nice, but not sure what's the best way to test this right now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141684
Approved by: https://github.com/bdhirsh
Summary: Previously D70489427 changed tanh impl to `.tanh()`, and this is causing some meta internal workload perf regression. This diff will introduce a config so we can set it based on need.
Differential Revision: D73909371
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152542
Approved by: https://github.com/desertfire
Fixes#151582
example warning for Dim.AUTO:
```
torch/_export/non_strict_utils.py:499] dimension inputs['x'].shape[1] 0/1 specialized; Dim.AUTO was specified along with a sample input with hint = 1.
```
example error when Dim.DYNAMIC specializes:
```
- Received user-specified dim hint Dim.DYNAMIC(min=None, max=None), but export 0/1 specialized due to hint of 0 for dimension inputs['x'].shape[0].
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151827
Approved by: https://github.com/angelayi
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.
Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).
Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.
The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).
Ported most code from #146593.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151261
Approved by: https://github.com/fegin, https://github.com/fduwjj
We need to make function schema proxyable to trace a the auto_functionalized hop that takes function schema as inputs. The implementation basically follows how we support torchbind object:
1. upon seeing an untracked function schema arg, we creates a constant get_attr node
2. we track the function schema argument in export to support lift/unlift.
3. we need to support serde for functional schema. We'll add support for this in follow-up PRs.
However, compared with torchbind object:
1. we don't need a dynamo implementation, because the function schema is added when we auto_functionalize a hop to the argument of auto_functionalized. One potential use case is users re-traces an exported program with strict mode. Since non-strict is the default now, we don't see a use case yet.
2. we don't need an inductor implementation, because the function schema will go away after auto_functionalized re-inplacing pass.
edit: we greatly simplifies (and generalizes) the implementation following @zou3519 's suggestion of using pytree.register_constant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152073
Approved by: https://github.com/zou3519
ghstack dependencies: #152072
Test Plan:
Dumped the local net torch.package to local
Ran
```
buck2 run scripts/shengqin:test_model_export -- /tmp/mtia_local_torch_package {\"local\":null}
```
succeeded
Reviewed By: hongyang-zhao
Differential Revision: D73405271
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152039
Approved by: https://github.com/houseroad
Summary:
Same with D71358949, but removing newly added log to avoid test failures.
Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass.
Original dper pass diffs: D67884534, D68123479, D68384238
Test Plan:
Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log
```
Log: P1798246938
Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log
```
Log: P1798246675
Seeing logs like
`[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False`
Differential Revision: D73934142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152536
Approved by: https://github.com/wdvr
These are the tests for torch._inductor.compile, so I renamed the file
test_compile. This is to avoid confusion with
torch._inductor.standalone_compile, which is now a lot more standalone
than torch._inductor.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103
Approved by: https://github.com/oulgen
Today `cpp_extensions` makes heavy use of printing to stderr, this makes our life harder in KernelBot where we typically rely on stderr to only surface real errors but instead today cpp_extensions leverages stderr for updates that could be qualified as INFO, WARNING, ERROR
Now instead we'll recommend users of our cpp extension system to do something like
```python
import logging
cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension")
cpp_ext_logger.setLevel(logging.WARNING)
```
While this dramatically reduces log spew, it can be viewed as a BC breaking change if people were relying on certain strings being present in stdout or stderr
Considering different teams might want to silence errors differently, this PR proposes replacing all `print()` statements with `logging` statements with the same heuristics that the python logging module recommends
1. DEBUG: For things like detailed compilation steps or reading filepaths - by default gets logged on stdout
2. INFO: Build progress - by default gets logged on stdout
3. WARNING: Surfacing issues that might cause bad performance or slow compilation times - by default gets logged on stdout
4. ERROR: Problems that prevent proper functioning - by default gets logged on stdout
Note that warnings.warn is a different library and is not hooked up to the python logging module by default
So the goal of this PR is to make it possible for teams to set the logging that is most appropriate to them. One annoying thing is logger throws ruff errors if you try to use it in conjunction with f strings or .format so have to use old school %s
An unrelated improvement I'd be happy to push to a seperate PR is adding support for "native" in `TORCH_CUDA_ARCH_LIST` which would just pick the ARCH for the current device
An example of what's in stderr today
```
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/grayscale/build.ninja...
/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module grayscale...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module grayscale...
/usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:679: UserWarning: Graph break due to unsupported builtin grayscale.PyCapsule.grayscale. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
torch._dynamo.utils.warn_once(msg)
```
Whereas after this PR users can do
`python benchmark_load_inline.py > >(tee stdout.txt) 2> >(tee stderr.txt >&2)`
```python
import os
import sys
from pathlib import Path
import shutil
import tempfile
import torch
from torch.utils.cpp_extension import load_inline
import logging
cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension")
cpp_ext_logger.setLevel(logging.WARNING)
os.environ["TORCH_CUDA_ARCH_LIST"] = "native"
cpp_code = """
torch::Tensor to_gray(torch::Tensor input);
"""
cuda_kernel_code = """
torch::Tensor to_gray(torch::Tensor input) {
auto output = torch::epty({input.size(0), input.size(1)}, input.options());
return output ;
}
"""
# Avoid caching results
with tempfile.TemporaryDirectory() as build_dir:
cuda_module = load_inline(
name="to_gray_cuda",
cpp_sources=cpp_code,
cuda_sources=cuda_kernel_code,
functions=["to_gray"],
with_cuda=True,
verbose=True,
extra_cflags=["-std=c++17"], # "-ftime-report", "-H"],
extra_cuda_cflags=["-arch=sm_89"],
build_directory=build_dir,
)
```
## New logs
### On failure
Which gives a much more reasonable stdout
```
[1/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o
/tmp/tmpbg_xzv0r/cuda.cu(6): error: namespace "torch" has no member "epty"
auto output = torch::epty({input.size(0), input.size(1)}, input.options());
^
1 error detected in the compilation of "/tmp/tmpbg_xzv0r/cuda.cu".
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpbg_xzv0r/main.cpp -o main.o
ninja: build stopped: subcommand failed.
```
And stderr
```
Traceback (most recent call last):
File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2874, in _run_ninja_build
subprocess.run(
File "/home/marksaroufim/.conda/envs/nv/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/marksaroufim/load_inline_slow/benchmark_load_inline.py", line 30, in <module>
cuda_module = load_inline(
File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2261, in load_inline
return _jit_compile(
File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2367, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2528, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2892, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'to_gray_cuda'
```
### On success
stdout
```
[1/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpxv_ovlrf/main.cpp -o main.o
[2/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpxv_ovlrf/cuda.cu -o cuda.cuda.o
[3/3] c++ main.o cuda.cuda.o -shared -L/home/marksaroufim/pytorch/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-12.8/lib64 -lcudart -o to_gray_cuda.so
```
And an empty stderr as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152260
Approved by: https://github.com/albanD
This PR adds support for `sm_121` of the DGX Spark. The `sm_121` is binary compatible with `sm_120` (just like `sm_89` and `sm_86`), therefore a compilation targeting `sm_121` is not required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152314
Approved by: https://github.com/eqy
I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more.
However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool.
So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id.
Fixes#151199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472
Approved by: https://github.com/eellison
This relaxes the guard introduced in #100444 (which aggressively guard
on the object id, despite Dynamo is just tracing its `__call__` method.
This allows users to bypass the high compilation time issue in #150706
by compiling transformer blocks only. Without this patch, we'd get lots
of unnecessary recompilation, as the block has difference attention
processor instances.
Compiling blocks only _significantly_ speeds up compilation process
(from ~310s to ~32s), and even speeds up e2e performance for some reason
(7.83s to 7.67s).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152395
Approved by: https://github.com/anijain2305
ghstack dependencies: #152369
Seems there was a typo where `set_device` was called when the intent was to use `current_device`
As-is the test will fail on multigpu systems with
`TypeError: set_device() missing 1 required positional argument: 'device'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152474
Approved by: https://github.com/Skylion007
These are the tests for torch._inductor.compile, so I renamed the file
test_compile. This is to avoid confusion with
torch._inductor.standalone_compile, which is now a lot more standalone
than torch._inductor.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103
Approved by: https://github.com/oulgen
The standard requires that the argument to functions like `isdigit`, `isalpha`, and similar must be either `EOF` or an `unsigned char`; otherwise, the behavior is undefined (UB).
To avoid out-of-bounds reads, modern implementations of some libraries (such as glibc) deliberately pad their internal tables to guarantee valid memory access even for negative values. However, this is implementation-specific, and other libraries may not do this.
Properly casting the argument to `unsigned char` is good practice to avoid potential issues on some platforms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152360
Approved by: https://github.com/cyyever, https://github.com/Skylion007
For 3rd-party devices now, `` instantiate_device_type_tests()`` with explicitly passing ``str`` obj (rather than `List[str]/Tuple[str]`) to argument ``only_for`` or ``except_for`` would causes unexpected results.
For example, if calling ``instantiate_device_type_tests(TestXXX, globals(), only_for="cpu")``, then it goes into [filter_desired_device_types()](f38dae76ee/torch/testing/_internal/common_device_type.py (L729)) and results in ``only_for=['c', 'p', 'u']`` because ``only_for`` we passed is a "cpu" string.
This PR fixes the above unexpected behavior for ``str`` case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152177
Approved by: https://github.com/albanD
First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`)
Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466, #152479
As well as `.add`/`.sub` with complex alpha
Before this change `python3 -c "import torch;print(torch.rand(10, device='mps', dtype=torch.complex64).add(torch.rand(10, device='mps', dtype=torch.complex64), alpha=.5j))"` used to fail with
```
RuntimeError: value cannot be converted to type double without overflow
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152479
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466
Differential Revision: D73626732
In this PR, we add support for bmm dynamic shape, provided that the batch stride is the biggest in the stride for A, B, and D. For example, for A of size `(B, M, K)`, we support stride `(M*K, K, 1)` and `(M*K, 1, M)`. With this assumption, we can infer the batch stride from existing arguments.
The reason is we don't want to add 2-3 more runtime params. The concerns are complexity and possible perf regression, though we didn't verify the latter.
We can revisit this if there is a need for that.
We also remove `B = 1` for normal mm and addmm. We tested it and didn't see perf regression. But open to revisiting this as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152393
Approved by: https://github.com/ColinPeppler
Summary:
Fixes https://github.com/pytorch/pytorch/issues/151476
The `custom_meta` collected from `mod` has keys that follow name of nodes in `mod`, which are inconsistent with the node names after the naming pass. For example a constant `b` will become `c_b`.
Test Plan: buck2 run caffe2/test:test_export -- -r test_run_decompositions_keep_tensor_constant_metadata
Differential Revision: D73703068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152241
Approved by: https://github.com/angelayi
Power System build is failing with below error.
After this commit it is failing:
912102b4ec
Fix the build error along with test cases that are failing for complex double and complex float data type.
Build Failure Logs:
```
vec_base.h:790:6: error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::operator’
790 | c[i] = a[i] * b[i];
| ~^
error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::oper
ator’
802 | c[i] = a[i] / b[i];
| ~^
error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera
tor’
790 | c[i] = a[i] * b[i];
| ~^
error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera
tor’
802 | c[i] = a[i] / b[i];
| ~^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152402
Approved by: https://github.com/malfet
When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so
Fixes#146603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725
Approved by: https://github.com/albanD
This PR implements the second codegen task of CUTLASS EVT: translating inductor epilogue nodes into python code that will be traced by the EVT infra.
Details:
The implementation uses a simple ops wrapper which only supports add and mul pointwise ops today (to be extended in the future). This ops wrapper generates python code from inner_fn of the epilogue nodes in the format EVT expects. The main caveat is that one of the outputs needs to be named "D" and the accumulator input needs to be named "acc". Reads/writes are named according to the inductor buffer names otherwise.
Previously merged:
* #150904
* #150903
* #150346
* #150345
* #150344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150905
Approved by: https://github.com/eellison
ghstack dependencies: #152305, #152306
This is a new pass to replace the pre-existing passes. It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.
The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
not increaes curr memory beyond peak memory
The pass logs a summary table for each graph to TORCH_LOGS=overlap.
e.g. (exact format may have been tweaked but this shows the idea).
```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4
[rank
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
Summary: This PR adds a private configuration to the partitioner that ensures that the decision taken is the same across all ranks. This is a temporary workaround, as when size_hints are also taken into account in compiler collectives this workaround will not be needed anymore.
Test Plan:
This has been tested on some internal models, but I haven't added any tests in PyTorch (yet?)
T
Differential Revision: D73666017
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152264
Approved by: https://github.com/bdhirsh
Installs setuptools since I get
https://github.com/pytorch/pytorch/actions/runs/14736804186/job/41364832984#step:5:60
```
+ python3 -m tools.generate_torch_version --is_debug=false
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/tools/generate_torch_version.py", line 9, in <module>
from setuptools import distutils # type: ignore[import]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'setuptools'
```
It should be a no op in the normal lint workflow since setuptools is in the docker image
Switched from using python3.10 to system python, which should be python3.9
Use venv to put deps not in the base?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152433
Approved by: https://github.com/huydhn
Summary: I'm just trying to fix the test again. It's out of date because it's disabled and some dynamo_timed-related fields are gone now.
Test Plan: `python test/dynamo/test_utils.py -k dynamo_timed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152387
Approved by: https://github.com/anijain2305
This PR adds detailed logging of each triton kernel we compile, and its autotune result, to every kernel we compile with triton. We add these results to a global variable that we then clear after each triton kernel compile.
We can't keep these objects around after compile time, so we can't record the autotune cache save or coordinate descent tuning, unfortunately, but we can log at least:
- The duration of compilation
- Whether or not autotune cache hit
- The best autotuning config, if there's only one.
Example triton kernel info: https://gist.github.com/jamesjwu/493bdd0f36b0b7e3ca327f87bd6c2c75
See internal diff for an example log for internal model.
Differential Revision: [D73674443](https://our.internmc.facebook.com/intern/diff/D73674443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152197
Approved by: https://github.com/oulgen, https://github.com/eellison
Run into this problem while re-enabling `test_float_repr_dynamic_shapes`, where `_print_Max` were called for integer and long argument which resulted in the following compilation error
```
error: call to 'max' is ambiguous
out_ptr0[x0 + x1*metal::max(1, ks0)] = static_cast<float>(tmp26);
^~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:2477:16: note: candidate function
METAL_FUNC int max(int x, int y)
^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:3686:17: note: candidate function
METAL_FUNC long max(long x, long y)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152430
Approved by: https://github.com/dcci
ghstack dependencies: #152363
This pr adds an optimal reordering for minimizing #partitions.
## Optimal reordering for minimizing #partitions
A bfs could minimize #partitions (ignore peak memory for now):
1. For each node, compute node_to_indegree: dict[node, int].
2. Maintain 2 queues: cudagraphable_nodes, and non_cudagraphable_nodes. Iterate through all nodes and add nodes to one of these 2 queues if node_to_indegree[node] == 0.
3. While non_cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
4. While cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
5. Repeat step 3 & 4 until all nodes have been scheduled.
We call this strategy `reorder_for_minimizing_partition`.
**Q: Why is this optimal?**
Suppose this is not optimal, we have a counter example with 2 non_cudagraphable regions:
```
[non_cudagrable1, cudagraphable2, non_cudagraphable3]
```
where we can reorder to only 1 non_cudagraphable region:
```
[non_cudagrable1, non_cudagraphable3, cudagraphable2]
```
This reorder means non_cudagraphable3 does not depend on cudagraphable2. So after we scheduled non_cudagraphable1, both non_cudagraphable3 and cudagraphable2 have in_degree as 0. If this is true, Step 3 should have already scheduled non_cudagraphable3 before cudagraphable2 such that the counter example cannot exist.
This shows we cannot find such a counter example and the bfs is optimal on minimizing #partitions.
## Minimize peak memory
`reorder_for_peak_memory` currently uses topological_sort_dfs, topological_sort_lpmf, and topological_sort_bfs, where the later 2 are bfs. ILP brings small benefits and it can hardly scale to more than 100 nodes, according to @xuanzhang816. So ILP is not used for peak memory reorder in the inductor.
Heuristics strategy:
- Conduct reorder_for_peak_memory as the default order
- Conduct reorder_for_minimal_partitions and get results as list[tuple[partition, bool]], where partition: list[BaseSchedulerNode] and bool for cudagraphable.
- If the reorder increases peak memory too much, we use the default order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151968
Approved by: https://github.com/eellison
I frequently come to CONTRIBUTING.md to copy paste the below snippet to rebuild pytorch which in zsh gives this error because zsh interprets # as a command. These comments add nothing so just removing
```
error: pathspec 'sync' did not match any file(s) known to git
error: pathspec 'the' did not match any file(s) known to git
error: pathspec 'submodules' did not match any file(s) known to git
Building wheel torch-2.8.0a0+git9c01c87
invalid command name '#'
```
```
git submodule update --init --recursive # very important to sync the submodules
python setup.py develop # then try running the command again
git submodule update --init --recursive
python setup.py develop
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152259
Approved by: https://github.com/janeyx99
Remove conda usage from TD llm retriever job
python3 in the base is python3.9 right now. I'm not sure what the best way to deal with a potentially different python version would be, dnf install?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152338
Approved by: https://github.com/huydhn
Fixes#152344
Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory.
Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Add precedence to the infix printing done by sympy_str.
Without this change sympy_str will print the same string for both `a+b*(c+d)` and `(a+b)*(c+d)`.
While there I also cleaned up the printing for `-a` and `a - b`.
Added some tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151920
Approved by: https://github.com/jansel
Summary:
bucketization involves comparing an input with border values. Without careful consideration of dtypes, this can cause dangerous implicit casting.
aten.bucketize resolves this via dtype promotion. We enable dtype promotion for the inductor bucketization pass so as to maintain alignment with the aten op.
Test Plan:
```
python3 test/inductor/test_torchinductor.py -k "bucketize"
```
Fixes#145929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150634
Approved by: https://github.com/davidberard98, https://github.com/eellison
Replaces the janky way of using the IntArrayRef to create an NSArray to ask for it to provide its contents in a string format with use of stringstream.
This speeds up the call for getting the key string for caching (or reading from cache) for shaped inputs by ~5x. While the actual wall time, depending on the number of input tensors, is only some microseconds this time represents non-negligible chunk of the overall time spent in preparing to dispatch work to the GPU. And since this function gets called on every time a (cacheable) operation in MPS is used it should be a small but broadly impacting time saver.
Using mps_linear as an example. Note this is before PR https://github.com/pytorch/pytorch/pull/152199 so it only captures the CPU time spent in the op call:
Before the change:
```
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1108f07d0>
func(*args, **kwargs)
Median: 22.75 us
IQR: 0.87 us (22.50 to 23.38)
8361 measurements, 1 runs per measurement, 1 thread
```
After the change:
```
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x108875350>
func(*args, **kwargs)
Median: 18.67 us
IQR: 0.46 us (18.50 to 18.96)
10342 measurements, 1 runs per measurement, 1 thread
```
Which aligns with the observed change for getTensorStringKeys() taking ~1us instead of ~5us in mps_linear op I got from a point measurement sandwiching the function call with `std::chrono::high_resolution_clock`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152202
Approved by: https://github.com/Skylion007
When CompiledFxGraph is deallocated, its cudagraphifed fn (i.e., `current_callable`) is expected to also be deallocated.
Without graph partition, this is true since the cudagraphified fn is only refered by compiled_fx_graph.current_callable.
However, with graph partition, runner.partitions hold cudagraphified fns while compiled_fx_graph.current_callable holds the runner.call. Thus the cudagraphied fn may not be deallocated when CompiledFxGraph is deallocated. This leads to errors in several unit tests (e.g., test_unaligned_static_input_no_cudagraphs and test_unaligned_static_input_non_trees).
In this PR, we also clean up runner.partitions when CompiledFxGraph is deallocated. This fixes the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152066
Approved by: https://github.com/eellison
Summary:
We suspect that switching the NVCC host compiler from GCC to Clang, while targeting multiple architectures, is causing issues because only _CUDA_ARCH_LIST_ is being passed, without _CUDA_ARCH_.
To resolve this c10 compilation error, we should first fix the problem and then switch the NVCC host compiler from GCC to Clang. Once this is done, the errors no longer occur.
Test Plan: CI
Reviewed By: zhuhan0
Differential Revision: D73383236
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152030
Approved by: https://github.com/cyyever, https://github.com/ZainRizvi
as titled, we can just set new_local_tensor to be the local tensor and
remove the None check, as there would be cases where there's no
transformation needed (i.e. src_placements and dst_placements are the same,
and we still want to return the original local_tensor)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303
Approved by: https://github.com/awgu
Summary: The HF storage reader/writer component can work for any back-end in theory, so we shouldn't enforce the token to be passed into fsspecreader/writer, because the specific fsspec implementation may not handle tokens. Specifically, manifold doesn't accept a token arg, but we're passing one in always, which is throwing
Test Plan: signals
Differential Revision: D73130679
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151464
Approved by: https://github.com/Skylion007
`test_reproduce_121253_issue_addmm_fusion_check` checks for "mkl._mkl_linear" being found in the generated source which cannot be there when MKL isn't available.
Add skip marker similar to other tests in this file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152322
Approved by: https://github.com/Skylion007
Adds explicit error checking during sharding propagation for view ops
rather than relying on runtime errors during local op execution.
Before:
An error is thrown by aten.view op called by DTensor dispatch, because
the local shard size is incompatible with the (incorrectly calculated)
args to the view op.
`RuntimeError: shape '[384]' is invalid for input of size 512`
After:
We raise more specific errors for cases of incompatible view operations
during sharding propagation, before getting to runtime dispatch.
`RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.`
Change Summary:
add 'strict_view' kwarg to the helper methods that implement
view/reshape op shard prop rules, so it can be decided op-by-op whether
to raise these new errors
enabled errors just for the 'view' op in this PR
added two specific checks/errors that can occur during view ops.
Details:
- View ops are never allowed to flatten a dimension that is unevenly
sharded, since that would likely change the size/content of the
local_tensor and require redistribute
- View ops are also never allowed to flatten two dims if the rightmost
dim is a Shard() placment, becuase it would cause contiguity errors
without redistribution
Notes:
- Disables support for several ops in test_dtensor_ops.py test, which
decompose to an illegal view that only works by performing a
redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron
Follow Ups:
- triage other view-like ops (besides aten::view) for using strict_view
- look for other gaps where view-like ops could still perform
redistribution (ban them all, and document this)
Fixes#143372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149764
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #152045
Using `logging.basicConfig` to set root logger's level is not a good behavior. Fix common_distributed.py to set level for current logger only, because it affects downstream's 3rd-party testing plugins.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152319
Approved by: https://github.com/Skylion007
this was discussed with @eellison and he recommended using statically_known_true here, the intuition is. We already have 0/1 specializations in place, if we reach those checks with dynamic shapes that are not already specialized
then we do not want them to specialize them, "a recompilation here is not justified".
Those are all non-semantic changing optimizations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148893
Approved by: https://github.com/eellison
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.
In this diff, we want to implement the feature on MTIA side
Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```
As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}
Differential Revision: D71993315
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
Fixes#151522
This PR fixes the issue that Dynamo fails to trigger a graph break for sparse tensors in certain code paths. I added an additional check to handle this case, and it resolves the original problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151897
Approved by: https://github.com/jansel
Fix: #135099
This PR changes how we map the original inputs into the new set of
inputs that take in the tensor input's base instead of their aliases.
**Problem:** in order to create this mapping, we had a dictionary that
mapped the hashed arguments into their respective indices. However, if
there's a group of equal arguments, we will have only one mapping for
such an argument. This breaks the assumption that there will be one
mapping for each argument.
**Solution:** map the hashed arguments into a list of indices. Then, we
will be able to correctly reconstruct the parameters for the new calling
convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275
Approved by: https://github.com/bdhirsh
Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.
The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.
Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226
Approved by: https://github.com/albanD, https://github.com/guangyey
ghstack dependencies: #151404, #151221, #151411
MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool.
Usage:
Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing.
```
pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a
# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
```
Testing:
```
python test/test_cuda.py -k test_mempool_limited_memory_with_allocator
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487
Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel
Without MKL there is only 1 epilogue, not 2 because `addmm` is used instead of `packed_linear/_mkl_linear`.
This fails first at `TestSelectAlgorithmCPU.test_linear_with_in_out_buffer_batch_size_8_in_features_3_in_features2_192_image_size_224_out_features_64_bias_True_cpu_float32`
Instead of skipping the whole test just adjust the count for the single check.
Final numbers of `test/inductor/test_cpu_select_algorithm.py` without MKL:
```
Ran 1337 tests
OK (skipped=1211)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151548
Approved by: https://github.com/jansel
Solves the following problems of caffe2 HIP tests building on Windows:
1. HIP tests now use `hip_add_executable` to be built with custom_command invoking hip compiler, due to lack of cmake support for HIP in 3.18 (currently used).
2. failing with "Command line too long" which resulted from `hip_add_executable` adding the same flags over and over on top of `HIP_HIPCC_FLAGS` with every test added.
3. Disables `HasSameArgTypes` test on Windows, as `at::native::modern::detail` is nowhere to be found in the codebase (I think it must be a legacy thing). Perhaps the whole test should be removed/rewritten?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152014
Approved by: https://github.com/jeffdaily
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature.
XPU still uses it, so exclude XPU builds until https://github.com/intel/torch-xpu-ops/pull/1615 is merged
Test plan:
- 0def9b4acc should fail MPS builds
```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
c10::optional<int64_t> extra) {
^~~~~~~~~~~~~
std::optional
```
- a769759dd4 should fail CUDA builds
```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
input, c10::nullopt, reduce_op, group_name, out);
^
1 error detected in the compilation of
```
Fixes https://github.com/pytorch/pytorch/issues/150313
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature
Test plan:
- 0def9b4acc should fail MPS builds
```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
c10::optional<int64_t> extra) {
^~~~~~~~~~~~~
std::optional
```
- a769759dd4 should fail CUDA builds
```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
input, c10::nullopt, reduce_op, group_name, out);
^
1 error detected in the compilation of
```
Fixes https://github.com/pytorch/pytorch/issues/150313
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
Summary: I'm investigating differences in total torch.compile overhead in our two main internal sources: dynamo_compile and pt2_compile_events. One source of discrepancy is due to cudagraphs overheads. Currently, we have a context manager that optionally attributes a dynamo_timed region to a cudagraph-related column logged to dynamo_compile, but _all_ dynamo_timed regions show up in pt2_compile_events (hence the discrepancy; pt2_compile_events is overcounting). We could filter out these specific events from pt2_compile_events when measuring overall overhead. But I'm going to argue that those timed regions that we DO NOT consider as a compiler-related overhead don't have much value in logging in the first place. So I'm suggesting we just remove those instances.
Here's the production job with the discrepancy:
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/3604eypl
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/c2dv8sty
Test Plan:
torchbench nanogpt:
* tlparse: https://fburl.com/h1n2ascc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u37yrynp
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/s7avd0di
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152136
Approved by: https://github.com/BoyuanFeng
Summary:
Various fixes to make fbcode work w/ ArmPL's cblas header:
1) Avoid re-declaring prototypes for internal blas methods which ArmPL already declares.
2) Fix `std::complex` conversion when using these methods.
3) Drop `extern "C"` around include fo `cblas.h`.
Test Plan: CI
Differential Revision: D72808561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151126
Approved by: https://github.com/Skylion007
By instantiating it implicitly, otherwise attempts to run something like
```
% python3 -c "import torch; print(torch.special.entr(torch.testing.make_tensor(10, dtype=torch.bool, device='mps')))"
```
will fail with
```
Failed to created pipeline state object, error: Error Domain=AGXMetalG14X Code=3 "Compiler encountered an internal error"
```
Similar in spirit to https://github.com/pytorch/pytorch/pull/149123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152204
Approved by: https://github.com/dcci
This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes.
The main behavioral change introduced in this diff is on CheckFunctionManager:
```
check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save")
guards_state: bytes = check_fn_manager.guards_state
```
Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later.
When we load back guards state, we will set `guards_serialization_mode` is set to `load`:
```
output_graph_state = pickle.loads(guards_state)
check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load")
```
# TENSOR_MATCH
Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards.
We kick off the work from TENSOR_MATCH from this diff.
# Testing
For each type of guard we will test it like the following:
1. Use guard_filter_fn to select 1 type of guard each time.
2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager)
3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager)
4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318
Approved by: https://github.com/jansel, https://github.com/anijain2305
When e.g. OpenBLAS is used instead of MKL the differences get to large:
> Greatest absolute difference: 5.91278076171875e-05 at index (7,) (up to 1e-05 allowed)
> Greatest relative difference: 3.468156592134619e-06 at index (7,) (up to 1.3e-06 allowed)
I traced some of the matmul operations and there are differences of around 8e-6 between MKL and OpenBLAS but I haven't found where exactly the backward pass is calculated which is where the actual differences arise. So I couldn't check if there is some difference in the low-level BLAS function used by the autograd.
However it seems odd that there is a difference at all: For the MKL case it seems to be zero up to the accuracy shown by Python.
So it seems the AOT compilation has some differences when MKL is not available.
Maybe this is also the reason why it fails for ARM and hence the test is skipped there. Maybe @zou3519 knows more as he introduced those skip markers in https://github.com/pytorch/pytorch/pull/85565
Is there any documentation how and where `matmul_backward(_out)` is generated and how AOT transforms it with and without MKL?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152106
Approved by: https://github.com/zou3519
**Summary**
Add a new op, `onednn.qmul.tensor`, for int8 elementwise mul, which accepts inputs on CPU device (instead of QuantizedCPU).
The new op is implemented by AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.mul`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_mul_onednn
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151112
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
# Motivation
We propose adding support for the Python with statement on `torch.accelerator.device_index` to enable device switching functionality. This enhancement would simplify writing device-agnostic code and provide benefits across all accelerators. Its device-specific counterparts include [`torch.cuda.device`](00199acdb8/torch/cuda/__init__.py (L482)) and [`torch.cuda._DeviceGuard`](00199acdb8/torch/cuda/__init__.py (L469)).
**Design Philosophy**
It accepts either an `Int` or `None` as input. When `None` is passed, no device switch is performed. Supporting `None` is important for compatibility, as it's possible to encounter `None` values from `torch.device.index`.
Therefore, with this PR, we can do like this
```python
src = 0
dst = 1
# Set src to current device
torch.accelerator.set_device_index(src)
with torch.accelerator.device_index(dst):
# Inside with statement, we set dst to current device
assert torch.accelerator.get_device_index() == dst
# Here the current device should be src
assert torch.accelerator.get_device_index() == src
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148864
Approved by: https://github.com/albanD
Adding `torch.ops.fbgemm` to GraphPickler's allowlist. Otherwise, the fx graph module containing `fbgemm` node will return "Unable to pickle non-standard op" error.
The validation is done on the model and the difference appears only on the graph name not the node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152079
Approved by: https://github.com/aorenste
**Summary**
For int8 GEMM Template, the micro GEMM will calculate in u8s8s32 and we will do the scale/zp compensation in the epilogue. In general, it will be calculated as:
```
temp = micro_gemm_output * x_scale * w_scale
temp = temp - (x_scale * w_scale * x_zp) * sum(w, 0)
```
For case when `x_scale, w_scale, x_zp` are constant, we can pre-calculate the compensation to save runtime calculation.
**Performance**
Test with 4 cores of XEON-5 and shapes from VIT model
Before
```
GEMM(M=197,N=768,K=768) compile: 0.0939 ms (2.48 TOPS, 18.13 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.4275 ms (2.17 TOPS, 13.90 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2677 ms (3.47 TOPS, 22.20 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0148 ms (0.10 TOPS, 99.10 GB/s)
```
After
```
GEMM(M=197,N=768,K=768) compile: 0.0597 ms (3.90 TOPS, 28.53 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.2126 ms (4.37 TOPS, 27.95 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2282 ms (4.07 TOPS, 26.04 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0149 ms (0.10 TOPS, 98.71 GB/s)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152000
Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/147019 :
Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.
Motivation
Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.
Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148981
Approved by: https://github.com/davidberard98
In https://github.com/pytorch/pytorch/issues/151746, users ran into an error where a custom triton op cannot be resolved into an operator from string target. We improve the error message by reminding users to register the same custom operator at de-serialization time.
Now the error looks like this:
```python
torch._export.serde.serialize.SerializeError: We failed to resolve torch.ops.triton_kernel.add.default to an operator. If it's a custom op/custom triton op, this is usally because the custom op is not registered when deserializing. Please import the custom op to register it before deserializing. Otherwise, please file an issue on github. Unsupported target type for node Node(target='torch.ops.triton_kernel.add.default', inputs=[NamedArgument(name='x', arg=Argument(as_tensor=TensorArgument(name='linear')), kind=1), NamedArgument(name='y', arg=Argument(as_tensor=TensorArgument(name='mul')), kind=1)], outputs=[Argument(as_tensor=TensorArgument(name='add'))], metadata={'stack_trace': 'File "/data/users/yidi/pytorch/test.py", line 50, in forward\n output = triton_add(dense_output, bias)', 'nn_module_stack': 'L__self__,,__main__.SimpleModel', 'torch_fn': 'add.default_1;OpOverload.add.default'}, is_hop_single_tensor_return=None): <class 'str'>.```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152029
Approved by: https://github.com/jingsh
spot checked builds for line like `Found CUSPARSELT: /usr/local/cuda/lib64/libcusparseLt.so`. I don't know if there's another way to do it
I am slowly trying to reduce the duplicated code in docker image installs
Pros:
* less dup code
Cons:
* more docker copies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150600
Approved by: https://github.com/atalman
This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes.
The main behavioral change introduced in this diff is on CheckFunctionManager:
```
check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save")
guards_state: bytes = check_fn_manager.guards_state
```
Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later.
When we load back guards state, we will set `guards_serialization_mode` is set to `load`:
```
output_graph_state = pickle.loads(guards_state)
check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load")
```
# TENSOR_MATCH
Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards.
We kick off the work from TENSOR_MATCH from this diff.
# Testing
For each type of guard we will test it like the following:
1. Use guard_filter_fn to select 1 type of guard each time.
2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager)
3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager)
4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match.
Differential Revision: [D72987485](https://our.internmc.facebook.com/intern/diff/D72987485/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318
Approved by: https://github.com/jansel, https://github.com/anijain2305
Summary: In the "too big to optimize" error message, tell the user that they should use the torch._inductor.config.aot_inductor.compile_wrapper_opt_level = 'O0' flag
Test Plan:
This is not added to unit test cases because it runs for a little longer time before the expected failure
```
def test_runtime_checks_error_msg(self):
with torch.library._scoped_library("mylib", "FRAGMENT") as lib:
torch.library.define(
"mylib::foo",
"(Tensor a, Tensor b) -> Tensor",
tags=torch.Tag.pt2_compliant_tag,
lib=lib,
)
torch.library.impl("mylib::foo", "cpu", lib=lib)
def foo(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
return a + b
torch.library.impl_abstract("mylib::foo", lib=lib)
def foo_fake_impl(a, b):
return a + b
class Model(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
def forward(self, x):
for i in range(10000):
x = torch.ops.mylib.foo(x, x)
return x
inputs = (torch.ones(8, 8, 8), )
model = Model()
with self.assertRaisesRegex(Exception, "torch._inductor.config.aot_inductor.compile_wrapper_opt_level"):
with torch.no_grad():
AOTIRunnerUtil.compile(
model,
inputs,
)
```
Differential Revision: D72323380
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151855
Approved by: https://github.com/desertfire
Followup work on top https://github.com/pytorch/pytorch/pull/149480
Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet
Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench
This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal
For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch
Future work, as a prereq to making the work public
* divup primitive
* support multiple kernels
* Expose _get_nvrtc_version from native code
* interop with torch.compile
* AMD support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484
Approved by: https://github.com/malfet
As reported in https://github.com/pytorch/pytorch/issues/149292, according to manual, `vfmsq_f32` implements `c - a * b` rather than `a * b - c`, so it's call must be prefixed with `vnegq_f32`
Also, adjust the tests to use OpMath for FMA computation to avoid accuracy error accumulation due to non-fused multiply-and-add over lower precision dtypes
Note that `Vectorized::fmsub` is not currently instantiated anywhere, so it could safely remain broken
TODO:
- Enable C++ testing on MacOS and/or aarch64 platforms (right now Mac tests are build without C++ tests)
Fixes https://github.com/pytorch/pytorch/issues/149292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152075
Approved by: https://github.com/swolchok
ghstack dependencies: #151955
Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below:

Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script
```python
import csv
import time
import numpy as np
import torch
import torch.nn.functional as F
matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
batch_sizes = [1]
elementwise_affine = [False, True]
num_runs = 50
warmup_runs = 3
def create_input_tensor(n, batch_size):
torch.manual_seed(42)
return torch.randn(batch_size, n, dtype=torch.float32)
def run_layer_norm(A, normalized_shape, elementwise_affine):
torch.mps.synchronize()
start = time.perf_counter()
out = F.layer_norm(A, normalized_shape)
torch.mps.synchronize()
end = time.perf_counter()
return out, end - start
results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []}
for el_aff in elementwise_affine:
for n in matrix_sizes:
for batch_size in batch_sizes:
print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}")
try:
A_cpu = create_input_tensor(n, batch_size)
A_mps = A_cpu.to("mps")
normalized_shape = (n,)
for _ in range(warmup_runs):
_, _ = run_layer_norm(A_mps, normalized_shape, el_aff)
times = []
for _ in range(num_runs):
_, t = run_layer_norm(A_mps, normalized_shape, el_aff)
times.append(t)
mean_time = np.mean(times)
std_time = np.std(times)
results["N"].append(n)
results["elementwise_affine"].append(el_aff)
results["batch_size"].append(batch_size)
results["mean_time"].append(mean_time)
results["std_time"].append(std_time)
print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")
except RuntimeError as e:
print(f"Error for N={n}, batch_size={batch_size}: {e}")
continue
with open("layernorm_benchmark_times_new.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"])
for i in range(len(results["N"])):
writer.writerow(
[
results["N"][i],
results["elementwise_affine"][i],
results["batch_size"][i],
results["mean_time"][i],
results["std_time"][i],
]
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Summary: The TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature is thread unsafe for calling the initializers, but we want to allow the deferred initializer call to be safe from multiple threads. Add a mutex to ensure we have thread safe construction of the libraries post launch.
Differential Revision: D73457714
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151938
Approved by: https://github.com/swolchok, https://github.com/zou3519
Summary:
We reverted D72896450 due to a weird error happens at a seemingly unrelated test "buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact"
"
I did some investigation and found that moving import AOTConfig and create_joint inside the create_fw_bw_grap causes a delay of importing the recursively imported modules in AOTConfig create_joint from test construction time to the test running time. The path.exists mock gets called multiple times due to the inspect.getsource calls in multiple places of torch.
Specifically, we set a breakpoint at the sideeffect of mocked os.path.exists. P1787425831 shows the importing stack trace before the change. P1787431638 shows the importing stacktrace after the change.
The notable difference is that in the second pastry, we trigger an os.path.exists when somewhere in triton we called inspect.getsourcelines when we construct OnDiskPreprocStateSerializer, which gets recorded by the mock.
Looking at the test, it seems what the test actualy wants to test is the deserialize step. So we reset_mock before the step to avoid mocking things happened at import time.
Test Plan:
buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact"
and existing tests for map.
Differential Revision: D73138415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151479
Approved by: https://github.com/angelayi, https://github.com/zou3519
This PR reapplies #151490 and #151753 together, and adds some missing checks when applying the fast path.
Previously missed checks:
1) indexing path has the stride in the indexed dimension in bytes, gather path has the stride in the indexed dimension in elements. When checking if fast path is applicable, I didn't take this difference into account, and still multiplied the indexing stride by element size. Fixed and test added
2) We want to take fast path only when we are copying contiguous equally spaced slices of inputs + all the necessary alignment requirements. The effective tensor size should be 2d (after all possible flattening is applied), the index stride in the last dimension should be 0, and, since in the kernel we are not applying non-indexing-related offsets to src tensor, the src tensor stride in the second dimension should be 0. This automatically happens for gather with dim=0, so I didn't put in an explicit condition for this. Sometimes all conditions except first dim "effective" stride equal to 0 are satisfied for scatter on non-zero dim, when index size in the indexing dimension is 1 and thus it is collapsed (dimensions of size 1 are always collapsed), e.g.
```
# test gather along 1st dim that can accidentally trigger fast path
# because due to index dimension in the gather dim being 1
# an unexpected squashing in tensorIterator happens
src = make_tensor((16, 2, 16), device=device, dtype=dtype)
ind = torch.randint(2, (16, 1), device=device).view(16, 1, 1).expand(16, 1, 16)
res = torch.gather(src, dim=1, index=ind)
if res.device.type == "cuda":
ref_cpu = torch.gather(src.cpu(), dim=1, index=ind.cpu())
self.assertEqual(res.cpu(), ref_cpu, atol=0, rtol=0)
```
Note that if index size here was (16, 2, 16) instead of (16, 1, 16) then the middle dimension could not be collapsed and we wouldn't end up incorrectly taking fast path.
We could update the kernel to take this stride into account when computing offsets into src tensor, or we could specifically disallow non-zero stride on the first dimension. I took the second path for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151917
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/Skylion007
Summary:
This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows:
* `(128, 128, 64, 5, 8)`
* `(256, 128, 64, 4, 8)`
* `(128, 128, 64, 5, 4)`
These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model.
While these new configs are generally more performant across the board, we see notable gains a few specific cases:
* In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations.
* Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n.
These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations.
Test Plan:
contbuild & OSS CI
Reviewers: paulzhan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison
Summary: Previously, when attr is defined, "if attr" will try to evaluate the data of attr, which is not intendended and we get a ugly error stack if the attr is not evaluable (like a fake tensor) before the callable(attr) check.
Test Plan: Existing tests.
Reviewed By: yushangdi, henryoier
Differential Revision: D73460905
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151944
Approved by: https://github.com/yushangdi
Summary:
I can confirm that `torch.jit.Error.mro()` contains `Exception` in the inheritance hierarchy.
This avoids a bunch of `pyre-ignore`s in D73352417.
Test Plan: Sandcastle
Differential Revision: D73464544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151947
Approved by: https://github.com/Skylion007
Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code.
The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature.
Implementation:
- Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch
- Dynamo doesn't trace into the context manager but updates config at compile time
- Correctness is based on our correctness for handling supported context managers
- Implementation is inspired by how `GradModeVariable` is implemented.
Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach)
See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches.
NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
As requested by Shuai. I also included an additional refactor to capture
changes in the whitelist over time since previously the first time it
was set, it was impossible override when a new config was set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151766
Approved by: https://github.com/pianpwk
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.
This PR adds AMX-based GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. It brings performance benefits on platforms where AMX is available.
**Validation results**
We have run GPT-J-6B and Llama-3-8B-Instruct on a 6th gen Xeon with 96 cores. Results show that the AMX-based microkernel outperforms AVX512-based one by >5x for prefill stage with 1024 input length.
**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150603
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
Fix https://github.com/pytorch/pytorch/issues/151589
It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size
1. alignement
2. max number of contiguous elements a thread need to process
Here is the triton code that decides vectorization size [link](c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)), and here is the triton code that considers contiguity for vectorization [link](c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269))
When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction.
Check this test:
```
@parametrize(
"size",
(
128,
1024,
1024 * 1024,
),
)
def test_slice_view_dtype(self, size):
offset = 1
def f(x):
return x[2:].view(dtype=torch.float32) + 1
x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device)
self.common(f, (x,), reference_in_float=False)
```
Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size
1. when size = 128, triton pick ld.global.b32 to load data from global memory
2. when size = 1024, triton uses ld.global.v2.b32
4. when size = 1024 * 1024, triton uses ld.global.v4.b32
So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #151841
**Summary**
Fixes [#151290](https://github.com/pytorch/pytorch/issues/151290) and [#151523](https://github.com/pytorch/pytorch/issues/151523), which are regressions introduced by [#144020](https://github.com/pytorch/pytorch/pull/144020). That PR enabled parallelization at the inner loop level.
However, a currently unsupported case arises when parallel reduction occurs under the vectorization loop level, specifically in patterns like:
```
for vec_loop_level:
do_parallel_reduction
```
In such cases, a temporary buffer `tmp_acc_array` is allocated for tail scalar kernels, and another temporary buffer `tmp_acc_array` is also defined for parallel reduction. This results in a conflict due to overlapping temporary buffers. This PR disables the problematic case to avoid the conflict until proper support is implemented.
**Test Plan**
```
python test/inductor/test_flex_attention.py -k test_make_block_mask_cpu
python test/inductor/test_cpu_repro.py -k test_parallel_reduction_vectorization
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151887
Approved by: https://github.com/jansel
Summary:
I had a minor annoyance when debugging graphs using EXIR dialect ops,
that all the function normalization went away. For functions with > 5 arguments,
some of which are just simple bools and ints, it's very helpful to have
the kwarg names attached.
Enhance `normalize_target` to handle EdgeOpOverload targets. To avoid
a circular dependency on Executorch from pytorch core, I just use a `hasattr`
check for "_op". This only happens if the target is not already a recognized
torch function.
Also, I noticed that the new `fx.Node.normalized_arguments` function
didn't forward an important kwarg to `normalize_target`, so I fixed that too.
Test Plan: Tested with FxGraphDrawer and an fx Graph containing EXIR nodes.
Differential Revision: D67545909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143689
Approved by: https://github.com/angelayi
Filtering out the stacktrace so that the stacktrace on nodes when using fx.Tracer looks nicer. I just copied the filtering we have in [proxy_tensor.py](6720d23969/torch/fx/experimental/proxy_tensor.py (L1903-L1931)).
Previously the stacktrace looked like:
```
File "/data/users/angelayi/pytorch/moo.py", line 3964, in <module>
run_tests()
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 1342, in run_tests
unittest.main(argv=argv)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 101, in __init__
self.runTests()
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 271, in runTests
self.result = testRunner.run(self.test)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/runner.py", line 184, in run
test(result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
test(result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
test(result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 650, in __call__
return self.run(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3324, in run
self._run_custom(
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3296, in _run_custom
super_run(result=result)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3156, in wrapper
method(*args, **kwargs)
File "/data/users/angelayi/pytorch/moo.py", line 1495, in test_stack_trace
gm = torch.fx.GraphModule(m, tracer.trace(m))
File "/data/users/angelayi/pytorch/torch/fx/_symbolic_trace.py", line 837, in trace
(self.create_arg(fn(*args)),),
File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
x = x * 2
File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 716, in impl
return tracer.create_proxy("call_function", target, args, kwargs)
File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 248, in create_proxy
proxy.node.stack_trace = "".join(CapturedTraceback.extract().format())
```
Now it looks like:
```
File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
x = x * 2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151029
Approved by: https://github.com/jfix71, https://github.com/zou3519, https://github.com/jingsh
With `AdditionalInputs`, the behavior is the same as with tensors:
```python
class M(torch.nn.Module):
def forward(self, x, y):
return x + y
additional_inputs = torch.export.AdditionalInputs()
additional_inputs.add((5, 5))
additional_inputs.add((3, 5))
additional_inputs.add((5, 4))
ep = torch.export.export(
M(), (6, 7), dynamic_shapes=additional_inputs, strict=False
)
```
With `ShapesCollection`, we now need to wrap integer inputs as `_IntWrapper` so that we can have a unique identifier for each integer input.
```python
class M(torch.nn.Module):
def forward(self, x, y):
return x + y
from torch.export.dynamic_shapes import _IntWrapper
args = (_IntWrapper(5), _IntWrapper(5))
# Or we can do `args = pytree.tree_map_only(int, lambda a: _IntWrapper(a), orig_args)`
shapes_collection = torch.export.ShapesCollection()
shapes_collection[args[0]] = Dim.DYNAMIC
shapes_collection[args[1]] = Dim.DYNAMIC
ep = torch.export.export(
M(), args, dynamic_shapes=shapes_collection, strict=False
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151842
Approved by: https://github.com/pianpwk
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support
TODOs:
- Get rid of indexing kernel and compute it directly when kernel is run
- Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
- Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.
scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513
Using the following reproducer
```
import torch
import triton
def main():
dtype = torch.float32
dim = 1305301
a = torch.rand(100, device="cuda", dtype=dtype)
index = torch.randint(0, 100, (dim,), device="cuda")
src = torch.rand(dim, device="cuda", dtype=dtype)
print("=" * 20)
print(
triton.testing.do_bench(
lambda: a.scatter_add(0, index, src),
return_mode="median",
)
)
print("=" * 20)
if __name__ == "__main__":
main()
```
co-authored by: @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
Should be the last part of https://github.com/pytorch/pytorch/pull/150558, except for maybe s390x stuff, which I'm still not sure what's going on there
For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used. Previously it would use imagename:arch-main, which could be a version of the image based on an older commit
After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images
Release and main builds should still pull from docker io
Cons:
* if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch
* spin wait for docker build to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151706
Approved by: https://github.com/atalman
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:
1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")
1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).
Most of this PR is tests.
Test Plan:
- a lot of new tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
fixes https://github.com/pytorch/pytorch/issues/151055. Thanks @desertfire for the patch that fixed this.
I was a bit careful about the test - I wanted to make sure the test accurately ensures that we don't regress and our error message is not spammy when users enter an invalid `TORCH_LOGS=....` argument. But I tried to avoid using expecttests, since people occasionally add new logging artifacts and I didn't want to add to much churn by forcing this to fail CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151678
Approved by: https://github.com/desertfire, https://github.com/zou3519
Fixes#150367
This PR makes decomposition table from onnx registry, which includes registered ops not only ATen and prim. This will help to keep the custom ops that are specified in the custom_translation table from decomposition during ONNX export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151826
Approved by: https://github.com/justinchuby
Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour
Test Plan: CI
Differential Revision: D73388529
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824
Approved by: https://github.com/bdhirsh
if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}";then
echo"To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}";then
echo"To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.