Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107
Approved by: https://github.com/albanD
This adds integration into inductor in two parts
1) It kicks off the best config lookup at lowering time within mm.py
2) It awaits the future at scheduling time in select_algorithm.py
Notably this does not do the following
1) Support for enumerating between mm, addmm and bmm
2) Support for enumerating between exhaustive/max
3) Enumerating different hardware SKUs eg. H100, A100, etc.
those will come in the next diffs
Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121
Approved by: https://github.com/izaitsevfb
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
The exception tracebacks:
```
# inner exception
traceback:
File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
launchers.append(result.make_launcher())
^^^^^^^^^^^^^^^^^^^^^^
File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
self.kernel.load_kernel(device)
File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
(self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(
# wrapped exception
traceback:
File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
choice.precompile()
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
self.bmreq.precompile()
File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
getattr(mod, self.kernel_name).precompile()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
self._make_launchers()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688
Approved by: https://github.com/stashuk-olek
Summary:
Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical.
Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations.
If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors.
Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing.
Users can set
```
torch._functorch.config.fake_tensor_prefer_device_type = 'mtia'
```
to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA.
Test Plan:
Added two unit tests.
Rollback Plan:
Differential Revision: D79698438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159931
Approved by: https://github.com/jansel
Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159992
Approved by: https://github.com/yiming0416
Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it
Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405
Approved by: https://github.com/saumishr
**Summary**
Some thoughts on view-op and `_StridedShard` interaction:
1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned)
compared to `Shard`. It only changes how shards permute across the devices.
2. `view()` op on DTensor strictly forbids shard redistribution which means if
`view()` may cause shard permutation across devices, it should be rejected.
This is enforced in today's sharding prop for `view()`.
3. Since DTensor `view()` won't introduce any redistribution, it's certain that
`placements` won't change except the inner `dim` attribute of `Shard`
or `_StridedShard`.
Therefore, to support `_StridedShard` in `view()` op, the only change required
is to keep `_StridedShard` as `_StridedShard` in the output spec.
**Test**
`pytest test/distributed/tensor/test_view_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159656
Approved by: https://github.com/wconstab
Summary:
Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`.
It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79751098
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159989
Approved by: https://github.com/dolpm
## Fixes https://github.com/pytorch/pytorch/issues/157683
## mini repro
* Just copy the code from the issue to reproduce it.
```python
import torch
device = "cpu"
# Input tensors
v2_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device)
v3_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device)
def my_model(v2_0, v3_0):
v6_0 = -v3_0
v4_0 = v2_0 * v3_0
v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
v0_0 = v2_0.to(torch.int32)
v5_0 = v0_0.amax(dim=0)
return v6_0, v4_0, v1_0, v0_0, v5_0
v6_0, v4_0, v1_0, v0_0, v5_0 = my_model(v2_0, v3_0)
print("v6_0", v6_0.shape)
print("v4_0", v4_0.shape)
compiled_model = torch.compile(my_model, backend="inductor")
v6_0, v4_0, v1_0, v0_0, v5_0 = compiled_model(v2_0, v3_0)
print("v6_0", v6_0.shape)
print("v4_0", v4_0.shape)
print("v1_0", v1_0.shape)
print("v0_0", v0_0.shape)
print("v5_0", v5_0.shape)
```
error_stack
```
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’
41 | convert(const Vectorized<src_t>& src) {
| ^~~~~~~
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注: template argument deduction/substitution failed:
/tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误:模板参数数目不对(不应是 4 个而应是 2 个)
37 | auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
## summary
**The C++ kernel generated by the Inductor had the wrong data type for the output variable; it should be int32_t instead of int64_t. This incorrect data type led to an incompatible data type conversion, which caused the g++ compilation to fail.**
The original code that caused the problem.
```
def my_model(v2_0, v3_0):
v6_0 = -v3_0
v4_0 = v2_0 * v3_0
v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
v0_0 = v2_0.to(torch.int32)
// The original code that caused the problem.
v5_0 = v0_0.amax(dim=0)
```
## proof procedure
The c++ kernel generated by inductor:
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const int32_t* in_ptr0,
int32_t* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1416L); x0+=static_cast<int64_t>(16L))
{
{
int32_t tmp_acc0_arr[16];
for (int i = 0; i < 16; i++)
{
tmp_acc0_arr[i] = std::numeric_limits<int32_t>::min();
}
int32_t tmp_acc0 = std::numeric_limits<int32_t>::min();
at::vec::Vectorized<int32_t> tmp_acc0_vec = at::vec::Vectorized<int32_t>(std::numeric_limits<int32_t>::min());
for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(1L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L)))
{
auto tmp0 = at::vec::Vectorized<int32_t>::loadu(in_ptr0 + static_cast<int64_t>(x0 + 1416L*x1), static_cast<int64_t>(16));
tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L)))
{
for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++)
{
auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail + 1416L*x1)];
tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)] = max_propagate_nan(tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)], tmp0);
}
}
}
}
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L)))
{
// impossible data type conversion which would caused the g++ compilation to fail.
auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
int32_t_tmp_acc0_vec.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L)))
{
for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++)
{
out_ptr0[static_cast<int64_t>(x0_tail)] = tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)];
}
}
}
}
}
}
```
the compilers complains
```text
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’
41 | convert(const Vectorized<src_t>& src) {
| ^~~~~~~
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注: template argument deduction/substitution failed:
/tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误:模板参数数目不对(不应是 4 个而应是 2 个)
37 | auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
so the following line have problem
```c++
// this line means that tmp_acc0_vec should be Vectorized<int64_t>, and it will convert it to Vectorized<int32_t>.
auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
The issue is that tmp_acc0_vec is of type Vectorized<int32_t>, but the template parameters expect it to be Vectorized<int64_t>. and it will convert it to a Vectorized<int32_t>. this is conflict. the conversion should not be exist for tmp_acc0_vec is already Vectorized<int32_t>.The following line hardcodes the output variable type to int64, which causes unnecessary and incorrect type conversions.
d89f30ad45/torch/_inductor/codegen/cpp.py (L2985-L2993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157904
Approved by: https://github.com/jgong5
Summary:
This field is not used today, and it's not useful either.
The device allocation is configured at model loading time, specified by user.
It shouldn't be part of the model definition.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79385513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159653
Approved by: https://github.com/zhxchen17
This PR changes the behavior for compile wrapped op tests:
- supported_but_unclaimed_forward
- supported_but_unclaimed_backward
These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile.
~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976
Approved by: https://github.com/ezyang
If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X.
Coded with claude code.
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965
Approved by: https://github.com/malfet
# Motivation
Previously, I thought using `with stream:` was sufficient. However, many older scripts still use `torch.xpu.stream` as the context manager. To maintain backward compatibility, I had to include `torch.xpu.stream` in the trace rules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159844
Approved by: https://github.com/jansel
The fix in https://github.com/pytorch/pytorch/pull/155446 addressed the "stack empty" issue that's easily reproducible on CPython 3.12.0-4. While this issue can also appear in other versions, it's not as easy to reproduce there.
I recently found a new cause for this problem.
1df5d00145/Python/ceval.c (L5807-L5836)
In the CPython 3.10 implementation, PyTrace_C_CALL and PyTrace_C_RETURN/PyTrace_C_EXCEPTION are supposed to appear in pairs. However, when c_profilefunc is changed, unexpected PyTrace_C_RETURN/PyTrace_C_EXCEPTION events can occur.
Here is the code to reproduce this problem.
```
import threading
import time
import torch
from threading import Event, Lock
lock = Lock()
lock.acquire()
event1 = Event()
event2 = Event()
event3 = Event()
def run():
event1.set()
event2.wait()
lock.acquire()
event3.set()
threading.Thread(target=run).start()
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True):
event1.wait()
event2.set()
time.sleep(1)
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True):
lock.release()
event3.wait()
```
<img width="1766" height="1250" alt="image" src="https://github.com/user-attachments/assets/6794eeca-7364-429e-91eb-62cdad116bd3" />
To fix this problem, we can record active_frames_ and remaining_start_frames_ for each thread, and when the PyTrace_C-RETURN/PyTrace_CEXT CEPTION event occurs, we can determine whether to record this event based on these two fields.
In reality, even without this fix, the final data appears to be right since the match process can handle this case (it would just result in an exception log being printed).
Do you think the fix is necessary?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159574
Approved by: https://github.com/sraikund16