Fix#156261 _foreach_copy indexing (#156719)
Fixes#156261
Thanks to @ngimel's fast eyes
For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719
Approved by: https://github.com/albanD
(cherry picked from commit 4ee4863232b9e07728d85254768bcba3aadc9b9a)
Co-authored-by: Jane Xu <janeyx@meta.com>
docs: add get_default_backend_for_device to distributed documentation (#156783)
`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.
CC: @guangyey, @EikanWang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey, https://github.com/malfet
(cherry picked from commit b146ca74f01df3cf711fd0f855e05805e490156c)
Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
don't error out in empty_cache under mempool context (#158152)
Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152
Approved by: https://github.com/zou3519, https://github.com/eqy
(cherry picked from commit 9056279f8159b052599a31b591a78da1acc4224c)
Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
[autograd] Avoid creating and recording event when unnecessary (#157503)
Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.
2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.
We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.
Removing this unnecessary create + record event should save a few us for each instance avoided.
Fixes https://github.com/pytorch/pytorch/issues/157407
----
Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715
(cherry picked from commit 8bda95228fbefa6ce204bf4da8b632d1516431bb)
Co-authored-by: soulitzer <soulitzer@gmail.com>
[user triton] AOT inductor support for device-side TMA (#155896)
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`
Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.
To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs
This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)
Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).
For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire
(cherry picked from commit b6c00dfe249a7bfc1d61a322d5bc30f164353abf)
Co-authored-by: David Berard <dberard@fb.com>
* Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029.
* Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.
[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
(cherry picked from commit 82eefaedd98b63de8a87e34275af781f8eb177e1)
Co-authored-by: David Berard <dberard@fb.com>
[PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255)
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0
***Build Failure Logs***
**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
302 | if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
| ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
303 | (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
```
**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
19 | class Vectorized<ComplexFlt> {
| ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
633 | auto abs_a = a.abs_2_();
| ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
634 | auto abs_b = b.abs_2_();
| ^~~~~~
/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
666 | vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
673 | vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
| ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
680 | vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```
***With this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
-- USE_TENSORPIPE=True
-- USE_GLOO=True
-- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc
```
This patch will fix the pytorch build issue on power, and i am able to build successfully.
Hi @malfet @albanD
Please review this PR for pytorch build issue that we are observing on power.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
(cherry picked from commit 5e18bc333144473f1f10bc8a5ba05dba7950fb8a)
Co-authored-by: Avanish Tiwari <avanish@linux.ibm.com>
[aarch64] Add back NCCL lib to cuda arm wheel (#156888)
We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime.
https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build.
related - https://github.com/pytorch/pytorch/issues/144768
```
Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888
Approved by: https://github.com/atalman
(cherry picked from commit de45c5f673ce261e9a82c54280beeda36cff640e)
Co-authored-by: Ting Lu <tingl@nvidia.com>
[ROCm] Bump AOTriton to 0.10b (#156499)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:
* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
+ Without this optimization the binary size of `libaotriton.so` could be
over 100MiB due to 2x more supported architectures compared with 0.9b.
Now it is only about 11MiB.
* Support sliding window attention (SWA) in
`_flash_attention_forward/backward`. Should fix#154582
See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.
Notable changes to SDPA backend:
* `std::optional<int64_t>` `window_size_left/right` are directly passed to
ROCM's SDPA backend, because the default value `-1` is meaningful to
AOTriton's backend and bottom-right aligned causal mask is implemented with
negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
(cherry picked from commit d9577df312d477e8fa5b9d7bc61fb1f2c07b8e48)
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Fix environment and push env var for docker image builds for binary builds (#156910)
Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch
Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io. You also don't do the push unless WITH_PUSH is true
binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910
Approved by: https://github.com/atalman
(cherry picked from commit 78ee2ee90eed957aec3dc80423b108b16938a8ae)
Co-authored-by: Catherine Lee <csl@fb.com>
By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 ) at exit, as at this point handle to child process might no longer be valid
Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
Fixes https://github.com/pytorch/pytorch/issues/153050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
Users may want CUDAGraph for certain sizes and fallback for other sizes.
As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.
Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.
This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.
Example:
```python
import torch
torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]
def f(x):
return x + 1
f = torch.compile(f, mode="reduce-overhead", dynamic=False)
def run(batch_size, seq_len, d):
x = torch.randn((batch_size, seq_len, d), device="cuda")
# Need to mark the dimension as dynamic. Automated-dynamic
# may have some ux issues on matching `cudagraph_capture_sizes`
# with the actual dynamic shapes, since there are specialization and
# multiple dynamo graphs.
torch._dynamo.mark_dynamic(x, 0)
torch._dynamo.mark_dynamic(x, 1)
for _ in range(3):
f(x)
for i in range(2, 10):
for j in range(2, 10):
run(i, j, 8)
num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
Don't call `sum()` on a tensor that is default constructed.
Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:
```
Traceback (most recent call last):
File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
yield
File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
self._callTestMethod(testMethod)
File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
method(*args, **kwargs)
File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
ln_out_cuda.backward(grad_output_cuda)
File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
torch.autograd.backward(
File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
_engine_run_backward(
File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::TensorBase::options() const from :0
#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0
```
Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600
Approved by: https://github.com/eqy, https://github.com/ngimel
At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished.
Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653
Approved by: https://github.com/kwen2501
I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639
Approved by: https://github.com/atalman
Differential Revision: D76843916
Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature:
Excessive register spillage
Using much larger amounts of shared memory than available on the hardware
This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive:
```
AUTOTUNE mm(1152x21504, 21504x1024)
strides: [21504, 1], [1, 21504]
dtypes: torch.bfloat16, torch.bfloat16
mm 0.1167 ms 100.0%
triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices
INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610
Approved by: https://github.com/jansel
Summary: Fixes https://github.com/pytorch/pytorch/issues/149311
Test Plan:
Just changes string output
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 60.993us 0.97% 60.993us 1.848us 0 B 0 B 33
...
```
Rollback Plan:
Differential Revision: D76857251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250
Approved by: https://github.com/sanrise
Original issue: https://github.com/pytorch/pytorch/issues/154820
The issue happens when there is a mutation for the same input in forward AND in backward.
AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.
The fix:
1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward
We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation.
2/ Exposing mutation_counter to python
We want to keep invariant that copy_ exist only in the end of joint graph.
3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.
add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.
4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.
proxy_tensor changes:
By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
Differential Revision: D76514984
Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185
Approved by: https://github.com/masnesral
When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model.
This PR adds a config to only install torchbench models we want to run.
Example usage:
```
# Install 1 torchbench model
make build-deps TORCHBENCH_MODELS="alexnet"
# Install 3 torchbench models
make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch"
# Install all models
make build-deps
# Install all models
make build-deps TORCHBENCH_MODELS=""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465
Approved by: https://github.com/ezyang
In practice `bool(...)` is either constant folded by Dynamo or used for
branching (so most of its emulation logic lived in
`InstructionTranslator.generic_jump`.
This patch adds a dedicated `bool` hanlder (only for symbolic
bool/int/float for now), and fixes#136075.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863
Approved by: https://github.com/williamwen42
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.
**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
Summary:
For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf.
Test Plan:
Verify on two hosts
Host1:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=0 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Host2:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=1 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Rollback Plan:
Differential Revision: D76937048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389
Approved by: https://github.com/kwen2501
Summary:
Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors.
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.
Decisions made along the way:
1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.
The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.
Test Plan: CI
Differential Revision: D75497142
Co-authored-by: Benjamin Glass <bglass@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352
This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.*
As of right now with this code change the terminal outputs:
if the code came from torch.* :
Nothing, as we shouldn't warn for functions that come from torch.*
else:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module>
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] torch.compile(foo, backend="eager")(torch.randn(4))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463
Approved by: https://github.com/williamwen42
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py`
- Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including:
- A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion
- Flag-based signaling only after `quiet()` completes, guaranteeing data delivery
- Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475
Approved by: https://github.com/kwen2501
ghstack dependencies: #156472, #156473, #156474
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```
**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB
As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.
cc and credit to @ShatianWang for noticing this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py`
- Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including:
- A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls
- Final fence before flag update to ensure all data transfers complete before signaling
- Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156472, #156473
The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes
In terms of differences with eager autograd engine:
- For function mode, CA additionally disables/re-enables `_set_multithreading_enabled`
- For dispatch mode:
- accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead
- Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516
Approved by: https://github.com/jansel
ghstack dependencies: #156374, #156509
Example:
```python
File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd.
You can disable compiled autograd for this operation by:
1. Relocating the unsupported autograd call outside the compiled region.
2. Wrapping the unsupported autograd call within a scope that disables compiled autograd.
3. Configuring the specific compilation unit to disable compiled autograd.
4. Globally disabling compiled autograd at the application's initialization.
```
No duplicate error messages for python side trace-time errors
```python
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture
raise NotImplementedError(
NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either:
1. Moving the unsupported autograd call outside of the torch.compile'd region.
2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager.
3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call.
4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509
Approved by: https://github.com/jansel
ghstack dependencies: #156374
This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py`
- Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675))
- Implemented `test_triton_signal_wait_until` with:
- Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block`
- Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value
- End-to-end validation of both data transfer and signal synchronization
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
ghstack dependencies: #156472
### Summary
int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape.
### Perf data
GPT-J 128 input tokens, 128 output tokens.
32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded.
| May 8 nightly first token latency | First token latency with this implementation | Rest token latency with May 8 nightly | Rest token latency with this implementation combined with #149373 |
|---|---|---|---|
|202 ms | 190 ms | 33 ms | 30 ms|
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel
Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`.
- Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including:
- A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag.
- Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data.
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until`
```python
@triton.jit
def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)
@triton.jit
def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr):
nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val)
...
if rank == 0:
print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21")
wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib)
print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...")
print(f"[RANK 0] Current out buffer contents: {out.tolist()}")
torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device))
print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.")
if rank == 1:
print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0")
put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)
print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0")
put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib)
print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.")
...
```
Output:
```
[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21
[RANK 1] About to PUT 8 elements of value 13 to rank 0
[RANK 1] About to PUT flag value 21 to wake up rank 0
[RANK 1] FLAG PUT complete! Rank 0 should wake up now.
[RANK 0] WAIT IS OVER! Flag was set, checking data now...
[RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13]
[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.
[RANK 0] Test completed successfully! 🎉
[RANK 1] Test completed successfully! 🎉
...
----------------------------------------------------------------------
Ran 1 test in 18.773s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472
Approved by: https://github.com/kwen2501
Summary:
In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1).
This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states).
The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry.
Test Plan:
loggercli codegen GeneratedPt2CompileEventsLoggerConfig
Rollback Plan:
Differential Revision: D76628834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959
Approved by: https://github.com/bobrenjc93
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.
Remove eigen submodule and replace it with eigen_pin.txt
Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
dynamo: Don't crash when someone tries to access a non existent list member
Test added which reproduces the failure. Note that I'm using the new
unimplemented_v2 API. Let me know if people have a strong preference that I use
something else.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335
Approved by: https://github.com/jansel
Since it's now actually used within async_compile.multi_kernel
```
def multi_kernel(self, *args, **kwargs) -> Any:
from torch._inductor.codegen.multi_kernel import MultiKernelCall
# no need to call this in parallel since the sub-kernels are already parallel tasks
return MultiKernelCall(*args, **kwargs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels
Copied from original auto_functionalize Diff Summary D53776805:
This is a non-functional kernel implementation for auto_functionalize
In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.
This would mutates the input tensors inplace, which is unsafe in general.
However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.
In the proper functional implementation, it will
make a clone of the mutating input tensor
return these new instance of tensors as AutoFunctionalizeKernel output.
If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.
Test Plan: See internal for test plan
Differential Revision: D76926383
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
In preparation of adding integer addmm, move matmul computation part into matmul_inner function
Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.
Remove eigen submodule and replace it with eigen_pin.txt
Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices
### CHANGES
- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py
Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.
- torch/testing/_internal/common_distributed.py
extended common utility functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.
Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
Implements https://github.com/pytorch/pytorch/issues/144908.
Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights
TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.
**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.
Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.
Torch Native Runtime RFC: pytorch/rfcs#72
Test Plan:
CI
Rollback Plan:
Differential Revision: D76830436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:
* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
+ Without this optimization the binary size of `libaotriton.so` could be
over 100MiB due to 2x more supported architectures compared with 0.9b.
Now it is only about 11MiB.
* Support sliding window attention (SWA) in
`_flash_attention_forward/backward`. Should fix#154582
See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.
Notable changes to SDPA backend:
* `std::optional<int64_t>` `window_size_left/right` are directly passed to
ROCM's SDPA backend, because the default value `-1` is meaningful to
AOTriton's backend and bottom-right aligned causal mask is implemented with
negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.
Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")
CC: @pengxin99, @guangyey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours. I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090
With this computing power, we could also run the whole suite every 4 hours now. I could run this less frequently later if I see queueing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`
Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.
We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.
Fixes#155778
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:c10_kernel_test
```
Differential Revision: D76825830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.
Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).
- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.
Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```
Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7
Rollback Plan:
Differential Revision: D74747190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:
- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
- Fallback to old implementation if no `max_version` or if version
lower than 1.0
- Check that the to-be-consumed capsule is of version up to 1.X
In order to accommodate these new specifications, this PR adds the
following main changes:
- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python. This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.
Fixes#150988Fixes#153478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
In preparation of adding integer addmm, move matmul computation part into matmul_inner function
Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.
Test Plan:
CI
Rollback Plan:
Differential Revision: D76929722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).
The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.
Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype. this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.
This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging. This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.
Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references. I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.
The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)
# do this on every checkpoint:
with staging_context(stager):
cpu_state_dict = copy.deepcopy(state_dict)
```
Also, adds support for pinned-memory.
One problem this implementation does not address is that we lose the original device.
The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.
Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.
The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)
# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```
Test Plan:
unit tests
Differential Revision: D75993324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
Context on torch.cuda.memory._record_memory_history buffer behavior
## Description
Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?
Fixes#129674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op
Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```
Rollback Plan:
Differential Revision: D76699002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
Summary:
### Diff Context
1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.
2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.
Test Plan: CI
Differential Revision: D75553096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.
Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
# Context
MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.
# This diff
Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.
The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.
----------------
p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time
Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid
Rollback Plan:
Differential Revision: D76769711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```
but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```
so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.
Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.
Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13
CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```
Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13
USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455
USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```
Rollback Plan:
Differential Revision: D76520301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
Fixes#156309
Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet
Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.
Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.
This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).
Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
## Summary
Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.
Fixes#115992
## Details
`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.
## Testing
The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback
**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.
Test Plan:
CI
Rollback Plan:
Differential Revision: D76734378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```
**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB
As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.
cc and credit to @ShatianWang for noticing this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.
**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.
Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.
What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.
Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2} and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.
Test Plan:
CI + dry run.
Rollback Plan:
Differential Revision: D76552340
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.
Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
Implements https://github.com/pytorch/pytorch/issues/144908.
Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198
There are two problems:
(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.
This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.
Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.
(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.
**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.
This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4
Co-authored-by: drisspg <drisspguessous@gmail.com>
Summary:
# Why
speed up cutlass kernel generation and retrieval
# What
using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally
this is the OSS only part of the change, to facilitate integration
Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```
## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```
## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```
## prove that we can download errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```
## showing timing information
```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```
Reviewed By:
henrylhtsang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
There are a few considerations here:
1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404
2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.
3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.
4. There are two foot guns:
a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).
b. The following seuquence won't work as intended:
```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```
This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.
I think these two foot guns are probably okay given that this a bit of an experts' API.
Fixes#155106
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
Followup on https://github.com/pytorch/pytorch/pull/125971
`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!
Example demonstration:
```python
class Base:
def register_buffer(self, buffer):
pass
class InheritedOk(Base):
pass
class InheritedOverride(Base):
def register_buffer(self, buffer):
pass
b = Base()
ok = InheritedOk()
override = InheritedOverride()
print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False
print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```
(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
Fixes#134840
Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:
- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive
These changes help prevent runtime errors when users attempt to use large padding values.
## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.
I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).
Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.
We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().
If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events
In short, they will be used to experess fork and join operations in
the graph if external=False.
External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.
Finishes #146145
I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.
Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.
Rollback Plan:
Differential Revision: D76775373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)
What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.
However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.
With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.
`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2
Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1, rocSolver <3.26)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
## Update using Cutlass 3.x (2025/06/15)
Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91
This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64
Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.
## Original using SM89
I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).
Simple benchmark script
```python
import torch
from torch._inductor.utils import do_bench_using_profiling
N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
scale_A = torch.ones(M, 1).cuda()
scale_B = torch.ones(1, N).cuda()
out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
torch.testing.assert_close(out, out_ref)
latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
tflops = (2 * M * N * K) / latency_us / 1e9
print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```
M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS
Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...
However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS
Small M slightly improves, but large M is still bad.
If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results
M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73
Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.
To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`
M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86
A bit better than current configs, but still very far away from hardware limit.
@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?
Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53
Better than default configs, but still far away from the config above for compute-bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.
The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file). This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted. But sometimes pytest reports a success, but the process has non zero exit code. The two cases I know of are hangs and double freeing at exit. In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option). But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly
Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds. This is correct even in the normal path because initial value shouldn't change in that case
Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this
I think I have a PR that fixes the above but its old so I need to take another look
Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`
Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
```
// The torch::stable::Tensor class is a highlevel C++ header-only wrapper around
// the C shim Tensor APIs. We've modeled this class after TensorBase, as custom
// op kernels only really need to interact with Tensor metadata (think sizes,
// strides, device, dtype). Other functions on Tensor (like empty_like) should
// live like the ATen op that they are and exist outside of this struct.
//
// There are several goals of this class over AtenTensorHandle and
// RAIIAtenTensorHandle:
// 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the
// C APIs with AtenTensorHandle. Under the hood we still call to these C shim
// APIs to preserve stability.
// 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass
// around ownership. This makes it difficult to pass one input into 2
// different functions, e.g., doing something like c = a(t) + b(t) for
// stable::Tensor t. Thus, we use a shared_ptr here.
```
This PR:
- exemplifies the above
- adds test cases in libtorch_agnostic to make sure the file actually works
- includes the results of a battle with template specialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367
Approved by: https://github.com/albanD
Summary:
This PR adds support for torch.cuda.FloatTensor and friends in Dynamo.
These are indeed legacy APIs, but that doesn't stop us from adding
support for them in torch.compile.
I add support for these in the same way that we support torch.Tensor:
these APIs can be safely put into the Dynamo graph.
Fixes#130722
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107
Approved by: https://github.com/williamwen42
Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder.
To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used.
(Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37`
Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031
Approved by: https://github.com/NikhilAPatel
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).
This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.
FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.
This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.
Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```
After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.
Also I want to change how cutlass .so are compiled. Hence this change.
This change is broken down since @coconutruben is trying to make a change to the same files too.
Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.
Torch Native Runtime RFC: pytorch/rfcs#72
Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test
Rollback Plan:
Differential Revision: D76525939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
Summary:
When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic.
A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs.
Test Plan:
test/dynamo/test_pgo.py
Rollback Plan:
Differential Revisi,on: D76630499
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961
Approved by: https://github.com/bobrenjc93
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:
1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile
One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.
I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
Fixes#128796
This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:
1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
- Masking before division
- Using MaskedTensor (experimental API)
The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.
This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.
dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer
Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up.
Before:
- Compiled Autograd Engine: proxies call to inductor accumulate_grad op
- Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract)
```python
new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1); getitem_1 = None
copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1); copy_ = None
```
- AOTAutograd: functionalizes the copy_
After:
- Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call.
- Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd
- AOTAutograd: traces into the op, with FunctionalTensors.
While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case.
In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521
Approved by: https://github.com/jansel
Delays code generation for arguments to fallback ops. This is inspired by #155642, and likely fixes similar memory leaks.
Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:
1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:
1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile
One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.
I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.
In jit tests:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/clee2000
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```
These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.
* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)
AOTI support is not implemented yet.
Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.
Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)
Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
Fixes#155048
The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change:
New note:
`amax, amin, max(dim), min(dim) evenly distributes gradient between equal values
when there are multiple input elements with the same minimum or maximum value.`
cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071
Approved by: https://github.com/soulitzer
Fixes#155688
Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`. This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.
Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.
This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
- it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals
I am also seeing a couple of wrong usages !! that i will fix in the next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154774
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
Summary: #buildall
Test Plan: CI
Differential Revision: D74582970
When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972
Approved by: https://github.com/avikchaudhuri
Fixes#154328
**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.
Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.
**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.
**Fixes #155718**
## Usage Examples
```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"
# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"
# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"
# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
Changes:
- Remove old invalid settings and replace with new settings.
- Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760
Approved by: https://github.com/drisspg
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API).
This PR does the following:
* replace the experimental TMA APIs with the stable TMA APIs
* remove the workspace args.
Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone]
Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771
Approved by: https://github.com/mandroid6, https://github.com/nmacchioni
Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export.
However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export.
For example,
```python
import torch
class Model(torch.nn.Module):
def forward(self, x, y, z):
return torch.cat((x, y), axis=1) + z
model = Model()
x = torch.randn(2, 3)
y = torch.randn(2, 5)
z = torch.randn(1, 8)
model(x, y, z)
DYN = torch.export.Dim.DYNAMIC
ds = {0: DYN, 1: DYN}
with torch.fx.experimental._config.patch(backed_size_oblivious=True):
ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds))
print(ep)
"""
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"):
#
sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0)
sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1)
sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0)
sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1)
sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0)
sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1)
# File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1); x = y = None
#
eq: "Sym(True)" = sym_size_int_2 == sym_size_int; sym_size_int_2 = None
_assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'"); eq = _assert_scalar_default = None
add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3; sym_size_int_1 = sym_size_int_3 = None
eq_1: "Sym(True)" = add_1 == sym_size_int_5; add_1 = sym_size_int_5 = None
_assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'"); eq_1 = _assert_scalar_default_1 = None
eq_2: "Sym(True)" = sym_size_int == sym_size_int_4; sym_size_int = sym_size_int_4 = None
_assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'"); eq_2 = _assert_scalar_default_2 = None
# File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z); cat = z = None
return (add,)
Graph signature:
# inputs
x: USER_INPUT
y: USER_INPUT
z: USER_INPUT
# outputs
add: USER_OUTPUT
Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]}
"""
ep.module()(x, y, z)
"""
Traceback (most recent call last):
File "/home/titaiwang/pytorch/test_export.py", line 20, in <module>
ep.module()(x, y, z)
File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__
raise e
File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl
return inner()
^^^^^^^
File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner
args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook
_check_input_constraints_for_graph(
File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph
_check_symint(
File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint
raise RuntimeError(
RuntimeError: Expected input at *args[2].shape[0] to be equal to 2, but got 1
"""
```
The explanation (from @pianpwk):
In the model we have `return torch.cat((x, y), axis=1) + z`.
Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are:
1) Yes -> then we must specialize `s8` to 1
2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`.
Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2).
For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717
Approved by: https://github.com/justinchuby
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
Fixes#155028
This pull request updates the documentation by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696
Approved by: https://github.com/svekars
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
`use_max_autotune()` is likely not what people expect it to be;
Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning.
Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847
Approved by: https://github.com/jingsh, https://github.com/masnesral
Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`.
Test Plan:
CI
Rollback Plan:
Differential Revision: D76558294
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867
Approved by: https://github.com/jingsh
Fixes#148950.
During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301
), the functional argument of conj complex tensor gets cloned. This result in always having *.is_conj()* evaluted to false in decomposition function.
Propose a fix of calling resolve_conj() in the decomposition of complex tensor add.
Test as below
`python test/dynamo/test_repros.py ReproTests.test_add_complex_conj`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945
Approved by: https://github.com/jansel
Fixes#155036
This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377
Approved by: https://github.com/svekars
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
When op is unimplemented for a specific dtype
Which makes more sense, than a RuntimeError
Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```
release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for
Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
I guess this is more of an RFC
Goal:
Enable keep going so that we can get information immediately for failures. We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.
Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`. If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.
I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label). There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.
After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification. This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one
To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85) to add a new field like "will_fail": true, and also triggers the log classifier to run
Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed
Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones
Pros:
Not many changes outside of HUD/UI
Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet
Addressing #154890
Not really a proper fix but at least it's more informative than the current crash.
For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475
Approved by: https://github.com/kulinseth
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155781
Description:
As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation.
Checklist:
- [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.
@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label "module: docs"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937
Approved by: https://github.com/jingsh, https://github.com/svekars
Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this
Test Plan:
ensure existing tests pass
Rollback Plan:
Differential Revision: D76364024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837
Approved by: https://github.com/saumishr
By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
Fixes https://github.com/pytorch/pytorch/issues/155023
Description:
converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md`
**I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged.**
Checklist:
- [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.
@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781
Approved by: https://github.com/svekars
Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers.
This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types.
This fixes compilation of something like
```python
@torch.compile
def f(x, y):
return x[y % 5]
```
which beforehand failed to compile with
```
torch._inductor.exc.InductorError: SyntaxError: failed to compile
#include <c10/metal/utils.h>
kernel void generated_kernel(
device float* out_ptr0,
constant long* in_ptr0,
constant float* in_ptr1,
uint xindex [[thread_position_in_grid]]
) {
int x0 = xindex;
auto tmp0 = in_ptr0[x0];
auto tmp1 = 12;
auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1));
auto tmp3 = 1024;
auto tmp4 = static_cast<long>(tmp3);
auto tmp5 = tmp2 + tmp4;
auto tmp6 = tmp2 < 0;
auto tmp7 = tmp6 ? tmp5 : tmp2;
if ((tmp7 < 0) && (tmp7 > 1024)) return;
auto tmp9 = in_ptr1[tmp7];
out_ptr0[x0] = static_cast<float>(tmp9);
}
with program_source:372:28: error: array subscript is not an integer
auto tmp9 = in_ptr1[tmp7];
^~~~~
```
This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891
Approved by: https://github.com/manuelcandales
Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes.
Cause overflow error with a huge number “9223372036854776048”
Test Plan: CIs should pass.
Differential Revision: D74354663
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131
Approved by: https://github.com/laithsakka
Meta:
`fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid.
See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734
This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE.
Here are commands executed:
```
% manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz
% manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz
% manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar
% ls
-rw-r--r-- 1 navidq staff 2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz
-rw-r--r-- 1 navidq staff 4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz
-rw-r--r-- 1 navidq staff 4.4M Jun 12 10:03 glslc-windows-v2024_3.tar
% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz
ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937
% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz
1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649
% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar
76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984
```
Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832
Approved by: https://github.com/GregoryComer
This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling, dump it all into bytes, and then stitch it back together into a cache of callables.
## Why use CacheArtifactManager?
Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly.
## How is it different from CacheArtifactManager?
Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable.
Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so:

Whereas we'd visualize PrecompileContext's result like so:

For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext.
After this PR, precompile consists of three main interfaces:
### CompilePackage
- Everything needed to run one torch.compile'd function (including graph breaks)
- `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry
- `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends
- `cache_entry()` return a serializable cache entry to save
### DynamoStore
- Responsible for tracking CompilePackages on disk (and/or in memory)
- `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact
- `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data
- `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization)
### PrecompileContext
- Overarching context for serializing and deserializing precompile artifacts. Supports **global** and **local** setups.
- `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes
- `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO)
- `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key
<img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415
Approved by: https://github.com/zhxchen17
ghstack dependencies: #155118
Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface:
1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states.
a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object.
b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state.
2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry.
3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions.
This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend.
Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118
Approved by: https://github.com/jamesjwu
Co-authored-by: James Wu <jjwu@meta.com>
The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780
Approved by: https://github.com/jansel
Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`.
This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation.
Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources.
Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738
Approved by: https://github.com/bobrenjc93
Fixes#136849
## Test Result
```python
>>> import torch
>>> device = torch.cuda.device_count() + 1
>>> torch.cuda.current_stream(device) # INTERNAL ASSERT FAILED
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream
streamdata = torch._C._cuda_getCurrentStream(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)
>>> torch.cuda.default_stream(device) # INTERNAL ASSERT FAILED
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream
streamdata = torch._C._cuda_getDefaultStream(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)
>>> torch.cuda.set_per_process_memory_fraction(0.5, device) # INTERNAL ASSERT FAILED
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction
torch._C._cuda_setMemoryFraction(fraction, device)
RuntimeError: Allocator not initialized for device : did you call init?
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318
Approved by: https://github.com/albanD
When op is unimplemented for a specific dtype
Which makes more sense, than a RuntimeError
Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```
release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU.
Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k
grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384
Approved by: https://github.com/eellison
Motivation:
PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`).
Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However, if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want.
Therefore, we would like to skip updating the default distributed backend if one is already registered in the map.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320
Approved by: https://github.com/guangyey, https://github.com/d4l3k
9df2e8020f/18e8d4b13b0 (43980565863-box)
started OOMing after switching to cuda 12.8
Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
Fixes#132140
As described in the issue, I've added an example that showcases the use of higher-dimensional inputs and outputs, batched inputs, and the vectorize=True with `torch.autograd.functional.jacobian`.
Could you please review?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155683
Approved by: https://github.com/soulitzer
This change will add the ability to support re-sharding for hf safetensors checkpoints.
This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class.
Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519
Approved by: https://github.com/saumishr
Summary: This diff modifies the elastic agent's API to pass the event log handler to the record function calls. This change enables the elastic agent to log events to a specific destination, improving the monitoring and debugging capabilities of the distributed training process.
Test Plan:
unit tests
ran an e2e training job.
Differential Revision: D75194115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155457
Approved by: https://github.com/d4l3k
- Fix typo in class name `OpenRgistration`->`OpenRegistration`
- Use existing `common` alias of `torch.testing._internal.common_utils`, i.e. `s/torch.testing._internal.common_utils.markDynamoStrictTest/common.markDynamoStrictTest/`
- Remove unused `TEST_CUDA` and `TEST_ROCM` are unused in that file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155760
Approved by: https://github.com/albanD
Summary:
We generate AtenTensorHandles for Fallback kernels regardless of the arg
type. If we indeed "fallback", we will regenerate the AtenTensorHandles
that will cause the first handle being generated not recycled, thus a
memory leak would occur.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_fallback_mem_leak
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155642
Approved by: https://github.com/jingsh, https://github.com/desertfire
Summary:
Revive https://github.com/pytorch/pytorch/pull/138406. Only limit the scope to files in c10.
Summary from the original PR,
```
Looking in the code I see
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later.
Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]].
Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED.
```
Test Plan: CI
Differential Revision: D72762767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058
Approved by: https://github.com/r-barnes
Instead of `setup-miniconda`
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
- Skip `TestMultiprocessing.test_fs_sharing` as even though it completes, it hangs on the shutdown both in CI and in all local setups I have
- Skip `TestCppExtensionOpenRgistration.test_base_device_registration` as it hangs on the shutdown as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515, #155697
Summary:
Moves DelegateExecutor base class to PyTorch core. It provides the extension point of backend delegation for NativeRT.
Torch Native Runtime RFC: pytorch/rfcs#72
Test Plan:
This is only a virtual base class. So relying on internal CI is sufficient.
Rollback Plan:
Differential Revision: D76351984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155581
Approved by: https://github.com/zhxchen17
vLLM profiler sets with_stack=True that shows the dict_getitem on the profiler, both inflating the numbers and confusing compile users. This PR keeps BINARY_SUBSCR for regular dicts, while using `dict.__getitem__` only for dict subclasses.
Using binary_subscr is little bit faster, but not enough to make any major latency improvements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155727
Approved by: https://github.com/zou3519, https://github.com/StrongerXi, https://github.com/jansel
Summary:
urrently the node.meta["stack_trace"] is not preserved when we torch package/load GraphModule, which means the original stack trace is lost. When we re-trace the packaged graph module, we just get a stack trace like fx-generated._0......
Adding the node.meta["stack_trace"] to torch packaged graph module
Test Plan:
```
buck2 run @//mode/dev-nosan fbcode//caffe2/test:package -- -r TestPackageFX
```
Rollback Plan:
Differential Revision: D76379692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155638
Approved by: https://github.com/angelayi
**Summary**
GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4.
**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155444
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.
While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.
Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?
Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
as titled. It's sometimes confusing to use PlacementStrategy as a name,
as we also have OpStrategy and TupleStrategy, the latter two contain
the former, so it is better to make the naming clearer.
Renaming PlacementStrategy -> OpSpec as it is an operator spec that
contains output_spec + input_specs.
Also found some utils can be merged to OpSchema so included together in
this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592
Approved by: https://github.com/awgu
Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified.
Test Plan:
contbuild & OSS CI
Rollback Plan:
Differential Revision: D75817800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902
Approved by: https://github.com/wz337
As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR.
In addition this PR adds an integration test in addition to the unit tests.
It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend.
Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518
Approved by: https://github.com/saumishr
Summary:
Previosuly, we only add stack trace in class _ModuleStackTracer(PythonKeyTracer) for non-strict export. I moved this stack trace logic to the parent class PythonKeyTracer, this way the graph traced from Module using make_fx will have stack_trace as well.
Motivation: we've observed some uses cases where users first use make_fx on the Module, and then run export on the resulting graph. If the result of make_fx doesn't have stack trace, the stack trace information is lost.
**User needs to turn this on by passing in `stack_trace=True` to make_fx. We don't make this the default option since this might increase inductor compilation time (`make_fx` is used in inductor to trace graph patterns for pattern matching). It's also turned on if `_inductor.config.trace.enabled` is True.**
**preserving stack trace is on by default for ModuleStackTracer, which is used for non-strict export.**
Test Plan:
```
buck run test:test_export -- -r test_stack_trace
buck run fbcode//caffe2/test/dynamo:test_dynamo -- -k test_autocast_ordering
```
Rollback Plan:
Differential Revision: D76298692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155486
Approved by: https://github.com/angelayi, https://github.com/zou3519
Summary:
the new coreml tool is export mlpakage instead mlmodel in default option. when we use new 8.0 coreml tool to convert to backend, the error is
```
Exception: MLModel of type mlProgram cannot be loaded just from the model spec object. It also needs the path to the weights file. Please provide that as well, using the 'weights_dir' argument.
```
Test Plan:
tested with internal workflow
Rollback Plan:
Differential Revision: D76325462
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155543
Approved by: https://github.com/shoumikhin
This is a new PR duplicating #154675 due to merge issues with that PR coming from my old (now updated) version of ghstack.
I am a Vulkan noob, but this extension and flag seem to be necessary. See "Encounted VK_ERROR_INCOMPATIBLE_DRIVER" at https://vulkan-tutorial.com/Drawing_a_triangle/Setup/Instance .
(For anyone trying to repro at home, I have the following homebrew packages installed, not all of which may be necessary: molten-vk, vulkan-headers, vulkan-loader, vulkan-tools, vulkan-utility-libraries. I also have VK_ICD_FILENAMES set to /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json, and I built PyTorch with USE_VULKAN=1. Making sure vkcube works helped me debug this setup.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155595
Approved by: https://github.com/malfet
~~The PR: https://github.com/pytorch/pytorch/pull/152478 did not respect the release policy that the deprecation should happen after the deprecation message has been set for 2 releases. This PR postpone 2.8 to the rightful version 2.10.~~
~~NOTE: "as early as" 2.10 shall give ONNX users more time to adapt and provide feedback.~~
To follow the upcoming torchscript deprecation, `torch.onnx.export` expects to switch dynamo=True (also turn on fallback=True for bc) on torch 2.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155580
Approved by: https://github.com/justinchuby, https://github.com/tugsbayasgalan
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel, https://github.com/davidberard98
Previously when processing `sym_and(a, b, c)`, symbolic shapes wouldn't individually process a, b, and c and store their implications. This would lead us to data-dependent error on individual checks, e.g. we stored `u0 >= 0 & u0 <= 10`, but then couldn't figure out `u0 <= 10`.
This handles that, and also makes `sym_and/or` user-code friendly, for testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154737
Approved by: https://github.com/laithsakka
Improves the GEMM overview logging in PyTorch Inductor to properly display batch size information for batched matrix operations like `torch.bmm` and `torch.baddbmm`.
**Fixes #155307**
## Problem
The current GEMM logging for `torch.bmm` shows:
```python
# Repro
import os
os.environ["TORCH_LOGS"] = "inductor"
import torch
M, N, K = 1024, 1024, 1024
dtype = torch.bfloat16
A = torch.randn(10, M, K, device="cuda", dtype=dtype)
B = torch.randn(10, K, N, device="cuda", dtype=dtype)
compiled_model = torch.compile(torch.bmm, fullgraph=True)
_ = compiled_model(A, B)
```
**Before:**
```
Name | M | N | K | Count
----------------------------------------------------------------------------------------------------
aten.bmm | 1024 | 1024 | 1024 | 1
----------------------------------------------------------------------------------------------------
```
The batch size (10) is missing from the logs, making it unclear what the actual operation dimensions were.
## Solution
**After:**
```
Name | B | M | N | K | Count
----------------------------------------------------------------------------------------------------------------------------------
aten.bmm | 10 | 1024 | 1024 | 1024 | 1
aten.mm | - | 1024 | 1024 | 1024 | 2
----------------------------------------------------------------------------------------------------------------------------------
```
## Changes Made
### 1. Enhanced Parsing Logic in compile_fx.py
- Detects batched operations by checking if operation name ends with `'bmm'` or `'baddbmm'`
- For batched operations: takes last 4 parts as `batch, m, n, k`
- For non-batched operations: takes last 3 parts as `m, n, k`
- **Dedicated "B" column**: Added separate column for batch size instead of embedding in operation name
- Shows batch size for batched operations, shows "-" for non-batched operations
### 2. Updated All MM Operations for Consistency
- **bmm.py**:
- Extract batch size from `mat1.get_size()[0]` for both `tuned_bmm` and `tuned_baddbmm`
- Use positional counter keys: `aten.bmm_{batch_size}_{m}_{n}_{k}`
- Enhanced log messages to include batch size information
- **mm.py**: Updated counter keys for consistency:
- `aten.mm_{m}_{n}_{k}` (no batch dimension)
- `aten.addmm_{m}_{n}_{k}` (no batch dimension)
- `aten._int_mm_{m}_{n}_{k}` (no batch dimension)
- `aten._scaled_mm.default_{m}_{n}_{k}` (no batch dimension)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155544
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
## What
- use `definitely_contiguous_for_memory_format` instead of `is_contiguous` when the non-contiguous case is fine if we encounter a DDE.
- use ref's contiguous over Aten's contiguous because Aten's version will DDE and stop tracing. ref's version will use `definitely_contiguous_for_memory_format` and clone if there's a DDE.
## Example DDEs
- Fixed with `definitely_contiguous_for_memory_format` in `fast_binary_impl`
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq((u0//387), 0) (unhinted: Eq((u0//387), 0)). (Size-like symbols: u0)
Caused by: layer_norm = self.layer_norm(linear) # caffe2/test/export/test_export.py:4566 in forward (_subclasses/fake_impls.py:1022 in fast_binary_impl)
```
- Fixed with `refs.contiguous` instead of calling aten's contiguous (that'd require a bigger re-write in Aten)
```
File "c10/core/TensorImpl.h", line 825, in torch::autograd::THPVariable_contiguous(_object*, _object*, _object*)
File "c10/core/SymbolicShapeMeta.h", line 87, in c10::TensorImpl::is_contiguous_default(c10::MemoryFormat) const
File "c10/core/SymbolicShapeMeta.cpp", line 250, in c10::SymbolicShapeMeta::init_is_contiguous() const
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(128*((u0//387)), 0) (unhinted: Eq(128*((u0//387)), 0)). (Size-like symbols: u0)
Caused by: (_refs/__init__.py:3302 in native_layer_norm)
```
- Fixed with `definitely_contiguous_for_memory_format` in ref's contiguous
```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression 387*((u0//387)) < 2 (unhinted: 387*((u0//387)) < 2). (Size-like symbols: u0)
Caused by: (_prims_common/__init__.py:279 in is_contiguous)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155260
Approved by: https://github.com/laithsakka
ghstack dependencies: #155499
Example new error message
```
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.
Framework stack:
File "??", line 0, in _start
File "", line 0, in __libc_start_main_alias_2
File "??", line 0, in __libc_start_call_main
File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain
File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module>
foo(x)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper
return fn(*args, **kwargs)
File "offloadstuff.c", line 0, in dynamo__custom_eval_frame
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__
return self._torchdynamo_orig_callable(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__
result = self._inner_convert(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__
return _compile(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function
return function(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner
out_code = transform_code_object(code, transform)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
transformations(instructions, code_options)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn
return fn(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform
tracer.run()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run
super().run()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run
while self.step():
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step
self.dispatch_table[inst.opcode](self, inst)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl
self.push(fn_var.call_function(self, self.popn(nargs), {}))
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
return handler(tx, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda>
return lambda tx, args, kwargs: obj.call_function(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
return handler(tx, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph
return wrap_fx_proxy(tx, proxy)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy
return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls
return _wrap_fx_proxy(
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy
example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value
ret_val = wrap_fake_exception(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception
return fn()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node
return node.target(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL
File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS
File "python_variable_methods.cpp", line 0, in _object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object*, _object*, _object*)
File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object*, _object*, _object*)
File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&)
File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName)
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
return fn(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl
return self._dispatch_impl(func, types, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl
return maybe_propagate_real_tensors(fast_impl(self, *args, **kwargs))
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl
final_shape = infer_size(final_shape, shape)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size
torch._check(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check
_check_with(RuntimeError, cond, message)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with
if expect_true(cond):
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true
return a.node.expect_true(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true
return self.guard_bool(file, line)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool
r = self.evaluate()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate
return self.shape_env.evaluate_sym_node(self, size_oblivious)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node
return self.evaluate_expr(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
return retlog(fn(*args, **kwargs))
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr
return self._inner_evaluate_expr(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
return retlog(fn(*args, **kwargs))
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr
return self._evaluate_expr(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr
self._maybe_guard_rel(g)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel
self._refine_ranges(expr)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges
self._set_replacement(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement
self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract
torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp),
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call
File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> (*)(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)
File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool)
File "??", line 0, in torch::unwind::unwind()
User stack:
File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo
return torch.randn(5) * x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155603
Approved by: https://github.com/zou3519, https://github.com/cyyever
ghstack dependencies: #155133
Fixes https://github.com/pytorch/torchtitan/issues/1185
It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.
I'm not sure how worried we should be on the blast radius of this change. On the one hand:
(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)
(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
This PR adds the workflow of whenever a dev makes a PR that contains files under torch/_dynamo, we check for any unimplemented_v2() callsites and if any of them have been modified in some sort of way, the workflow fails and lists them exactly which callsites and let's them know what the command lines are to update the registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155610
Approved by: https://github.com/williamwen42
Summary:
As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
- as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now
- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.
Test Plan: test_hf_storage.py
Reviewed By: saumishr
Differential Revision: D75099862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
#153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running.
This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249
Approved by: https://github.com/zou3519
We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573
Approved by: https://github.com/fegin, https://github.com/d4l3k
When we export a scripted function, we inline the original callable stored in "_torchdynamo_inline", this is the same strategy as torch.compile path.
We do the same thing for script method, where a "\_\_wrapped\_\_" attribute points to the original callable in most cases. There are some corner cases we identified: top-level jit.scripted modules' method doesn't have a \_\_wrapped\_\_. In this case, we fall back to the original scripted approach. Maybe there're more such cases but need verification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155180
Approved by: https://github.com/zou3519
Sccache wasn't working for nvcc on jammy, so manually set the path to include where nvcc is
I had problems with always making nvcc a wrapper in some inductor tests where I got
```
sccache: encountered fatal error
sccache: error: PCH not supported by nvcc
sccache: caused by: PCH not supported by nvcc
```
and I also got an error (only on clang) when trying to set CMAKE_CUDA_COMPILER_LAUNCHER to /opt/cache/bin/sccache or sccache
```
ccache: error: failed to execute compile
sccache: caused by: Compiler not supported: "nvcc warning : Support for offline compilation for architectures prior to \'<compute/sm/lto>_75\' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\nnvcc fatal : Failed to preprocess host compiler properties.\n"
```
Non jammy cuda jobs' docker images used a different dockerfile, which set CMAKE_CUDA_COMPILER_LAUNCHER e895e9689c/.ci/docker/ubuntu-cuda/Dockerfile (L110)
Alt solution:
Given that I only get the error on clang, I could set CMAKE_CUDA_COMPILER_LAUNCHER=sccache only when not using clang
Setting CUDA_NVCC_EXECUTABLE doesn't fail but also doesn't result in cache hits/misses
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155464
Approved by: https://github.com/malfet, https://github.com/huydhn
Since its introduction ~4 years ago, the test `test_sort_large` has always been deselected because it requires 200GB of CUDA memory. Now, as we do have GPUs this big, it gets selected, but fails with `var_mean` not being a member if `torch.Tensor` and `var_mean` accepting only floating point tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155546
Approved by: https://github.com/eqy
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.
In quantization tests:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154728
Approved by: https://github.com/ezyang
This is the first PR of a series in an attempt to get the content of #134592 merged as smaller PRs (Given that the original one was closed due to a lack of reviewers).
This specific PR contains:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Update ao tests.
There will be follow up PRs to update the other test suites but I don't have permissions to create branches directly on pytorch/pytorch so I can't create a stack and therefore will have to create them one at the time.
Cc @jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154612
Approved by: https://github.com/jcaip
Implements the forward and backward hardshrink operators as Metal kernels.
In order to support the lambda parameter, we extend the `exec_unary_kernel` and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar.
We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304
Approved by: https://github.com/malfet
There is now functionality for the developer to add a --additional-info arg to the add and update dev terminal command to include any additional info the dev might want to remark about the graph break.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155526
Approved by: https://github.com/williamwen42
In a precompiled bytecode, it looks like the following:
```
pre-graph bytecode
...
compiled graph code
...
post-graph bytecode
```
In pre-graph bytecode we have calls into helper functions like torch._dynamo.utils.call_size which will invoke @disable inside the bytecode.
Normally torch.compile() will handle these frames fine, but for precompile we will load bytecode from a clean state of dynamo and we want a way to assert recompile never happen, so the current way to ensure this is by doing set_stance("fail_on_recompile") (open to any other idea to test this, but IMO this is the closest thing we have today).
This approach doesn't work when util functions like call_size() is involved and this PR fixes a bunch of places to make sure "fail_on_recompile" can skip through the functions meant to be skipped during compilation.
Differential Revision: [D76156867](https://our.internmc.facebook.com/intern/diff/D76156867/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155363
Approved by: https://github.com/jamesjwu, https://github.com/jansel
ghstack dependencies: #155329
While loading deserialized dynamo states back from disk, precompile will need a direct way to access ExtraState and populate guarded bytecode as cache entries.
This diff adds two API at code level to load precompiled guard + bytecode entries.
1. _load_precompile_entry() will append an entry to a precompile entry list per code object. This precompile entry will be looked up before normal compiled entries.
2. _reset_precompile_entries() will clean up all the installed existing entries. This is useful to prevent a case where user call loading multiple times and explode the number of entries on the list.
Differential Revision: [D76083247](https://our.internmc.facebook.com/intern/diff/D76083247/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155329
Approved by: https://github.com/jamesjwu, https://github.com/jansel
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops
Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/jansel
Co-authored-by: Jason Ansel <jansel@jansel.net>
Triton XPU shares its version file with the community one. When the community updates Triton version, it will temporarily break the XPU CI/CD because they use different repositories and commits. To decouple Triton version bumps between the community and XPU, we propose splitting the version into two separate files.
Refer the latest community triton version bump PR #153117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155313
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/atalman
Follow-up to #154858.
Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API.
First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr).
Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim.
Differential Revision: [D76318784](https://our.internmc.facebook.com/intern/diff/D76318784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155182
Approved by: https://github.com/drisspg
Summary:
When a custom op has int return type in its schema. The returned value will be specialized and such behaviour is different from a symint return type. This diff **only added support for int return type**.
As the returned int will be specialized and fused into downstream kernels (if being used), we can simply skip the int return type in the proxy executor.
Note that in the eager run, the returned int will be specialized to the value defined in the real impl of the custom op. In exported program or in AOTI, the returned int will be specialized to the value defined in the fake impl of the custom op. So the definitions of the return value should be consistent across real and fake impl of the custom op. Otherwise the eager run and AOTI run will have different results.
Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_int_output
```
Rollback Plan:
Differential Revision: D76159406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155465
Approved by: https://github.com/angelayi
Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA.
On-device TMA works like this:
```
@triton.jit
def kernel(A_ptr, workspace_ptr, ...):
tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...)
tl._experimental_descriptor_store(workspace_ptr, data, ...)
```
The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case.
The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example).
Consider this example, which I'll used to describe the pros/cons of this approach.
```
@triton.jit
def create_tma(global_ptr, workspace_ptr):
tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...)
@triton.jit
def kernel(A, B, workspace_ptr):
create_tma(A, workspace_ptr)
workspace_B = workspace_ptr + 128
create_tma(B, workspace_B)
data = tl._experimental_descriptor_load(workspace_ptr, ...)
tl._experimental_descriptor_store(workspace_B, data, ...)
```
An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`).
A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor.
Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380
Approved by: https://github.com/oulgen
Some core modules for versions of python installed in /opt depend on libraries in /usr/local but those libraries are not copied over from the base container.
For example: /opt/python/cp312-cp312/bin/python3 -c "import sqlite3"
ImportError: libsqlite3.so: cannot open shared object file: No such file or directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155211
Approved by: https://github.com/huydhn
Summary:
Serialization contains utilities to deserialize a graph saved on disk in json format as defined in `torch/csrc/utils/generated_serialization_types.h` to the in-memory representation as defined in `torch/nativert/graph/Graph.h`
Test Plan:
buck2 run @mode/dev-nosan caffe2/test/cpp/nativert:serialization_test
Rollback Plan:
Differential Revision: D76012641
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155229
Approved by: https://github.com/zhxchen17
... and fix ck-tile instances not being generated due to incorrect caching
### Testing
Added test cases for CKTILE instances
```
pytest test/inductor/test_ck_backend.py -k gemm_backends_CKTILE
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155294
Approved by: https://github.com/coconutruben
Summary:
Previosuly, we only add stack trace in `class _ModuleStackTracer(PythonKeyTracer)` for non-strict export. I moved this stack trace logic to the parent class `PythonKeyTracer`, this way the graph traced from Module using make_fx will have stack_trace as well.
Motivation: we've observed some uses cases where users first use `make_fx` on the Module, and then run `export` on the resulting graph. If the result of `make_fx` doesn't have stack trace, the stack trace information is lost.
Test Plan:
```
buck run test:test_export -- -r test_stack_trace
```
Rollback Plan:
Differential Revision: D75985427
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155155
Approved by: https://github.com/angelayi, https://github.com/zou3519
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@`a3a196`](a3a196ccdb) includes:
- Enhanced Adaptive Average Pooling 2D Backward Kernel for performance and code simplification
- Group Norm Backward Optimization with vectorization and parallel reduction
- Support CL path for MaxUnpooling2d and MaxUnpooling3d
- Rename USE_ONEMKL as USE_ONEMKL_XPU and set it as default ON
- Refactor USE_XCCL & USE_C10D_XCCL option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154962
Approved by: https://github.com/EikanWang
Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.
Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA
Differential Revision: D75749065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154859
Approved by: https://github.com/eellison, https://github.com/ngimel
On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable:
05dd638ee9/torch/_dynamo/backends/common.py (L76)
This disables dynamo in the bw_compiler but also disables the runnable the compiler returns.
On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already.
```
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ]
```
This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical.
Fixes https://github.com/pytorch/pytorch/issues/154536
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b
Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
556 bool has_names() const {
557 // If a user is using unnamed tensors, then we can short-circuit right here.
558 // Otherwise, impl::has_names attempts to retrieve names.
-> 559 if (!impl_->has_named_tensor_meta()) {
560 return false;
561 }
562 return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
21 TORCH_CHECK(dim.type() != NameType::WILDCARD,
22 "Please look up dimensions by name, got: name = None.");
-> 23 TORCH_CHECK(tensor.has_names(),
24 "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
25 const auto names = tensor.names();
26
```
TODOs:
- May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
- Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests
Fixes https://github.com/pytorch/pytorch/issues/155306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: #155382
Add comprehensive module docstring explaining built-in function and type
variable tracking, including handling of Python built-ins, type constructors,
operators, and special constructs during symbolic execution.
Originally generated by claude but reviewed and edited by me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155402
Approved by: https://github.com/Skylion007
ghstack dependencies: #155403
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel
Add comprehensive module docstring explaining side effect tracking and
management, including mutation tracking, context changes, aliasing,
and state preservation during symbolic execution.
Originally generated by claude but reviewed and edited by me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155403
Approved by: https://github.com/williamwen42
Add comprehensive module docstring explaining the tracing rules and policies
that govern TorchDynamo's compilation decisions, including skip rules,
inlining policies, and library-specific handling.
Originally generated by claude but reviewed and edited by me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155401
Approved by: https://github.com/williamwen42
Fixes#151829
Summary:
Currently, inductor has a lazy init which causes certain aten ops to run during a profiling run. This ends up cluttering the function events especially for smaller traces. One of the attempts to fix this was to simply remove that import from the profiler entirely but it looks like the import happens somewhere downstream anyways and the event still flood our profile.
To fix this, we induce the inductor import during prepare trace if the inductor is present. This way regardless of how the workload imports the inductor the actual init process will be done before tracing starts, resulting in more accurate tracing.
Test Plan:
Added test, also ran N7316820 manually and went from getting many events on the first run to the following output (only difference is Runtime Triggered Module Loading which is CUPTI overhead event):
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
aten::mul_ 1.40% 340.638us 99.92% 24.390ms 24.390ms 1.535us 100.00% 4.605us 4.605us 1
cudaLaunchKernel 0.60% 146.533us 98.52% 24.049ms 24.049ms 0.000us 0.00% 3.070us 3.070us 1
Runtime Triggered Module Loading 6.14% 1.500ms 6.14% 1.500ms 1.500ms 1.535us 100.00% 1.535us 1.535us 1
Runtime Triggered Module Loading 91.78% 22.403ms 91.78% 22.403ms 22.403ms 1.535us 100.00% 1.535us 1.535us 1
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.535us 100.00% 1.535us 1.535us 1
cudaDeviceSynchronize 0.08% 20.031us 0.08% 20.031us 20.031us 0.000us 0.00% 0.000us 0.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
aten::mul_ 82.81% 484.396us 94.26% 551.378us 551.378us 1.440us 100.00% 1.440us 1.440us 1
cudaLaunchKernel 11.45% 66.982us 11.45% 66.982us 66.982us 0.000us 0.00% 0.000us 0.000us 1
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.440us 100.00% 1.440us 1.440us 1
cudaDeviceSynchronize 5.74% 33.581us 5.74% 33.581us 33.581us 0.000us 0.00% 0.000us 0.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Rollback Plan:
Differential Revision: D76056511
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155243
Approved by: https://github.com/ngimel
Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build.
Follow up to: https://github.com/pytorch/pytorch/pull/152715
Before this PR,
On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using.
After this PR,
TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build.
p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314
Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman
AMD is beginning to roll out ROCm distribution via Python wheels. This patch adds the `__init__.py` hook that is necessary to bootstrap ROCm correctly on Linux and Windows when built from these wheels.
See draft, developer documentation describing the mechanism here: https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md
This operates to similar effect as how Torch can depend on CUDA wheels, with some differences:
* ROCm libraries and checks are delegated to helpers in the `rocm_sdk` module, which knows how to find and configure access to the installed libraries. This limits the amount of plumbing and path machinations that must match up between the framework and ROCm.
* When building torch against ROCm, no ROCm system install is needed: instead the proper SDK development wheel is installed and the `CMAKE_PREFIX_PATH` is obtained via `rocm-sdk path --cmake`.
* It is expected that whoever produces such a build will also place a generated `_rocm_init.py` in the `torch` module with initialization logic to preload libraries, check versions, verify GPU compatibility, etc.
* See [build_prod_wheels.py](https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py) for an example build script that is being used to generate nightlies in this configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155285
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
* The `rocm-core` CMake package only started appearing in ROCm 6.4, so rework the version probing to work if it is not present. Also collapses the unneeded operating system conditioning in favor of feature probing.
* Make `hipsparselt` optional: it only started appearing in ROCm 6.4 and it is not in all recent distribution channels yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155305
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Add a few missing returns in torch._logging and use ruff to infer the obvious ones.
LazyStr now properly checks the return type of the Callable and the args and kwargs passed to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155345
Approved by: https://github.com/ezyang
It's better because you return less date, encapsulate more, and no longer need special handling of symvec vs nonsym vec dim(). Also removes a few casts and fixes a few potential edge cases relating to unsigned comparisons
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155334
Approved by: https://github.com/ezyang
Downstream consumer of the 2D all-to-all-v is often a group GEMM.
Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8.
This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed.
The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.)
The algorithm is as follows.

In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155172
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677, #155058
Summary: Log backward no-op to tlparse and pt2 compile events.
Test Plan:
$ rm -rf /tmp/r && TORCH_TRACE=/tmp/r buck2 run //scripts/jovian:backward_noop_repro_compile
Used print statements to verify we enter the logging code region.
Differential Revision: D75231665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154544
Approved by: https://github.com/c00w
# Feature
If `config.triton.autotune_at_compile_time` is set to `True`, autotune Triton kernels during FX conversion. Else, stick with the existing behavior of using the first precompiled config.
# Test plan
Added CI tests verifying that the tuner is called iff this flag is set, with and without dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155049
Approved by: https://github.com/jansel
A 2D AllToAllv shuffle is illustrated below:
(`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank)
```
Source: | Rank 0 | Rank 1 |
| c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 |
Dest : | Rank 0 | Rank 1 |
| c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 |
```
where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`).
That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155058
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677
The purpose of this change is to reduce scope of s390x CI to stop it potentially blocking usual workflows for other users
while still keeping nightly builds and tests for me to look at.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155208
Approved by: https://github.com/malfet
Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running
any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to
exception
```
% MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled
test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ...
validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).'
zsh: abort MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
```
Extend it to 4 elements and pass output dtype, which will be used by
binary_op later on anyway
Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make
sure everything passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272
Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever
Prompt for Sonnet 3.7 in Claude Code: Only inspect tools/nightly.py, all
other files are irrelevant to your task. Do not use any shell commands.
Task: Add a --detach argument to this script which instead of making a
new branch just directly checks out the correct commit in detached mode.
With two interventions:
- Branch and detach are mutually exclusive. So you should consolidate
them into a single argument. Why don't we take over the 'None' option?
- Do you know that nightly_version is guaranteed to be a commit hash? It
seems it would be safer to explicitly pass --detach
I tested by running `python tools/nightly.py checkout` and observing
that my worktree was detached at this point.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154314
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
## Description
Fixes a typo in the comment of `torch/fx/experimental/proxy_tensor.py`, changing "intialize" to "initialize".
## Issue
None
## Type of change
- [x] Typo fix
## Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] My changes generate no new warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155301
Approved by: https://github.com/jingsh, https://github.com/ezyang, https://github.com/cyyever
Fixes#122757
The fix is lost after revert and rebase previous PR https://github.com/pytorch/pytorch/pull/150750 (only change of tests are merged).
## Test Result
```python
>>> import torch
>>>
>>> model_output = torch.randn(10, 5).cuda()
>>> labels = torch.randint(0, 5, (10,)).cuda()
>>> weights = torch.randn(5)
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
>>> loss = loss_fn(input=model_output, target=labels)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
return F.cross_entropy(
^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/functional.py", line 3476, in cross_entropy
return torch._C._nn.cross_entropy_loss(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155085
Approved by: https://github.com/mikaylagawarecki
Summary:
For simple FSDP, this `visualize_overlap` function is throwing errors.
Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places.
Test Plan:
:)
Rollback Plan:
Reviewed By: Microve
Bifferential Revision: D75985733
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222
Approved by: https://github.com/yf225
Summary: If a model name is specified in aoti config, the generated files will use that model name as file stem.
Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_using_model_name_for_files
```
Bifferential Revision: D75102034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154129
Approved by: https://github.com/desertfire
Summary:
Adding "from_node" information that indicates which nodes are unlifted in `.module()` call.
The lifted nodes will have "ExportedProgram.module().unlift()" passname in the last entry of from_node.
Test Plan:
```
buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export
```
Rollback Plan:
Reviewed By: angelayi
Differential Revision: D75837494
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155053
Approved by: https://github.com/angelayi
Previously, we didn't disable functionalization key when materializing backward graph. This causes the torch.zeros_like call for the case where grad is None to return a functional tensor that's not tracked by the proxy tensor mode.
This PR fixes it by putting the tracing code under disable functionalization ctx manager.
Fixes https://github.com/pytorch/pytorch/issues/153437.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154343
Approved by: https://github.com/zou3519
Fixes#154938
When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8
Regarding the revert & reland:
* This PR causes the python 3.13 version to be bumped from 3.13.2 to 3.13.3. test_deopt_from_append_list starts unexpectedly passing on 3.13.3, so I originally modified the test in https://github.com/pytorch/pytorch/pull/155167 to xfail only for <=3.13.2
* However there was a land race with https://github.com/pytorch/pytorch/pull/150796, which introduced another test that passes only for >=3.13.3.
Resolution:
* @guilhermeleobas reverted https://github.com/pytorch/pytorch/pull/150796 so I will reland this (and I've merged the test_deopt_from_append_list change into this PR. And based on Guilherme's feedback, I'm just skipping the test instead of selectively failing/passing the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
Fixes#153133
Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic.
The fix introduces double precision validation for checking whether the step sign is consistent with the range direction.
This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320
Approved by: https://github.com/malfet
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops
Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
Co-authored-by: Jason Ansel <jansel@jansel.net>
Fixes#132031
## Test Result
```python
In [1]: import torch
...: torch.manual_seed(0)
...: torch.cuda.manual_seed(0)
...: a = torch.randn(3, 4)
...: b = torch.randn(3, 4)
...: torch.cross(a, b, out=a)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[1], line 6
4 a = torch.randn(3, 4)
5 b = torch.randn(3, 4)
----> 6 torch.cross(a, b, out=a)
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999
Approved by: https://github.com/lezcano
During `codegen_inputs`, we check whether there are undefined symbols:
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)
Previously, for graph partition inputs, we do not explicitly add symints.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)
We relied on sizes/strides of TensorBox for codegen symint inputs. For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)
In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2*s0, 2]`. Since `2*s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`.
The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679
Approved by: https://github.com/eellison
Not sure why, apparently this test starts passing on python 3.13.3 (while it fails on python <=3.13.2) and it's causing unexpected passes on xfail-ed tests when newer versions of python are used, e.g. in #155056.
Verified locally in a python 3.13.1 vs. python 3.13.3 conda env.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155167
Approved by: https://github.com/williamwen42
Some functions have empty linemaps, and if you call `PyCodeCache.stack_frames_for_code` on code in the wrong order, you'll end up triggering a too many values to unpack issue: https://github.com/pytorch/pytorch/issues/154536
Specifically, if you populate PyCodeCache's linemap via caching, and then request the stack frames of a inductor generated output file that has an empty linemap, this function will try to unpack too many arguments.
Test plan:
```
import os
os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "1"
os.environ["TORCHINDUCTOR_AUTOGRAD_CACHE"] = "1"
import torch
@torch.compile
def fn(x: torch.Tensor):
(x_grad,) = torch.autograd.grad(x.sum(), x)
return x_grad
x = torch.randn(10, 10, requires_grad=True)
result = fn(x)
```
Run this twice and see that everything works as expected.
It's hard to exactly pinpoint a good unit test for this: it requires a whole lot of moving parts to get the issue to trigger because:
- The callsite in question in dynamo, without caching, will always run before generating the code, so cls.linemaps[path] will be None most of the time
- The inductor generated output needs to call *back* into dynamo via `assert_size_stride`
- In our test case, the CompiledBackward needs to not have linemaps, and also be called in the middle of a graph break while compiling a different cached function. Caching switches the order the PyCodeCache.linemap is populated (i.e. either before or after the graph break is evaluated), which causes the issue.
All these things need to interact together to create the bug, so it's a bit difficult to write a simple unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155064
Approved by: https://github.com/bdhirsh
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.
I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).
Workaround #153479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448
Approved by: https://github.com/kwen2501
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py
Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.
The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:
```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
Summary:
There are some cases where we want only local annotations for memory snapshot such as executing inside the cudastream callback, which cannot execute CUDA operators. Thus the cuda errors happen: Exception in RecordFunction callback: CUDA error: operation not permitted
However, we need to have an option to turn on the globally so that on-demand snapshot can get annotations. Additionally, there may be some cases in which auto-trace will also want annotations using record functions so we expose the flag to the auto-trace as well.
Test Plan:
Run MVAI executable and see that the errors go away
Rollback Plan:
Differential Revision: D75831687
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154932
Approved by: https://github.com/mzzchy, https://github.com/sanrise
Summary: Gloo PG doesn't release GIL, which results in python code hanging until the destructor completes. The destructor waits for all work on the PG to complete which can take a long time.
Test Plan: Ran
```
$ pytest --log-cli-level=INFO -vs torchft/local_sgd_integ_test.py
```
with a large timeout on the async work. Call to `gil_scoped_release` doesn't show up in the gdb stack trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154976
Approved by: https://github.com/d4l3k, https://github.com/dcci, https://github.com/fduwjj
"Can't be indexed using 32-bit iterator" is not really helpful error
This PR distinguishes between error from old indexing helper function as well as to binaryTensorIterator
Adds the same warning to unary op, otherwise it just runs and returns incorrect value
Test plan (manual, don't have machine with enough RAM to run it reliable in CI):
```
% python -c "import torch;print(torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps'))"
RuntimeError: add can't be indexed using 32-bit iterator for shape [1048576, 5000]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155150
Approved by: https://github.com/Skylion007, https://github.com/dcci
The user can now use the terminal to update the registry whenever they update an existing gb_type's properties. Additionally, if the user changes the gb_type description itself, they can update the registry as well.
Terminal command template for updating existing gb_type: python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite]
Terminal command template for updating existing gb_type name (can also be used if the user changed the other properties as well including the gb_type name): python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite] --new_gb_type "new_name_for_existing_gb_type"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154985
Approved by: https://github.com/williamwen42
Fixes#144522
## Description
FX QAT docs for prepare_qat_fx incorrectly used get_default_qat_qconfig when it should use get_default_qat_qconfig_mapping for a qconfig_mapping.
Previous example code incorrectly used `get_default_qat_qconfig`, resulting in a qconfig being incorrectly
passed to `prepare_qat_fx`. `prepare_qat_fx` requires a `qconfig_mapping`, not a single `qconfig`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155100
Approved by: https://github.com/jerryzh168
Summary:
looks like CPU is enabled for update_constant_buffer in D71177509
enable these tests as well.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_without_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_user_managed_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_update_weights" -v
```
Rollback Plan:
Differential Revision: D75908993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155078
Approved by: https://github.com/angelayi
This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs.
In distributed tests:
- Ensure all files which should call run_tests do call run_tests.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""
Cc @wconstab @clee2000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628
Approved by: https://github.com/Skylion007
This PR is part of a series attempting to re-submit #134592 as smaller PRs.
In fx tests:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154715
Approved by: https://github.com/Skylion007
Summary: As title. See added test for more context.
Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output_2
Rollback Plan:
Differential Revision: D75900658
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155073
Approved by: https://github.com/angelayi
Refactor the pt2 archive saving to consolidate the format of torch.export.save and torch._inductor.package.package_aoti.
This PR adds the following functions, which torch.export.save and AOTI packaging calls into:
```python
package_pt2(
f: FileLike,
*,
exported_programs: Optional[Union[ExportedProgram, dict[str, ExportedProgram]]] = None,
aoti_files: Optional[Union[list[str], dict[str, list[str]]]] = None,
extra_files: Optional[dict[str, Any]] = None,
) -> FileLike
@dataclass
class PT2ArchiveContents:
exported_programs: dict[str, ExportedProgram]
aoti_runners: dict[str, AOTICompiledModel]
extra_files: dict[str, Any]
load_pt2(f: FileLike) -> PT2ArchiveContents
```
Power users directly call into these APIs if they want to bundle multiple exported programs, aoti files, or extra metadata.
This is how the pt2 archive looks like ([spec](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?tab=t.0)):
```
├── archive_format
├── version
├── .data
├── data
│ ├── aotinductor
│ │ └── model1
│ │ ├── model1.cpp
│ │ ├── model1.so # currently AOTI automatically moves weights in here, TODO to move it out
│ │ ├── cg7domx3woam3nnliwud7yvtcencqctxkvvcafuriladwxw4nfiv.cubin
│ │ └── cubaaxppb6xmuqdm4bej55h2pftbce3bjyyvljxbtdfuolmv45ex.cubin
│ ├── weights
│ │ ├── model1.pt # TODO to dedup weights between model1/model2
│ │ └── model2.pt
│ └── constants
│ │ ├── model1.pt # TODO to dedup weights between model1/model2
│ │ └── model2.pt
│ └── sample_inputs
│ ├── model1.pt # TODO to dedup weights between model1/model2
│ └── model2.pt
├── extra
│ └── user_metadata.txt
└── models
├── model1.json
└── model2.json
```
Future todos:
- unbundle the weights -- instead of .pt, we can use bin files, which will also allow us to dedup weights if we store multiple models
- update aoti_compile_and_package to also save the exported program
- integrate TNR with this packaging flow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152495
Approved by: https://github.com/yushangdi
In OneDNN v3.7, SDPA has below defects:
1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.
In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.
We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.
What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.
Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)
Differential Revision: [D75799278](https://our.internmc.facebook.com/intern/diff/D75799278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
# Feature
This PR fixes two bugs with Inductor's FX backend.
1. When extracting offsets from `ReinterpretView`'s, we accidentally took the offset of the parent layout instead of the view's layout. This case is triggered when multiple kernels write into the same buffer due to `torch.cat`.
2. In certain rare cases, `V.graph.graph_inputs` can contain a constant input value. In case this happens, create a new `sympy.Symbol` for the input, for compatibility with the existing `SymbolBuffer` abstraction mapping to an FX placeholder. This case is triggered when calling `torch._inductor.compile` on certain modules coming from `torch.export`.
# Test plan
Added a couple of tests exposing these bugs.
1. Concat with multiple kernels writing to the same buffer.
3. `Export` -> `torch._inductor.compile` with a constant input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154958
Approved by: https://github.com/jansel
This PR registers the RotaryEmbedding op in the `torch.ops.onnx` name spaces and allows the exporter to recognize and export onnx operators.
## Design
ONNX operators of their respective opset version is implemented in torch/onnx/ops/_impl.py, and are registered in the torch.ops.onnx namespace following the following rule:
`OpType-version => torch.ops.onnx.OpType.opset{version}`
For example, `RotaryEmbedding-23` becomes `torch.ops.onnx.RotaryEmbedding.opset23`
This name is parsed by the exporter to create an onnx node in the graph without having to go through translation.
When users use the ops in the model, we provide more convenient, unversioned functions under `torch.onnx.ops` that will dispatch to the implementations based on user input (type and provided attributes). For example, users can directly call `torch.onnx.ops.rotary_embedding()` to use the op natively in their pytorch models. I chose snake case naming to make the functions more pythonic and aligned with other torch apis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154745
Approved by: https://github.com/titaiwangms
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.
In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.
The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.
While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:
1. Add header file
2. Add source file
3. **Add test and build rules**
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.
Differential Revision: [D75495273](https://our.internmc.facebook.com/intern/diff/D75495273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154532
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530, #154531
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:
1. Add header file
2. **Add source file**
3. Add test and build rules.
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.
Differential Revision: [D75492405](https://our.internmc.facebook.com/intern/diff/D75492405/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154531
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530
We split the large PR for adding Graph.h and Graph.cpp to `nativert` into 3 smaller PRs:
1. **Add header file**
2. Add source file
3. Add test and build rules.
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.
Differential Revision: [D75491860](https://our.internmc.facebook.com/intern/diff/D75491860/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154530
Approved by: https://github.com/SherlockNoMad
To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format
Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature.
At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression.
An example of the JSON output of the operator benchmark.
```
[
{
"benchmark": {
"name": "PyTorch operator benchmark - add_M1_N1_K1_cpu",
"mode": "inference",
"dtype": "float32",
"extra_info": {
"input_config": "M: 1, N: 1, K: 1, device: cpu"
}
},
"model": {
"name": "add_M1_N1_K1_cpu",
"type": "micro-benchmark",
"origins": [
"pytorch"
]
},
"metric": {
"name": "latency",
"unit": "us",
"benchmark_values": [
2.074
],
"target_value": null
}
},
{
"benchmark": {
"name": "PyTorch operator benchmark - add_M64_N64_K64_cpu",
"mode": "inference",
"dtype": "float32",
"extra_info": {
"input_config": "M: 64, N: 64, K: 64, device: cpu"
}
},
"model": {
"name": "add_M64_N64_K64_cpu",
"type": "micro-benchmark",
"origins": [
"pytorch"
]
},
"metric": {
"name": "latency",
"unit": "us",
"benchmark_values": [
9.973
],
"target_value": null
}
},
]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410
Approved by: https://github.com/huydhn
Triton 3.4 will remove the experimental TMA apis: https://github.com/triton-lang/triton/pull/6488
To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API.
Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes.
Note: we'll need to apply this for other things in inductor, this just does it for flex attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154858
Approved by: https://github.com/NikhilAPatel, https://github.com/drisspg
test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter.
To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935
Approved by: https://github.com/Skylion007
By changing the functor to looks as follows
```metal
struct xlog1py_functor {
template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
inline T operator()(const T a, const T b) {
return static_cast<T>(c10:🤘:xlog1py(a, b));
}
template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
inline float operator()(const T a, const T b) {
return c10:🤘:xlog1py(float(a), float(b));
}
};
```
Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155002
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #154936
## Summary
Adds missing type annotations to `torch.nn.init` and removes `# mypy: allow-untyped-defs` since all functions are now properly typed.
## Changes
- Added missing type annotations to initialization functions in the module.
- Added missing typing imports: `Any`, `Callable`, `Union`
- Removed `# mypy: allow-untyped-defs` comment
- Create Literal types for kaiming initialization mode and nonlinearity.
- Created `__all__`
## Why
Better IDE support, catches type errors earlier, and brings the module up to PyTorch's typing standards. No runtime changes - purely additive typing improvements.
Tested with existing test suite and lintrunner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154504
Approved by: https://github.com/Skylion007
Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses.
For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153748
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730
In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153723
Approved by: https://github.com/jansel
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.
Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698
Approved by: https://github.com/oulgen
During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154929
Approved by: https://github.com/kwen2501
Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner.
Test Plan: CI
Bifferential Revision: D75756619
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850
Approved by: https://github.com/janeyx99
resolve https://github.com/pytorch/pytorch/issues/154655
`fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases
* we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward
for recommendation model, we may have mutiple root for dense part
Change default beahvior to always respect `reshard_after_forward=True`
Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704
Approved by: https://github.com/mori360
Triton has added another artifact that gets generated (triton-lang/triton#6992), so `test_cache_load_function` started failing as there are now 8 (instead of 7) artifacts.
Instead of figuring out a way to check exactly which set of artifacts will get generated, I instead modified the test to just check that there are _at least_ 6 artifacts, to account for different platforms (intel/amd/nvidia) and different triton versions (which may or may not have a `.source` artifact)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154879
Approved by: https://github.com/oulgen, https://github.com/masnesral
Summary:
This was implemented in SR due to caching of runtime instances building up and causing some memory usage spikes after some large amount traffic went through the model, and then once traffic went down, SR was still caching all the previous usage.
We need something similar on the Sigmoid side to make sure the static dispatch modules aren't hogging memory. Currently, all ExecutionFrame objects are being cached, and never freed if stale.
Test Plan:
Added extra execution frames in tmp commit D75257998 and ran local replayer test to confirm extra execution frames get cleaned up down to min size, which is set at 8
{F1978532047}
Also tested by modifying load_net_predictor (modifications also in D75257998) to run benchmarkNumIterations twice - once with benchmarkNumThreads, and once with only one thread. Also set clearing interval at one second. Verified that execution frames get cleared when we drop down to one thread.
{F1978558984}
```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```
Bifferential Revision: D75257992
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154636
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.
Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #154769
Replace the git version, so whl name goes from `torch-something+git<old commit>` to `torch-something+git<new commit>`
Renamed a bunch of variables to hopefully be more clear
Tested on ef210ad54b
* Removed gating that prevents it from running on PRs (which is going to be merged soon)
* Removed gating that checks for which files can be changed (since this PR has stuff outside of the acceptable list)
* The above two allow the whl to be reused, and I added assert 1 == 2 in common_utils and checked that jobs failed (meaning they were using updated code despite not building)
Checked that the whl in the docker image has the right commit sha, didn't check torch.__version__ though
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154773
Approved by: https://github.com/malfet
----
* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py
Minor changes were made to each test to run them inside Dynamo
One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
Fixes#154615
Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15.
For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs.
Tests for ConvTranspose3D were enabled for the supported data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696
Approved by: https://github.com/malfet
Updates cpp-httplib to 0.20.1. This mostly updates OSS with a bunch of CMake, CXX compiler errors, and bugfixes from upstream. It's a header only library so should be pretty straightforward to upgrade
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154825
Approved by: https://github.com/malfet
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`
`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](cb8a72b301/Python/errors.c (L282)), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](cb8a72b301/Python/errors.c (L32))
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references
Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'
import torch
x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
x[y] = 2
print(x)
except torch.AcceleratorError as e:
print("Exception was raised", e.args[0])
print("Captured error code is ", e.error_code)
```
which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
#12 at::native::abs(at::Tensor const&) from ??:0
#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#17 at::_ops::abs::call(at::Tensor const&) from ??:0
#18 at::native::isfinite(at::Tensor const&) from ??:0
#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
#22 PyObject_CallFunctionObjArgs from ??:0
#23 _PyObject_MakeTpCall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyObject_FastCallDictTstate from ??:0
#26 _PyStack_AsDict from ??:0
#27 _PyObject_MakeTpCall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 _PyFunction_Vectorcall from ??:0
#30 _PyEval_EvalFrameDefault from ??:0
#31 _PyFunction_Vectorcall from ??:0
#32 _PyEval_EvalFrameDefault from ??:0
#33 _PyFunction_Vectorcall from ??:0
#34 _PyEval_EvalFrameDefault from ??:0
#35 PyFrame_GetCode from ??:0
#36 PyNumber_Xor from ??:0
#37 PyObject_Str from ??:0
#38 PyFile_WriteObject from ??:0
#39 _PyWideStringList_AsList from ??:0
#40 _PyDict_NewPresized from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyEval_EvalCode from ??:0
#43 PyEval_EvalCode from ??:0
#44 PyUnicode_Tailmatch from ??:0
#45 PyInit__collections from ??:0
#46 PyUnicode_Tailmatch from ??:0
#47 _PyRun_SimpleFileObject from ??:0
#48 _PyRun_AnyFileObject from ??:0
#49 Py_RunMain from ??:0
#50 Py_BytesMain from ??:0
#51 __libc_init_first from ??:0
#52 __libc_start_main from ??:0
#53 _start from ??:0
Captured error code is 710
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: #154436
Last update to this submodule was 3 years ago, and the API is pretty stable and this is a minor version release update. Part of a bunch of PRs to eradicate low CMake required versions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154796
Approved by: https://github.com/jansel
Update NVTX3 submodule to 3.2.1.
* Mostly improved compiler support, Python support, and better CMake and C++ support.
* Also has a few new APIs to support fancy new features.
* This is header only library so should be an easy non-invasive change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154797
Approved by: https://github.com/jansel
Previously, @Chillee wrote a script https://github.com/pytorch/pytorch/pull/125811 to remove inductor dependency for inductor compiled triton kernels. We'd like to automate the process of obtaining the launch parameters.
Added functionality to the torch/utils/_get_clean_triton.py to automatically generate the launch_params file if it does not exist and the auto_generate_params flag is set to True. This includes running the input file in a subprocess with the appropriate environment variable. Updated the get_clean_triton function and the main script to support this new feature, allowing users to disable auto-generation via a command-line argument.
# Test Plan
test embedding op in TritonBench
```
# generate inductor compiled triton kernels
TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FX_GRAPH_CACHE=0 python run.py --op embedding --mode fwd --precision fp32 --metrics nsys_rep --only inductor_embedding --num-inputs 1 --input-id 11
# run the script to get rid of inductor dependency. By default, triton_only_repro.py is the output file name.
python ~/pytorch/torch/utils/_get_clean_triton.py ~/tritonbench/torch_compile_debug/run_2025_05_29_14_47_50_497790-pid_849274/torchinductor/model__0_forward_1.0/output_code.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154666
Approved by: https://github.com/davidberard98
Upstream triton has moved setup.py from python/ to ./. This PR allows versions to be buildable by checking the location of setup.py and choosing the cwd of the build commands based on the location.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154635
Approved by: https://github.com/atalman
**Context**:
AMD triton kernels can be launched with special kwargs, like `waves_per_eu`. Triton configs with these kwargs look like this:
```
triton.Config({
"BLOCK_SIZE": 64,
"waves_per_eu": 2,
})
```
in comparison, nvidia's special kwargs are explicit parameters on the config, e.g. num_warps:
```
triton.Config(
{"BLOCK_SIZE": 64},
num_warps=4,
)
```
**Problem**: this causes custom triton kernels w/ PT2 to error out, because there's a kwarg in the triton.Config that doesn't appear in the kernel signature.
**Solution**: When splicing in the constexpr values into the arg list, ignore any values in the config kwargs list if they don't appear in the function signature.
Differential Revision: [D75599629](https://our.internmc.facebook.com/intern/diff/D75599629/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D75599629/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154605
Approved by: https://github.com/njriasan
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.
While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.
Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?
Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.
Test Plan:
NaN asserts properly generated in example gemm script:
vars = (buf1, primals_2, buf2, primals_1, )
for var in vars:
if isinstance(var, torch.Tensor):
assert not var.isnan().any().item()
assert not var.isinf().any().item()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/148981 re-opening for visibility :
Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.
Motivation
Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.
Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154618
Approved by: https://github.com/jansel
If `TEST_TENSORBOARD == False` then `DataType` is not defined or imported. However it is used unconditionally when defining the test with `parametrize` which leads to an NameError crashing the test execution on start.
Provide a Dummy to make it syntactially correct. Tests will be skipped on start.
```
File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 885, in <module>
class TestTensorProtoSummary(BaseTestCase):
File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 889, in TestTensorProtoSummary
(torch.float16, DataType.DT_HALF),
^^^^^^^^
NameError: name 'DataType' is not defined
Got exit code 1, retrying...
test_tensorboard 1/1 failed! [Errno 2] No such file or directory: '/dev/shm/build/pytorch-v2.2.1/.pytest_cache/v/cache/stepcurrent/test_tensorboard_0_0dba8bc00bbe233f'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154709
Approved by: https://github.com/Skylion007
Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image.
Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835
Approved by: https://github.com/malfet
Summary: When setting the memory snapshot callback we register and unregister callbacks for performance reasons. For ease of use, it makes sense to just remove all callbacks regardless of which flags are enabled. The enable stays behind a feature flag, this just changes the disable to ignore the flag itself.
Test Plan: Ran without any flags and saw all callbacks removed.
Differential Revision: D75636035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154664
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
Summary:
This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.
There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.
Test Plan: Reran the tests listed in T196779132 and they pass.
## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.
Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #153433, #153449
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.
Test Plan:
NaN asserts properly generated in example gemm script:
vars = (buf1, primals_2, buf2, primals_1, )
for var in vars:
if isinstance(var, torch.Tensor):
assert not var.isnan().any().item()
assert not var.isinf().any().item()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward.
Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout.
```python
class CallbackTrigger(enum.Enum):
# most common case, dynamo attempts to trace a new frame
DYNAMO = 1
# backward compilation can be deferred to runtime
LAZY_BACKWARD = 2
# some backends autotune at runtime
TRITON_AUTOTUNING = 3
# cudagraphs record at runtime
CUDAGRAPH_RECORDING = 4
```
Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596
Approved by: https://github.com/masnesral
## Summary
Me and @laithsakka spoke offline about this one, TLDR is that we wanted this

to also be true for Inductor. In that vein we added two new apis to size-vars which is `guard_or_false`, or `guard_or_true`
with the semantics:
guard_or_false, guard_or_true:
Those APIs may add guards, but will never fail with data-dependent errors; They will try to evaluate the expression with the possibility of adding guards, if that fails due to data dependency, instead of hard failing. False or True are returned.
When to use this?
Performance optimizations that warrant a recompilation.
Take the general path and add a runtime check.
```
# Consider this branching.
if x==0:
return 1
else
return 10
# To make data dependent friendly, it can be written as the following:
if guard_or_false(x==0):
return 1
else
torch.check(x!=0) # runtime check
return 10
```
However there is still 1 more api to add to make this example work which is the torch.check which works with expressions, I will leave that to the @laithsakka
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154672
Approved by: https://github.com/laithsakka
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.
There's really two parts of this work:
**The frontend changes:**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.
**The backend changes (this PR):**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.
I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153449
Approved by: https://github.com/zou3519
ghstack dependencies: #153433
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.
Test Plan:
NaN asserts properly generated in example gemm script:
vars = (buf1, primals_2, buf2, primals_1, )
for var in vars:
if isinstance(var, torch.Tensor):
assert not var.isnan().any().item()
assert not var.isinf().any().item()
Differential Revision: D74691131
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
Summary:
When a C++ custom op returns an uninitialized tensor, it will be marked as None in Python. For this scenario, the user should mark the possibly uninitialized return as Tensor? in the custom op schema.
This diff adds `as_optional_tensor` type to export schema and the support for optional tensor in AOTI proxy executor.
Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output
```
Differential Revision: D75262529
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154286
Approved by: https://github.com/desertfire
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.
There's really two parts of this work:
**The frontend changes (this PR):**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.
**The backend changes:**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.
I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153433
Approved by: https://github.com/zou3519
Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257
Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```
Differential Revision: D75494468
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473
Approved by: https://github.com/Skylion007
Fixes https://github.com/pytorch/torchtitan/issues/1185
It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.
I'm not sure how worried we should be on the blast radius of this change. On the one hand:
(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)
(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
ghstack dependencies: #153766
Summary: AMD streams are lazily initialized and sometimes (e.g. when we just want to do event recording on the stream) we might not be setting the device guard while it's initializing which would lead to invalid configuration error.
Differential Revision: D75456460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154433
Approved by: https://github.com/jeffdaily
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.
We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.
What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.
Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006
Approved by: https://github.com/drisspg
Fix for https://github.com/pytorch/pytorch/issues/152425
inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor.
If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation.
We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark:
```
import torch
t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16)
@torch.compile(dynamic=False)
def foo(x):
return x.add_(1)
import triton
print(triton.testing.do_bench(lambda: foo(t[:-1])))
torch._dynamo.reset()
print(triton.testing.do_bench(lambda: foo(t[1:])))
```
gives
```
0.04063070610165596
0.07613472988113162
```
So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case.
In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
## Overview
This PR is to optimize FlexAttention CPP template with block sparse.
Block sparse is natively supported in FlexAttention block mask structures, thus following logic of the kv blocks from `kv_indice ` and `full_kv_indice ` is the strightforward way to add this optimization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147196
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~
We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:
1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions
As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
Summary: as title, the clamp type promotion should take min/max arg into consideration as well.
Test Plan:
```
buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu
python test/inductor/test_torchinductor.py -k test_clamp -v
```
Differential Revision: D75490124
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471
Approved by: https://github.com/desertfire, https://github.com/chenyang78
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -342,7 +339,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.2.6
onnxscript==0.3.1
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -382,3 +379,10 @@ dataclasses_json==0.6.7
cmake==4.0.0
#Description: required for building
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
#however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"#removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
Please provide a clear and concise description of what the bug is.
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently.
Your example should be fully self-contained and not rely on any artifact that should be downloaded.
For example:
```python
# All necessary imports at the beginning
@ -26,6 +28,7 @@ body:
If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
If your issue is related to numerical accuracy or reproducibility, please read the [numerical accuracy](https://docs.pytorch.org/docs/stable/notes/numerical_accuracy.html) and [reproducibility](https://docs.pytorch.org/docs/stable/notes/randomness.html) notes. If the difference is not expected as described in these documents, please provide appropriate justification on why one result is wrong and the other is correct.
placeholder:|
A clear and concise description of what the bug is.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.