By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 ) at exit, as at this point handle to child process might no longer be valid
Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
Fixes https://github.com/pytorch/pytorch/issues/153050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
Users may want CUDAGraph for certain sizes and fallback for other sizes.
As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.
Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.
This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.
Example:
```python
import torch
torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]
def f(x):
return x + 1
f = torch.compile(f, mode="reduce-overhead", dynamic=False)
def run(batch_size, seq_len, d):
x = torch.randn((batch_size, seq_len, d), device="cuda")
# Need to mark the dimension as dynamic. Automated-dynamic
# may have some ux issues on matching `cudagraph_capture_sizes`
# with the actual dynamic shapes, since there are specialization and
# multiple dynamo graphs.
torch._dynamo.mark_dynamic(x, 0)
torch._dynamo.mark_dynamic(x, 1)
for _ in range(3):
f(x)
for i in range(2, 10):
for j in range(2, 10):
run(i, j, 8)
num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
Don't call `sum()` on a tensor that is default constructed.
Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:
```
Traceback (most recent call last):
File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
yield
File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
self._callTestMethod(testMethod)
File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
method(*args, **kwargs)
File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
ln_out_cuda.backward(grad_output_cuda)
File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
torch.autograd.backward(
File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
_engine_run_backward(
File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::TensorBase::options() const from :0
#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0
```
Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600
Approved by: https://github.com/eqy, https://github.com/ngimel
At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished.
Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653
Approved by: https://github.com/kwen2501
I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639
Approved by: https://github.com/atalman
Differential Revision: D76843916
Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature:
Excessive register spillage
Using much larger amounts of shared memory than available on the hardware
This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive:
```
AUTOTUNE mm(1152x21504, 21504x1024)
strides: [21504, 1], [1, 21504]
dtypes: torch.bfloat16, torch.bfloat16
mm 0.1167 ms 100.0%
triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices
INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610
Approved by: https://github.com/jansel
Summary: Fixes https://github.com/pytorch/pytorch/issues/149311
Test Plan:
Just changes string output
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 60.993us 0.97% 60.993us 1.848us 0 B 0 B 33
...
```
Rollback Plan:
Differential Revision: D76857251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250
Approved by: https://github.com/sanrise
Original issue: https://github.com/pytorch/pytorch/issues/154820
The issue happens when there is a mutation for the same input in forward AND in backward.
AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.
The fix:
1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward
We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation.
2/ Exposing mutation_counter to python
We want to keep invariant that copy_ exist only in the end of joint graph.
3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.
add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.
4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.
proxy_tensor changes:
By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
Differential Revision: D76514984
Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185
Approved by: https://github.com/masnesral
When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model.
This PR adds a config to only install torchbench models we want to run.
Example usage:
```
# Install 1 torchbench model
make build-deps TORCHBENCH_MODELS="alexnet"
# Install 3 torchbench models
make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch"
# Install all models
make build-deps
# Install all models
make build-deps TORCHBENCH_MODELS=""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465
Approved by: https://github.com/ezyang
In practice `bool(...)` is either constant folded by Dynamo or used for
branching (so most of its emulation logic lived in
`InstructionTranslator.generic_jump`.
This patch adds a dedicated `bool` hanlder (only for symbolic
bool/int/float for now), and fixes#136075.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863
Approved by: https://github.com/williamwen42
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.
**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
Summary:
For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf.
Test Plan:
Verify on two hosts
Host1:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=0 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Host2:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=1 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Rollback Plan:
Differential Revision: D76937048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389
Approved by: https://github.com/kwen2501
Summary:
Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors.
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.
Decisions made along the way:
1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.
The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.
Test Plan: CI
Differential Revision: D75497142
Co-authored-by: Benjamin Glass <bglass@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352
This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.*
As of right now with this code change the terminal outputs:
if the code came from torch.* :
Nothing, as we shouldn't warn for functions that come from torch.*
else:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module>
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] torch.compile(foo, backend="eager")(torch.randn(4))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463
Approved by: https://github.com/williamwen42
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py`
- Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including:
- A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion
- Flag-based signaling only after `quiet()` completes, guaranteeing data delivery
- Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475
Approved by: https://github.com/kwen2501
ghstack dependencies: #156472, #156473, #156474
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```
**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB
As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.
cc and credit to @ShatianWang for noticing this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py`
- Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including:
- A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls
- Final fence before flag update to ensure all data transfers complete before signaling
- Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156472, #156473
The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes
In terms of differences with eager autograd engine:
- For function mode, CA additionally disables/re-enables `_set_multithreading_enabled`
- For dispatch mode:
- accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead
- Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516
Approved by: https://github.com/jansel
ghstack dependencies: #156374, #156509
Example:
```python
File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd.
You can disable compiled autograd for this operation by:
1. Relocating the unsupported autograd call outside the compiled region.
2. Wrapping the unsupported autograd call within a scope that disables compiled autograd.
3. Configuring the specific compilation unit to disable compiled autograd.
4. Globally disabling compiled autograd at the application's initialization.
```
No duplicate error messages for python side trace-time errors
```python
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture
raise NotImplementedError(
NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either:
1. Moving the unsupported autograd call outside of the torch.compile'd region.
2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager.
3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call.
4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509
Approved by: https://github.com/jansel
ghstack dependencies: #156374
This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py`
- Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675))
- Implemented `test_triton_signal_wait_until` with:
- Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block`
- Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value
- End-to-end validation of both data transfer and signal synchronization
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
ghstack dependencies: #156472
### Summary
int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape.
### Perf data
GPT-J 128 input tokens, 128 output tokens.
32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded.
| May 8 nightly first token latency | First token latency with this implementation | Rest token latency with May 8 nightly | Rest token latency with this implementation combined with #149373 |
|---|---|---|---|
|202 ms | 190 ms | 33 ms | 30 ms|
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel
Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs.
Changes:
- Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`.
- Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including:
- A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag.
- Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data.
Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until`
```python
@triton.jit
def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)
@triton.jit
def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr):
nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val)
...
if rank == 0:
print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21")
wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib)
print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...")
print(f"[RANK 0] Current out buffer contents: {out.tolist()}")
torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device))
print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.")
if rank == 1:
print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0")
put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)
print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0")
put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib)
print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.")
...
```
Output:
```
[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21
[RANK 1] About to PUT 8 elements of value 13 to rank 0
[RANK 1] About to PUT flag value 21 to wake up rank 0
[RANK 1] FLAG PUT complete! Rank 0 should wake up now.
[RANK 0] WAIT IS OVER! Flag was set, checking data now...
[RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13]
[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.
[RANK 0] Test completed successfully! 🎉
[RANK 1] Test completed successfully! 🎉
...
----------------------------------------------------------------------
Ran 1 test in 18.773s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472
Approved by: https://github.com/kwen2501
Summary:
In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1).
This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states).
The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry.
Test Plan:
loggercli codegen GeneratedPt2CompileEventsLoggerConfig
Rollback Plan:
Differential Revision: D76628834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959
Approved by: https://github.com/bobrenjc93
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.
Remove eigen submodule and replace it with eigen_pin.txt
Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
dynamo: Don't crash when someone tries to access a non existent list member
Test added which reproduces the failure. Note that I'm using the new
unimplemented_v2 API. Let me know if people have a strong preference that I use
something else.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335
Approved by: https://github.com/jansel
Since it's now actually used within async_compile.multi_kernel
```
def multi_kernel(self, *args, **kwargs) -> Any:
from torch._inductor.codegen.multi_kernel import MultiKernelCall
# no need to call this in parallel since the sub-kernels are already parallel tasks
return MultiKernelCall(*args, **kwargs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels
Copied from original auto_functionalize Diff Summary D53776805:
This is a non-functional kernel implementation for auto_functionalize
In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.
This would mutates the input tensors inplace, which is unsafe in general.
However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.
In the proper functional implementation, it will
make a clone of the mutating input tensor
return these new instance of tensors as AutoFunctionalizeKernel output.
If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.
Test Plan: See internal for test plan
Differential Revision: D76926383
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
In preparation of adding integer addmm, move matmul computation part into matmul_inner function
Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.
Remove eigen submodule and replace it with eigen_pin.txt
Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices
### CHANGES
- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py
Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.
- torch/testing/_internal/common_distributed.py
extended common utility functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.
Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
Implements https://github.com/pytorch/pytorch/issues/144908.
Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights
TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.
**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.
Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.
Torch Native Runtime RFC: pytorch/rfcs#72
Test Plan:
CI
Rollback Plan:
Differential Revision: D76830436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:
* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
+ Without this optimization the binary size of `libaotriton.so` could be
over 100MiB due to 2x more supported architectures compared with 0.9b.
Now it is only about 11MiB.
* Support sliding window attention (SWA) in
`_flash_attention_forward/backward`. Should fix#154582
See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.
Notable changes to SDPA backend:
* `std::optional<int64_t>` `window_size_left/right` are directly passed to
ROCM's SDPA backend, because the default value `-1` is meaningful to
AOTriton's backend and bottom-right aligned causal mask is implemented with
negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.
Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")
CC: @pengxin99, @guangyey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours. I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090
With this computing power, we could also run the whole suite every 4 hours now. I could run this less frequently later if I see queueing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`
Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.
We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.
Fixes#155778
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:c10_kernel_test
```
Differential Revision: D76825830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.
Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).
- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.
Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```
Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7
Rollback Plan:
Differential Revision: D74747190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:
- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
- Fallback to old implementation if no `max_version` or if version
lower than 1.0
- Check that the to-be-consumed capsule is of version up to 1.X
In order to accommodate these new specifications, this PR adds the
following main changes:
- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python. This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.
Fixes#150988Fixes#153478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
In preparation of adding integer addmm, move matmul computation part into matmul_inner function
Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.
Test Plan:
CI
Rollback Plan:
Differential Revision: D76929722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).
The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.
Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype. this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.
This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging. This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.
Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references. I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.
The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)
# do this on every checkpoint:
with staging_context(stager):
cpu_state_dict = copy.deepcopy(state_dict)
```
Also, adds support for pinned-memory.
One problem this implementation does not address is that we lose the original device.
The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.
Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.
The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)
# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```
Test Plan:
unit tests
Differential Revision: D75993324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
Context on torch.cuda.memory._record_memory_history buffer behavior
## Description
Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?
Fixes#129674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op
Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```
Rollback Plan:
Differential Revision: D76699002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
Summary:
### Diff Context
1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.
2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.
Test Plan: CI
Differential Revision: D75553096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.
Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
# Context
MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.
# This diff
Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.
The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.
----------------
p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time
Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid
Rollback Plan:
Differential Revision: D76769711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```
but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```
so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.
Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.
Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13
CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```
Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13
USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455
USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```
Rollback Plan:
Differential Revision: D76520301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
Fixes#156309
Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet
Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.
Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.
This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).
Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
## Summary
Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.
Fixes#115992
## Details
`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.
## Testing
The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback
**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.
Test Plan:
CI
Rollback Plan:
Differential Revision: D76734378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```
**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB
As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.
cc and credit to @ShatianWang for noticing this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.
**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.
Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.
What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.
Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2} and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.
Test Plan:
CI + dry run.
Rollback Plan:
Differential Revision: D76552340
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.
Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
Implements https://github.com/pytorch/pytorch/issues/144908.
Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198
There are two problems:
(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.
This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.
Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.
(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.
**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.
This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4
Co-authored-by: drisspg <drisspguessous@gmail.com>
Summary:
# Why
speed up cutlass kernel generation and retrieval
# What
using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally
this is the OSS only part of the change, to facilitate integration
Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```
## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```
## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```
## prove that we can download errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```
## showing timing information
```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```
Reviewed By:
henrylhtsang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
There are a few considerations here:
1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404
2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.
3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.
4. There are two foot guns:
a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).
b. The following seuquence won't work as intended:
```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```
This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.
I think these two foot guns are probably okay given that this a bit of an experts' API.
Fixes#155106
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
Followup on https://github.com/pytorch/pytorch/pull/125971
`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!
Example demonstration:
```python
class Base:
def register_buffer(self, buffer):
pass
class InheritedOk(Base):
pass
class InheritedOverride(Base):
def register_buffer(self, buffer):
pass
b = Base()
ok = InheritedOk()
override = InheritedOverride()
print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False
print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```
(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
Fixes#134840
Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:
- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive
These changes help prevent runtime errors when users attempt to use large padding values.
## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.
I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).
Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.
We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().
If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events
In short, they will be used to experess fork and join operations in
the graph if external=False.
External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.
Finishes #146145
I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.
Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.
Rollback Plan:
Differential Revision: D76775373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)
What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.
However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.
With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.
`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2
Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1, rocSolver <3.26)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
## Update using Cutlass 3.x (2025/06/15)
Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91
This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64
Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.
## Original using SM89
I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).
Simple benchmark script
```python
import torch
from torch._inductor.utils import do_bench_using_profiling
N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
scale_A = torch.ones(M, 1).cuda()
scale_B = torch.ones(1, N).cuda()
out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
torch.testing.assert_close(out, out_ref)
latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
tflops = (2 * M * N * K) / latency_us / 1e9
print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```
M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS
Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...
However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS
Small M slightly improves, but large M is still bad.
If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results
M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73
Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.
To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`
M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86
A bit better than current configs, but still very far away from hardware limit.
@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?
Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53
Better than default configs, but still far away from the config above for compute-bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.
The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file). This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted. But sometimes pytest reports a success, but the process has non zero exit code. The two cases I know of are hangs and double freeing at exit. In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option). But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly
Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds. This is correct even in the normal path because initial value shouldn't change in that case
Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this
I think I have a PR that fixes the above but its old so I need to take another look
Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`
Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
```
// The torch::stable::Tensor class is a highlevel C++ header-only wrapper around
// the C shim Tensor APIs. We've modeled this class after TensorBase, as custom
// op kernels only really need to interact with Tensor metadata (think sizes,
// strides, device, dtype). Other functions on Tensor (like empty_like) should
// live like the ATen op that they are and exist outside of this struct.
//
// There are several goals of this class over AtenTensorHandle and
// RAIIAtenTensorHandle:
// 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the
// C APIs with AtenTensorHandle. Under the hood we still call to these C shim
// APIs to preserve stability.
// 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass
// around ownership. This makes it difficult to pass one input into 2
// different functions, e.g., doing something like c = a(t) + b(t) for
// stable::Tensor t. Thus, we use a shared_ptr here.
```
This PR:
- exemplifies the above
- adds test cases in libtorch_agnostic to make sure the file actually works
- includes the results of a battle with template specialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367
Approved by: https://github.com/albanD
Summary:
This PR adds support for torch.cuda.FloatTensor and friends in Dynamo.
These are indeed legacy APIs, but that doesn't stop us from adding
support for them in torch.compile.
I add support for these in the same way that we support torch.Tensor:
these APIs can be safely put into the Dynamo graph.
Fixes#130722
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107
Approved by: https://github.com/williamwen42
Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder.
To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used.
(Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37`
Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031
Approved by: https://github.com/NikhilAPatel
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).
This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.
FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.
This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.
Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```
After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.
Also I want to change how cutlass .so are compiled. Hence this change.
This change is broken down since @coconutruben is trying to make a change to the same files too.
Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.
Torch Native Runtime RFC: pytorch/rfcs#72
Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test
Rollback Plan:
Differential Revision: D76525939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
Summary:
When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic.
A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs.
Test Plan:
test/dynamo/test_pgo.py
Rollback Plan:
Differential Revisi,on: D76630499
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961
Approved by: https://github.com/bobrenjc93
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:
1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile
One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.
I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
Fixes#128796
This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:
1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
- Masking before division
- Using MaskedTensor (experimental API)
The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.
This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.
dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer
Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up.
Before:
- Compiled Autograd Engine: proxies call to inductor accumulate_grad op
- Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract)
```python
new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1); getitem_1 = None
copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1); copy_ = None
```
- AOTAutograd: functionalizes the copy_
After:
- Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call.
- Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd
- AOTAutograd: traces into the op, with FunctionalTensors.
While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case.
In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521
Approved by: https://github.com/jansel
Delays code generation for arguments to fallback ops. This is inspired by #155642, and likely fixes similar memory leaks.
Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:
1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:
1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile
One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.
I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.
In jit tests:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/clee2000
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```
These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.
* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)
AOTI support is not implemented yet.
Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.
Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)
Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
Fixes#155048
The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change:
New note:
`amax, amin, max(dim), min(dim) evenly distributes gradient between equal values
when there are multiple input elements with the same minimum or maximum value.`
cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071
Approved by: https://github.com/soulitzer
Fixes#155688
Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`. This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.
Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.
This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
- it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals
I am also seeing a couple of wrong usages !! that i will fix in the next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154774
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
Summary: #buildall
Test Plan: CI
Differential Revision: D74582970
When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972
Approved by: https://github.com/avikchaudhuri
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -382,3 +382,7 @@ cmake==4.0.0
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"#removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
# WAR to resolve the ld error in libtorch build with CUDA 12.9
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.