Summary:
Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts.
Added
`#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console.
This is the changes of the PTD.
Test Plan:
Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3
{F1971449692}
Reviewed By: c-p-i-o
Differential Revision: D67036093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678
Approved by: https://github.com/c-p-i-o
Summary:
Triton compiler does not automatically promote fp16/bf16 reductions to fp32 accumulation. This will result in significant accuracy issue.
This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]`
Test Plan:
CI
```
python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction
```
Differential Revision: D65965032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052
Approved by: https://github.com/blaine-rister
Summary:
Fix https://github.com/pytorch/pytorch/issues/142035 and https://github.com/pytorch/pytorch/issues/143621
When Linear module params are tied to another parameter, like this:
```
class SimpleLinearModel(nn.Module):
def __init__(self, input_size, output_size):
super(SimpleLinearModel, self).__init__()
# Define a linear layer
self.linear = nn.Linear(input_size, output_size)
self.tied_weight = self.linear.weight
def forward(self, x):
# Forward pass through the linear layer
b = self.tied_weight + 1
return self.linear(x), b
```
We get a graph like below:
```
graph():
%p_tied_weight : [num_users=0] = placeholder[target=p_tied_weight]
%p_linear_weight : [num_users=2] = placeholder[target=p_linear_weight]
%p_linear_bias : [num_users=1] = placeholder[target=p_linear_bias]
%x : [num_users=1] = placeholder[target=x]
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%p_linear_weight, 1), kwargs = {})
%linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_linear_weight, %p_linear_bias), kwargs = {})
return (linear, add)
```
Notice that ` %p_linear_weight : [num_users=2]`.
When we get source partitions, we should exclude attributes nodes like `p_linear_weight` from outputs.
A real world example where people do something like this is in https://github.com/pytorch/pytorch/issues/142035.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r test_module_partitioner_weight_tied
```
Differential Revision: D66998592
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142446
Approved by: https://github.com/angelayi
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.
Replaces https://github.com/ROCm/pytorch/pull/1592
This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.
Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author
NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
as titled, this PR expose this dunder method as a public API in the doc,
so that different checkpoint implementations can leverage this protocol,
instead of exposing a separate API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100
Approved by: https://github.com/awgu
ghstack dependencies: #144099
# Issue
This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges.
# Fix
Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel.
Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with a node of ranges `([8], [])` when we have `reduction_numel=8`.
Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND.
# Test plan
This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041
Approved by: https://github.com/jansel
`autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS).
This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI.
```
// in AOTI codegen
kernels.cuda_fused_0(
(const half*)arg0_1.data_ptr(), (const half*)arg1_1.data_ptr(), (half*)buf0.data_ptr(),
(int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216,
(size_t*)nullptr, (uint8_t*)workspace_0.data_ptr(), stream);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990
Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire
When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic.
This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223
Approved by: https://github.com/desertfire
ghstack dependencies: #141371
Additionally, enable torchinductor opinfo tests exercising all
previously fixed bugs in this stack.
Note: I've manually sharded the cpp_wrapper CI checks into 2 shards.
Once all OpInfo tests are enabled we should switch back to automatic
sharding, but until then the pipeline doesn't have appropriate timing
stats. More shards would be helpful given the compilation slowdown
associated with cpp_wrapper, but 2 will do for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371
Approved by: https://github.com/desertfire
Fixes#105203 and is a follow up PR to #141833
When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work.
In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding.
The current default behaviour is left as is.
Tests are also updated to assert on the worker IDs for each sample of data returned.
I've run the following to confirm they aren't flaky
```bash
for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324
Approved by: https://github.com/andrewkho