This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'
## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files
## Files Modified
This PR updates files that contained the target text.
Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652
Approved by: https://github.com/jeffdaily
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address
Co-author: @amd-hhashemi
Reproducer:
```
import time
import torch
x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)
for _ in range(20):
x.index_add_(0, ind, src)
start_time = time.time()
for i in range(100):
x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```
Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us
After:
Avg time for index_add_: 2675.15 us
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
# Description
`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.
This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.
Confirmed the new approach can handle large inputs. Correctness needs validation.
# Testing Command
`python torch_spmv.py 22500000 272500000`
## Script `torch_spmv.py`
``` python
import torch
import argparse
def parse_args():
parser = argparse.ArgumentParser(
description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
)
parser.add_argument("n", type=int, help="Size of the NxN matrix")
parser.add_argument("nnz", type=int, help="Number of non-zero entries")
return parser.parse_args()
def main():
args = parse_args()
n = args.n
nnz = args.nnz
dtype = torch.float32
device = torch.device('cuda')
# Generate random indices for the sparse matrix in COO format.
torch.manual_seed(42)
rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
indices = torch.stack([rows, cols], dim=0)
# Generate random values.
values = torch.randn(nnz, dtype=torch.float32, device=device)
# Create the sparse COO matrix and move it to the target device.
sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
sparse_matrix = sparse_matrix.coalesce()
# Generate a random dense vector.
dense_vector = torch.randn(n, dtype=torch.float32, device=device)
# Perform sparse matrix - dense vector multiplication.
# Using torch.sparse.mm which expects a 2D tensor for the vector.
result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
# result = torch.mv(sparse_matrix, dense_vector)
# Print the result.
print("Result of the multiplication:")
print(torch.sum(result))
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods.
Test Plan:
`torch._C._get_accelerator()`
Rollback Plan:
Differential Revision: D79981115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304
Approved by: https://github.com/nautsimon
By adding `addmm` kernel, which is a logical continuation of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float
Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via
```
PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_
```
Fixes https://github.com/pytorch/pytorch/issues/154901
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #160228, #160234
Refactors how the enablement/disablement of CK Gemms and SDPA works.
- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849
**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads
**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]
where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path
Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769
```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/
Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6
**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:
Differential Revision: D79042656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
# Moativation
This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.
# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext
Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
fixes typo in word `enought` to correct `enough` at 3 places in these files
```
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/cuda/CuFFTPlanCache.h
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587
Approved by: https://github.com/ezyang
scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue.
~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work.
FIXES https://github.com/pytorch/pytorch/issues/159097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134
Approved by: https://github.com/eellison
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`
With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.
Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE