Summary:
Add optional user-passed `alpha` argument to
`at::cuda::blas::scaled_gemm`, necessary for two-level-scaled NVFP4 gemm
calls (where the global de-scales are folded into the `alpha` argument.
Global de-scales are naturally device tensors, but using cublas'
device-pointer mode for `alpha`/`beta` has an interesting lifetime
implication - the `alpha` tensor must be valid & correct until the end
of the matmul call, *not* just the launch (as for host values). To
enable this, I added device-constant memory for `one` and `zero`, along
with a statically-held single-fp32-value tensor, which is valid from the
first passed-`alpha` invocation of `scaled_gemm` to the end of the
program. User-passed values are copied into this perpetual buffer to
ensure lifetime requirements are met.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165563
Approved by: https://github.com/drisspg, https://github.com/eqy
Replace (more) exact calculation with hardware approximation.
Benefits:
Reduced code size.
Improved performance for certain scenarios.
Experiments show low reduction in precision.
Experiments show no significant performance regressions. bfloat16 as well as float16 related calculations may benefit largely from this change.
Co-author: @mhalk @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165589
Approved by: https://github.com/jeffdaily
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Differential Revision: D84868162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165700
Approved by: https://github.com/Skylion007
Fixes#158232
The autocast caching heuristic in `aten/src/ATen/autocast_mode.cpp:139` did not account for gradient mode state when deciding whether to cache. FSDP2 is not directly related.
~~This PR adds `GradMode::is_enabled()` check to caching condition. Caching is now disabled in `no_grad()` contexts to prevent storing tensors with incorrect gradient state. Ensures correctness at the cost of using cache.~~
This PR proposes separate caches for gradient-enabled and gradient-disabled modes.
Adds tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165068
Approved by: https://github.com/ngimel, https://github.com/janeyx99
Add mx fp4 support in Blas.cpp.
Updated the scale_kernel_dispatch array and ScaledGemmImplementation enum to include MXFP4 support.
Modify the tests under test_scaled_matmul_cuda accordingly.
PYTORCH_TEST_WITH_ROCM=1 python test/test_scaled_matmul_cuda.py -v -k test_blockwise
115 test passed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165528
Approved by: https://github.com/jeffdaily
Adding bf16 for the backward pass of `torch._fake_quantize_learnable_per_tensor_affine()`.
Note that for testing, we modified the seed to avoid increasing tolerance due to cases where difference in Python vs CPP downcasting causes tensor mismatches. (e.g. 27.87704 vs 27.8408 before downcasting, 27.7500 vs 27.8750 after downcasting for Python vs CPP op)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165362
Approved by: https://github.com/andrewor14
The commits f4d8bc46c7706f872abcb4ec41f0b32207d5d826 added TF32 support for x86 CPUs,
which causes build failures on PowerPC systems with mkldnn.
This patch disables TF32 paths on PowerPC while keeping x86 TF32 support intact,
allowing PyTorch to build successfully on PowerPC.
I have run the mkldnn test case on PowerPC, and it passed successfully.
`pytest test/test_mkldnn.py
87 passed, 2 skipped in 1709.02s (0:28:29`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163454
Approved by: https://github.com/jgong5, https://github.com/malfet
Changed the implementation from an output-based approach to an input-based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup.
The changes are from Yu-Yun <YuYun.Chang@amd.com>.
# Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X
- The original "scatter-add" approach
- Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs.
- The new "gather-sum" approach
- Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the `compute_output_range` device function).
# Breakdown of the code changes
- Inversion of the parallelization strategy of the kernel function `upsample_bilinear2d_backward_out_frame`
- Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (`const size_t o_numel = nc * width2 * height2;`).
- Each thread processed one output pixel.
- The new loop is parallelized over the number of elements in the input gradient tensor (`const size_t i_numel = nc * height1 * width1;`).
- Each thread is responsible for calculating the final gradient for a single input pixel.
- The kernel launch changes accordingly in the function `upsample_bilinear2d_backward_out_cuda_template`.
- Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (`input_pos`) during the forward pass interpolation
- This is essentially the mathematical inverse of the forward pass.
- This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor.
- Gradient calculation approach switching from "scatter-add" to "gather-sum"
- Scatter-add
- For each output pixel, the thread calculated 4 gradient contributions and use `fastAtomicAdd` 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor.
- Gather-sum
- A thread responsible for one input pixel calls `compute_output_range` to determine the small rectangular region of output pixels that influence the input's final gradient value.
- The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel.
- All these contributions are accumulated into a private, per-thread register variable (`accscalar_t grad_sum = 0;`).
- W/o any gloabl memory access, this accumulation is extremely fast.
- When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (`idata[index] = static_cast<scalar_t>(grad_sum);`).
# Why performance gets boosted
- Analysis of the root cause of performance drop
- Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward
- First and foremost, elimination of the contention of atomic operations
- Many parallel threads called `atomicAdd` frequently attempting to update the exact same memory location in the input gradient tensor at the same time.
- The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points.
- MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue.
- When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect.
- The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many `atomicAdd`.
- Improved memory access pattern and locality
- Write coalescing
- The regular sum writes `idata[index] = static_cast<scalar_t>(grad_sum);` can be perfectly coalesced by GPUs.
- Read locality
- Even though there are many (potentially repeated) reads from the output tensor (`static_cast<accscalar_t>(odata[output_idx])`), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread.
- Trade-off: computation for memory synchronization
- The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X.
- Removal of atomic operations avoids expensive memory synchronization.
---
Optimizations of `grid_sampler_2d_backward` will be addressed in a separate PR.
Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164572
Approved by: https://github.com/jeffdaily
Summary:
* Add `torch._scaled_grouped_mm_v2` with more functionality and
extensibility for future formats
* Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint
* Test both original and v2 functionality
Test Plan:
```
pytest -svv -k grouped test/test_scaled_matmul_cuda.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165154
Approved by: https://github.com/drisspg, https://github.com/danielvegamyhre
Fixes#165110
The `PUBLIC` scope causes CUTLASS of the FBGEMM being included in for all PyTorch targets, including special matmuls (RowwiseScaledMM, ScaledGroupMM and GroupMM). Due to version mismatch between FBGEMM/CUTLASS and PyTorch/CUTLASS it is unacceptable to use FBGEMM/CUTLASS in PyTorch targets. This PR limits the scope of FBGEMM/CUTLASS to `fbgemm_genai` target only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165424
Approved by: https://github.com/cthi, https://github.com/eqy, https://github.com/danielvegamyhre
https://github.com/pytorch/pytorch/pull/164790 modifies aten to perform a different reduction order intra warp. However, this change exposed a large difference in a sum for complex32. Namely the case:
```
import torch
a = torch.tensor([[ 4.82031250+7.34765625j,
-3.37109375-1.9501953125j],
[ 3.7832031250-2.43359375j,
-6.07812500+5.32812500j]], dtype=torch.complex32, device='cuda:0')
sum_out = torch.sum(a)
nansum_out = torch.nansum(a)
torch.testing.assert_close(
sum_out,
nansum_out,
rtol=0,
atol=0,
)
```
Here, the result of `sum` and `nansum` differed significantly by 1e-2. Further investigation showed that the explicit casting of b back to `arg_t` from `scalar_t` was the root cause. `arg_t` is the dtype of the accumulator, ComplexFloat, and `scalar_t` of the input dtype, ComplexHalf. When we cast in the reduction to the accumulator order, that means the input is still of ComplexHalf, which loses precision as it can store intermediate values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165494
Approved by: https://github.com/ngimel
reproducer
```
import torch
# does not crash
a = torch.rand((0), device="cpu")
b = torch.rand((0), device="cpu")
a.dot(b)
# crashes due to internal assert
a = torch.rand((0), device="mps")
b = torch.rand((0), device="mps")
a.dot(b)
```
Discovered when implementing an op for SparseMPS backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165237
Approved by: https://github.com/malfet
Fixes#160752
# Background:
`torch.func.jacfwd` is implemented as vmap over forward-mode JVP. With torch.compile(dynamic=True), FakeTensor + SymInt shape reasoning is used while tracing through the transform. The old vmap rule for one_hot decomposed into “zeros_symint + scatter,” which interacted poorly with the transform stack and dynamic shapes, leading to failures mid-trace. Using a functional equality construction makes one_hot composable with vmap/JVP and friendly to dynamic shape tracing.
# Changes:
- functorch vmap batching rule for `aten::one_hot` now uses a purely functional formulation:
- Replace “zeros + scatter” with eq(self.unsqueeze(-1), arange(num_classes)).to(kLong) under FuncTorchBatched.
- one_hot native path remains unchanged for regular eager; vmap transform no longer relies on scatter, which was fragile under dynamic shape tracing.
The minimal repro from the issue is now fixed:
```python
import torch
import torch.nn.functional as F
MAX, BATCH = 3, 37
def func(x, idxs):
return x.square() * F.one_hot(idxs, MAX)
def jacfunc(x, idxs):
return torch.func.jacfwd(func, argnums=0)(x, idxs)
idxs = torch.randint(MAX, (BATCH,), dtype=torch.int64)
x = torch.rand((BATCH, MAX), dtype=torch.float64)
# eager
out_eager = jacfunc(x, idxs)
# compiled dynamic
jacfunc_c = torch.compile(jacfunc, dynamic=True)
out_comp = jacfunc_c(x, idxs)
torch.testing.assert_close(out_eager, out_comp)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160837
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
Summary:
The `nDims` variable is mutated inside the loop but never restored to its original value.
This affects subsequent iterations of the outer loop.
Each batch iteration may get incorrect `nDims` after the first batch.
Test Plan: CI
Reviewed By: ngimel
Differential Revision: D84612194
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165446
Approved by: https://github.com/ngimel
Summary: Add note mentioning which scaling type pairs are supported in Inductor ATen, since this was a source of confusion and also informs which scaling strategies we choose to support for other backends, like Triton.
Test Plan: n/a
Reviewed By: lw
Differential Revision: D84522373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165450
Approved by: https://github.com/NikhilAPatel
Skip test_compiled_autograd_attribution on s390x
It fails both on s390x and x86_64 at least under some circumstances. Disable it for now until on s390x until it works reliably.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163647
Approved by: https://github.com/malfet
Follow up to #165098 - adding bf16 support for the backward pass. To avoid BC breaking changes/losing precision, we upcast the parameters to fp32 after the op gets called, and downcast the gradients to bf16 before returning.
For testing, we upcast to fp32 before calling the reference function. We increase the tolerance to 1e-2 for bf16 inputs because of a difference in casting calculations between python's `x.to(torch.bfloat16)` and cpp's `x.to(at::kBFloat16)` (after comparing intermediate tensors, we found that the numerics diverge after the final casting). We don't explicitly cast in the CPP op but rather let autograd/optimizer handle it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165325
Approved by: https://github.com/andrewor14
Summary:
### Problem
ArrayRef's `equals()`does elementwise quality using `==` operator. This can cause a DDE for unbacked symints since `==` operator calls `guard_bool`.
```
// SymInt.h
bool operator==(const SymInt& o) const {
return sym_eq(o).guard_bool(__FILE__, __LINE__);
}
```
### Solution
Adds `sym_equals()` to do elementwise equality for `SymIntArrayRef`. Use this instead of `equals()` for `SymIntArrayRef`.
Reviewed By: guangy10, pianpwk, muchulee8
Differential Revision: D84168401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165112
Approved by: https://github.com/Skylion007
This function is relatively hot; inlining here reduces time reported by `python -m timeit --setup 'import torch; t = torch.tensor([1])' 't._cdata'` from about 125 nsec/loop to about 110 nsec/loop. (To be fair, variance is high, but I did confirm with perf that time in this path seems to have roughly halved during torchtitan training.)
Note that locally I am getting bit by a GCC bug that I documented in a comment. Would be interested to hear if this does anything for clang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164617
Approved by: https://github.com/ezyang