pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Aaron Gokaslan	d3c2123ea6	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 ) * This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly. * Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels. * Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time. * These kernels and settings will be needed for Flash Attention 3 whenever we add that too. Fixes #133695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686 Approved by: https://github.com/ezyang	2024-09-28 21:11:15 +00:00
drisspg	fcf2a1378b	Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989 ) # Summary First PR got reverted and needed a redo This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989 Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo	2024-06-19 04:49:39 +00:00
PyTorch MergeBot	a5b86a1ec0	Revert "FP8 rowwise scaling (#125204 )" This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1. Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))	2024-06-06 16:12:34 +00:00
drisspg	5dc9128229	FP8 rowwise scaling (#125204 ) # Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw, https://github.com/malfet	2024-06-05 15:46:40 +00:00
PyTorch MergeBot	d05cddfe23	Revert "FP8 rowwise scaling (#125204 )" This reverts commit 923edef31c7f3e98a14625724f2019b1422dcb26. Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))	2024-06-03 15:00:21 +00:00
drisspg	923edef31c	FP8 rowwise scaling (#125204 ) # Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw	2024-05-31 20:09:08 +00:00
Aaron Gokaslan	dfba65be8b	Update Cutlass to v3.1 (#94188 ) Now that we are on CUDA 11+ exclusively, we can update Nvidia's Cutlass to the next version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94188 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/malfet	2023-04-25 22:02:42 +00:00
Christian Puhrsch	f6c6048b10	Use CUTLASS GEMM for NT bmm (#85894 ) Copy of https://github.com/pytorch/pytorch/pull/85710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894 Approved by: https://github.com/drisspg	2022-10-18 23:11:47 +00:00
PyTorch MergeBot	d169f950da	Revert "Use CUTLASS GEMM for NT bmm [OSS-only] (#85894 )" This reverts commit ef58a132f223d5abf2bd3f8bee380aca6c29d17f. Reverted https://github.com/pytorch/pytorch/pull/85894 on behalf of https://github.com/DanilBaibak due to Break internal build	2022-10-13 15:28:09 +00:00
Christian Puhrsch	ef58a132f2	Use CUTLASS GEMM for NT bmm [OSS-only] (#85894 ) OSS-only copy of https://github.com/pytorch/pytorch/pull/85710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894 Approved by: https://github.com/drisspg	2022-10-12 20:03:28 +00:00

10 Commits