179 Commits

Author SHA1 Message Date
6e588da0f4 [Build/CI] Fix CUDA 11.8 build (#17679)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-22 12:13:54 -07:00
c7852a6d9b [Build] Allow shipping PTX on a per-file basis (#18155)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-15 16:41:55 -07:00
60f7624334 Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844) 2025-05-12 19:52:47 -07:00
0c0fdae84f [Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model (#16362) 2025-05-09 16:24:41 -07:00
376786fac1 Add cutlass support for blackwell fp8 blockwise gemm (#14383)
Signed-off-by: Shu Wang <shuw@nvidia.com>
2025-05-08 15:09:55 -07:00
1d0c9d6b2d [Kernel] some optimizations for dense marlin and moe marlin (#16850)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-05-05 09:39:30 -07:00
f62cad6431 [Build/CI] Upgrade CUTLASS to 3.9.2 (#17641)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-05-04 19:23:17 -07:00
c8386fa61d [Build/CI] Upgrade CUTLASS to 3.9.1 (#17602)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-05-02 22:25:14 -07:00
3e887d2e0c permute/unpermute kernel for moe optimization (#14568)
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
2025-05-02 11:31:55 -07:00
460a2b1100 [torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations (#10867)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-05-01 07:59:28 -07:00
2c4f59afc3 Update PyTorch to 2.7.0 (#16859) 2025-04-29 19:08:04 -07:00
ed7a29d9f8 [NVIDIA] Support Cutlass MLA for Blackwell GPUs (#16032)
Signed-off-by: kaixih <kaixih@nvidia.com>
2025-04-27 06:29:21 -07:00
188b7f9b8c [Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (#15830)
Signed-off-by: charlifu <charlifu@amd.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
2025-04-21 20:46:22 -07:00
d06ba4ed3f [Kernel] moe wna16 marlin kernel (#14447)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-04-14 20:05:22 -07:00
e9528f6dc6 [Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173)
Signed-off-by: DefTruth <qiustudent_r@163.com>
2025-04-11 06:50:50 -06:00
b7b7676d67 [Distributed] Add custom allreduce support for ROCM (#14125)
Signed-off-by: ilmarkov <imarkov@redhat.com>
Co-authored-by: ilmarkov <imarkov@redhat.com>
2025-03-31 22:49:12 -07:00
e6e3c55ef2 Move dockerfiles into their own directory (#14549)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-31 13:47:32 -07:00
555aa21905 [V1] Fully Transparent Implementation of CPU Offloading (#15354)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-31 20:22:34 +08:00
c802f5430d [ROCm][AMD][Build] Update AMD supported arch list (#15632)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-03-28 20:39:18 -07:00
9239bf718e [Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>
2025-03-27 00:54:44 +00:00
14f301b541 Update to torch==2.6.0 (#12721)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-14 16:58:30 -04:00
977a16772c [Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 (#14430)
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>
2025-03-14 09:55:14 -07:00
916836bbfb [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. (#14664)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-03-12 09:31:19 -07:00
45f3f3f59e [ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. (#14629)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-12 08:00:28 -04:00
07b4b7a37f [BugFix/Build] Fix sparse kernels not getting built on hopper (#14572)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-11 17:09:03 +00:00
90e88ab756 [Kernel] moe wna16 cuda kernel (#13321)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-10 20:12:40 -04:00
7caff01a7b [Build/BugFix] Fix hopper 12.8 build (#14354)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-08 08:11:56 +00:00
e123aafdf0 Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (#14157)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 12:25:24 +08:00
f89978ad7c add cutlass support for blackwell fp8 gemm (#13798) 2025-03-04 07:55:07 -08:00
6a92ff93e1 [Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) 2025-02-28 22:30:59 -08:00
f95903909f [Kernel] FlashMLA integration (#13747)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-02-27 10:35:08 +08:00
094b7d9496 [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues (#13797) 2025-02-25 18:52:03 -08:00
aabeb2688f [ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) 2025-02-25 00:39:59 -08:00
e109e598c7 [NVIDIA] Support nvfp4 cutlass gemm (#13571) 2025-02-22 05:24:05 -08:00
0023cd2b9d [ROCm] MI300A compile targets deprecation (#13560) 2025-02-19 23:05:00 -08:00
c1e37bf71b [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-14 00:01:14 +00:00
4fc5c23bb6 [NVIDIA] Support nvfp4 quantization (#12784) 2025-02-12 19:51:51 -08:00
da317197dd [Build] Fix cuda link target of cumem_allocator in CPU env (#12863)
Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-11 21:55:57 +08:00
ef533d25fb [Bugfix] FA2 illegal memory access (#12848) 2025-02-06 19:54:07 -08:00
9798b2fb00 [Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868) 2025-01-30 18:33:00 -08:00
73aa6cfdf7 Revert "[Build/CI] Fix libcuda.so linkage" (#12552) 2025-01-29 21:12:24 +00:00
103bd17ac5 [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-27 10:40:00 -05:00
72bac73067 [Build/CI] Fix libcuda.so linkage (#12424)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-26 21:18:19 +00:00
68f11149d8 [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-26 11:09:34 -08:00
3132a933b6 [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 20:20:59 +00:00
ab5bbf5ae3 [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 15:27:59 +00:00
978b45f399 [Kernel] Flash Attention 3 Support (#12093)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-23 06:45:48 -08:00
68ad4e3a8d [Core] Support fully transparent sleep mode (#11743)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:39:32 +08:00
73001445fb [V1] Implement Cascade Attention (#11635)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-01 21:56:46 +09:00
970d6d0776 [Build][Kernel] Update CUTLASS to v3.6.0 (#11607)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-30 17:22:13 +08:00