vllm-ascend

mirror of https://github.com/vllm-project/vllm-ascend.git synced 2025-10-20 05:33:51 +08:00

Files

Chen Chen bcc313e8f2 add mla_preprocess kernel (#3226 )

### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>

2025-10-12 07:39:45 +08:00

kernels

[Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672 )

2025-09-02 11:46:59 +08:00

mla_preprocess

add mla_preprocess kernel (#3226 )

2025-10-12 07:39:45 +08:00

camem_allocator.cpp

Add sleep mode feature for Ascend NPU (#513 )

2025-04-18 13:11:39 +08:00

ops.h

add mla_preprocess kernel (#3226 )