[bugfix][torchair] fix missing weight nz cast for w13_weight in torchair_w8a8_dynamic.py (#3446)

### What this PR does / why we need it?
Fix the issue of missing NZ conversion for quantized weights in GMM
after moe_dispatch operator in torchair scenario, which does not involve
aclgraph & single scenarios.

### How was this patch tested?
vllm serving passed with lower latency (~5ms TPOT with bs_per_rank=28 &
ep_size=32)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
linfeng-yuan
2025-10-14 21:11:05 +08:00
committed by GitHub
parent 5fe883fa43
commit c55d99d13e

View File

@ -1052,6 +1052,7 @@ class TorchairAscendW8A8DynamicFusedMoEMethod:
layer.w2_weight.data = layer.w2_weight.data.transpose(
1, 2).contiguous()
if is_enable_nz():
torch_npu.npu_format_cast_(layer.w13_weight, ACL_FORMAT_FRACTAL_NZ)
torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ)
layer.w13_weight_scale.data = layer.w13_weight_scale.data.view(
layer.w13_weight_scale.data.shape[0], -1)