Commit Graph

8 Commits

Author SHA1 Message Date
9935d45728 [CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460)
### What this PR does / why we need it?
Add model basic accuracy test(Qwen2.5-0.5B-Instruct)

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-17 14:59:56 +08:00
f6af1d2471 [MISC] fix logger (#515)
logger in vllm-ascend doesn't work. This PR fix the issue.

Fix: https://github.com/vllm-project/vllm-ascend/issues/431

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-15 10:18:05 +08:00
f6cf92e7d5 [quant][bugfix] fix deepseek quant bug (#478)
see #465

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-08 09:15:56 +08:00
344228a5da [deepseek][bugfix] support deepseek quant (#469)
- support deepseek quant
  - add w8a8_dynamic quant
see #391

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-07 10:56:12 +08:00
7330416de3 [BugFix] Fix bugs when using ascend quantization (#275)
### What this PR does / why we need it?
It fixes following bugs:
1. When searching a specific linear quantization implementation from a
tool (such as MindIE-Turbo), the mapping of packed linear is required to
identify correponding quant type.
2. The exception is narrowed down to ImportError when importing
MindIETurboQuantizer to better throw other errors.
3. The api of AscendKVCacheMethod.apply is aligned with that in
AscendAttentionBackendImpl.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/d63804cf-c060-451f-9cb0-d012e06b5333)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-03-12 11:33:21 +08:00
3217f0d10f [Feature] Modify description and api for ascend quantization (#243)
### What this PR does / why we need it?
1. It adds more description for classes in quant_config.py
2. It renames AscendQKVQuantAttentionMethod to AscendKVCacheMethod to
align with vLLM naming style.
3. It modifies the process when AscendLinearMethod or
AscendKVCacheMethod calls create_weights.


### Does this PR introduce _any_ user-facing change?
Yes. When creating weights, now AscendLinearMethod uses get_weight,
get_pertensor_param and get_perchannel_param api from linear quant
implementation, while AscendKVCacheMethod passes layer into linear quant
implementation.

### How was this patch tested?
By performing offline inference

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-03-06 15:17:25 +08:00
whx
0d3463400a [Performance] Change the shape of kv_cache to avoid view of k_cache and v_cache. (#204)
This PR changes the shape of kv cache to avoid the view of k_cache and
v_cache.
What's more, cache the metadata of k_cache and v_cache to avoid
duplicative slice operations to improve performance.

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
2025-03-05 10:51:07 +08:00
5f465010de [Core] Cherry pick from 0.7.1 to keep the main code newest (#127)
Cherry pick from 0.7.1 to keep the main code newest

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-21 17:07:37 +08:00