Files
verl/docs/ascend_tutorial/ascend_quick_start.rst

229 lines
11 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

verl x Ascend
===================================
Last updated: 08/15/2025.
我们在 verl 上增加对华为昇腾设备的支持。
硬件支持
-----------------------------------
Atlas 200T A2 Box16
Atlas 900 A2 PODc
安装
-----------------------------------
基础环境准备
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------+-------------+
| software | version |
+-----------+-------------+
| Python | == 3.10 |
+-----------+-------------+
| CANN | == 8.1.RC1 |
+-----------+-------------+
| torch | == 2.5.1 |
+-----------+-------------+
| torch_npu | == 2.5.1 |
+-----------+-------------+
基础环境准备请参照这份 `文档 <https://gitee.com/ascend/pytorch>`_
vllm & vllm-ascend
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
为了能够在 verl 中正常使用 vllm需使用以下命令编译安装 vllm 和 vllm-ascend。请注意根据机器类型区分安装方式。
.. code-block:: bash
# vllm
git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt
# for Atlas 200T A2 Box16
VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
# for Atlas 900 A2 PODc
VLLM_TARGET_DEVICE=empty pip install -e .
.. code-block:: bash
# vllm-ascend
git clone -b v0.7.3.post1 --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export COMPILE_CUSTOM_KERNELS=1
python setup.py install
安装verl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
git clone https://github.com/volcengine/verl.git
cd verl
pip install -r requirements-npu.txt
pip install -e .
其他三方库说明
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+--------------+---------------+
| software | description |
+--------------+---------------+
| transformers | v4.52.4 |
+--------------+---------------+
| flash_attn | not supported |
+--------------+---------------+
| liger-kernel | not supported |
+--------------+---------------+
1. 支持通过 transformers 使能 --flash_attention_2 transformers 需等于 4.52.4版本。
2. 不支持通过 flash_attn 使能 flash attention 加速。
3. 不支持 liger-kernel 使能。
4. 针对 x86 服务器,需要安装 cpu 版本的 torchvision。
.. code-block:: bash
pip install torchvision==0.20.1+cpu --index-url https://download.pytorch.org/whl/cpu
快速开始
-----------------------------------
正式使用前建议您通过对Qwen2.5-0.5B GRPO的训练尝试以检验环境准备和安装的正确性。
1.下载数据集并将数据集预处理为parquet格式以便包含计算RL奖励所需的必要字段
.. code-block:: bash
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
2.执行训练
.. code-block:: bash
set -x
export VLLM_ATTENTION_BACKEND=XFORMERS
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=128 \
data.max_prompt_length=512 \
data.max_response_length=128 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=5e-7 \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.entropy_coeff=0.001 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=20 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=console \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_7b_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=1 \
trainer.device=npu $@
(可选) 设置MindSpeed训练后端指导
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. 参考 `MindSpeed README <https://gitee.com/ascend/MindSpeed>`_ 说明安装 MindSpeed 加速库。
2. 使能 verl worker 模型 ``strategy`` 配置为 ``megatron`` ,例如 ``actor_rollout_ref.actor.strategy=megatron``
3. MindSpeed 自定义入参可通过 ``override_transformer_config`` 参数传入,例如对 actor 模型开启 FA 特性可使用 ``+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True``
4. 更多特性信息可参考 `MindSpeed+verl 文档 <https://gitee.com/ascend/MindSpeed/blob/master/docs/user-guide/verl.md>`_
支持现状
-----------------------------------
**表1** RL类算法
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| algorithm | model | rewards mae | throughput ratio | actor.strategy | rollout.name | hardware |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| GRPO | Qwen2.5-7B-instruct | 0.38% | 0.588 | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| GRPO | Qwen2.5-32B-instruct | 0.30% | 0.685 | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| GRPO | Qwen2.5-VL-3B-instruct | 3.14% | 0.470 | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| GRPO | Qwen2.5-VL-7B-instruct | 3.30% | 0.380 | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| GRPO | Qwen2.5-VL-32B-instruct | 0.79% | 0.568 | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| DAPO | Qwen2.5-7B-instruct | 3.83% | pending | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| DAPO | Qwen2.5-32B | 3.42% | pending | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| DAPO | Qwen3-8B-base | 5.3% | pending | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| DAPO | Qwen3-14B-base | 5.9% | pending | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
| DAPO | Qwen3-30B-base | 1.08% | pending | FSDP | vllm-ascend | Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+-------------------+-------------------+--------------------------+
**表2** SFT类算法
+-----------+-------------------------+----------------+-------------------+-------------------+----------------------+
| algorithm | model | train loss mae | total time ratio | actor.strategy | hardware |
+-----------+-------------------------+----------------+-------------------+-------------------+----------------------+
| SFT-PEFT | Qwen3-8B | 0.09% | 0.705 | FSDP | Atlas 900 A2 PODc |
+-----------+-------------------------+----------------+-------------------+-------------------+----------------------+
| ReTool-SFT| Qwen2.5-7B-instruct | 0.08% | 0.775 | FSDP | Atlas 900 A2 PODc |
+-----------+-------------------------+----------------+-------------------+-------------------+----------------------+
精度对比说明
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
对于 SFT 类算法,我们期望在相同配置下华为昇腾设备与 A100 的 loss 平均绝对误差<= 2%。计算方式如下图。更多信息请参考 `精度计算说明 <https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html>`_
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/loss_comparison.png?raw=true
:alt: loss_comparison
根据经验,对于 GRPO 等 RL 类算法,我们期望在相同配置下华为昇腾设备与 A100 的 rewards 平均绝对误差<= 4%,计算方式参考上图。
吞吐对比说明
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Ascend npu 和 A100 分别取日志中前4个 step 的 "perf/throughput" 做平均, throughput ratio = npu 平均值 / A100 平均值。
计划
-----------------------------------
查看 `roadmap <https://github.com/volcengine/verl/discussions/2171>`_ 获取更多特性的支持进度。
声明
-----------------------------------
verl中提供的ascend支持代码皆为参考样例如在生产环境中使用请通过官方正式途径沟通谢谢。