mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
[trainer, recipe] feat: fully async training recipe (#2981)
### What does this PR do? To implement a purely asynchronous training workflow, we further split the training process into a Trainer and a Rollouter based on the existing one-step-off policy code, with samples transmitted via a message queue. We will continue to integrate partial rollout to mitigate the impact of long-tail training. > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. https://github.com/volcengine/verl/pull/2231 https://github.com/volcengine/verl/pull/2200 ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: meituan-search <machi04@meituan.com> Co-authored-by: wangshulin02 <wangshulin02@meituan.com> Co-authored-by: arron <arron@MBP-2G17FXQ05P-2332.local> Co-authored-by: wangshulin02 <953550366@qq.com> Co-authored-by: hadoop-ai-search <hadoop-ai-search@set-zw04-mlp-codelab-pc1189.mt> Co-authored-by: sl-1314 <82856253+sl-1314@users.noreply.github.com> Co-authored-by: arron <arron@MBP-VH9RV7LTJC-1907.local> Co-authored-by: arron <arron@MBP-JFQXPWR11F-1943.local>
This commit is contained in:
428
docs/advance/fully_async.md
Normal file
428
docs/advance/fully_async.md
Normal file
@ -0,0 +1,428 @@
|
||||
# Recipe: Fully Async Policy Async Trainer
|
||||
|
||||
**Author:** `https://github.com/meituan-search`
|
||||
|
||||
Last updated: 10/17/2025.
|
||||
|
||||
This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
|
||||
supporting asynchronous sample generation and training.
|
||||
Under this system, we achieved a 2.35x-2.67x performance improvement when training the Qwen2.5-7B model with 128 GPUs,
|
||||
without significantly affecting the results.
|
||||
|
||||
## Introduction
|
||||
|
||||
### Background
|
||||
|
||||
The separated rollout and train architecture, compared to the colocate architecture, can allocate resources more
|
||||
flexibly and design more flexible training logic, thereby addressing issues such as low GPU utilization and training
|
||||
efficiency caused by long-tail problems.
|
||||
The one_step_off_policy alleviates the problem of long rollout times and achieves some gains in training efficiency by
|
||||
designing a separated architecture and performing asynchronous training between rollout and train for one round.
|
||||
However, it forcibly uses data from one round of asynchronous training, which is not flexible enough and cannot
|
||||
completely eliminate the impact of long-tail on training efficiency.
|
||||
In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have
|
||||
been implemented based on the separated architecture and have achieved gains.
|
||||
We借鉴 their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial
|
||||
rollout training.
|
||||
By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy
|
||||
can significantly improve training efficiency.
|
||||
|
||||
> Magistral https://arxiv.org/abs/2506.10910
|
||||
>
|
||||
> AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language
|
||||
> Reasoning https://arxiv.org/abs/2505.24298
|
||||
>
|
||||
> StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream
|
||||
> Generation https://arxiv.org/abs/2504.15930
|
||||
>
|
||||
> AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training https://arxiv.org/abs/2507.01663
|
||||
>
|
||||
|
||||
### Core Contributions
|
||||
|
||||
* **Resource Isolation**: Unlike using hybrid_engine, Rollouter and Trainer use separate computing resources and need to
|
||||
specify the resources they occupy separately.
|
||||
* **Parallel Generation and Training**: While the Trainer is training, the Rollouter is generating new samples.
|
||||
* **Multi-step Asynchronous**: Compared to one step off policy, it supports asynchronous settings from 0.x steps to
|
||||
multiple steps, making the asynchronous solution more flexible.
|
||||
* **NCCL Parameter Synchronization**: Uses NCCL communication primitives for parameter communication between Rollouter
|
||||
and Trainer.
|
||||
* **Stream Inference and Training**: Rollouter generates data sample by sample, and data transmission uses a single
|
||||
sample as the minimum transmission unit.
|
||||
* **Asynchronous Training and Freshness Control**: By setting the parameter async_training.staleness_threshold, it
|
||||
supports training with samples generated by old parameters.
|
||||
* **PartialRollout**: The Rollouter's inference process supports partial rollout logic. During parameter
|
||||
synchronization, by adding `sleep() and resume()` logic, it
|
||||
saves samples from ongoing rollouts and continues using them in the next rollout, reducing the time spent waiting for
|
||||
ongoing tasks to finish during parameter synchronization.
|
||||
|
||||
Currently, the supported usage mode is fsdp+vllm. vllm must use the server mode based on AgentLoop.
|
||||
|
||||
## Design
|
||||
|
||||
The overall architecture of fully_async_policy is shown in the figure below. fully_async_policy mainly consists of four
|
||||
parts: Rollouter, MessageQueue, Trainer, and ParameterSynchronizer.
|
||||
|
||||

|
||||
|
||||
1. Rollouter generates sequences sample by sample and puts the generated samples into the MessageQueue, with the
|
||||
production speed controlled by freshness.
|
||||
2. MessageQueue is used to temporarily store samples generated by Rollouter.
|
||||
3. Trainer fetches samples from MessageQueue sample by sample. After fetching `require_batches*ppo_mini_batch_size`
|
||||
samples, it will perform training. After training for async_training.trigger_parameter_sync_step rounds, it triggers
|
||||
a parameter synchronization with Rollouter.
|
||||
4. ParameterSynchronizer implements the NCCL synchronous parameter synchronization capability.
|
||||
|
||||
The source of benefits compared to the base scheme lies in the fact that in the colocate case, using more resources for
|
||||
rollout cannot solve the idleness caused by long-tail samples.
|
||||
After we perform resource isolation, the time for rollout and train may be longer than before (because fewer resources
|
||||
are used),
|
||||
but the overlap in their time consumption reduces the end-to-end time consumption.
|
||||
|
||||

|
||||
|
||||
## Usage
|
||||
|
||||
### Parameter Description
|
||||
|
||||
| super params | implication |
|
||||
|-----------------------------------------------|------------------------------------------------------------------------------------------------|
|
||||
| `trainer.nnodes` | Number of nodes for Trainer |
|
||||
| `trainer.n_gpus_per_node` | Number of GPUs per node for Trainer |
|
||||
| `rollout.nnodes` | Number of nodes for Rollouter |
|
||||
| `rollout.n_gpus_per_node` | Number of GPUs per node for Rollouter |
|
||||
| `data.train_batch_size` | In the fully async strategy, this value is not effective (default is 0) |
|
||||
| `data.gen_batch_size` | In the fully async strategy, uses streaming sample production logic (default is 1) |
|
||||
| `rollout.total_rollout_steps` | Total number of rollout samples |
|
||||
| `rollout.test_freq` | How many times Rollouter updates parameters before performing a validation |
|
||||
| `actor_rollout_ref.actor.ppo_mini_batch_size` | The ppo_mini_batch_size is a global num across all workers/gpus |
|
||||
| `async_training.require_batches` | Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once |
|
||||
| `async_training.trigger_parameter_sync_step` | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |
|
||||
| `async_training.staleness_threshold` | Freshness control |
|
||||
| `async_training.partial_rollout` | Whether to perform partial_rollout |
|
||||
| `async_training.use_rollout_log_probs` | Use log_probs generated by rollout |
|
||||
|
||||
**Further Explanation:**
|
||||
|
||||
* `rollout.total_rollout_steps`
|
||||
|
||||
Compared to colocate, the quantity can be aligned by multiplying train_batch_size and step:
|
||||
`rollout.total_rollout_steps = data.train_batch_size * step`.
|
||||
|
||||
* `async_training.trigger_parameter_sync_step`
|
||||
|
||||
In the fully async strategy, it indicates how many local updates the Trainer performs (i.e., how many times it fetches
|
||||
`require_batches * ppo_mini_batch_size` samples) before a parameter synchronization with Rollouter.
|
||||
Between every two parameter synchronizations between Rollouter and Trainer, the Trainer will process
|
||||
`trigger_parameter_sync_step* require_batches*ppo_mini_batch_size` samples.
|
||||
To fairly compare speed with colocate, trigger_parameter_sync_step should be set to
|
||||
`data.train_batch_size / (require_batches * ppo_mini_batch_size)`.
|
||||
|
||||
* `async_training.staleness_threshold`
|
||||
|
||||
In the fully async strategy, it indicates the maximum proportion of stale samples allowed to be used.
|
||||
|
||||
* staleness_threshold=0, indicates synchronous training.
|
||||
Rollouter will generate a fixed number of samples between two parameter updates, the sample count is:
|
||||
$$rollout\_num = (trigger\_parameter\_sync\_step*require\_batches*ppo\_mini\_batch\_size)$$
|
||||
* staleness_threshold>0, indicates asynchronous training, can be set to a decimal for more flexible asynchronous
|
||||
calls.
|
||||
Rollouter will generate at most the following number of samples between two parameter updates:
|
||||
$$rollout\_num = (1+staleness\_threshold)*(trigger\_parameter\_sync\_step*require\_batches*ppo\_mini\_batch\_size) - num\_staleness\_sample $$
|
||||
|
||||
num_staleness_sample represents the number of stale samples generated in excess during the last rollout.
|
||||
|
||||
Since it's a streaming system, rollout continues to generate and trainer continues to consume. If rollouter is slower,
|
||||
trainer will trigger parameter synchronization earlier, and rollouter will not actually produce rollout_num samples.
|
||||
When rollout is fast enough, setting staleness_threshold to 1 is basically equivalent to one_step_off policy.
|
||||
To avoid too many expired samples affecting training accuracy, it is recommended to set this value to less than 1.
|
||||
|
||||
* `async_training.partial_rollout`
|
||||
|
||||
partial_rollout only actually takes effect when staleness_threshold>0.
|
||||
|
||||
* `async_training.use_rollout_log_probs`
|
||||
|
||||
In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to
|
||||
the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling,
|
||||
old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm
|
||||
correctness. In the fully
|
||||
async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.
|
||||
|
||||
* `async_training.require_batches`
|
||||
|
||||
In streaming training, require_batches should be set to 1, indicating that training is performed after producing
|
||||
enough ppo_mini_batch_size samples.
|
||||
In actual testing, we found that if fewer samples are issued at once, due to the order of data distribution, it can
|
||||
cause training instability and longer response lengths.
|
||||
Here, we additionally provide require_batches for streaming distribution and control the number of samples
|
||||
participating in training at once.
|
||||
|
||||
### Supported Modes
|
||||
|
||||
1. on policy pipeline:
|
||||
1. **trigger_parameter_sync_step=1, staleness_threshold=0**
|
||||
2. Rollouter produces `require_batches*ppo_mini_batch_size` samples at once, Trainer fetches these samples for
|
||||
training, and after training completes, Trainer and Rollouter perform a parameter synchronization;
|
||||
3. During the rollout phase, if there are long-tail samples but few rollout samples, shorter samples cannot fill
|
||||
idle resources, causing some resource waste.
|
||||
4. As shown in figure a;
|
||||
|
||||
2. stream off policy pipeline:
|
||||
1. **trigger_parameter_sync_step>1, staleness_threshold=0**
|
||||
2. Synchronous streaming training will be performed. Rollouter produces
|
||||
`require_batches*ppo_mini_batch_size*trigger_parameter_sync_step` samples at once, Trainer performs a local
|
||||
training every time it fetches `require_batches*ppo_mini_batch_size` samples, and after training
|
||||
trigger_parameter_sync_step times, Trainer and Rollouter perform a parameter synchronization;
|
||||
3. Compared to a, since more samples are generated at once, resource idleness will be lower.
|
||||
4. In one step training, there will be two periods of resource idleness: when fetching the first batch of samples,
|
||||
train waits for `require_batches*ppo_mini_batch_size` samples to be produced, and during the last parameter
|
||||
update, rollout waits for training to complete.
|
||||
5. As shown in figure b;
|
||||
|
||||
3. async stream pipeline with stale samples:
|
||||
1. **trigger_parameter_sync_step>=1, staleness_threshold>0, partial_rollout=False**
|
||||
2. After each parameter update, Rollouter will plan to produce at most rollout_num samples (in practice, the number
|
||||
of samples generated may be less than this value depending on rollout speed).
|
||||
3. If the rollout process is relatively fast, Rollouter will generate some additional samples num_stale_samples
|
||||
before parameter synchronization for immediate use by Trainer after synchronization.
|
||||
When triggering parameter synchronization, if Rollouter has ongoing tasks, it will wait for the tasks to complete
|
||||
and not add new tasks;
|
||||
4. Compared to b, except for the first step training, subsequent training will not have the time to wait for the
|
||||
first batch rollout to finish, but will have the time to wait for active tasks to finish.
|
||||
5. As shown in figure c;
|
||||
|
||||
4. async stream pipeline with partial rollout:
|
||||
1. **trigger_parameter_sync_step>=1, staleness_threshold>0, partial_rollout=True**
|
||||
2. Compared to c, when triggering parameter synchronization, if Rollouter has samples being produced, it will
|
||||
interrupt the rollout process and perform parameter synchronization. The interrupted samples will continue to be
|
||||
generated after synchronization. This reduces the time to wait for active tasks to finish.
|
||||
3. As shown in figure d;
|
||||
|
||||

|
||||
|
||||
### Key Metrics
|
||||
|
||||
| metrics | implication |
|
||||
|------------------------------------------------|--------------------------------------------------------------------------------------------------------|
|
||||
| `trainer/idle_ratio` | Trainer idle rate |
|
||||
| `rollouter/idle_ratio` | Rollouter idle rate |
|
||||
| `fully_async/count/stale_samples_processed` | Total number of old samples used in training |
|
||||
| `fully_async/count/stale_trajectory_processed` | Total number of old trajectories used in training (one sample produces rollout.n trajectories) |
|
||||
| `fully_async/partial/total_partial_num` | Number of partial samples processed by Trainer between two trigger_parameter_sync_step |
|
||||
| `fully_async/partial/partial_ratio` | Ratio of partial samples processed by Trainer between two trigger_parameter_sync_step |
|
||||
| `fully_async/partial/max_partial_span` | Maximum parameter span of partial samples processed by Trainer between two trigger_parameter_sync_step |
|
||||
|
||||
### Parameter Tuning Recommendations
|
||||
|
||||
* Resource Allocation and Adjustment:
|
||||
* Reasonable resource allocation is the prerequisite for achieving good training efficiency. The ideal resource
|
||||
allocation should make the rollout time and train time close, thereby minimizing pipeline bubbles in the entire
|
||||
training process,
|
||||
avoiding resource idleness, and ensuring Trainer does not use old samples. In real training scenarios, resource
|
||||
allocation can be adjusted based on the idle time of rollout and train during actual training,
|
||||
which can be obtained from rollouter/idle_ratio and trainer/idle_ratio. If rollouter/idle_ratio is high and
|
||||
trainer/idle_ratio is low,
|
||||
Trainer resources should be increased and Rollouter resources should be reduced, and vice versa.
|
||||
|
||||
* Key Parameters:
|
||||
* staleness_threshold: Setting it too high will cause more old samples to be used, affecting model performance. It
|
||||
is recommended to set it to less than 1.
|
||||
* require_batches: The closer to 1, the closer to a pure streaming process, the smaller the training bubbles, and
|
||||
the faster the acceleration effect that can be achieved in terms of speed, but it will affect the order of sample
|
||||
processing;
|
||||
* trigger_parameter_sync_step: The smaller the setting, the closer to on policy, but it will cause frequent
|
||||
parameter synchronization. Long-tail samples waste resources that cannot be filled by short samples, resulting in
|
||||
low resource utilization.
|
||||
The larger the setting, the higher the computational efficiency, but the accuracy will be affected by off policy.
|
||||
* rollout.test_freq: It will occupy Rollouter resources and is not recommended to be set too small.
|
||||
|
||||
* Mode Selection: By adjusting different parameters, the Fully Async architecture supports optimization acceleration at
|
||||
different levels, suitable for tasks in different scenarios.
|
||||
* For small-scale tasks that need to ensure training stability and on-policy nature, and have low speed
|
||||
requirements, the on policy pipeline mode (Mode 1) can be tried.
|
||||
* For scenarios that need to improve training throughput but are sensitive to staleness, the stream off policy
|
||||
pipeline mode can be tried. That is, by
|
||||
setting trigger_parameter_sync_step>1 to improve training efficiency, but still maintaining the synchronization
|
||||
mechanism (staleness_threshold=0) (Mode 2).
|
||||
* For large-scale tasks with high training speed requirements and can tolerate a certain degree of off-policy and
|
||||
staleness, setting staleness_threshold>
|
||||
0 and partial_rollout=True can improve training efficiency, using the async stream pipeline mode (Mode 3 or 4).
|
||||
|
||||
### Quick Start
|
||||
|
||||
```shell
|
||||
rollout_mode="async"
|
||||
rollout_name="vllm" # sglang or vllm
|
||||
if [ "$rollout_mode" = "async" ]; then
|
||||
export VLLM_USE_V1=1
|
||||
return_raw_chat="True"
|
||||
fi
|
||||
|
||||
train_prompt_bsz=0
|
||||
gen_prompt_bsz=1
|
||||
n_resp_per_prompt=16
|
||||
train_prompt_mini_bsz=32
|
||||
total_rollout_steps=$(((512*400)))
|
||||
test_freq=10
|
||||
staleness_threshold=0
|
||||
trigger_parameter_sync_step=16
|
||||
partial_rollout=False
|
||||
|
||||
|
||||
python -m recipe.fully_async_policy.fully_async_main \
|
||||
train_batch_size=${train_prompt_bsz} \
|
||||
data.gen_batch_size=${gen_prompt_bsz} \
|
||||
data.return_raw_chat=${return_raw_chat} \
|
||||
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
|
||||
actor_rollout_ref.actor.strategy=fsdp2 \
|
||||
critic.strategy=fsdp2 \
|
||||
actor_rollout_ref.hybrid_engine=False \
|
||||
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
|
||||
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
|
||||
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
|
||||
actor_rollout_ref.rollout.name=${rollout_name} \
|
||||
actor_rollout_ref.rollout.mode=${rollout_mode} \
|
||||
actor_rollout_ref.rollout.calculate_log_probs=True \
|
||||
trainer.nnodes="${NNODES_TRAIN}" \
|
||||
trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
|
||||
rollout.nnodes="${NNODES_ROLLOUT}" \
|
||||
rollout.n_gpus_per_node="${NGPUS_PER_NODE}" \
|
||||
rollout.total_rollout_steps="${total_rollout_steps}" \
|
||||
rollout.test_freq="${test_freq}" \
|
||||
async_training.staleness_threshold="${staleness_threshold}" \
|
||||
async_training.trigger_parameter_sync_step="${trigger_parameter_sync_step}" \
|
||||
async_training.partial_rollout="${partial_rollout}"
|
||||
```
|
||||
|
||||
## Experiments
|
||||
|
||||
### Asynchronous Training on 7B Model
|
||||
|
||||
We used Qwen2.5-Math-7B to verify the benefits of the fully async strategy under long candidates and multiple resources.
|
||||
Using the `async stream pipeline with stale samples` strategy, we achieved about 2x performance improvement on 32 cards,
|
||||
64 cards, and 128 cards without significantly affecting experimental results.
|
||||
|
||||
* Machine: H20
|
||||
* Model: Qwen2.5-Math-7B
|
||||
* Rollout length: max_response_length FSDP2: 28K tokens;
|
||||
* Algorithm: DAPO
|
||||
* Dataset: TRAIN_FILE: dapo-math-17k.parquet TEST_FILE: aime-2024.parquet
|
||||
* Engine: vllm+FSDP2
|
||||
* rollout.n: 16
|
||||
* ppo_mini_batch_size: 32
|
||||
* test_freq: 20
|
||||
|
||||
* colocate sync:
|
||||
* step: 400
|
||||
* train_batch_size: 512
|
||||
|
||||
* fully_async_policy
|
||||
* total_rollout_steps: 512*400
|
||||
* require_batches: 4
|
||||
* trigger_parameter_sync_step: 4
|
||||
* staleness_threshold: 0.3
|
||||
* partial_rollout: True
|
||||
|
||||
| training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 |
|
||||
|:--------------------:|:---------------------:|:--------:|:--------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------------:|
|
||||
| colocate sync | 32 | 790.10 | 357.41 | 107.71 | 313.81 | 13h 44m | 1d 3h 43m | 2d 9h 22m | 3d 17h 5m | max: 0.3313<br>last: 0.2448 |
|
||||
| fully_async_policy | 16:16 | | | \ | | | | | | max: <br>last: |
|
||||
| colocate sync | 64 | 365.28 | 150.72 | 70.26 | 133.41 | 10h 22m | 20h 45m | 1d 7h 6m | 1d 17h 32m | max: 0.3365<br>last: 0.2333 |
|
||||
| fully_async_policy | 32:32 | 189.26 | 28.46 | \ | 156.98 | 4h 57m<br>(2.09x) | 10h 14m<br>(2.03x) | 16h 58m<br>(1.83x) | 21h 40m<br>(1.92x) | max: 0.3677<br>last: 0.3406 |
|
||||
| colocate sync | 128 | 356.30 | 177.85 | 53.92 | 113.81 | 8h 36m | 17h 56m | 1d 5h 6m | 1d 16h 48m | max: 0.3573<br>last: 0.2958 |
|
||||
| fully_async_policy | 64:64 | 150.63 | 33.14 | \ | 113.16 | 3h 13m<br>(2.67x) | 6h 46m<br>(2.65x) | 10h 53m<br>(2.67x) | 17h 22m<br>(2.35x) | max: 0.3521<br>last: 0.3094 |
|
||||
|
||||
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-colocate_async?nw=nwuserhouzg
|
||||
|
||||
### 128-card 7B Asynchronous Mode Experiment
|
||||
|
||||
We used Qwen2.5-Math-7B to verify the effects of various modes supported by fully async.
|
||||
We can see that the benefit brought by streaming is approximately 0.6x, and after combining staleness and
|
||||
partial_rollout, the benefit reaches 2.35x.
|
||||
|
||||
| mode | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 |
|
||||
|:-------------------------------------------------------------------------------------------------------:|:---------------------:|:--------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:-----------------------------:|
|
||||
| colocate sync | 128 | 356.30 | 177.85 | 53.92 | 113.81 | 8h 36m | 17h 56m | 1d 5h 6m | 1d 16h 48m | max: 0.3573<br>last: 0.2958 |
|
||||
| `stream off policy pipeline`<br>(+fully async: trigger_parameter_sync_step= 4,<br>require_batches= 4) | 231.34 | 128.47 | \ | 98.77 | 4h 25m | 9h 41m | 15h 2m | 1d 1h 53m | max: 0.2844<br>last: 0.2604 |
|
||||
| `async stream pipeline with stale samples`<br>(+staleness_threshold=0.5) | | | | | | | | | |
|
||||
| `async stream pipeline with partial rollout`<br>(+partial_rollout=True) | 150.63 | 33.14 | \ | 113.16 | 3h 13m | 6h 46m | 10h 53m | 17h 22m | max: 0.3521<br>last: 0.3094 |
|
||||
|
||||
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-stream_stale_partial?nw=nwuserhouzg
|
||||
|
||||
### 128-card Stale Ablation Experiment
|
||||
|
||||
Under the `async stream pipeline with partial rollout` mode, we verified the impact of staleness settings on training
|
||||
efficiency.
|
||||
We found that the larger the staleness, the more obvious the final gains.
|
||||
We also noticed that the times for staleness values of 0.3 and 0.5 are quite close, because as the training steps
|
||||
increase, the response length changes significantly, causing training instability.
|
||||
Further analysis and optimization are needed for this issue.
|
||||
|
||||
| staleness_threshold | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 |
|
||||
|:---------------------:|:--------:|:--------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:-----------------------------:|
|
||||
| 0 | 231.34 | 128.47 | \ | 98.77 | 4h 25m | 9h 41m | 15h 2m | 1d 1h 53m | max: 0.2844<br>last: 0.2604 |
|
||||
| 0.1 | 171.30 | 58.17 | \ | 109.12 | 3h 53m | 8h 37m | 14h 25m | 19h 59m | max: 0.3542<br>last: 0.2979 |
|
||||
| 0.3 | 146.11 | 38.88 | \ | 103.22 | 3h 18m | 6h 49m | 11h 40m | 17h 20m | max: 0.3469<br>last: 0.2865 |
|
||||
| 0.5 | 150.63 | 33.14 | \ | 113.16 | 3h 13m | 6h 46m | 10h 53m | 17h 22m | max: 0.3521<br>last: 0.3094 |
|
||||
|
||||
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-stream_stale_partial?nw=nwuserhouzg
|
||||
|
||||
### 128-card 7B require_batches Ablation Experiment
|
||||
|
||||
In multiple tests, we found that the number of samples issued each time in streaming affects the response length during
|
||||
training, which in turn affects training time. We verified the impact on results by modifying
|
||||
`async_training.require_batches`.
|
||||
|
||||
| require_batches | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | acc/mean@1 |
|
||||
|:-----------------:|:--------:|:-------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:-----------------------------:|
|
||||
| 1 | 203.47 | 30.88 | \ | 181.08 | 3h 31m | 8h 29m | 17h 36m | max: 0.349<br>last: 0.326 |
|
||||
| 2 | 158.72 | 26.32 | \ | 128.08 | 3h 35m | 7h 38m | 13h 57m | max: 0.351<br>last: 0.3406 |
|
||||
| 4 | 124.64 | 25.62 | \ | 95.06 | 3h 13m | 6h 46m | 10h 53m | max: 0.3521<br>last: 0.3521 |
|
||||
|
||||
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-ablation_require_batches?nw=nwuserhouzg
|
||||
|
||||
### 30B Model Mode Experiment
|
||||
|
||||
TODO: The 30B experiment is still in progress.
|
||||
|
||||
* Machine: H20
|
||||
* Model: Qwen2.5-32B~~~~
|
||||
* Rollout length: max_response_length FSDP2: 20K tokens;
|
||||
* Algorithm: DAPO
|
||||
* Engine: vllm+FSDP2
|
||||
* rollout.n: 16
|
||||
* ppo_mini_batch_size: 32
|
||||
* test_freq: 20
|
||||
|
||||
* colocate sync:
|
||||
* step:200
|
||||
* train_batch_size: 512
|
||||
|
||||
* fully_async_policy
|
||||
* total_rollout_steps: 512*200
|
||||
* trigger_parameter_sync_step: 512/32 = 16
|
||||
* staleness_threshold: 0
|
||||
* partial_rollout: False
|
||||
|
||||
| training mode | Resource allocation | mode | step | generate_sequences | old_log_prob | update_actor | total time | acc/best@32/mean |
|
||||
|--------------------|---------------------|--------------------------------------------|------|--------------------|--------------|--------------|------------|------------------|
|
||||
| colocate sync | 128 | | | | | | | |
|
||||
| fully_async_policy | 64:64 | stream off policy pipeline | | | | | | |
|
||||
| fully_async_policy | 64:64 | async stream pipeline with stale samples | | | | | | |
|
||||
| fully_async_policy | 64:64 | async stream pipeline with partial rollout | | | | | | |
|
||||
|
||||
|
||||
## Future Plans
|
||||
|
||||
* GRPO experiments
|
||||
* Megatron adaptation
|
||||
* SGLang integration
|
||||
* Transfer queue integration
|
||||
* Asynchronous parameter synchronization
|
||||
* AReaL asynchronous algorithm implementation
|
||||
* TPPO algorithm implementation
|
||||
* Multi-turn and Tool support
|
@ -124,6 +124,7 @@ verl is fast with:
|
||||
advance/rollout_is_migration.md
|
||||
advance/one_step_off
|
||||
advance/agent_loop
|
||||
advance/fully_async
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
Reference in New Issue
Block a user