[trainer, recipe] feat: fully async training recipe (#2981)

### What does this PR do? To implement a purely asynchronous training workflow, we further split the training process into a Trainer and a Rollouter based on the existing one-step-off policy code, with samples transmitted via a message queue. We will continue to integrate partial rollout to mitigate the impact of long-tail training. > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. https://github.com/volcengine/verl/pull/2231 https://github.com/volcengine/verl/pull/2200 ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: meituan-search <machi04@meituan.com> Co-authored-by: wangshulin02 <wangshulin02@meituan.com> Co-authored-by: arron <arron@MBP-2G17FXQ05P-2332.local> Co-authored-by: wangshulin02 <953550366@qq.com> Co-authored-by: hadoop-ai-search <hadoop-ai-search@set-zw04-mlp-codelab-pc1189.mt> Co-authored-by: sl-1314 <82856253+sl-1314@users.noreply.github.com> Co-authored-by: arron <arron@MBP-VH9RV7LTJC-1907.local> Co-authored-by: arron <arron@MBP-JFQXPWR11F-1943.local>
2025-10-20 13:43:50 +08:00 · 2025-10-17 22:29:18 +08:00
parent dd8864f9ee
commit b25bb7d4f3
39 changed files with 6292 additions and 35 deletions
--- a/docs/advance/fully_async.md
+++ b/docs/advance/fully_async.md
@ -0,0 +1,428 @@
+# Recipe: Fully Async Policy Async Trainer
+
+**Author:** `https://github.com/meituan-search`
+
+Last updated: 10/17/2025.
+
+This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
+supporting asynchronous sample generation and training.
+Under this system, we achieved a 2.35x-2.67x performance improvement when training the Qwen2.5-7B model with 128 GPUs,
+without significantly affecting the results.
+
+## Introduction
+
+### Background
+
+The separated rollout and train architecture, compared to the colocate architecture, can allocate resources more
+flexibly and design more flexible training logic, thereby addressing issues such as low GPU utilization and training
+efficiency caused by long-tail problems.
+The one_step_off_policy alleviates the problem of long rollout times and achieves some gains in training efficiency by
+designing a separated architecture and performing asynchronous training between rollout and train for one round.
+However, it forcibly uses data from one round of asynchronous training, which is not flexible enough and cannot
+completely eliminate the impact of long-tail on training efficiency.
+In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have
+been implemented based on the separated architecture and have achieved gains.
+We借鉴 their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial
+rollout training.
+By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy
+can significantly improve training efficiency.
+
+> Magistral https://arxiv.org/abs/2506.10910
+>
+> AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language
+> Reasoning https://arxiv.org/abs/2505.24298
+>
+> StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream
+> Generation https://arxiv.org/abs/2504.15930
+>
+> AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training https://arxiv.org/abs/2507.01663
+>
+
+### Core Contributions
+
+* **Resource Isolation**: Unlike using hybrid_engine, Rollouter and Trainer use separate computing resources and need to
+  specify the resources they occupy separately.
+* **Parallel Generation and Training**: While the Trainer is training, the Rollouter is generating new samples.
+* **Multi-step Asynchronous**: Compared to one step off policy, it supports asynchronous settings from 0.x steps to
+  multiple steps, making the asynchronous solution more flexible.
+* **NCCL Parameter Synchronization**: Uses NCCL communication primitives for parameter communication between Rollouter
+  and Trainer.
+* **Stream Inference and Training**: Rollouter generates data sample by sample, and data transmission uses a single
+  sample as the minimum transmission unit.
+* **Asynchronous Training and Freshness Control**: By setting the parameter async_training.staleness_threshold, it
+  supports training with samples generated by old parameters.
+* **PartialRollout**: The Rollouter's inference process supports partial rollout logic. During parameter
+  synchronization, by adding `sleep() and resume()` logic, it
+  saves samples from ongoing rollouts and continues using them in the next rollout, reducing the time spent waiting for
+  ongoing tasks to finish during parameter synchronization.
+
+Currently, the supported usage mode is fsdp+vllm. vllm must use the server mode based on AgentLoop.
+
+## Design
+
+The overall architecture of fully_async_policy is shown in the figure below. fully_async_policy mainly consists of four
+parts: Rollouter, MessageQueue, Trainer, and ParameterSynchronizer.
+
+![fully_async_policy_structure](
+https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_structure.svg?raw=true)
+
+1. Rollouter generates sequences sample by sample and puts the generated samples into the MessageQueue, with the
+   production speed controlled by freshness.
+2. MessageQueue is used to temporarily store samples generated by Rollouter.
+3. Trainer fetches samples from MessageQueue sample by sample. After fetching `require_batches*ppo_mini_batch_size`
+   samples, it will perform training. After training for async_training.trigger_parameter_sync_step rounds, it triggers
+   a parameter synchronization with Rollouter.
+4. ParameterSynchronizer implements the NCCL synchronous parameter synchronization capability.
+
+The source of benefits compared to the base scheme lies in the fact that in the colocate case, using more resources for
+rollout cannot solve the idleness caused by long-tail samples.
+After we perform resource isolation, the time for rollout and train may be longer than before (because fewer resources
+are used),
+but the overlap in their time consumption reduces the end-to-end time consumption.
+
+![fully_async_policy_revenue](
+https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_revenue.svg?raw=true)
+
+## Usage
+
+### Parameter Description
+
+| super params                                  | implication                                                                                    |
+|-----------------------------------------------|------------------------------------------------------------------------------------------------|
+| `trainer.nnodes`                              | Number of nodes for Trainer                                                                    |
+| `trainer.n_gpus_per_node`                     | Number of GPUs per node for Trainer                                                            |
+| `rollout.nnodes`                              | Number of nodes for Rollouter                                                                  |
+| `rollout.n_gpus_per_node`                     | Number of GPUs per node for Rollouter                                                          |
+| `data.train_batch_size`                       | In the fully async strategy, this value is not effective (default is 0)                        |
+| `data.gen_batch_size`                         | In the fully async strategy, uses streaming sample production logic (default is 1)             |
+| `rollout.total_rollout_steps`                 | Total number of rollout samples                                                                |
+| `rollout.test_freq`                           | How many times Rollouter updates parameters before performing a validation                     |
+| `actor_rollout_ref.actor.ppo_mini_batch_size` | The ppo_mini_batch_size is a global num across all workers/gpus                                |
+| `async_training.require_batches`              | Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once                           |
+| `async_training.trigger_parameter_sync_step`  | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |
+| `async_training.staleness_threshold`          | Freshness control                                                                              |
+| `async_training.partial_rollout`              | Whether to perform partial_rollout                                                             |
+| `async_training.use_rollout_log_probs`        | Use log_probs generated by rollout                                                             |
+
+**Further Explanation:**
+
+* `rollout.total_rollout_steps`
+
+  Compared to colocate, the quantity can be aligned by multiplying train_batch_size and step:
+  `rollout.total_rollout_steps = data.train_batch_size * step`.
+
+* `async_training.trigger_parameter_sync_step`
+
+  In the fully async strategy, it indicates how many local updates the Trainer performs (i.e., how many times it fetches
+  `require_batches * ppo_mini_batch_size` samples) before a parameter synchronization with Rollouter.
+  Between every two parameter synchronizations between Rollouter and Trainer, the Trainer will process
+  `trigger_parameter_sync_step* require_batches*ppo_mini_batch_size` samples.
+  To fairly compare speed with colocate, trigger_parameter_sync_step should be set to
+  `data.train_batch_size / (require_batches * ppo_mini_batch_size)`.
+
+* `async_training.staleness_threshold`
+
+  In the fully async strategy, it indicates the maximum proportion of stale samples allowed to be used.
+
+    * staleness_threshold=0, indicates synchronous training.
+      Rollouter will generate a fixed number of samples between two parameter updates, the sample count is:
+      $$rollout\_num = (trigger\_parameter\_sync\_step*require\_batches*ppo\_mini\_batch\_size)$$
+    * staleness_threshold>0, indicates asynchronous training, can be set to a decimal for more flexible asynchronous
+      calls.
+      Rollouter will generate at most the following number of samples between two parameter updates:
+      $$rollout\_num = (1+staleness\_threshold)*(trigger\_parameter\_sync\_step*require\_batches*ppo\_mini\_batch\_size) - num\_staleness\_sample $$
+
+  num_staleness_sample represents the number of stale samples generated in excess during the last rollout.
+
+  Since it's a streaming system, rollout continues to generate and trainer continues to consume. If rollouter is slower,
+  trainer will trigger parameter synchronization earlier, and rollouter will not actually produce rollout_num samples.
+  When rollout is fast enough, setting staleness_threshold to 1 is basically equivalent to one_step_off policy.
+  To avoid too many expired samples affecting training accuracy, it is recommended to set this value to less than 1.
+
+* `async_training.partial_rollout`
+
+  partial_rollout only actually takes effect when staleness_threshold>0.
+
+* `async_training.use_rollout_log_probs`
+
+  In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to
+  the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling,
+  old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm
+  correctness. In the fully
+  async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.
+
+    * `async_training.require_batches`
+
+  In streaming training, require_batches should be set to 1, indicating that training is performed after producing
+  enough ppo_mini_batch_size samples.
+  In actual testing, we found that if fewer samples are issued at once, due to the order of data distribution, it can
+  cause training instability and longer response lengths.
+  Here, we additionally provide require_batches for streaming distribution and control the number of samples
+  participating in training at once.
+
+### Supported Modes
+
+1. on policy pipeline:
+    1. **trigger_parameter_sync_step=1, staleness_threshold=0**
+    2. Rollouter produces `require_batches*ppo_mini_batch_size` samples at once, Trainer fetches these samples for
+       training, and after training completes, Trainer and Rollouter perform a parameter synchronization;
+    3. During the rollout phase, if there are long-tail samples but few rollout samples, shorter samples cannot fill
+       idle resources, causing some resource waste.
+    4. As shown in figure a;
+
+2. stream off policy pipeline:
+    1. **trigger_parameter_sync_step>1, staleness_threshold=0**
+    2. Synchronous streaming training will be performed. Rollouter produces
+       `require_batches*ppo_mini_batch_size*trigger_parameter_sync_step` samples at once, Trainer performs a local
+       training every time it fetches `require_batches*ppo_mini_batch_size` samples, and after training
+       trigger_parameter_sync_step times, Trainer and Rollouter perform a parameter synchronization;
+    3. Compared to a, since more samples are generated at once, resource idleness will be lower.
+    4. In one step training, there will be two periods of resource idleness: when fetching the first batch of samples,
+       train waits for `require_batches*ppo_mini_batch_size` samples to be produced, and during the last parameter
+       update, rollout waits for training to complete.
+    5. As shown in figure b;
+
+3. async stream pipeline with stale samples:
+    1. **trigger_parameter_sync_step>=1, staleness_threshold>0, partial_rollout=False**
+    2. After each parameter update, Rollouter will plan to produce at most rollout_num samples (in practice, the number
+       of samples generated may be less than this value depending on rollout speed).
+    3. If the rollout process is relatively fast, Rollouter will generate some additional samples num_stale_samples
+       before parameter synchronization for immediate use by Trainer after synchronization.
+       When triggering parameter synchronization, if Rollouter has ongoing tasks, it will wait for the tasks to complete
+       and not add new tasks;
+    4. Compared to b, except for the first step training, subsequent training will not have the time to wait for the
+       first batch rollout to finish, but will have the time to wait for active tasks to finish.
+    5. As shown in figure c;
+
+4. async stream pipeline with partial rollout:
+    1. **trigger_parameter_sync_step>=1, staleness_threshold>0, partial_rollout=True**
+    2. Compared to c, when triggering parameter synchronization, if Rollouter has samples being produced, it will
+       interrupt the rollout process and perform parameter synchronization. The interrupted samples will continue to be
+       generated after synchronization. This reduces the time to wait for active tasks to finish.
+    3. As shown in figure d;
+
+![fully_async_policy_mode](
+https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_mode.svg?raw=true)
+
+### Key Metrics
+
+| metrics                                        | implication                                                                                            |
+|------------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| `trainer/idle_ratio`                           | Trainer idle rate                                                                                      |
+| `rollouter/idle_ratio`                         | Rollouter idle rate                                                                                    |
+| `fully_async/count/stale_samples_processed`    | Total number of old samples used in training                                                           |
+| `fully_async/count/stale_trajectory_processed` | Total number of old trajectories used in training (one sample produces rollout.n trajectories)         |
+| `fully_async/partial/total_partial_num`        | Number of partial samples processed by Trainer between two trigger_parameter_sync_step                 |
+| `fully_async/partial/partial_ratio`            | Ratio of partial samples processed by Trainer between two trigger_parameter_sync_step                  |
+| `fully_async/partial/max_partial_span`         | Maximum parameter span of partial samples processed by Trainer between two trigger_parameter_sync_step |
+
+### Parameter Tuning Recommendations
+
+* Resource Allocation and Adjustment:
+    * Reasonable resource allocation is the prerequisite for achieving good training efficiency. The ideal resource
+      allocation should make the rollout time and train time close, thereby minimizing pipeline bubbles in the entire
+      training process,
+      avoiding resource idleness, and ensuring Trainer does not use old samples. In real training scenarios, resource
+      allocation can be adjusted based on the idle time of rollout and train during actual training,
+      which can be obtained from rollouter/idle_ratio and trainer/idle_ratio. If rollouter/idle_ratio is high and
+      trainer/idle_ratio is low,
+      Trainer resources should be increased and Rollouter resources should be reduced, and vice versa.
+
+* Key Parameters:
+    * staleness_threshold: Setting it too high will cause more old samples to be used, affecting model performance. It
+      is recommended to set it to less than 1.
+    * require_batches: The closer to 1, the closer to a pure streaming process, the smaller the training bubbles, and
+      the faster the acceleration effect that can be achieved in terms of speed, but it will affect the order of sample
+      processing;
+    * trigger_parameter_sync_step: The smaller the setting, the closer to on policy, but it will cause frequent
+      parameter synchronization. Long-tail samples waste resources that cannot be filled by short samples, resulting in
+      low resource utilization.
+      The larger the setting, the higher the computational efficiency, but the accuracy will be affected by off policy.
+    * rollout.test_freq: It will occupy Rollouter resources and is not recommended to be set too small.
+
+* Mode Selection: By adjusting different parameters, the Fully Async architecture supports optimization acceleration at
+  different levels, suitable for tasks in different scenarios.
+    * For small-scale tasks that need to ensure training stability and on-policy nature, and have low speed
+      requirements, the on policy pipeline mode (Mode 1) can be tried.
+    * For scenarios that need to improve training throughput but are sensitive to staleness, the stream off policy
+      pipeline mode can be tried. That is, by
+      setting trigger_parameter_sync_step>1 to improve training efficiency, but still maintaining the synchronization
+      mechanism (staleness_threshold=0) (Mode 2).
+    * For large-scale tasks with high training speed requirements and can tolerate a certain degree of off-policy and
+      staleness, setting staleness_threshold>
+      0 and partial_rollout=True can improve training efficiency, using the async stream pipeline mode (Mode 3 or 4).
+
+### Quick Start
+
+```shell
+rollout_mode="async"
+rollout_name="vllm" # sglang or vllm
+if [ "$rollout_mode" = "async" ]; then
+    export VLLM_USE_V1=1
+    return_raw_chat="True"
+fi
+
+train_prompt_bsz=0
+gen_prompt_bsz=1
+n_resp_per_prompt=16
+train_prompt_mini_bsz=32
+total_rollout_steps=$(((512*400)))
+test_freq=10
+staleness_threshold=0
+trigger_parameter_sync_step=16
+partial_rollout=False
+
+
+python -m recipe.fully_async_policy.fully_async_main \
+	train_batch_size=${train_prompt_bsz} \
+    data.gen_batch_size=${gen_prompt_bsz} \
+    data.return_raw_chat=${return_raw_chat} \
+    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
+    actor_rollout_ref.actor.strategy=fsdp2 \
+    critic.strategy=fsdp2 \
+    actor_rollout_ref.hybrid_engine=False \
+    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
+    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
+    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
+    actor_rollout_ref.rollout.name=${rollout_name} \
+    actor_rollout_ref.rollout.mode=${rollout_mode} \
+    actor_rollout_ref.rollout.calculate_log_probs=True \
+    trainer.nnodes="${NNODES_TRAIN}" \
+    trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
+    rollout.nnodes="${NNODES_ROLLOUT}" \
+    rollout.n_gpus_per_node="${NGPUS_PER_NODE}" \
+    rollout.total_rollout_steps="${total_rollout_steps}" \
+    rollout.test_freq="${test_freq}" \
+    async_training.staleness_threshold="${staleness_threshold}" \
+    async_training.trigger_parameter_sync_step="${trigger_parameter_sync_step}" \
+    async_training.partial_rollout="${partial_rollout}"
+```
+
+## Experiments
+
+### Asynchronous Training on 7B Model
+
+We used Qwen2.5-Math-7B to verify the benefits of the fully async strategy under long candidates and multiple resources.
+Using the `async stream pipeline with stale samples` strategy, we achieved about 2x performance improvement on 32 cards,
+64 cards, and 128 cards without significantly affecting experimental results.
+
+* Machine: H20
+* Model: Qwen2.5-Math-7B
+* Rollout length: max_response_length FSDP2: 28K tokens;
+* Algorithm: DAPO
+* Dataset: TRAIN_FILE: dapo-math-17k.parquet TEST_FILE: aime-2024.parquet
+* Engine: vllm+FSDP2
+* rollout.n: 16
+* ppo_mini_batch_size: 32
+* test_freq: 20
+
+* colocate sync:
+    * step: 400
+    * train_batch_size: 512
+
+* fully_async_policy
+    * total_rollout_steps: 512*400
+    * require_batches: 4
+    * trigger_parameter_sync_step: 4
+    * staleness_threshold: 0.3
+    * partial_rollout: True
+
+|  training mode   	   | resource allocation 	 | step  	  |  gen  	  | old_log_prob 	 | update_actor 	 | total time<br>100 step 	 | total time<br>200 step 	 | total time<br>300 step 	 | total time<br>400 step 	 |     acc/mean@1          	      |
+|:--------------------:|:---------------------:|:--------:|:--------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------------:|
+| colocate sync      	 | 32                  	 | 790.10 	 | 357.41 	 | 107.71       	 | 313.81       	 | 13h 44m                	 | 1d 3h 43m              	 | 2d 9h 22m              	 | 3d 17h 5m              	 | max: 0.3313<br>last: 0.2448  	 |
+| fully_async_policy 	 | 16:16               	 |    	     |    	     | \            	 |       	        |            	             |            	             |            	             |            	             | max: <br>last:               	 |
+| colocate sync      	 | 64                  	 | 365.28 	 | 150.72 	 | 70.26        	 | 133.41       	 | 10h 22m                	 | 20h 45m                	 | 1d 7h 6m               	 | 1d 17h 32m             	 | max: 0.3365<br>last:  0.2333 	 |
+| fully_async_policy 	 | 32:32               	 | 189.26 	 | 28.46  	 | \            	 | 156.98       	 | 4h 57m<br>(2.09x)      	 | 10h 14m<br>(2.03x)     	 | 16h 58m<br>(1.83x)     	 | 21h 40m<br>(1.92x)     	 | max: 0.3677<br>last: 0.3406  	 |
+| colocate sync      	 | 128                 	 | 356.30 	 | 177.85 	 | 53.92        	 | 113.81       	 | 8h 36m                 	 | 17h 56m                	 | 1d 5h 6m               	 | 1d 16h 48m             	 | max: 0.3573<br>last: 0.2958  	 |
+| fully_async_policy 	 | 64:64               	 | 150.63 	 | 33.14  	 | \            	 | 113.16       	 | 3h 13m<br>(2.67x)      	 | 6h 46m<br>(2.65x)      	 | 10h 53m<br>(2.67x)     	 | 17h 22m<br>(2.35x)     	 | max: 0.3521<br>last: 0.3094  	 |
+
+> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-colocate_async?nw=nwuserhouzg
+
+### 128-card 7B Asynchronous Mode Experiment
+
+We used Qwen2.5-Math-7B to verify the effects of various modes supported by fully async.
+We can see that the benefit brought by streaming is approximately 0.6x, and after combining staleness and
+partial_rollout, the benefit reaches 2.35x.
+
+|                             mode                                         	                              |        step  	        |  gen  	  | old_log_prob 	 | update_actor 	 | total time<br>100 step 	 | total time<br>200 step 	 | total time<br>300 step 	 | total time<br>400 step 	 |     acc/mean@1         	      |
+|:-------------------------------------------------------------------------------------------------------:|:---------------------:|:--------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:-----------------------------:|
+|                                          colocate sync      	                                           | 128                 	 | 356.30 	 |    177.85 	    | 53.92        	 |      113.81       	      | 8h 36m                 	 | 17h 56m                	 | 1d 5h 6m               	 |   1d 16h 48m             	    | max: 0.3573<br>last: 0.2958  	 |
+| `stream off policy pipeline`<br>(+fully async: trigger_parameter_sync_step= 4,<br>require_batches= 4) 	 |       231.34 	        | 128.47 	 | \            	 | 98.77        	 | 4h 25m                 	 | 9h 41m                 	 | 15h 2m                 	 | 1d 1h 53m              	 | max: 0.2844<br>last: 0.2604 	 |
+|          `async stream pipeline with stale samples`<br>(+staleness_threshold=0.5)            	          |           	           |    	     |       	        |       	        |            	             |            	             |            	             |            	             |               	               |
+|        `async stream pipeline with partial rollout`<br>(+partial_rollout=True)                 	        |       150.63 	        | 33.14  	 | \            	 | 113.16       	 | 3h 13m                 	 | 6h 46m                 	 | 10h 53m                	 | 17h 22m                	 | max: 0.3521<br>last: 0.3094 	 |
+
+> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-stream_stale_partial?nw=nwuserhouzg
+
+### 128-card Stale Ablation Experiment
+
+Under the `async stream pipeline with partial rollout` mode, we verified the impact of staleness settings on training
+efficiency.
+We found that the larger the staleness, the more obvious the final gains.
+We also noticed that the times for staleness values of 0.3 and 0.5 are quite close, because as the training steps
+increase, the response length changes significantly, causing training instability.
+Further analysis and optimization are needed for this issue.
+
+| staleness_threshold 	 | step  	  |  gen  	  | old_log_prob 	 | update_actor 	 | total time<br>100 step 	 | total time<br>200 step 	 | total time<br>300 step 	 | total time<br>400 step 	 |     acc/mean@1         	      |
+|:---------------------:|:--------:|:--------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:-----------------------------:|
+| 0                   	 | 231.34 	 | 128.47 	 | \            	 | 98.77        	 | 4h 25m                 	 | 9h 41m                 	 | 15h 2m                 	 | 1d 1h 53m              	 | max: 0.2844<br>last: 0.2604 	 |
+| 0.1                 	 | 171.30 	 | 58.17  	 | \            	 | 109.12       	 | 3h 53m                 	 | 8h 37m                 	 | 14h 25m                	 | 19h 59m                	 | max: 0.3542<br>last: 0.2979 	 |
+| 0.3                 	 | 146.11 	 | 38.88  	 | \            	 | 103.22       	 | 3h 18m                 	 | 6h 49m                 	 | 11h 40m                	 | 17h 20m                	 | max: 0.3469<br>last: 0.2865 	 |
+| 0.5                 	 | 150.63 	 | 33.14  	 | \            	 | 113.16       	 | 3h 13m                 	 | 6h 46m                 	 | 10h 53m                	 | 17h 22m                	 | max: 0.3521<br>last: 0.3094 	 |
+
+> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-stream_stale_partial?nw=nwuserhouzg
+
+### 128-card 7B require_batches Ablation Experiment
+
+In multiple tests, we found that the number of samples issued each time in streaming affects the response length during
+training, which in turn affects training time. We verified the impact on results by modifying
+`async_training.require_batches`.
+
+| require_batches 	 | step  	  | gen  	  | old_log_prob 	 | update_actor 	 | total time<br>100 step 	 | total time<br>200 step 	 | total time<br>300 step 	 |     acc/mean@1         	      |
+|:-----------------:|:--------:|:-------:|:--------------:|:--------------:|:------------------------:|:------------------------:|:------------------------:|:-----------------------------:|
+| 1               	 | 203.47 	 | 30.88 	 | \            	 | 181.08       	 | 3h 31m                 	 | 8h 29m                 	 | 17h 36m                	 | max: 0.349<br>last: 0.326   	 |
+| 2               	 | 158.72 	 | 26.32 	 | \            	 | 128.08       	 | 3h 35m                 	 | 7h 38m                 	 | 13h 57m                	 | max: 0.351<br>last: 0.3406  	 |
+| 4               	 | 124.64 	 | 25.62 	 | \            	 | 95.06        	 | 3h 13m                 	 | 6h 46m                 	 | 10h 53m                	 | max: 0.3521<br>last: 0.3521 	 |
+
+> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-ablation_require_batches?nw=nwuserhouzg
+
+### 30B Model Mode Experiment
+
+TODO: The 30B experiment is still in progress.
+
+* Machine: H20
+* Model: Qwen2.5-32B~~~~
+* Rollout length: max_response_length FSDP2: 20K tokens;
+* Algorithm: DAPO
+* Engine: vllm+FSDP2
+* rollout.n: 16
+* ppo_mini_batch_size: 32
+* test_freq: 20
+
+* colocate sync:
+    * step:200
+    * train_batch_size: 512
+
+* fully_async_policy
+    * total_rollout_steps: 512*200
+    * trigger_parameter_sync_step: 512/32 = 16
+    * staleness_threshold: 0
+    * partial_rollout: False
+
+| training mode      | Resource allocation | mode                                       | step | generate_sequences | old_log_prob | update_actor | total time | acc/best@32/mean |
+|--------------------|---------------------|--------------------------------------------|------|--------------------|--------------|--------------|------------|------------------|
+| colocate sync      | 128                 |                                            |      |                    |              |              |            |                  |
+| fully_async_policy | 64:64               | stream off policy pipeline                 |      |                    |              |              |            |                  |
+| fully_async_policy | 64:64               | async stream pipeline with stale samples   |      |                    |              |              |            |                  |
+| fully_async_policy | 64:64               | async stream pipeline with partial rollout |      |                    |              |              |            |                  |
+
+
+## Future Plans
+
+* GRPO experiments
+* Megatron adaptation
+* SGLang integration
+* Transfer queue integration
+* Asynchronous parameter synchronization
+* AReaL asynchronous algorithm implementation
+* TPPO algorithm implementation
+* Multi-turn and Tool support
--- a/docs/index.rst
+++ b/docs/index.rst
@ -124,6 +124,7 @@ verl is fast with:
   advance/rollout_is_migration.md
   advance/one_step_off
   advance/agent_loop
+   advance/fully_async

 .. toctree::
   :maxdepth: 1