[doc] feat: update fully async experiment message (#3804)

This commit is contained in:
arron
2025-10-18 06:20:01 +08:00
committed by GitHub
parent b25bb7d4f3
commit 85d5b2ee2e
3 changed files with 12 additions and 12 deletions

View File

@ -1,8 +1,8 @@
# Recipe: Fully Async Policy Async Trainer # Recipe: Fully Async Policy Trainer
**Author:** `https://github.com/meituan-search` **Author:** `https://github.com/meituan-search`
Last updated: 10/17/2025. Last updated: 10/18/2025.
This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter, This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
supporting asynchronous sample generation and training. supporting asynchronous sample generation and training.
@ -22,7 +22,7 @@ However, it forcibly uses data from one round of asynchronous training, which is
completely eliminate the impact of long-tail on training efficiency. completely eliminate the impact of long-tail on training efficiency.
In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have
been implemented based on the separated architecture and have achieved gains. been implemented based on the separated architecture and have achieved gains.
We借鉴 their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial We borrow from their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial
rollout training. rollout training.
By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy
can significantly improve training efficiency. can significantly improve training efficiency.
@ -324,7 +324,7 @@ Using the `async stream pipeline with stale samples` strategy, we achieved about
* total_rollout_steps: 512*400 * total_rollout_steps: 512*400
* require_batches: 4 * require_batches: 4
* trigger_parameter_sync_step: 4 * trigger_parameter_sync_step: 4
* staleness_threshold: 0.3 * staleness_threshold: 0.5
* partial_rollout: True * partial_rollout: True
| training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 | | training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 |
@ -390,7 +390,7 @@ training, which in turn affects training time. We verified the impact on results
TODO: The 30B experiment is still in progress. TODO: The 30B experiment is still in progress.
* Machine: H20 * Machine: H20
* Model: Qwen2.5-32B~~~~ * Model: Qwen2.5-32B
* Rollout length: max_response_length FSDP2: 20K tokens; * Rollout length: max_response_length FSDP2: 20K tokens;
* Algorithm: DAPO * Algorithm: DAPO
* Engine: vllm+FSDP2 * Engine: vllm+FSDP2

View File

@ -1,8 +1,8 @@
# Recipe: Fully Async Policy Async Trainer # Recipe: Fully Async Policy Trainer
**Author:** `https://github.com/meituan-search` **Author:** `https://github.com/meituan-search`
Last updated: 10/17/2025. Last updated: 10/18/2025.
This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter, This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
supporting asynchronous sample generation and training. supporting asynchronous sample generation and training.
@ -22,7 +22,7 @@ However, it forcibly uses data from one round of asynchronous training, which is
completely eliminate the impact of long-tail on training efficiency. completely eliminate the impact of long-tail on training efficiency.
In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have
been implemented based on the separated architecture and have achieved gains. been implemented based on the separated architecture and have achieved gains.
We借鉴 their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial We borrow from their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial
rollout training. rollout training.
By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy
can significantly improve training efficiency. can significantly improve training efficiency.
@ -324,7 +324,7 @@ Using the `async stream pipeline with stale samples` strategy, we achieved about
* total_rollout_steps: 512*400 * total_rollout_steps: 512*400
* require_batches: 4 * require_batches: 4
* trigger_parameter_sync_step: 4 * trigger_parameter_sync_step: 4
* staleness_threshold: 0.3 * staleness_threshold: 0.5
* partial_rollout: True * partial_rollout: True
| training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 | | training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 |
@ -390,7 +390,7 @@ training, which in turn affects training time. We verified the impact on results
TODO: The 30B experiment is still in progress. TODO: The 30B experiment is still in progress.
* Machine: H20 * Machine: H20
* Model: Qwen2.5-32B~~~~ * Model: Qwen2.5-32B
* Rollout length: max_response_length FSDP2: 20K tokens; * Rollout length: max_response_length FSDP2: 20K tokens;
* Algorithm: DAPO * Algorithm: DAPO
* Engine: vllm+FSDP2 * Engine: vllm+FSDP2

View File

@ -1,4 +1,4 @@
# Recipe: Fully Async Policy Async Trainer # Recipe: Fully Async Policy Trainer
**Author:** `https://github.com/meituan-search` **Author:** `https://github.com/meituan-search`
@ -273,7 +273,7 @@ python -m recipe.fully_async_policy.fully_async_main \
* total_rollout_steps: 512*400 * total_rollout_steps: 512*400
* require_batches: 4 * require_batches: 4
* trigger_parameter_sync_step: 4 * trigger_parameter_sync_step: 4
* staleness_threshold: 0.3 * staleness_threshold: 0.5
* partial_rollout: True * partial_rollout: True
| training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 | | training mode | resource allocation | step | gen | old_log_prob | update_actor | total time<br>100 step | total time<br>200 step | total time<br>300 step | total time<br>400 step | acc/mean@1 |