Files
verl/recipe/fully_async_policy/README.md
arron b25bb7d4f3 [trainer, recipe] feat: fully async training recipe (#2981)
### What does this PR do?

To implement a purely asynchronous training workflow, we further split
the training process into a Trainer and a Rollouter based on the
existing one-step-off policy code, with samples transmitted via a
message queue.

We will continue to integrate partial rollout to mitigate the impact of
long-tail training.

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review. https://github.com/volcengine/verl/pull/2231
https://github.com/volcengine/verl/pull/2200

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

---------

Co-authored-by: meituan-search <machi04@meituan.com>
Co-authored-by: wangshulin02 <wangshulin02@meituan.com>
Co-authored-by: arron <arron@MBP-2G17FXQ05P-2332.local>
Co-authored-by: wangshulin02 <953550366@qq.com>
Co-authored-by: hadoop-ai-search <hadoop-ai-search@set-zw04-mlp-codelab-pc1189.mt>
Co-authored-by: sl-1314 <82856253+sl-1314@users.noreply.github.com>
Co-authored-by: arron <arron@MBP-VH9RV7LTJC-1907.local>
Co-authored-by: arron <arron@MBP-JFQXPWR11F-1943.local>
2025-10-17 22:29:18 +08:00

29 KiB

Recipe: Fully Async Policy Async Trainer

Author: https://github.com/meituan-search

Last updated: 10/17/2025.

This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter, supporting asynchronous sample generation and training. Under this system, we achieved a 2.35x-2.67x performance improvement when training the Qwen2.5-7B model with 128 GPUs, without significantly affecting the results.

Introduction

Background

The separated rollout and train architecture, compared to the colocate architecture, can allocate resources more flexibly and design more flexible training logic, thereby addressing issues such as low GPU utilization and training efficiency caused by long-tail problems. The one_step_off_policy alleviates the problem of long rollout times and achieves some gains in training efficiency by designing a separated architecture and performing asynchronous training between rollout and train for one round. However, it forcibly uses data from one round of asynchronous training, which is not flexible enough and cannot completely eliminate the impact of long-tail on training efficiency. In other frameworks such as AReaL, Magistral, StreamRL, and AsyncFlow, asynchronous training and streaming training have been implemented based on the separated architecture and have achieved gains. We借鉴 their methods and implemented them in VERL. The fully_async_policy supports asynchronous, streaming, and partial rollout training. By reasonably setting parameters such as resource allocation and parameter synchronization frequency, fully_async_policy can significantly improve training efficiency.

Magistral https://arxiv.org/abs/2506.10910

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning https://arxiv.org/abs/2505.24298

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation https://arxiv.org/abs/2504.15930

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training https://arxiv.org/abs/2507.01663

Core Contributions

  • Resource Isolation: Unlike using hybrid_engine, Rollouter and Trainer use separate computing resources and need to specify the resources they occupy separately.
  • Parallel Generation and Training: While the Trainer is training, the Rollouter is generating new samples.
  • Multi-step Asynchronous: Compared to one step off policy, it supports asynchronous settings from 0.x steps to multiple steps, making the asynchronous solution more flexible.
  • NCCL Parameter Synchronization: Uses NCCL communication primitives for parameter communication between Rollouter and Trainer.
  • Stream Inference and Training: Rollouter generates data sample by sample, and data transmission uses a single sample as the minimum transmission unit.
  • Asynchronous Training and Freshness Control: By setting the parameter async_training.staleness_threshold, it supports training with samples generated by old parameters.
  • PartialRollout: The Rollouter's inference process supports partial rollout logic. During parameter synchronization, by adding sleep() and resume() logic, it saves samples from ongoing rollouts and continues using them in the next rollout, reducing the time spent waiting for ongoing tasks to finish during parameter synchronization.

Currently, the supported usage mode is fsdp+vllm. vllm must use the server mode based on AgentLoop.

Design

The overall architecture of fully_async_policy is shown in the figure below. fully_async_policy mainly consists of four parts: Rollouter, MessageQueue, Trainer, and ParameterSynchronizer.

fully_async_policy_structure

  1. Rollouter generates sequences sample by sample and puts the generated samples into the MessageQueue, with the production speed controlled by freshness.
  2. MessageQueue is used to temporarily store samples generated by Rollouter.
  3. Trainer fetches samples from MessageQueue sample by sample. After fetching require_batches*ppo_mini_batch_size samples, it will perform training. After training for async_training.trigger_parameter_sync_step rounds, it triggers a parameter synchronization with Rollouter.
  4. ParameterSynchronizer implements the NCCL synchronous parameter synchronization capability.

The source of benefits compared to the base scheme lies in the fact that in the colocate case, using more resources for rollout cannot solve the idleness caused by long-tail samples. After we perform resource isolation, the time for rollout and train may be longer than before (because fewer resources are used), but the overlap in their time consumption reduces the end-to-end time consumption.

fully_async_policy_revenue

Usage

Parameter Description

super params implication
trainer.nnodes Number of nodes for Trainer
trainer.n_gpus_per_node Number of GPUs per node for Trainer
rollout.nnodes Number of nodes for Rollouter
rollout.n_gpus_per_node Number of GPUs per node for Rollouter
data.train_batch_size In the fully async strategy, this value is not effective (default is 0)
data.gen_batch_size In the fully async strategy, uses streaming sample production logic (default is 1)
rollout.total_rollout_steps Total number of rollout samples
rollout.test_freq How many times Rollouter updates parameters before performing a validation
actor_rollout_ref.actor.ppo_mini_batch_size The ppo_mini_batch_size is a global num across all workers/gpus
async_training.require_batches Number of ppo_mini_batch_size that FullyAsyncTrainer fetches at once
async_training.trigger_parameter_sync_step Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization
async_training.staleness_threshold Freshness control
async_training.partial_rollout Whether to perform partial_rollout
async_training.use_rollout_log_probs Use log_probs generated by rollout

Further Explanation:

  • rollout.total_rollout_steps

    Compared to colocate, the quantity can be aligned by multiplying train_batch_size and step: rollout.total_rollout_steps = data.train_batch_size * step.

  • async_training.trigger_parameter_sync_step

    In the fully async strategy, it indicates how many local updates the Trainer performs (i.e., how many times it fetches require_batches * ppo_mini_batch_size samples) before a parameter synchronization with Rollouter. Between every two parameter synchronizations between Rollouter and Trainer, the Trainer will process trigger_parameter_sync_step* require_batches*ppo_mini_batch_size samples. To fairly compare speed with colocate, trigger_parameter_sync_step should be set to data.train_batch_size / (require_batches * ppo_mini_batch_size).

  • async_training.staleness_threshold

    In the fully async strategy, it indicates the maximum proportion of stale samples allowed to be used.

    • staleness_threshold=0, indicates synchronous training. Rollouter will generate a fixed number of samples between two parameter updates, the sample count is: rollout\_num = (trigger\_parameter\_sync\_step*require\_batches*ppo\_mini\_batch\_size)
    • staleness_threshold>0, indicates asynchronous training, can be set to a decimal for more flexible asynchronous calls. Rollouter will generate at most the following number of samples between two parameter updates: rollout\_num = (1+staleness\_threshold)*(trigger\_parameter\_sync\_step*require\_batches*ppo\_mini\_batch\_size) - num\_staleness\_sample

    num_staleness_sample represents the number of stale samples generated in excess during the last rollout.

    Since it's a streaming system, rollout continues to generate and trainer continues to consume. If rollouter is slower, trainer will trigger parameter synchronization earlier, and rollouter will not actually produce rollout_num samples. When rollout is fast enough, setting staleness_threshold to 1 is basically equivalent to one_step_off policy. To avoid too many expired samples affecting training accuracy, it is recommended to set this value to less than 1.

  • async_training.partial_rollout

    partial_rollout only actually takes effect when staleness_threshold>0.

  • async_training.use_rollout_log_probs

    In reinforcement learning algorithms, log_probs have implicit correlations with parameter versions and tokens. Due to the settings of algorithms like PPO/GRPO/DAPO, when calculating importance sampling, old_log_prob must use the log_probs corresponding to the rollout parameters and tokens to ensure algorithm correctness. In the fully async strategy, we default to old_log_prob being calculated by rollout rather than by trainer.

    • async_training.require_batches

    In streaming training, require_batches should be set to 1, indicating that training is performed after producing enough ppo_mini_batch_size samples. In actual testing, we found that if fewer samples are issued at once, due to the order of data distribution, it can cause training instability and longer response lengths. Here, we additionally provide require_batches for streaming distribution and control the number of samples participating in training at once.

Supported Modes

  1. on policy pipeline:

    1. trigger_parameter_sync_step=1, staleness_threshold=0
    2. Rollouter produces require_batches*ppo_mini_batch_size samples at once, Trainer fetches these samples for training, and after training completes, Trainer and Rollouter perform a parameter synchronization;
    3. During the rollout phase, if there are long-tail samples but few rollout samples, shorter samples cannot fill idle resources, causing some resource waste.
    4. As shown in figure a;
  2. stream off policy pipeline:

    1. trigger_parameter_sync_step>1, staleness_threshold=0
    2. Synchronous streaming training will be performed. Rollouter produces require_batches*ppo_mini_batch_size*trigger_parameter_sync_step samples at once, Trainer performs a local training every time it fetches require_batches*ppo_mini_batch_size samples, and after training trigger_parameter_sync_step times, Trainer and Rollouter perform a parameter synchronization;
    3. Compared to a, since more samples are generated at once, resource idleness will be lower.
    4. In one step training, there will be two periods of resource idleness: when fetching the first batch of samples, train waits for require_batches*ppo_mini_batch_size samples to be produced, and during the last parameter update, rollout waits for training to complete.
    5. As shown in figure b;
  3. async stream pipeline with stale samples:

    1. trigger_parameter_sync_step>=1, staleness_threshold>0, partial_rollout=False
    2. After each parameter update, Rollouter will plan to produce at most rollout_num samples (in practice, the number of samples generated may be less than this value depending on rollout speed).
    3. If the rollout process is relatively fast, Rollouter will generate some additional samples num_stale_samples before parameter synchronization for immediate use by Trainer after synchronization. When triggering parameter synchronization, if Rollouter has ongoing tasks, it will wait for the tasks to complete and not add new tasks;
    4. Compared to b, except for the first step training, subsequent training will not have the time to wait for the first batch rollout to finish, but will have the time to wait for active tasks to finish.
    5. As shown in figure c;
  4. async stream pipeline with partial rollout:

    1. trigger_parameter_sync_step>=1, staleness_threshold>0, partial_rollout=True
    2. Compared to c, when triggering parameter synchronization, if Rollouter has samples being produced, it will interrupt the rollout process and perform parameter synchronization. The interrupted samples will continue to be generated after synchronization. This reduces the time to wait for active tasks to finish.
    3. As shown in figure d;

fully_async_policy_mode

Key Metrics

metrics implication
trainer/idle_ratio Trainer idle rate
rollouter/idle_ratio Rollouter idle rate
fully_async/count/stale_samples_processed Total number of old samples used in training
fully_async/count/stale_trajectory_processed Total number of old trajectories used in training (one sample produces rollout.n trajectories)
fully_async/partial/total_partial_num Number of partial samples processed by Trainer between two trigger_parameter_sync_step
fully_async/partial/partial_ratio Ratio of partial samples processed by Trainer between two trigger_parameter_sync_step
fully_async/partial/max_partial_span Maximum parameter span of partial samples processed by Trainer between two trigger_parameter_sync_step

Parameter Tuning Recommendations

  • Resource Allocation and Adjustment:

    • Reasonable resource allocation is the prerequisite for achieving good training efficiency. The ideal resource allocation should make the rollout time and train time close, thereby minimizing pipeline bubbles in the entire training process, avoiding resource idleness, and ensuring Trainer does not use old samples. In real training scenarios, resource allocation can be adjusted based on the idle time of rollout and train during actual training, which can be obtained from rollouter/idle_ratio and trainer/idle_ratio. If rollouter/idle_ratio is high and trainer/idle_ratio is low, Trainer resources should be increased and Rollouter resources should be reduced, and vice versa.
  • Key Parameters:

    • staleness_threshold: Setting it too high will cause more old samples to be used, affecting model performance. It is recommended to set it to less than 1.
    • require_batches: The closer to 1, the closer to a pure streaming process, the smaller the training bubbles, and the faster the acceleration effect that can be achieved in terms of speed, but it will affect the order of sample processing;
    • trigger_parameter_sync_step: The smaller the setting, the closer to on policy, but it will cause frequent parameter synchronization. Long-tail samples waste resources that cannot be filled by short samples, resulting in low resource utilization. The larger the setting, the higher the computational efficiency, but the accuracy will be affected by off policy.
    • rollout.test_freq: It will occupy Rollouter resources and is not recommended to be set too small.
  • Mode Selection: By adjusting different parameters, the Fully Async architecture supports optimization acceleration at different levels, suitable for tasks in different scenarios.

    • For small-scale tasks that need to ensure training stability and on-policy nature, and have low speed requirements, the on policy pipeline mode (Mode 1) can be tried.
    • For scenarios that need to improve training throughput but are sensitive to staleness, the stream off policy pipeline mode can be tried. That is, by setting trigger_parameter_sync_step>1 to improve training efficiency, but still maintaining the synchronization mechanism (staleness_threshold=0) (Mode 2).
    • For large-scale tasks with high training speed requirements and can tolerate a certain degree of off-policy and staleness, setting staleness_threshold> 0 and partial_rollout=True can improve training efficiency, using the async stream pipeline mode (Mode 3 or 4).

Quick Start

rollout_mode="async"
rollout_name="vllm" # sglang or vllm
if [ "$rollout_mode" = "async" ]; then
    export VLLM_USE_V1=1
    return_raw_chat="True"
fi

train_prompt_bsz=0
gen_prompt_bsz=1
n_resp_per_prompt=16
train_prompt_mini_bsz=32
total_rollout_steps=$(((512*400)))
test_freq=10
staleness_threshold=0
trigger_parameter_sync_step=16
partial_rollout=False


python -m recipe.fully_async_policy.fully_async_main \
	train_batch_size=${train_prompt_bsz} \
    data.gen_batch_size=${gen_prompt_bsz} \
    data.return_raw_chat=${return_raw_chat} \
    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
    actor_rollout_ref.actor.strategy=fsdp2 \
    critic.strategy=fsdp2 \
    actor_rollout_ref.hybrid_engine=False \
    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.name=${rollout_name} \
    actor_rollout_ref.rollout.mode=${rollout_mode} \
    actor_rollout_ref.rollout.calculate_log_probs=True \
    trainer.nnodes="${NNODES_TRAIN}" \
    trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
    rollout.nnodes="${NNODES_ROLLOUT}" \
    rollout.n_gpus_per_node="${NGPUS_PER_NODE}" \
    rollout.total_rollout_steps="${total_rollout_steps}" \
    rollout.test_freq="${test_freq}" \
    async_training.staleness_threshold="${staleness_threshold}" \
    async_training.trigger_parameter_sync_step="${trigger_parameter_sync_step}" \
    async_training.partial_rollout="${partial_rollout}"

Experiments

Asynchronous Training on 7B Model

We used Qwen2.5-Math-7B to verify the benefits of the fully async strategy under long candidates and multiple resources. Using the async stream pipeline with stale samples strategy, we achieved about 2x performance improvement on 32 cards, 64 cards, and 128 cards without significantly affecting experimental results.

  • Machine: H20

  • Model: Qwen2.5-Math-7B

  • Rollout length: max_response_length FSDP2: 28K tokens;

  • Algorithm: DAPO

  • Dataset: TRAIN_FILE: dapo-math-17k.parquet TEST_FILE: aime-2024.parquet

  • Engine: vllm+FSDP2

  • rollout.n: 16

  • ppo_mini_batch_size: 32

  • test_freq: 20

  • colocate sync:

    • step: 400
    • train_batch_size: 512
  • fully_async_policy

    • total_rollout_steps: 512*400
    • require_batches: 4
    • trigger_parameter_sync_step: 4
    • staleness_threshold: 0.3
    • partial_rollout: True
training mode resource allocation step gen old_log_prob update_actor total time
100 step
total time
200 step
total time
300 step
total time
400 step
acc/mean@1
colocate sync 32 790.10 357.41 107.71 313.81 13h 44m 1d 3h 43m 2d 9h 22m 3d 17h 5m max: 0.3313
last: 0.2448
fully_async_policy 16:16 \ max:
last:
colocate sync 64 365.28 150.72 70.26 133.41 10h 22m 20h 45m 1d 7h 6m 1d 17h 32m max: 0.3365
last: 0.2333
fully_async_policy 32:32 189.26 28.46 \ 156.98 4h 57m
(2.09x)
10h 14m
(2.03x)
16h 58m
(1.83x)
21h 40m
(1.92x)
max: 0.3677
last: 0.3406
colocate sync 128 356.30 177.85 53.92 113.81 8h 36m 17h 56m 1d 5h 6m 1d 16h 48m max: 0.3573
last: 0.2958
fully_async_policy 64:64 150.63 33.14 \ 113.16 3h 13m
(2.67x)
6h 46m
(2.65x)
10h 53m
(2.67x)
17h 22m
(2.35x)
max: 0.3521
last: 0.3094

source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-colocate_async?nw=nwuserhouzg

128-card 7B Asynchronous Mode Experiment

We used Qwen2.5-Math-7B to verify the effects of various modes supported by fully async. We can see that the benefit brought by streaming is approximately 0.6x, and after combining staleness and partial_rollout, the benefit reaches 2.35x.

mode step gen old_log_prob update_actor total time
100 step
total time
200 step
total time
300 step
total time
400 step
acc/mean@1
colocate sync 128 356.30 177.85 53.92 113.81 8h 36m 17h 56m 1d 5h 6m 1d 16h 48m
stream off policy pipeline
(+fully async: trigger_parameter_sync_step= 4,
require_batches= 4)
231.34 128.47 \ 98.77 4h 25m 9h 41m 15h 2m 1d 1h 53m max: 0.2844
last: 0.2604
async stream pipeline with stale samples
(+staleness_threshold=0.5)
async stream pipeline with partial rollout
(+partial_rollout=True)
150.63 33.14 \ 113.16 3h 13m 6h 46m 10h 53m 17h 22m max: 0.3521
last: 0.3094

source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-stream_stale_partial?nw=nwuserhouzg

128-card Stale Ablation Experiment

Under the async stream pipeline with partial rollout mode, we verified the impact of staleness settings on training efficiency. We found that the larger the staleness, the more obvious the final gains. We also noticed that the times for staleness values of 0.3 and 0.5 are quite close, because as the training steps increase, the response length changes significantly, causing training instability. Further analysis and optimization are needed for this issue.

staleness_threshold step gen old_log_prob update_actor total time
100 step
total time
200 step
total time
300 step
total time
400 step
acc/mean@1
0 231.34 128.47 \ 98.77 4h 25m 9h 41m 15h 2m 1d 1h 53m max: 0.2844
last: 0.2604
0.1 171.30 58.17 \ 109.12 3h 53m 8h 37m 14h 25m 19h 59m max: 0.3542
last: 0.2979
0.3 146.11 38.88 \ 103.22 3h 18m 6h 49m 11h 40m 17h 20m max: 0.3469
last: 0.2865
0.5 150.63 33.14 \ 113.16 3h 13m 6h 46m 10h 53m 17h 22m max: 0.3521
last: 0.3094

source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-stream_stale_partial?nw=nwuserhouzg

128-card 7B require_batches Ablation Experiment

In multiple tests, we found that the number of samples issued each time in streaming affects the response length during training, which in turn affects training time. We verified the impact on results by modifying async_training.require_batches.

require_batches step gen old_log_prob update_actor total time
100 step
total time
200 step
total time
300 step
acc/mean@1
1 203.47 30.88 \ 181.08 3h 31m 8h 29m 17h 36m max: 0.349
last: 0.326
2 158.72 26.32 \ 128.08 3h 35m 7h 38m 13h 57m max: 0.351
last: 0.3406
4 124.64 25.62 \ 95.06 3h 13m 6h 46m 10h 53m max: 0.3521
last: 0.3521

source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-ablation_require_batches?nw=nwuserhouzg

30B Model Mode Experiment

TODO: The 30B experiment is still in progress.

  • Machine: H20

  • Model: Qwen2.5-32B~~~~

  • Rollout length: max_response_length FSDP2: 20K tokens;

  • Algorithm: DAPO

  • Engine: vllm+FSDP2

  • rollout.n: 16

  • ppo_mini_batch_size: 32

  • test_freq: 20

  • colocate sync:

    • step:200
    • train_batch_size: 512
  • fully_async_policy

    • total_rollout_steps: 512*200
    • trigger_parameter_sync_step: 512/32 = 16
    • staleness_threshold: 0
    • partial_rollout: False
training mode Resource allocation mode step generate_sequences old_log_prob update_actor total time acc/best@32/mean
colocate sync 128
fully_async_policy 64:64 stream off policy pipeline
fully_async_policy 64:64 async stream pipeline with stale samples
fully_async_policy 64:64 async stream pipeline with partial rollout

Future Plans

  • GRPO experiments
  • Megatron adaptation
  • SGLang integration
  • Transfer queue integration
  • Asynchronous parameter synchronization
  • AReaL asynchronous algorithm implementation
  • TPPO algorithm implementation
  • Multi-turn and Tool support