mirror of https://github.com/volcengine/verl.git synced 2025-10-20 05:33:49 +08:00

Files

ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 ae5d8504d4 [trainer] feat: ReMax support using reward model for baseline (#3780 )

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Not only limited to reward functions, we should also support using rm to
calculate the reward baseline.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>

2025-10-17 12:07:05 +08:00

config

[recipe] fix: unsupported operand type(s) for |: 'dict' and 'DictConfig' (#2217 )

2025-06-26 17:27:01 -07:00

reward_score

[env] feat: safely bump py version to 3.10 (#2421 )

2025-07-12 16:29:39 -07:00

7b_clip_cov.sh

[fsdp,doc] refactor: rename warmup_style@FSDPOptimizerConfig -> lr_scheduler_type (#3739 )

2025-10-13 15:58:59 +08:00

7b_kl_cov.sh

[fsdp,doc] refactor: rename warmup_style@FSDPOptimizerConfig -> lr_scheduler_type (#3739 )

2025-10-13 15:58:59 +08:00

32b_clip_cov.sh

[fsdp,doc] refactor: rename warmup_style@FSDPOptimizerConfig -> lr_scheduler_type (#3739 )

2025-10-13 15:58:59 +08:00

32b_kl_cov_mininbsz.sh

[fsdp,doc] refactor: rename warmup_style@FSDPOptimizerConfig -> lr_scheduler_type (#3739 )

2025-10-13 15:58:59 +08:00

32b_kl_cov.sh

[fsdp,doc] refactor: rename warmup_style@FSDPOptimizerConfig -> lr_scheduler_type (#3739 )

2025-10-13 15:58:59 +08:00

entropy_ray_trainer.py

[trainer] feat: ReMax support using reward model for baseline (#3780 )

2025-10-17 12:07:05 +08:00

main_entropy.py

[ray] feat: add support for ray init kwargs (#3049 )

2025-08-15 20:02:56 +08:00

README.md

[recipe] feat: integrate entropy-mechanism recipe: Clip-Cov and KL-Cov methods (#1830 )

2025-06-19 15:08:43 -07:00

reward.py

[ci] refactor: reduce ruff line-length from 300 to 120 (#2287 )

2025-07-01 09:54:40 +08:00

README.md

The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.

🎉 News • ✨ Getting Started • 📖 Introduction

🎈 Citation • 🌻 Acknowledgement • 📬 Contact • 📈 Star History

🎉News

[2025/05/29] 🎉 Ranked #1 of the day on Huggingface Daily Papers.
[2025/05/29] Released our Paper on arXiv. See here. We provide insights into the entropy mechanism of RL for LLMs and propose two simple yet effective strategies to alleviate the entropy collapse.

✨Getting started

After preparing the training data, for training Qwen2.5-7B on a single node, taking the KL-Cov approach as an example, you can simply run:

cd verl
conda activate your_env
bash recipe/dapo/7b_kl_cov.sh

While for training Qwen2.5-32B on multi nodes, you can run the following commands:

cd verl
conda activate your_env
bash recipe/dapo/32b_kl_cov.sh

📖Introduction

This paper addresses the entropy collapse issue in scaling reinforcement learning (RL) for large language models (LLMs), where policy entropy drops sharply during training, leading to overconfidence and performance saturation. We empirically establish a relationship between entropy (H) and performance (R): R=−aexp(H)+b, showing performance is bottlenecked by entropy exhaustion.

Theoretically, we find entropy changes are driven by the covariance between action probability and logit updates, which correlates with advantage in Policy Gradient methods. High-probability, high-advantage actions reduce entropy, while rare, high-advantage actions increase it. Empirically, the covariance term remains positive, explaining entropy’s monotonic decline. To mitigate this, we propose Clip-Cov and KL-Cov, which restrict updates for high-covariance tokens. These methods effectively prevent entropy collapse, and improve performance.

📃Evaluation

Our method is able to maintain a considerably higher level of entropy throughout training. For example, when the baseline's entropy reaches a plateau and can no longer be consumed, the KL-Cov method still sustains an entropy level over 10 times higher. Meanwhile, the response length of the policy model steadily increases, and its performance on the test set consistently surpasses that of the baseline. This indicates that our model is able to explore more freely during training, learning better policy through RL.

Method	AIME24	AIME25	AMC	MATH-500	OMNI-MATH	OlympiadBench	Minerva	Avg.
Qwen2.5-7B
GRPO	21.2	9.6	58.7	78.8	27.9	40.7	36.7	38.6
w. Clip-higher	18.1	11.5	56.6	79.2	29.8	43.3	40.4	38.8
w. `CLIP-Cov`	22.1	15.8	58.2	80.4	30.5	44.1	41.1	40.4
w. `KL-Cov`	22.6	12.9	61.4	80.8	29.1	42.6	38.2	40.6
Qwen2.5-32B
GRPO	21.8	16.2	69.7	84.2	35.2	43.6	45.5	45.8
w. Clip-higher	35.6	22.3	69.5	77.2	35.1	42.5	43.0	47.2
w. `CLIP-Cov`	32.3	22.7	67.2	87.0	42.0	57.2	46.0	50.3
w. `KL-Cov`	36.8	30.8	74.5	84.6	39.1	49.0	46.3	52.2

Our two approaches both achieve non-trivial improvements across all benchmarks. Compared to GRPO, our method outperforms it by 2.0% on average for the 7B model and by 6.4% for the 32B model. Moreover, we observe that our method yields more substantial gains on the larger Qwen2.5-32B. Specifically, our method achieves improvements of 15.0% and 14.6% compared to GRPO on the most challenging benchmarks, AIME24 and AIME25, respectively.

🎈Citation

If you find this paper or repo helpful, please cite us.

@article{cui2025entropy,
  title={The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models},
  author={Cui, Ganqu and Zhang, Yuchen and Chen, Jiacheng and Yuan, Lifan and Wang, Zhi and Zuo, Yuxin and Li, Haozhan and Fan, Yuchen and Chen, Huayu and Chen, Weize and others},
  journal={arXiv preprint arXiv:2505.22617},
  year={2025}
}

🌻Acknowledgement

We implement our reinforcement learning algorithm extending from verl. We utilize vLLM for inference. Our models are trained primarily on Qwen2.5 family. Our training data is built from DAPO-MATH. Thanks for their great contributions!

📬 Contact

For questions, discussion, or collaboration opportunities, feel free to contact:

Ganqu Cui: cuiganqu@pjlab.org.cn
Yuchen Zhang: yuchen.zhang2003@gmail.com
Jiacheng Chen: jackchan9345@gmail.com
Ning Ding: ningding.cs@gmail.com

README.md Unescape Escape

The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.

🎉News

✨Getting started

📖Introduction

📃Evaluation

🎈Citation

🌻Acknowledgement

📬 Contact

README.md