mirror of https://github.com/volcengine/verl.git synced 2025-10-20 21:53:50 +08:00

Files

ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 141b1d3251 [recipe] fix: DAPO rewards using sandbox fusion (#2496 )

### What does this PR do?

Fix some bugs/outdated code so that we can use sandbox fusion for DAPO.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

Use `load_reward_manager` in `verl.trainer.ppo.reward` instead of
duplicating the code there.

Also, set `acc` in `reward_extra_info` when the returned result is only
a float number (e.g. sandbox fusion).

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

Signed-off-by: Hollow Man <hollowman@opensuse.org>

2025-07-14 20:10:48 +08:00

char_count

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

dapo

[recipe] fix: DAPO rewards using sandbox fusion (#2496 )

2025-07-14 20:10:48 +08:00

entropy

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

genrm_remote

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

minicpmo

[env] feat: safely bump py version to 3.10 (#2421 )

2025-07-12 16:29:39 -07:00

prime

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

[ci] refactor: reduce ruff line-length from 300 to 120 (#2287 )

2025-07-01 09:54:40 +08:00

retool

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

spin

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

sppo

[doc] fix: quickstart example can't work on zsh (#2509 )

2025-07-14 13:26:32 +08:00

README.md

[recipe, doc] fix: fix dapo branch name (#2090 )

2025-06-19 09:35:05 +08:00

README.md

Recipe

The examples under recipes/ are representative extensions to verl for specific end-to-end RL training recipes. The help the community reproduce experiments, verl team provides a snapshot of the codebase when each recipe is initially PR'ed to verl main. You can find them via github branches

Awesome work using verl

Logic-RL: a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
Seed-Coder: RL training of Seed-Coder boosts performance on competitive programming
all-hands/openhands-lm-32b-v0.1: A strong, open coding agent model, trained with multi-turn fine-tuning
s3 Efficient Yet Effective Search Agent Training via RL
Rec-R1: Bridging Generative Large Language Models and Recommendation Systems via Reinforcement Learning
Explore RL Data Scaling: Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
FIRE: Flaming-hot initiation with regular execution sampling for large language models
DQO: Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
cognition-engineering: Test time scaling drives cognition engineering.
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning.
AdaRFT: Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
critic-rl: LLM critics for code generation
self-rewarding-reasoning-LLM: self-rewarding and correction with generative reward models
DeepEnlighten: Reproduce R1 with social reasoning tasks and analyze key findings
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
PURE: Credit assignment is the key to successful reinforcement fine-tuning using process reward model
cognitive-behaviors: Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
deepscaler: iterative context scaling with GRPO
DAPO: the fully open source SOTA RL algorithm that beats DeepSeek-R1-zero-32B