a31a8f251f
[doc] fix: quickstart example can't work on zsh ( #2509 )
...
### What does this PR do?
I followed the instructions at
https://verl.readthedocs.io/en/latest/start/quickstart.html to run the
PPO example on my devbox, which uses zsh. However, I got the error zsh:
no matches found: `trainer.logger=[console]` because `[]` is interpreted
as a glob pattern in zsh.
```
(verl) ➜ verl git:(20250713-devbox-2-tmux0-verl-2) ✗ PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['console'] \
trainer.val_before_train=False \
trainer.n_gpus_per_node=1 \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.total_epochs=15 2>&1 | tee verl_demo.log
zsh: no matches found: trainer.logger=[console]
```
This PR has 3 changes:
* `trainer.logger=['console']` -> `trainer.logger=console`
* `trainer.logger=['console','wandb']` ->
`trainer.logger='["console","wandb"]'`
* `trainer.logger=['console','tensorboard']` ->
`trainer.logger='["console","tensorboard"]'`
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
* `trainer.logger=console` (zsh)
<img width="898" height="564" alt="image"
src="https://github.com/user-attachments/assets/a957a493-75e6-462b-9974-6b1c4cdf5a80 "
/>
* ``trainer.logger='["console","wandb"]'`` (zsh)
<img width="870" height="565" alt="image"
src="https://github.com/user-attachments/assets/e20613bf-2ccc-4653-b23f-90edc3d568d1 "
/>
* `trainer.logger=console` (bash)
```bash
ubuntu@ip-xxx-xx-x-xxx:~/verl$ PYTHONUNBUFFERED=1 python3 -m
verl.trainer.main_ppo \
> data.train_files=$HOME/data/gsm8k/train.parquet \
> data.val_files=$HOME/data/gsm8k/test.parquet \
> data.train_batch_size=256 \
> data.max_prompt_length=512 \
> data.max_response_length=256 \
> actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
> actor_rollout_ref.actor.optim.lr=1e-6 \
> actor_rollout_ref.actor.ppo_mini_batch_size=64 \
> actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
> actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
> actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
> actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
> actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
> critic.optim.lr=1e-5 \
> critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
> critic.ppo_micro_batch_size_per_gpu=4 \
> algorithm.kl_ctrl.kl_coef=0.001 \
> trainer.logger=console \
> trainer.val_before_train=False \
> trainer.n_gpus_per_node=1 \
> trainer.nnodes=1 \
> trainer.save_freq=10 \
> trainer.test_freq=10 \
> trainer.total_epochs=15 2>&1 | tee verl_demo.log
2025-07-14 02:52:27,669 INFO worker.py:1908 -- Started a local Ray
instance. View the dashboard at 127.0.0.1:8265
(TaskRunner pid=1799248) TaskRunner hostname: ip-172-31-9-244, PID:
1799248
(TaskRunner pid=1799248) {'actor_rollout_ref': {'actor': {'checkpoint':
{'load_contents': ['model',
(TaskRunner pid=1799248) 'optimizer',
(TaskRunner pid=1799248) 'extra'],
(TaskRunner pid=1799248) 'save_contents': ['model',
(TaskRunner pid=1799248) 'optimizer',
(TaskRunner pid=1799248) 'extra']},
```
* `trainer.logger='["console","wandb"]'` (bash)
```bash
ubuntu@ip-xxx-xx-x-xxx:~/verl$ PYTHONUNBUFFERED=1 python3 -m
verl.trainer.main_ppo \
> data.train_files=$HOME/data/gsm8k/train.parquet \
> data.val_files=$HOME/data/gsm8k/test.parquet \
> data.train_batch_size=256 \
> data.max_prompt_length=512 \
> data.max_response_length=256 \
> actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
> actor_rollout_ref.actor.optim.lr=1e-6 \
> actor_rollout_ref.actor.ppo_mini_batch_size=64 \
> actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
> actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
> actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
> actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
> actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
> critic.optim.lr=1e-5 \
> critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
> critic.ppo_micro_batch_size_per_gpu=4 \
> algorithm.kl_ctrl.kl_coef=0.001 \
> trainer.logger='["console","wandb"]' \
> trainer.val_before_train=False \
> trainer.n_gpus_per_node=1 \
> trainer.nnodes=1 \
> trainer.save_freq=10 \
> trainer.test_freq=10 \
> trainer.total_epochs=15 2>&1 | tee verl_demo.log
2025-07-14 02:54:13,989 INFO worker.py:1908 -- Started a local Ray
instance. View the dashboard at 127.0.0.1:8265
(TaskRunner pid=1805000) TaskRunner hostname: ip-172-31-9-244, PID:
1805000
(TaskRunner pid=1805000) {'actor_rollout_ref': {'actor': {'checkpoint':
{'load_contents': ['model',
(TaskRunner pid=1805000) 'optimizer',
(TaskRunner pid=1805000) 'extra'],
(TaskRunner pid=1805000) 'save_contents': ['model',
(TaskRunner pid=1805000) 'optimizer',
(TaskRunner pid=1805000) 'extra']},
```
### API and Usage Example
No
### Design & Code Changes
No
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md ).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting ):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs ).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows )
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1 ) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ ).
---------
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com >
2025-07-14 13:26:32 +08:00
52065c6405
[BREAKING][rollout] refactor: drop vllm v0.5.4 and v0.6.3 support ( #2257 )
...
### What does this PR do?
This PR removes support for vLLM versions 0.5.4 and 0.6.3 from the verl
repository, completing a comprehensive cleanup of legacy
version-specific code branches. The changes simplify the codebase by
eliminating conditional logic and version-specific implementations,
requiring users to upgrade to vLLM 0.7.0 or later (recommended: vLLM
0.8.3+).
**Key Changes:**
- Deleted legacy rollout implementations (`fire_vllm_rollout.py`,
`vllm_rollout.py`, `test_vllm_hf_loader.py`)
- Removed version-specific directories (`vllm_v_0_5_4`, `vllm_v_0_6_3`)
- Simplified sharding managers by removing `customized_vllm` flag
conditionals
- Updated configuration files to remove deprecated options
(`use_fire_sampling`)
- Cleaned up documentation and environment variable exports
### Checklist Before Starting
- [x] Search for similar PRs: No similar PRs found for this specific
cleanup
- [x] Format the PR title as `[BREAKING][vllm, rollout, worker]
refactor: Remove vLLM 0.5.4 and 0.6.3 support`
- Modules: `vllm`, `rollout`, `worker` (primary affected components)
- Type: `refactor` (code cleanup and simplification)
- Breaking: Yes, requires vLLM version upgrade
### Test
This PR has been validated through:
- **CI Pipeline**: All existing tests pass with vLLM 0.7.0+ (27 checks
pending/running)
- **Version Detection**: New version check logic properly rejects vLLM
0.5.4/0.6.3 with clear error messages
- **Merge Conflict Resolution**: Successfully resolved complex conflicts
during main branch merge
- **Pre-commit Checks**: All linting and formatting requirements
satisfied
### API and Usage Example
**Breaking Changes:**
- **vLLM Version Requirement**: Minimum supported version is now 0.7.0
(recommended: 0.8.3+)
- **Removed Configuration Options**: `use_fire_sampling` no longer
available in config files
- **Environment Variables**: `VLLM_ATTENTION_BACKEND=XFORMERS` exports
removed (not needed for vLLM 0.7.0+)
**Migration Guide:**
```bash
# Before: vLLM 0.5.4/0.6.3 with custom flags
pip install vllm==0.6.3
export VLLM_ATTENTION_BACKEND=XFORMERS
# After: vLLM 0.8.3+ with V1 API
pip install vllm>=0.8.3
export VLLM_USE_V1=1 # Recommended for optimal performance
```
**Updated Configuration:**
```yaml
# generation.yaml - removed use_fire_sampling option
rollout:
name: vllm_rollout
# use_fire_sampling: False # <- REMOVED
# Use standard vLLM rollout without legacy options
```
### High-Level Design
```mermaid
graph TB
subgraph "Before: Multi-Version Support"
A1[vLLM Version Check] --> B1{Version 0.5.4?}
A1 --> B2{Version 0.6.3?}
A1 --> B3{Version 0.7.0+?}
B1 --> C1[Legacy vllm_v_0_5_4 Code]
B2 --> C2[Legacy vllm_v_0_6_3 Code]
B3 --> C3[Modern vLLM Code]
end
subgraph "After: Simplified Support"
A2[vLLM Version Check] --> B4{Version >= 0.7.0?}
B4 -->|Yes| C4[Modern vLLM Code Only]
B4 -->|No| C5[Clear Error Message]
end
```
### Specific Changes
**Deleted Files:**
- `verl/workers/rollout/vllm_rollout/fire_vllm_rollout.py`
- `verl/workers/rollout/vllm_rollout/vllm_rollout.py`
- `tests/workers/rollout/rollout_vllm/test_vllm_hf_loader.py`
- `verl/third_party/vllm/vllm_v_0_5_4/` (entire directory)
- `verl/third_party/vllm/vllm_v_0_6_3/` (entire directory)
- `pytest.ini`
**Modified Core Files:**
- `verl/third_party/vllm/__init__.py`: Simplified version detection with
clear error messages
- `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py`: Removed
cache engine management and version conditionals
- `verl/workers/sharding_manager/fsdp_vllm.py`: Dropped
`customized_vllm` flag logic
- `verl/workers/sharding_manager/megatron_vllm.py`: Simplified weight
loading and cache management
**Configuration Updates:**
- `verl/trainer/config/generation.yaml`: Removed `use_fire_sampling`
option
- `verl/trainer/config/ppo_trainer.yaml`: Removed `use_fire_sampling`
option
- `tests/special_sanity/check_api_docs.py`: Removed `LLMEngine` from
whitelist
**Documentation Updates:**
- `docs/start/install.rst`: Updated to recommend vLLM 0.8.3+ with
`VLLM_USE_V1=1`
- `docs/perf/perf_tuning.rst`: Updated performance recommendations
- Removed 42+ `VLLM_ATTENTION_BACKEND=XFORMERS` exports from bash
scripts
**Reverted Changes:**
- `.github/workflows/vllm.yml`: Restored original container image names
- `docs/faq/faq.rst`: Restored original apptainer commands
- `docs/ascend_tutorial/ascend_quick_start.rst`: Reverted all
modifications
- `examples/tuning/*/`: Restored original `nproc_per_gpu` settings
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide )
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting ):
`pre-commit run --all-files --show-diff-on-failure --color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs ):
Updated install and performance tuning docs
- [x] Add unit or end-to-end test(s): Existing CI tests validate the
changes; legacy-specific tests were removed as intended
- [x] **CI Request**: Once PR is ready, message will be sent to
`ci-request` channel in verl Slack workspace
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2025-06-29 19:27:22 -07:00
072fc9feed
feat: support no reference model; fix KL issues ( #644 )
...
### Before get started
Difference between KL penalty in reward and KL loss
> [!TIP]
>
> 1. In-reward KL penalty
>
>
> $$
> r_t = r_{\varphi}(q, o_{\leq t}) - \beta\ \boxed{\log
\frac{\pi_{\theta}(o_t | q, o_{<t})}{\pi_{\text{ref}}(o_t | q, o_{<t})}}
> $$
>
> 2. KL Loss
>
> $$
> L^{\text{PPO}}(\theta) = \mathbb{E}_t [ \min(ratio_t A_t,
\text{clip}(ratio_t, 1 - \epsilon, 1 + \epsilon) A_t) ]
> $$
>
> $$
> \- \beta\ \boxed{D_{\text{KL}}(\pi_{\theta} || \pi_{\text{ref}})}
> $$
### Problems
1. The current code doesn't support not using reference model
This feature is half-implemented since the very first commit but never
completed, e.g., `RayPPOTrainer` has an attribute `use_reference_policy`
but it's always True since role_worker_mapping always has
`Role.RefPolicy`.
2. Restriction of `use_kl_loss`
Currently, `use_kl_loss` determines whether to use in-reward kl penalty
or kl loss. So we can not use **both or neither**.
87a813658f/verl/trainer/ppo/ray_trainer.py (L875-L879)
87a813658f/verl/workers/actor/dp_actor.py (L299-L307)
> [!CAUTION]
>
> ### You may have unintentionally adopted in-reward KL penalty
>
> For the experiments you've conducted, if you set
`actor.use_kl_loss`=False or didn't set it (Default is False),***You
unintentionally used in-reward KL penalty.*** If you don't want any KL,
you should set `actor_rollout_ref.actor.use_kl_loss=False` and
`algorithm.use_kl_in_reward=False` (or not to set them because they are
the default config) after this commit.
3. Deprecated config
After investigation, I guess Critic may used to be responsible for
in-reward KL. But this feature seems paralyzed.
1. Line 290, there may used to be `config.algorithm.kl_ctrl.target_kl`
and `config.critic.kl_ctrl.horizon` , which are not supported currently.
3ec83117c3/verl/trainer/ppo/ray_trainer.py (L289-L293)
2. In `verl/workers/critic/megatron_critic.py` : redundant set of
`self.kl_ctrl`
3b18b0eb74/verl/workers/critic/megatron_critic.py (L69-L73)
### What’s Changed?
1. Add support for not using reference model
2. Fixed the incomplete code of the KL controller.
3. A test case for using both kl terms
4. Some other misc issues in the code.
### How to disable reference model
* set `actor_rollout_ref.actor.use_kl_loss=False` and
`algorithm.use_kl_in_reward=False` (They are by default False, so you
can simply not set them)
2025-04-01 10:14:38 +08:00