frozenleaves/verl - verl - Gitea: Git for Me

mirror of https://github.com/volcengine/verl.git synced 2025-10-20 13:43:50 +08:00

Author	SHA1	Message	Date
Kai-Hsun Chen	a31a8f251f	[doc] fix: quickstart example can't work on zsh (#2509 ) ### What does this PR do? I followed the instructions at https://verl.readthedocs.io/en/latest/start/quickstart.html to run the PPO example on my devbox, which uses zsh. However, I got the error zsh: no matches found: `trainer.logger=[console]` because `[]` is interpreted as a glob pattern in zsh. ``` (verl) ➜ verl git:(20250713-devbox-2-tmux0-verl-2) ✗ PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ data.train_batch_size=256 \ data.max_prompt_length=512 \ data.max_response_length=256 \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ critic.optim.lr=1e-5 \ critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ critic.ppo_micro_batch_size_per_gpu=4 \ algorithm.kl_ctrl.kl_coef=0.001 \ trainer.logger=['console'] \ trainer.val_before_train=False \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=10 \ trainer.test_freq=10 \ trainer.total_epochs=15 2>&1 \| tee verl_demo.log zsh: no matches found: trainer.logger=[console] ``` This PR has 3 changes: * `trainer.logger=['console']` -> `trainer.logger=console` * `trainer.logger=['console','wandb']` -> `trainer.logger='["console","wandb"]'` * `trainer.logger=['console','tensorboard']` -> `trainer.logger='["console","tensorboard"]'` ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test * `trainer.logger=console` (zsh) <img width="898" height="564" alt="image" src="https://github.com/user-attachments/assets/a957a493-75e6-462b-9974-6b1c4cdf5a80" /> * ``trainer.logger='["console","wandb"]'`` (zsh) <img width="870" height="565" alt="image" src="https://github.com/user-attachments/assets/e20613bf-2ccc-4653-b23f-90edc3d568d1" /> * `trainer.logger=console` (bash) ```bash ubuntu@ip-xxx-xx-x-xxx:~/verl$ PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ > data.train_files=$HOME/data/gsm8k/train.parquet \ > data.val_files=$HOME/data/gsm8k/test.parquet \ > data.train_batch_size=256 \ > data.max_prompt_length=512 \ > data.max_response_length=256 \ > actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ > actor_rollout_ref.actor.optim.lr=1e-6 \ > actor_rollout_ref.actor.ppo_mini_batch_size=64 \ > actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ > actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ > actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ > actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ > actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ > critic.optim.lr=1e-5 \ > critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ > critic.ppo_micro_batch_size_per_gpu=4 \ > algorithm.kl_ctrl.kl_coef=0.001 \ > trainer.logger=console \ > trainer.val_before_train=False \ > trainer.n_gpus_per_node=1 \ > trainer.nnodes=1 \ > trainer.save_freq=10 \ > trainer.test_freq=10 \ > trainer.total_epochs=15 2>&1 \| tee verl_demo.log 2025-07-14 02:52:27,669 INFO worker.py:1908 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 (TaskRunner pid=1799248) TaskRunner hostname: ip-172-31-9-244, PID: 1799248 (TaskRunner pid=1799248) {'actor_rollout_ref': {'actor': {'checkpoint': {'load_contents': ['model', (TaskRunner pid=1799248) 'optimizer', (TaskRunner pid=1799248) 'extra'], (TaskRunner pid=1799248) 'save_contents': ['model', (TaskRunner pid=1799248) 'optimizer', (TaskRunner pid=1799248) 'extra']}, ``` * `trainer.logger='["console","wandb"]'` (bash) ```bash ubuntu@ip-xxx-xx-x-xxx:~/verl$ PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ > data.train_files=$HOME/data/gsm8k/train.parquet \ > data.val_files=$HOME/data/gsm8k/test.parquet \ > data.train_batch_size=256 \ > data.max_prompt_length=512 \ > data.max_response_length=256 \ > actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ > actor_rollout_ref.actor.optim.lr=1e-6 \ > actor_rollout_ref.actor.ppo_mini_batch_size=64 \ > actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ > actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ > actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ > actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ > actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ > critic.optim.lr=1e-5 \ > critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ > critic.ppo_micro_batch_size_per_gpu=4 \ > algorithm.kl_ctrl.kl_coef=0.001 \ > trainer.logger='["console","wandb"]' \ > trainer.val_before_train=False \ > trainer.n_gpus_per_node=1 \ > trainer.nnodes=1 \ > trainer.save_freq=10 \ > trainer.test_freq=10 \ > trainer.total_epochs=15 2>&1 \| tee verl_demo.log 2025-07-14 02:54:13,989 INFO worker.py:1908 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 (TaskRunner pid=1805000) TaskRunner hostname: ip-172-31-9-244, PID: 1805000 (TaskRunner pid=1805000) {'actor_rollout_ref': {'actor': {'checkpoint': {'load_contents': ['model', (TaskRunner pid=1805000) 'optimizer', (TaskRunner pid=1805000) 'extra'], (TaskRunner pid=1805000) 'save_contents': ['model', (TaskRunner pid=1805000) 'optimizer', (TaskRunner pid=1805000) 'extra']}, ``` ### API and Usage Example No ### Design & Code Changes No ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). --------- Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>	2025-07-14 13:26:32 +08:00
Blue Space	4081d8af1f	refactor example and test scripts to use megatron comm/comp overlap and checkpoint save (#1202 ) Examples megatron scripts are outdated.	2025-04-23 11:30:30 +08:00
Lumeng Wu	072fc9feed	feat: support no reference model; fix KL issues (#644 ) ### Before get started Difference between KL penalty in reward and KL loss > [!TIP] > > 1. In-reward KL penalty > > > $$ > r_t = r_{\varphi}(q, o_{\leq t}) - \beta\ \boxed{\log \frac{\pi_{\theta}(o_t \| q, o_{<t})}{\pi_{\text{ref}}(o_t \| q, o_{<t})}} > $$ > > 2. KL Loss > > $$ > L^{\text{PPO}}(\theta) = \mathbb{E}_t [ \min(ratio_t A_t, \text{clip}(ratio_t, 1 - \epsilon, 1 + \epsilon) A_t) ] > $$ > > $$ > \- \beta\ \boxed{D_{\text{KL}}(\pi_{\theta} \|\| \pi_{\text{ref}})} > $$ ### Problems 1. The current code doesn't support not using reference model This feature is half-implemented since the very first commit but never completed, e.g., `RayPPOTrainer` has an attribute `use_reference_policy` but it's always True since role_worker_mapping always has `Role.RefPolicy`. 2. Restriction of `use_kl_loss` Currently, `use_kl_loss` determines whether to use in-reward kl penalty or kl loss. So we can not use both or neither. `87a813658f/verl/trainer/ppo/ray_trainer.py (L875-L879)` `87a813658f/verl/workers/actor/dp_actor.py (L299-L307)` > [!CAUTION] > > ### You may have unintentionally adopted in-reward KL penalty > > For the experiments you've conducted, if you set `actor.use_kl_loss`=False or didn't set it (Default is False),*You unintentionally used in-reward KL penalty.* If you don't want any KL, you should set `actor_rollout_ref.actor.use_kl_loss=False` and `algorithm.use_kl_in_reward=False` (or not to set them because they are the default config) after this commit. 3. Deprecated config After investigation, I guess Critic may used to be responsible for in-reward KL. But this feature seems paralyzed. 1. Line 290, there may used to be `config.algorithm.kl_ctrl.target_kl` and `config.critic.kl_ctrl.horizon` , which are not supported currently. `3ec83117c3/verl/trainer/ppo/ray_trainer.py (L289-L293)` 2. In `verl/workers/critic/megatron_critic.py` : redundant set of `self.kl_ctrl` `3b18b0eb74/verl/workers/critic/megatron_critic.py (L69-L73)` ### What’s Changed? 1. Add support for not using reference model 2. Fixed the incomplete code of the KL controller. 3. A test case for using both kl terms 4. Some other misc issues in the code. ### How to disable reference model * set `actor_rollout_ref.actor.use_kl_loss=False` and `algorithm.use_kl_in_reward=False` (They are by default False, so you can simply not set them)	2025-04-01 10:14:38 +08:00
Guangming Sheng	386cfabed2	[misc] feat: make filter long prompt an option (#506 ) # Background In RLHFDataset, we filter out prompts that are too long. This requires apply_chat_template to the whole dataset, which is not scalable when the dataset is large. https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L132 Instead of performing filtering online, we probably want to move this process offline and add an assertion to avoid truncation or simply perform truncation Reference: #502 # Key Changes - Add an option `data.filter_overlong_prompts=True \` to enable the above data filtering. The default value is set to False, but we enable it for all the example scripts. - Add an option `data.truncation` to truncate the input_ids or prompt length if they exceed max_prompt_length. The default is 'error', which does not allow the max_prompt_length to be exceeded. The users should increase the max_prompt_length if throwing the error. You can also set `left` and `right`. ### Suggestion for large-scale dataset. For large-scale datasets, filtering overlong prompts could be time-consuming. You should set `data.filtering_overlong_prompts=False` and set `truncation='left'`. Also, please note that you should increase `data.max_prompt_length` to avoid over-truncation of the prompts.	2025-03-07 19:27:25 +08:00
Shawn/Yuxuan Tong	4011f407b0	[Fix] Deprecate `val_batch_size` (#353 ) Validation datasets are sent to inference engines as a whole batch, which will schedule the memory themselves. - [x] Remove `val_batch_size` from examples - [x] Set default values of `val_batch_size` in configs as `null` and add DEPRECATED comments - [x] Add deprecation warnings about `val_batch_size` in `_validate_config`	2025-02-24 10:24:24 +08:00
Guangming Sheng	9db52329f6	[misc] feat: support offload parameter and optimizer during rollout (#284 ) - Fixed FSDP1 model offload - With `actor_rollout_ref.actor.fsdp_config.param_offload=True \` and `actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ `. The GPU memory utilization can increase to 0.9 - With actor, critic and reference offload all enabled, there will only be one model copy at a time in the GPU memory. Therefore, we can further increase the `micro_batch_size_per_gpu` or `max_token_per_gpu` Specifically: - During rollout, only rollout model and KVCache are in the GPU memory. - During critic compute values, only the critic model will stay in the GPU memory while its optimizer and other model states are in CPU main memory - During actor update, the actor model, optimizer are stored on GPU while the reference model and critic model, critic optimizer are offloaded to CPU.	2025-02-17 14:07:43 +08:00
Guangming Sheng	f2a76acd94	[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu (#136 ) ## Summary This PR changes all the micro_batch_size to micro_batch_size_per_gpu. The Core logic of setting batch size: - All algorithmic metrics (train batch size, ppo mini batch size): are global (from the perspective of single-controller), which will be normalized in each Worker. - All performance-related parameters (micro batch size, max token length in dynamic batch size) are local parameters, which represent the data sizes per GPU (i.e., each Worker). ## Main Changes 1. Change the scripts and config and delete the normalization for micro_bsz 2. Fix CI for SFT	2025-01-27 11:26:52 +08:00
Guangming Sheng	5b90cd7dd3	[misc] fix: normalize batch size should divide sp size (#125 ) - The actual DP size when using SP is (DP // SP). As SP sets of GPUs have the same sequence but different parts	2025-01-23 18:34:37 +08:00
Guangming Sheng	e8eb9e4eca	[misc][Long Context] feat: support ulysses for long context training (#109 )	2025-01-18 12:21:41 +08:00

9 Commits