frozenleaves/verl - verl - Gitea: Git for Me

mirror of https://github.com/volcengine/verl.git synced 2025-10-20 13:43:50 +08:00

Author	SHA1	Message	Date
Zefan Wang	0415b27cfd	recipe: add reproducible PRIME baseline (#753 ) add example PRIME script and wandb log to doc	2025-03-25 17:29:46 -04:00
Joel	ab7b9414c3	refactor: unify ulysses flash attention patch to avoid single model patches (#735 ) This is an effort to unify transformers monkey patch to support ulyssess sequence parallellism for more models. ### Basic idea In transformer architecture, all operations except attention are token-wise, include RoPE, LayerNorm, MLP, etc, so we just need to patch the attention function. For now, ulyssess sequence relies on sequence packing and flash attention, and transformers widely use `_flash_attention_forward` in each model's Attention module, e.g LlamaAttention, Qwen2Attention. So we just need to add 2 all-to-all operations before and after `_flash_attention_forward`. ![image](https://github.com/user-attachments/assets/2f7cac85-c65e-449f-8457-8bc88219f631) - We introduce an additional all_gather in each layer for position_ids because `prepare_fa2_from_position_ids` needs it. The all_gather communication cost is `O(nnz)`, which should be negligible compare to QKV, meanwhile we also reduce RoPE computation to 1/sp_size of the original. ### Correctness Verification [run_qwen2-7b_seq_balance.sh](https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2-7b_seq_balance.sh) with `ulysses_sequence_parallel_size=2` - red(baseline): main branch transformers==4.47.1 - purple: dev branch transformers==4.47.1 - green: dev branch transformers==4.49.0 ![image](https://github.com/user-attachments/assets/ee0f3f82-86c2-414d-a8b4-775b2a30a98a) By unifying monkey patch, we can avoid individual model patches and achieve better forward compatibility with transformers, avoid issue like #357 #704. Also remove `check_model_support_rmpad` since we enforce `attn_implementation="flash_attention_2"`, every model which supports FlashAttention2 should support sequence packing. - [x] unify LLM model patch - [ ] clean llama/qwen attention patch - [ ] support qwen2vl ulyssess sp - [ ] unify VLM model patch with LLM model	2025-03-25 14:17:17 +08:00
Blue Space	5d0a7eaf6d	[feat] Megatron checkpoint support for current Llama and Qwen models (#687 ) # Intro Support Megatron checkpoint for Model, Optimizer States and RNG states, with a new layer of abstraction: `MegatronCheckpointManager` like FSDP. Also add checkpoint tests. # Involved Issues and PRs This solved issue #682 #605 , including PR #510 #634 #368 #330 . Thanks for the great efforts of @uygnef, @ShareLer and @caaatch22 in these contributions. # TODOs - [ ] Support Megatron dist checkpointing mechanism, now use torch.save/load to store/restore model weights. - [x] Quick: Also store hf format model. --------- Co-authored-by: caaatch22 <mr.liumingjie@gmail.com> Co-authored-by: Yu Feng <admin@fengyu.org> Co-authored-by: ShareLer <sharele@163.com>	2025-03-23 14:36:05 +08:00
hoshi-hiyouga	6133ae9292	misc: separate metric utils from ppo trainer (#599 ) ## What does this PR do? Use metric_utils to maintain the logic of computing metrics, avoiding too many lines in ppo trainer ## Who can review? @vermouth1992 @PeterSH6	2025-03-15 14:43:46 +08:00
Shawn/Yuxuan Tong	22657bade5	[config] feat: lr_warmup_steps (#564 ) This PR adds the `lr_warmup_steps` configuration. Note the `num_warmup_steps` is prior to `lr_warmup_steps_ratio`.	2025-03-14 16:09:12 +08:00
Joel	3fc3e2b74a	fix: remove redundant torch.cuda.empty_cache() (#575 ) #556 take effort to remove remove unnecessary empty_cache, but will cause CUDA oom at vllm wake_up. ```text File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences with self.rollout_sharding_manager: File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__ self.inference_engine.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up self.llm_engine.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up self.model_executor.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up self.collective_rpc("wake_up") File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up allocator.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up create_and_map(handle) File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map python_create_and_map(allocation_handle) RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62 ``` This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker and only empty cache before vllm wake_up and after vllm sleep, since vllm has its own caching memory allocator [CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103). Out of vllm scope, we should avoid empty cache to let pytorch using caching memory to speed up memory allocations. - [x] Cleanup FSDP worker torch.cuda.empty_cache() - [ ] Cleanup Megatron worker torch.cuda.empty_cache()	2025-03-13 16:31:34 +08:00
CajZella	9bb02d2705	[bugfix] PRIME filter overlong propmts & padding side incorrect & use xformers (#570 ) ### Description - fix filter_overlong_prompts setting in PRIME - fix padding side incorrect for Qwen in PRIME - When I utilize PRIME recipe to train Qwen series models, I got “ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to call tokenizer.padding_side = 'left' before tokenizing the input.” So I set `use_cache = False` when calling model to calculate output logits. - fix CUDA error with vllm v0.6.3 - When I run PRIME, I may get an error — CUDA error: an illegal memory access was encountered. According to https://github.com/vllm-project/vllm/issues/10389, I set `VLLM_ATTENTION_BACKEND=XFORMERS` .	2025-03-13 14:47:00 +08:00
Zefan Wang	b14299c814	update README.md (#534 ) 1. add [PRIME](https://arxiv.org/abs/2502.01456) to README.md 2. slightly change the example script to align with the paper	2025-03-11 19:23:38 +08:00
Zefan Wang	f0e7f9fcbe	recipe: PRIME algorithm (#362 ) Refactor and merge PRIME algorithm into verl/main https://github.com/PRIME-RL/PRIME Breaking changes: `trainer.fsdp_config.min_num_params` is now moved to `trainer.fsdp_config.wrap_policy.min_num_params`.	2025-03-10 11:31:43 -07:00

1 2

59 Commits