59 Commits

Author SHA1 Message Date
0415b27cfd recipe: add reproducible PRIME baseline (#753)
add example PRIME script and wandb log to doc
2025-03-25 17:29:46 -04:00
ab7b9414c3 refactor: unify ulysses flash attention patch to avoid single model patches (#735)
**This is an effort to unify transformers monkey patch to support
ulyssess sequence parallellism for more models.**

### Basic idea
In transformer architecture, all operations except attention are
token-wise, include RoPE, LayerNorm, MLP, etc, so we just need to patch
the attention function.
For now, ulyssess sequence relies on sequence packing and flash
attention, and transformers widely use `_flash_attention_forward` in
each model's Attention module, e.g LlamaAttention, Qwen2Attention. So we
just need to add 2 all-to-all operations before and after
`_flash_attention_forward`.

![image](https://github.com/user-attachments/assets/2f7cac85-c65e-449f-8457-8bc88219f631)

- We introduce an additional all_gather in each layer for position_ids
because `prepare_fa2_from_position_ids` needs it. The all_gather
communication cost is `O(nnz)`, which should be negligible compare to
QKV, meanwhile we also reduce RoPE computation to 1/sp_size of the
original.

### Correctness Verification

[run_qwen2-7b_seq_balance.sh](https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2-7b_seq_balance.sh)
with `ulysses_sequence_parallel_size=2`
- red(baseline): main branch transformers==4.47.1
- purple: dev branch transformers==4.47.1
- green: dev branch transformers==4.49.0

![image](https://github.com/user-attachments/assets/ee0f3f82-86c2-414d-a8b4-775b2a30a98a)


By unifying monkey patch, we can avoid individual model patches and
achieve better forward compatibility with transformers, avoid issue like
#357 #704.

Also remove `check_model_support_rmpad` since we enforce
`attn_implementation="flash_attention_2"`, every model which supports
FlashAttention2 should support sequence packing.

- [x] unify LLM model patch
- [ ] clean llama/qwen attention patch
- [ ] support qwen2vl ulyssess sp
- [ ] unify VLM model patch with LLM model
2025-03-25 14:17:17 +08:00
5d0a7eaf6d [feat] Megatron checkpoint support for current Llama and Qwen models (#687)
# Intro

Support Megatron checkpoint for Model, Optimizer States and RNG states,
with a new layer of abstraction: `MegatronCheckpointManager` like FSDP.
Also add checkpoint tests.

# Involved Issues and PRs

This solved issue #682 #605 , including PR #510 #634 #368 #330 . Thanks
for the great efforts of @uygnef, @ShareLer and @caaatch22 in these
contributions.

# TODOs

- [ ] Support Megatron dist checkpointing mechanism, now use
torch.save/load to store/restore model weights.
- [x] Quick: Also store hf format model.

---------

Co-authored-by: caaatch22 <mr.liumingjie@gmail.com>
Co-authored-by: Yu Feng <admin@fengyu.org>
Co-authored-by: ShareLer <sharele@163.com>
2025-03-23 14:36:05 +08:00
6133ae9292 misc: separate metric utils from ppo trainer (#599)
## What does this PR do?

Use metric_utils to maintain the logic of computing metrics, avoiding
too many lines in ppo trainer

## Who can review?

@vermouth1992 @PeterSH6
2025-03-15 14:43:46 +08:00
22657bade5 [config] feat: lr_warmup_steps (#564)
This PR adds the `lr_warmup_steps` configuration.

Note the `num_warmup_steps` is prior to `lr_warmup_steps_ratio`.
2025-03-14 16:09:12 +08:00
3fc3e2b74a fix: remove redundant torch.cuda.empty_cache() (#575)
#556 take effort to remove remove unnecessary empty_cache, but will
cause CUDA oom at vllm wake_up.
```text
  File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences
    with self.rollout_sharding_manager:
  File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__
    self.inference_engine.wake_up()
  File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up
    self.llm_engine.wake_up()
  File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up
    self.model_executor.wake_up()
  File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up
    self.collective_rpc("wake_up")
  File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up
    allocator.wake_up()
  File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up
    create_and_map(handle)
  File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map
    python_create_and_map(*allocation_handle)
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
```
This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker
and only empty cache before vllm wake_up and after vllm sleep, since
vllm has its own caching memory allocator
[CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103).
Out of vllm scope, we should avoid empty cache to let pytorch using
caching memory to speed up memory allocations.

- [x] Cleanup FSDP worker torch.cuda.empty_cache()
- [ ] Cleanup Megatron worker torch.cuda.empty_cache()
2025-03-13 16:31:34 +08:00
9bb02d2705 [bugfix] PRIME filter overlong propmts & padding side incorrect & use xformers (#570)
### Description
- fix filter_overlong_prompts setting in PRIME

- fix padding side incorrect for Qwen in PRIME 

- When I utilize PRIME recipe to train Qwen series models, I got
“*ValueError: You are attempting to perform batched generation with
padding_side='right' this may lead to unexpected behaviour for Flash
Attention version of Qwen2. Make sure to call tokenizer.padding_side =
'left' before tokenizing the input.*” So I set `use_cache = False` when
calling model to calculate output logits.

- fix CUDA error with vllm v0.6.3 

- When I run PRIME, I may get an error — *CUDA error: an illegal memory
access was encountered*. According to
https://github.com/vllm-project/vllm/issues/10389, I set
`VLLM_ATTENTION_BACKEND=XFORMERS` .
2025-03-13 14:47:00 +08:00
b14299c814 update README.md (#534)
1. add [PRIME](https://arxiv.org/abs/2502.01456) to README.md
2. slightly change the example script to align with the paper
2025-03-11 19:23:38 +08:00
f0e7f9fcbe recipe: PRIME algorithm (#362)
Refactor and merge PRIME algorithm into verl/main
https://github.com/PRIME-RL/PRIME

Breaking changes:    
`trainer.fsdp_config.min_num_params` is now moved to `trainer.fsdp_config.wrap_policy.min_num_params`.
2025-03-10 11:31:43 -07:00