**This is an effort to unify transformers monkey patch to support
ulyssess sequence parallellism for more models.**
### Basic idea
In transformer architecture, all operations except attention are
token-wise, include RoPE, LayerNorm, MLP, etc, so we just need to patch
the attention function.
For now, ulyssess sequence relies on sequence packing and flash
attention, and transformers widely use `_flash_attention_forward` in
each model's Attention module, e.g LlamaAttention, Qwen2Attention. So we
just need to add 2 all-to-all operations before and after
`_flash_attention_forward`.

- We introduce an additional all_gather in each layer for position_ids
because `prepare_fa2_from_position_ids` needs it. The all_gather
communication cost is `O(nnz)`, which should be negligible compare to
QKV, meanwhile we also reduce RoPE computation to 1/sp_size of the
original.
### Correctness Verification
[run_qwen2-7b_seq_balance.sh](https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2-7b_seq_balance.sh)
with `ulysses_sequence_parallel_size=2`
- red(baseline): main branch transformers==4.47.1
- purple: dev branch transformers==4.47.1
- green: dev branch transformers==4.49.0

By unifying monkey patch, we can avoid individual model patches and
achieve better forward compatibility with transformers, avoid issue like
#357#704.
Also remove `check_model_support_rmpad` since we enforce
`attn_implementation="flash_attention_2"`, every model which supports
FlashAttention2 should support sequence packing.
- [x] unify LLM model patch
- [ ] clean llama/qwen attention patch
- [ ] support qwen2vl ulyssess sp
- [ ] unify VLM model patch with LLM model
# Intro
Support Megatron checkpoint for Model, Optimizer States and RNG states,
with a new layer of abstraction: `MegatronCheckpointManager` like FSDP.
Also add checkpoint tests.
# Involved Issues and PRs
This solved issue #682#605 , including PR #510#634#368#330 . Thanks
for the great efforts of @uygnef, @ShareLer and @caaatch22 in these
contributions.
# TODOs
- [ ] Support Megatron dist checkpointing mechanism, now use
torch.save/load to store/restore model weights.
- [x] Quick: Also store hf format model.
---------
Co-authored-by: caaatch22 <mr.liumingjie@gmail.com>
Co-authored-by: Yu Feng <admin@fengyu.org>
Co-authored-by: ShareLer <sharele@163.com>
## What does this PR do?
Use metric_utils to maintain the logic of computing metrics, avoiding
too many lines in ppo trainer
## Who can review?
@vermouth1992 @PeterSH6
#556 take effort to remove remove unnecessary empty_cache, but will
cause CUDA oom at vllm wake_up.
```text
File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences
with self.rollout_sharding_manager:
File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__
self.inference_engine.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up
self.llm_engine.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up
self.model_executor.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up
self.collective_rpc("wake_up")
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up
allocator.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up
create_and_map(handle)
File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map
python_create_and_map(*allocation_handle)
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
```
This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker
and only empty cache before vllm wake_up and after vllm sleep, since
vllm has its own caching memory allocator
[CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103).
Out of vllm scope, we should avoid empty cache to let pytorch using
caching memory to speed up memory allocations.
- [x] Cleanup FSDP worker torch.cuda.empty_cache()
- [ ] Cleanup Megatron worker torch.cuda.empty_cache()
### Description
- fix filter_overlong_prompts setting in PRIME
- fix padding side incorrect for Qwen in PRIME
- When I utilize PRIME recipe to train Qwen series models, I got
“*ValueError: You are attempting to perform batched generation with
padding_side='right' this may lead to unexpected behaviour for Flash
Attention version of Qwen2. Make sure to call tokenizer.padding_side =
'left' before tokenizing the input.*” So I set `use_cache = False` when
calling model to calculate output logits.
- fix CUDA error with vllm v0.6.3
- When I run PRIME, I may get an error — *CUDA error: an illegal memory
access was encountered*. According to
https://github.com/vllm-project/vllm/issues/10389, I set
`VLLM_ATTENTION_BACKEND=XFORMERS` .
Refactor and merge PRIME algorithm into verl/main
https://github.com/PRIME-RL/PRIME
Breaking changes:
`trainer.fsdp_config.min_num_params` is now moved to `trainer.fsdp_config.wrap_policy.min_num_params`.