mirror of https://github.com/volcengine/verl.git synced 2025-10-20 21:53:50 +08:00

Files

Yingru Li 4f1c489e45 [algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error (#3810 )

## Summary

Fixes #3787 by removing `torch.quantile()`-based percentile metrics
(`rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75`) that caused
`RuntimeError: quantile() input tensor is too large` when using large
batch sizes or response lengths.

## Problem

When using configurations with large tensor sizes (e.g.,
`max_response_length: 32k`, `rollout.n: 16`, `train_batch_size: 16`),
the `torch.quantile()` function fails with a runtime error due to
PyTorch's internal tensor size limitations (~2^24 to 2^27 elements
depending on version, GPU memory, and dtype).

The error occurred in `verl/trainer/ppo/mismatch_helper.py`:
```python
metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25)
metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50)
metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75)
```

## Solution

Removed the three quantile-based percentile metrics from the Rollout IS
framework. The remaining metrics (`rollout_is_mean`, `rollout_is_std`,
`rollout_is_min`, `rollout_is_max`, `rollout_is_eff_sample_size`, etc.)
provide sufficient monitoring capabilities for importance sampling
health without triggering tensor size limitations.

## Changes

- **Modified**:
[verl/trainer/ppo/mismatch_helper.py](verl/trainer/ppo/mismatch_helper.py)
- Removed `rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75` metric
calculations
  - All other rollout IS and mismatch metrics remain functional

## Testing

Verified that:
- Rollout IS framework continues to function correctly without
percentile metrics
- No runtime errors with large tensor configurations
- All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are
computed correctly

Resolves #3787

2025-10-20 13:04:57 +08:00

README.md

[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error (#3810 )

2025-10-20 13:04:57 +08:00

run_with_rollout_is.sh

[rollout] refactor: rename "clip" mode back to "mask" mode (#3750 )

2025-10-13 11:06:36 -07:00

README.md

Rollout Importance Sampling (IS) Examples

This directory contains examples and documentation for using Rollout Importance Sampling to correct distribution mismatch between rollout and training policies.

References:

When Speed Kills Stability: https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda
Off-policy RL: https://fengyao.notion.site/off-policy-rl

Overview

Rollout Importance Sampling corrects for distribution mismatch when:

Rollout generation uses one policy (e.g., vLLM with BFloat16)
Training uses another policy (e.g., FSDP with FP32)
This mismatch leads to biased gradient estimates

Quick Start

Basic Configuration

algorithm:
  # Main control: set threshold to enable (null = disabled)
  rollout_is_threshold: 2.0
  # Whether to apply weights to policy loss (true) or just compute metrics (false)
  rollout_is: true
  rollout_is_level: token
  rollout_is_mode: truncate

# IMPORTANT: Must enable log prob calculation
actor_rollout_ref:
  rollout:
    calculate_log_probs: true

Running the Example

# Basic example with token-level truncate
bash examples/rollout_importance_sampling/run_with_rollout_is.sh

Configuration Options

Aggregation Levels (`rollout_is_level`)

Level	Properties	Threshold Range
token	Per-token	1.5 - 5.0
sequence	Per-sequence	2.0 - 10.0
geometric	Geometric mean	1.0002 - 1.001

Bounding Modes (`rollout_is_mode`)

Mode	Behavior
truncate	Cap weights at upper threshold only
clip	Zero out weights outside [lower, upper]

Key Parameters

rollout_is_threshold: Upper threshold for IS weights (null = disabled, float = enabled). Main on/off switch.
rollout_is: Whether to apply weights to loss (true) or just compute metrics (false). Default: false.
rollout_is_threshold_lower: Lower threshold (null = auto 1/upper)
rollout_is_veto_threshold: Catastrophic outlier threshold (default: 1e-4)

Configuration Examples

Example 1: Full IS Correction (Apply Weights)

algorithm:
  rollout_is_threshold: 2.0
  rollout_is: true  # Apply to loss
  rollout_is_level: token
  rollout_is_mode: truncate
  rollout_is_veto_threshold: 1e-4

Example 2: Metrics Only (No Weight Application)

algorithm:
  rollout_is_threshold: 2.0
  rollout_is: false  # Compute metrics only, don't apply to loss
  rollout_is_level: token
  rollout_is_mode: truncate

Example 3: Geometric Mean with Mask

algorithm:
  rollout_is_threshold: 1.0002
  rollout_is: true
  rollout_is_threshold_lower: 0.9998
  rollout_is_level: geometric
  rollout_is_mode: mask
  rollout_is_veto_threshold: 1e-4

Example 4: Sequence-level with Truncate

algorithm:
  rollout_is_threshold: 5.0
  rollout_is: true
  rollout_is_threshold_lower: null  # Auto-reciprocal: 0.2
  rollout_is_level: sequence
  rollout_is_mode: truncate
  rollout_is_veto_threshold: 1e-4

Example 5: Asymmetric Thresholds

algorithm:
  rollout_is_threshold: 5.0
  rollout_is: true
  rollout_is_threshold_lower: 0.8
  rollout_is_level: token
  rollout_is_mode: mask

Monitoring Metrics

Key metrics to watch (all prefixed with mismatch/ in logs):

Health Indicators

rollout_is_mean: Mean IS weight across sequences
rollout_is_eff_sample_size: Effective sample size after weighting
rollout_is_veto_fraction: Fraction of sequences vetoed

Distribution Metrics

rollout_is_max, rollout_is_min: Weight extremes
rollout_is_std: Standard deviation

Diagnostic Metrics

rollout_is_ratio_fraction_high: Fraction exceeding upper threshold
rollout_is_ratio_fraction_low: Fraction below lower threshold
rollout_is_catastrophic_token_fraction: Catastrophic tokens detected

Mismatch Metrics (Training vs Rollout Policy)

These metrics help diagnose the distribution mismatch between rollout and training policies:

Perplexity Metrics:

mismatch_training_ppl: Perplexity of training policy
mismatch_rollout_ppl: Perplexity of rollout policy
mismatch_ppl_ratio: Ratio of training PPL to rollout PPL
mismatch_log_ppl_diff: Log perplexity difference

KL Divergence Metrics:

mismatch_kl: KL divergence KL(π_rollout || π_training)
mismatch_k3_kl: K3 KL estimator

Troubleshooting

Issue: High Variance in IS Weights

Symptoms: rollout_is_std > 1.0, rollout_is_eff_sample_size < 0.3

Solutions:

Switch from sequence to geometric level
Tighten thresholds
Check if rollout and training are too different

Issue: Too Many Sequences Vetoed

Symptoms: rollout_is_veto_fraction > 0.1

Solutions:

Relax veto threshold: rollout_is_veto_threshold: 1e-3
Check for numerical issues in log prob computation
Verify rollout and training policies aren't completely different

Issue: Mean IS Weight Far from 1.0

Symptoms: rollout_is_mean < 0.5 or > 2.0

Solutions:

Check that calculate_log_probs=True is set
Verify rollout_log_probs are correctly passed
Check for systematic bias in rollout vs training

Issue: Too Much Data Discarded (Mask Mode)

Symptoms: rollout_is_masked_fraction > 0.5

Solutions:

Widen thresholds
Switch to truncate mode
Use geometric level for better stability

Performance Considerations

Memory Usage

Rollout IS adds minimal memory overhead (~1% of model memory)
Log-space computation prevents numerical overflow

Computational Cost

Token-level: ~1-2% overhead
Sequence-level: ~2-3% overhead
Geometric: ~2-3% overhead

Advanced Topics

Dual Thresholds

Specify both upper and lower explicitly:

rollout_is_threshold: 2.0      # Upper
rollout_is_threshold_lower: 0.5  # Lower (not 1/2.0 = 0.5)

Or use auto-reciprocal:

rollout_is_threshold: 2.0      # Upper = 2.0, Lower = 0.5 (auto)
rollout_is_threshold_lower: null

Veto Mechanism

The veto mechanism zeros out entire sequences containing catastrophic outliers:

If any token has ratio < rollout_is_veto_threshold, the entire sequence is rejected
This prevents extreme outliers from dominating training
Default threshold: 1e-4 (ratio 10,000x off)
Set to null to disable: rollout_is_veto_threshold: null

Examples

See the script in this directory:

run_with_rollout_is.sh: Basic example with token-level truncate mode

References

Implementation: verl/trainer/ppo/mismatch_helper.py
Core algorithm: verl/trainer/ppo/core_algos.py
Paper: "Your Efficient RL Framework Secretly Brings You Off-Policy RL Training"

README.md

Rollout Importance Sampling (IS) Examples

Overview

Quick Start

Basic Configuration

Running the Example

Configuration Options

Aggregation Levels (rollout_is_level)

Bounding Modes (rollout_is_mode)

Key Parameters

Configuration Examples

Example 1: Full IS Correction (Apply Weights)

Example 2: Metrics Only (No Weight Application)

Example 3: Geometric Mean with Mask

Example 4: Sequence-level with Truncate

Example 5: Asymmetric Thresholds

Monitoring Metrics

Health Indicators

Distribution Metrics

Diagnostic Metrics

Mismatch Metrics (Training vs Rollout Policy)

Troubleshooting

Issue: High Variance in IS Weights

Issue: Too Many Sequences Vetoed

Issue: Mean IS Weight Far from 1.0

Issue: Too Much Data Discarded (Mask Mode)

Performance Considerations

Memory Usage

Computational Cost

Advanced Topics

Dual Thresholds

Veto Mechanism

Examples

References

Aggregation Levels (`rollout_is_level`)

Bounding Modes (`rollout_is_mode`)