Files
verl/examples/rollout_importance_sampling
Yingru Li 4f1c489e45 [algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error (#3810)
## Summary

Fixes #3787 by removing `torch.quantile()`-based percentile metrics
(`rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75`) that caused
`RuntimeError: quantile() input tensor is too large` when using large
batch sizes or response lengths.

## Problem

When using configurations with large tensor sizes (e.g.,
`max_response_length: 32k`, `rollout.n: 16`, `train_batch_size: 16`),
the `torch.quantile()` function fails with a runtime error due to
PyTorch's internal tensor size limitations (~2^24 to 2^27 elements
depending on version, GPU memory, and dtype).

The error occurred in `verl/trainer/ppo/mismatch_helper.py`:
```python
metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25)
metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50)
metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75)
```

## Solution

Removed the three quantile-based percentile metrics from the Rollout IS
framework. The remaining metrics (`rollout_is_mean`, `rollout_is_std`,
`rollout_is_min`, `rollout_is_max`, `rollout_is_eff_sample_size`, etc.)
provide sufficient monitoring capabilities for importance sampling
health without triggering tensor size limitations.

## Changes

- **Modified**:
[verl/trainer/ppo/mismatch_helper.py](verl/trainer/ppo/mismatch_helper.py)
- Removed `rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75` metric
calculations
  - All other rollout IS and mismatch metrics remain functional

## Testing

Verified that:
- Rollout IS framework continues to function correctly without
percentile metrics
- No runtime errors with large tensor configurations
- All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are
computed correctly

Resolves #3787
2025-10-20 13:04:57 +08:00
..

Rollout Importance Sampling (IS) Examples

This directory contains examples and documentation for using Rollout Importance Sampling to correct distribution mismatch between rollout and training policies.

References:

Overview

Rollout Importance Sampling corrects for distribution mismatch when:

  1. Rollout generation uses one policy (e.g., vLLM with BFloat16)
  2. Training uses another policy (e.g., FSDP with FP32)
  3. This mismatch leads to biased gradient estimates

Quick Start

Basic Configuration

algorithm:
  # Main control: set threshold to enable (null = disabled)
  rollout_is_threshold: 2.0
  # Whether to apply weights to policy loss (true) or just compute metrics (false)
  rollout_is: true
  rollout_is_level: token
  rollout_is_mode: truncate

# IMPORTANT: Must enable log prob calculation
actor_rollout_ref:
  rollout:
    calculate_log_probs: true

Running the Example

# Basic example with token-level truncate
bash examples/rollout_importance_sampling/run_with_rollout_is.sh

Configuration Options

Aggregation Levels (rollout_is_level)

Level Properties Threshold Range
token Per-token 1.5 - 5.0
sequence Per-sequence 2.0 - 10.0
geometric Geometric mean 1.0002 - 1.001

Bounding Modes (rollout_is_mode)

Mode Behavior
truncate Cap weights at upper threshold only
clip Zero out weights outside [lower, upper]

Key Parameters

  • rollout_is_threshold: Upper threshold for IS weights (null = disabled, float = enabled). Main on/off switch.
  • rollout_is: Whether to apply weights to loss (true) or just compute metrics (false). Default: false.
  • rollout_is_threshold_lower: Lower threshold (null = auto 1/upper)
  • rollout_is_veto_threshold: Catastrophic outlier threshold (default: 1e-4)

Configuration Examples

Example 1: Full IS Correction (Apply Weights)

algorithm:
  rollout_is_threshold: 2.0
  rollout_is: true  # Apply to loss
  rollout_is_level: token
  rollout_is_mode: truncate
  rollout_is_veto_threshold: 1e-4

Example 2: Metrics Only (No Weight Application)

algorithm:
  rollout_is_threshold: 2.0
  rollout_is: false  # Compute metrics only, don't apply to loss
  rollout_is_level: token
  rollout_is_mode: truncate

Example 3: Geometric Mean with Mask

algorithm:
  rollout_is_threshold: 1.0002
  rollout_is: true
  rollout_is_threshold_lower: 0.9998
  rollout_is_level: geometric
  rollout_is_mode: mask
  rollout_is_veto_threshold: 1e-4

Example 4: Sequence-level with Truncate

algorithm:
  rollout_is_threshold: 5.0
  rollout_is: true
  rollout_is_threshold_lower: null  # Auto-reciprocal: 0.2
  rollout_is_level: sequence
  rollout_is_mode: truncate
  rollout_is_veto_threshold: 1e-4

Example 5: Asymmetric Thresholds

algorithm:
  rollout_is_threshold: 5.0
  rollout_is: true
  rollout_is_threshold_lower: 0.8
  rollout_is_level: token
  rollout_is_mode: mask

Monitoring Metrics

Key metrics to watch (all prefixed with mismatch/ in logs):

Health Indicators

  • rollout_is_mean: Mean IS weight across sequences
  • rollout_is_eff_sample_size: Effective sample size after weighting
  • rollout_is_veto_fraction: Fraction of sequences vetoed

Distribution Metrics

  • rollout_is_max, rollout_is_min: Weight extremes
  • rollout_is_std: Standard deviation

Diagnostic Metrics

  • rollout_is_ratio_fraction_high: Fraction exceeding upper threshold
  • rollout_is_ratio_fraction_low: Fraction below lower threshold
  • rollout_is_catastrophic_token_fraction: Catastrophic tokens detected

Mismatch Metrics (Training vs Rollout Policy)

These metrics help diagnose the distribution mismatch between rollout and training policies:

Perplexity Metrics:

  • mismatch_training_ppl: Perplexity of training policy
  • mismatch_rollout_ppl: Perplexity of rollout policy
  • mismatch_ppl_ratio: Ratio of training PPL to rollout PPL
  • mismatch_log_ppl_diff: Log perplexity difference

KL Divergence Metrics:

  • mismatch_kl: KL divergence KL(π_rollout || π_training)
  • mismatch_k3_kl: K3 KL estimator

Troubleshooting

Issue: High Variance in IS Weights

Symptoms: rollout_is_std > 1.0, rollout_is_eff_sample_size < 0.3

Solutions:

  1. Switch from sequence to geometric level
  2. Tighten thresholds
  3. Check if rollout and training are too different

Issue: Too Many Sequences Vetoed

Symptoms: rollout_is_veto_fraction > 0.1

Solutions:

  1. Relax veto threshold: rollout_is_veto_threshold: 1e-3
  2. Check for numerical issues in log prob computation
  3. Verify rollout and training policies aren't completely different

Issue: Mean IS Weight Far from 1.0

Symptoms: rollout_is_mean < 0.5 or > 2.0

Solutions:

  1. Check that calculate_log_probs=True is set
  2. Verify rollout_log_probs are correctly passed
  3. Check for systematic bias in rollout vs training

Issue: Too Much Data Discarded (Mask Mode)

Symptoms: rollout_is_masked_fraction > 0.5

Solutions:

  1. Widen thresholds
  2. Switch to truncate mode
  3. Use geometric level for better stability

Performance Considerations

Memory Usage

  • Rollout IS adds minimal memory overhead (~1% of model memory)
  • Log-space computation prevents numerical overflow

Computational Cost

  • Token-level: ~1-2% overhead
  • Sequence-level: ~2-3% overhead
  • Geometric: ~2-3% overhead

Advanced Topics

Dual Thresholds

Specify both upper and lower explicitly:

rollout_is_threshold: 2.0      # Upper
rollout_is_threshold_lower: 0.5  # Lower (not 1/2.0 = 0.5)

Or use auto-reciprocal:

rollout_is_threshold: 2.0      # Upper = 2.0, Lower = 0.5 (auto)
rollout_is_threshold_lower: null

Veto Mechanism

The veto mechanism zeros out entire sequences containing catastrophic outliers:

  • If any token has ratio < rollout_is_veto_threshold, the entire sequence is rejected
  • This prevents extreme outliers from dominating training
  • Default threshold: 1e-4 (ratio 10,000x off)
  • Set to null to disable: rollout_is_veto_threshold: null

Examples

See the script in this directory:

  • run_with_rollout_is.sh: Basic example with token-level truncate mode

References

  • Implementation: verl/trainer/ppo/mismatch_helper.py
  • Core algorithm: verl/trainer/ppo/core_algos.py
  • Paper: "Your Efficient RL Framework Secretly Brings You Off-Policy RL Training"