## Summary Fixes #3787 by removing `torch.quantile()`-based percentile metrics (`rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75`) that caused `RuntimeError: quantile() input tensor is too large` when using large batch sizes or response lengths. ## Problem When using configurations with large tensor sizes (e.g., `max_response_length: 32k`, `rollout.n: 16`, `train_batch_size: 16`), the `torch.quantile()` function fails with a runtime error due to PyTorch's internal tensor size limitations (~2^24 to 2^27 elements depending on version, GPU memory, and dtype). The error occurred in `verl/trainer/ppo/mismatch_helper.py`: ```python metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25) metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50) metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75) ``` ## Solution Removed the three quantile-based percentile metrics from the Rollout IS framework. The remaining metrics (`rollout_is_mean`, `rollout_is_std`, `rollout_is_min`, `rollout_is_max`, `rollout_is_eff_sample_size`, etc.) provide sufficient monitoring capabilities for importance sampling health without triggering tensor size limitations. ## Changes - **Modified**: [verl/trainer/ppo/mismatch_helper.py](verl/trainer/ppo/mismatch_helper.py) - Removed `rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75` metric calculations - All other rollout IS and mismatch metrics remain functional ## Testing Verified that: - Rollout IS framework continues to function correctly without percentile metrics - No runtime errors with large tensor configurations - All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are computed correctly Resolves #3787
Rollout Importance Sampling (IS) Examples
This directory contains examples and documentation for using Rollout Importance Sampling to correct distribution mismatch between rollout and training policies.
References:
- When Speed Kills Stability: https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda
- Off-policy RL: https://fengyao.notion.site/off-policy-rl
Overview
Rollout Importance Sampling corrects for distribution mismatch when:
- Rollout generation uses one policy (e.g., vLLM with BFloat16)
- Training uses another policy (e.g., FSDP with FP32)
- This mismatch leads to biased gradient estimates
Quick Start
Basic Configuration
algorithm:
# Main control: set threshold to enable (null = disabled)
rollout_is_threshold: 2.0
# Whether to apply weights to policy loss (true) or just compute metrics (false)
rollout_is: true
rollout_is_level: token
rollout_is_mode: truncate
# IMPORTANT: Must enable log prob calculation
actor_rollout_ref:
rollout:
calculate_log_probs: true
Running the Example
# Basic example with token-level truncate
bash examples/rollout_importance_sampling/run_with_rollout_is.sh
Configuration Options
Aggregation Levels (rollout_is_level
)
Level | Properties | Threshold Range |
---|---|---|
token | Per-token | 1.5 - 5.0 |
sequence | Per-sequence | 2.0 - 10.0 |
geometric | Geometric mean | 1.0002 - 1.001 |
Bounding Modes (rollout_is_mode
)
Mode | Behavior |
---|---|
truncate | Cap weights at upper threshold only |
clip | Zero out weights outside [lower, upper] |
Key Parameters
rollout_is_threshold
: Upper threshold for IS weights (null = disabled, float = enabled). Main on/off switch.rollout_is
: Whether to apply weights to loss (true) or just compute metrics (false). Default: false.rollout_is_threshold_lower
: Lower threshold (null = auto 1/upper)rollout_is_veto_threshold
: Catastrophic outlier threshold (default: 1e-4)
Configuration Examples
Example 1: Full IS Correction (Apply Weights)
algorithm:
rollout_is_threshold: 2.0
rollout_is: true # Apply to loss
rollout_is_level: token
rollout_is_mode: truncate
rollout_is_veto_threshold: 1e-4
Example 2: Metrics Only (No Weight Application)
algorithm:
rollout_is_threshold: 2.0
rollout_is: false # Compute metrics only, don't apply to loss
rollout_is_level: token
rollout_is_mode: truncate
Example 3: Geometric Mean with Mask
algorithm:
rollout_is_threshold: 1.0002
rollout_is: true
rollout_is_threshold_lower: 0.9998
rollout_is_level: geometric
rollout_is_mode: mask
rollout_is_veto_threshold: 1e-4
Example 4: Sequence-level with Truncate
algorithm:
rollout_is_threshold: 5.0
rollout_is: true
rollout_is_threshold_lower: null # Auto-reciprocal: 0.2
rollout_is_level: sequence
rollout_is_mode: truncate
rollout_is_veto_threshold: 1e-4
Example 5: Asymmetric Thresholds
algorithm:
rollout_is_threshold: 5.0
rollout_is: true
rollout_is_threshold_lower: 0.8
rollout_is_level: token
rollout_is_mode: mask
Monitoring Metrics
Key metrics to watch (all prefixed with mismatch/
in logs):
Health Indicators
rollout_is_mean
: Mean IS weight across sequencesrollout_is_eff_sample_size
: Effective sample size after weightingrollout_is_veto_fraction
: Fraction of sequences vetoed
Distribution Metrics
rollout_is_max
,rollout_is_min
: Weight extremesrollout_is_std
: Standard deviation
Diagnostic Metrics
rollout_is_ratio_fraction_high
: Fraction exceeding upper thresholdrollout_is_ratio_fraction_low
: Fraction below lower thresholdrollout_is_catastrophic_token_fraction
: Catastrophic tokens detected
Mismatch Metrics (Training vs Rollout Policy)
These metrics help diagnose the distribution mismatch between rollout and training policies:
Perplexity Metrics:
mismatch_training_ppl
: Perplexity of training policymismatch_rollout_ppl
: Perplexity of rollout policymismatch_ppl_ratio
: Ratio of training PPL to rollout PPLmismatch_log_ppl_diff
: Log perplexity difference
KL Divergence Metrics:
mismatch_kl
: KL divergence KL(π_rollout || π_training)mismatch_k3_kl
: K3 KL estimator
Troubleshooting
Issue: High Variance in IS Weights
Symptoms: rollout_is_std
> 1.0, rollout_is_eff_sample_size
< 0.3
Solutions:
- Switch from
sequence
togeometric
level - Tighten thresholds
- Check if rollout and training are too different
Issue: Too Many Sequences Vetoed
Symptoms: rollout_is_veto_fraction
> 0.1
Solutions:
- Relax veto threshold:
rollout_is_veto_threshold: 1e-3
- Check for numerical issues in log prob computation
- Verify rollout and training policies aren't completely different
Issue: Mean IS Weight Far from 1.0
Symptoms: rollout_is_mean
< 0.5 or > 2.0
Solutions:
- Check that
calculate_log_probs=True
is set - Verify rollout_log_probs are correctly passed
- Check for systematic bias in rollout vs training
Issue: Too Much Data Discarded (Mask Mode)
Symptoms: rollout_is_masked_fraction
> 0.5
Solutions:
- Widen thresholds
- Switch to
truncate
mode - Use
geometric
level for better stability
Performance Considerations
Memory Usage
- Rollout IS adds minimal memory overhead (~1% of model memory)
- Log-space computation prevents numerical overflow
Computational Cost
- Token-level: ~1-2% overhead
- Sequence-level: ~2-3% overhead
- Geometric: ~2-3% overhead
Advanced Topics
Dual Thresholds
Specify both upper and lower explicitly:
rollout_is_threshold: 2.0 # Upper
rollout_is_threshold_lower: 0.5 # Lower (not 1/2.0 = 0.5)
Or use auto-reciprocal:
rollout_is_threshold: 2.0 # Upper = 2.0, Lower = 0.5 (auto)
rollout_is_threshold_lower: null
Veto Mechanism
The veto mechanism zeros out entire sequences containing catastrophic outliers:
- If any token has ratio <
rollout_is_veto_threshold
, the entire sequence is rejected - This prevents extreme outliers from dominating training
- Default threshold: 1e-4 (ratio 10,000x off)
- Set to
null
to disable:rollout_is_veto_threshold: null
Examples
See the script in this directory:
run_with_rollout_is.sh
: Basic example with token-level truncate mode
References
- Implementation:
verl/trainer/ppo/mismatch_helper.py
- Core algorithm:
verl/trainer/ppo/core_algos.py
- Paper: "Your Efficient RL Framework Secretly Brings You Off-Policy RL Training"