mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
## Summary Fixes #3787 by removing `torch.quantile()`-based percentile metrics (`rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75`) that caused `RuntimeError: quantile() input tensor is too large` when using large batch sizes or response lengths. ## Problem When using configurations with large tensor sizes (e.g., `max_response_length: 32k`, `rollout.n: 16`, `train_batch_size: 16`), the `torch.quantile()` function fails with a runtime error due to PyTorch's internal tensor size limitations (~2^24 to 2^27 elements depending on version, GPU memory, and dtype). The error occurred in `verl/trainer/ppo/mismatch_helper.py`: ```python metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25) metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50) metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75) ``` ## Solution Removed the three quantile-based percentile metrics from the Rollout IS framework. The remaining metrics (`rollout_is_mean`, `rollout_is_std`, `rollout_is_min`, `rollout_is_max`, `rollout_is_eff_sample_size`, etc.) provide sufficient monitoring capabilities for importance sampling health without triggering tensor size limitations. ## Changes - **Modified**: [verl/trainer/ppo/mismatch_helper.py](verl/trainer/ppo/mismatch_helper.py) - Removed `rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75` metric calculations - All other rollout IS and mismatch metrics remain functional ## Testing Verified that: - Rollout IS framework continues to function correctly without percentile metrics - No runtime errors with large tensor configurations - All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are computed correctly Resolves #3787