verl/trainer at main - verl - Gitea: Git for Me

mirror of https://github.com/volcengine/verl.git synced 2025-10-20 13:43:50 +08:00

Files

Yingru Li 4f1c489e45 [algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error (#3810 )

## Summary

Fixes #3787 by removing `torch.quantile()`-based percentile metrics
(`rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75`) that caused
`RuntimeError: quantile() input tensor is too large` when using large
batch sizes or response lengths.

## Problem

When using configurations with large tensor sizes (e.g.,
`max_response_length: 32k`, `rollout.n: 16`, `train_batch_size: 16`),
the `torch.quantile()` function fails with a runtime error due to
PyTorch's internal tensor size limitations (~2^24 to 2^27 elements
depending on version, GPU memory, and dtype).

The error occurred in `verl/trainer/ppo/mismatch_helper.py`:
```python
metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25)
metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50)
metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75)
```

## Solution

Removed the three quantile-based percentile metrics from the Rollout IS
framework. The remaining metrics (`rollout_is_mean`, `rollout_is_std`,
`rollout_is_min`, `rollout_is_max`, `rollout_is_eff_sample_size`, etc.)
provide sufficient monitoring capabilities for importance sampling
health without triggering tensor size limitations.

## Changes

- **Modified**:
[verl/trainer/ppo/mismatch_helper.py](verl/trainer/ppo/mismatch_helper.py)
- Removed `rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75` metric
calculations
  - All other rollout IS and mismatch metrics remain functional

## Testing

Verified that:
- Rollout IS framework continues to function correctly without
percentile metrics
- No runtime errors with large tensor configurations
- All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are
computed correctly

Resolves #3787

2025-10-20 13:04:57 +08:00

config

[trainer] fix: Add data.seed to config (#3815 )

2025-10-20 09:57:14 +08:00

ppo

[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error (#3810 )

2025-10-20 13:04:57 +08:00

__init__.py

[ci] feat: pre-commit check all the files by default (#2017 )

2025-06-14 14:22:17 +08:00