66 Commits

Author SHA1 Message Date
bc7c86398c [misc] feat: create issue template for verl (#3330) 2025-09-03 20:45:20 +08:00
21b99ed741 [misc] feat: Added: "tensorboard" to the requirements.txt (#2900)
### What does this PR do?

> This PR adds tensorboard as a dependency to requirements.txt file,
across several Dockerfiles (Dockerfile.ngc.vllm, Dockerfile.ngc.vllm0.8,
Dockerfile.ngc.vllm0.8.sagemaker), a setup script
(install_vllm_sglang_mcore.sh), and the main setup.py file. This change
ensures that the tensorboard package is consistently installed, enabling
visualization of training metrics for various configurations and
deployment environments. This is a maintenance task that enhances the
project's observability without altering core functionality.

### Test

> This change is a dependency update and doesn't require specific
testing beyond confirming the installation is successful.

### API and Usage Example

> No API changes are introduced. The usage of TensorBoard would be
initiated by the user after installing the requirements.

```python
# No code snippet is applicable for this change
2025-08-08 22:39:53 +08:00
e2b773528f [megatron] feat: Add MindSpeed support on the NPU device (#2707)
### What does this PR do?

Add MindSpeed(Megatron) support on the NPU device. 
First, import the Megatron adapter to avoid import errors, and reapply
the patch according to the configuration.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-08-01 10:58:29 +08:00
4879d619fc [docker] feat: upgrade to torch 2.7, sglang 0.4.8 (#2617)
### What does this PR do?

[docker] feat: upgrade to torch 2.7, sglang 0.4.8

Stage 2: vllm 0.9.1
Stage 3: mcore 0.13.0

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-07-24 14:53:24 -07:00
ac826e0558 [tool] chore: Add log for AsyncRolloutRequest ID, and rollout viewr to support request id display and search (#2636)
### What does this PR do?

Add log for AsyncRolloutRequest ID in PPO ray_trainer and
sglang_rolllout. Update rollout viewr to support request id display and
search

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failur
2025-07-21 12:01:37 +08:00
69a467f934 [docker] fix: downgrade TransformerEngine version 2.2.1 to allow mcore image using rope fusion and provide another set of v0.5 image (#2611)
### What does this PR do?

Downgrade TransformerEngine version to allow mcore image using rope
fusion and provide another set of v0.5 image.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-18 17:23:19 +08:00
2b2aa9d3fd [tool] chore: introduce RolloutViewer TUI tools (#2469)
### What does this PR do?

Introduce a RolloutViewer TUI tools to visualize rollout and reward
dumped results easily, which supports:

-   async data loading, lightning open speed
-  ⌨️  full keyboard shortcut operation, you don't need a mouse
-  🔍  text search and highlight, you won't miss anything
- 📝 table or plain mode

usage:

```bash
python scripts/rollout_viewer.py ${trainer.rollout_data_dir}
```

 here is the main window screen shot:

<img width="2540" height="1416" alt="image"
src="https://github.com/user-attachments/assets/e34e5157-2880-4a21-afb2-73885d0dfb11"
/>



> We are from the Large Model Post-Training Team of 📕 Xiaohongshu's AI
Platform Technology Department , dedicated to developing
high-performance, easily-scalable distributed post-training engines.


### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-17 13:30:41 -07:00
H
332c7d53c1 [cfg] refactor: add flatten megatron trainer config generation and verification script (#2582)
### What does this PR do?

- Added CONFIG_SPECS array: "config_name:output_file:config_arg" format
- Now generates both _generated_ppo_trainer.yaml and
_generated_ppo_megatron_trainer.yaml
- Maintains identical output format and verification behavior

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`


### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
2025-07-17 08:08:45 -07:00
6e21c0a625 [megatron] feat: support distributed megatron model converter and merger (#2281)
### What does this PR do?


- support distributed mcore model converter and merger, especially for
huge models like dpskv3 671B
- fix model merger bugs for dpskv3, related to
https://github.com/volcengine/verl/pull/2125

background:
https://github.com/volcengine/verl/pull/2125#issuecomment-2993276556
<img width="1189" height="371" alt="image"
src="https://github.com/user-attachments/assets/a317b928-963a-41e5-ae81-d4b6aa669516"
/>


> We are from the Large Model Post-Training Team of 📕 Xiaohongshu's AI
Platform Technology Department , dedicated to developing
high-performance, easily-scalable distributed post-training engines.


### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-16 13:36:33 +08:00
H
d0c7bbbc05 [cfg] refactor: support +extra.any_key usage for the base dataclass config in verl (#2502)
### What does this PR do?

This PR makes update to the base config in verl:
- support +extra.any_key usage for the base config in verl.
- allow selective subfields to be frozen
- add a auto-generated config yaml file
`verl/trainer/config/_generated_ppo_trainer.yaml` for reference purpose,
in case the nested inheritance structure makes the config information
too scattered

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

- added frozen field tests

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

Now you can pass `--xx.profiler.extra.any_new_key=any_plain_value` in
command line to a dataclass inheriting `verl.BaseConfig`. This way we
can still pass dataclass configs inside verl but allow some flexiblity
in accepting new keys from users' adhoc usage.


### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Lin <haibin@Lins-Laptop.hsd1.wa.comcast.net>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-15 09:06:56 +08:00
92758d681c [env] fix: Change the permissions of install_vllm_sglang_mcore.sh from 644 to 755 to allow execution (#2508)
### What does this PR do?

I followed the instructions at
https://verl.readthedocs.io/en/latest/start/install.html#install-dependencies
to install verl. The guide asks me to run the script
`scripts/install_vllm_sglang_mcore.sh`, but its permission is set to
644.

```
# Make sure you have activated verl conda env
# If you need to run with megatron
bash scripts/install_vllm_sglang_mcore.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
```

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

Here are the steps I followed to update the permission.
```sh
(verl) ➜  verl git:(20250713-devbox-2-tmux0-verl) ✗ ./scripts/install_vllm_sglang_mcore.sh
zsh: permission denied: ./scripts/install_vllm_sglang_mcore.sh
(verl) ➜  verl git:(20250713-devbox-2-tmux0-verl) ✗ ll scripts/install_vllm_sglang_mcore.sh
-rw-rw-r-- 1 ubuntu ubuntu 2.4K Jul 13 05:04 scripts/install_vllm_sglang_mcore.sh
(verl) ➜  verl git:(20250713-devbox-2-tmux0-verl) ✗ chmod +x scripts/install_vllm_sglang_mcore.sh
(verl) ➜  verl git:(20250713-devbox-2-tmux0-verl) ✗ ./scripts/install_vllm_sglang_mcore.sh
1. install inference frameworks and pytorch they need
Looking in links: https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python
Collecting sglang==0.4.6.post1 (from sglang[all]==0.4.6.post1)
...
```

### API and Usage Example

No

### Design & Code Changes

No

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2025-07-13 15:36:11 -07:00
eac4863ad7 [env] feat: safely bump py version to 3.10 (#2421)
### What does this PR do?

This PR safely bumps python version to 3.10 for two reasons:
1.
[`removeprefix`](https://docs.python.org/3.9/whatsnew/3.9.html#new-string-methods-to-remove-prefixes-and-suffixes)
was introduced in python 3.9
588f9728f3/verl/single_controller/ray/base.py (L498-L505)
2.
[`match`](https://docs.python.org/3.10/whatsnew/3.10.html#simple-pattern-match-to-a-literal)
was introduced in python 3.10
588f9728f3/verl/tools/utils/tool_registry.py (L81-L92)



### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`


### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-12 16:29:39 -07:00
H
00a10a8ef3 [ci] refactor: reduce ruff line-length from 300 to 120 (#2287)
### What does this PR do?

Previously the ruff line-len is too large, making it hard for users to
view code. If we keep the config, manually created short lines will be
formatted to long lines as well. This PR contains 3 commits:
- df4bbfca62f41d972c48c8a76088ae2ac29691cf set line len to 120 and run
pre-commit auto-format
- 9d03f183edd9fff4e22215cacacf62c06b7b41d3 let devin fix the multi-line
code
- 9fc8d436f5007535fad3dc49983b01d0d457be9c skip lint for
test_sglang_async_rollout_sf_tools.py. manually adjust format for
rope_utils.py
- last two commits:
  1. merge with main
2. run lint after merge. add test_sglang_async_rollout_sf_tools.py and
scripts/legacy_model_merger.py to lint.exclude

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

This PR relies on CI for testing.


### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2025-07-01 09:54:40 +08:00
024a8b8578 [ckpt, doc] chore: add backward compatibility for model merger and sync docs (#2251)
### What does this PR do?

This PR add missing doc changes in
https://github.com/volcengine/verl/pull/2125:
- Synchronize checkpoint content and verl.model_merger with the latest
code
- Add content on how to merge checkpoints in the quick start
documentation to help users understand how to merge checkpoints

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-06-30 18:42:59 +08:00
3b3e597042 [megatron] feat: Support of dist checkpoint (#2125)
### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore, test`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Support of dist checkpoint in saving, loading and model merger.

### Test

Algorithm:

<img width="783" alt="image"
src="https://github.com/user-attachments/assets/9a200b47-5937-426a-8da6-c601d2d8328f"
/>

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: H <linhaibin.eric@gmail.com>
2025-06-25 17:17:29 +08:00
615f5f1461 [megatron] fix: dpskv3 convert src and dst mixed up bug (#2029)
### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

- fix DeepseekV3 convert bug introduced from
https://github.com/volcengine/verl/pull/1995 which mixed up the `src`
and `dst` parameters of function `safe_copy`. appologize for my mistake

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
2025-06-16 10:28:15 +08:00
c3ffce26d1 [ci] feat: pre-commit check all the files by default (#2017)
### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - type is in `feat, fix, refactor, chore`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

We found that most files have fixed the linting errors, so it might be
the time to check all the files by default.

This PR

1. fixes the remaining linting errors
(4409ad0070aa11027e13e26c469d46c63cdab7fb)
2. sets the pre-commit to check all the files by default
(4c30c2bb99ffec50b038c2a7ff34e28062d7a168)

> [!NOTE]
> **About merging / rebasing overhead**
> Similar to the previous, contributors only need to merge / rebase the
files they have changed, so the overhead should be acceptable.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] New CI unit test(s) are added to cover the code path.
- [x] Rely on existing unit tests on CI that covers the code path.
2025-06-14 14:22:17 +08:00
6681e25ff4 [ckpt] fix: run converter_hf_to_mcore with --test will raise an AttributeError (#2010)
### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - type is in `feat, fix, refactor, chore`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

> when I converter hf ckpt to mcore with --test, an AttributeError
raised , this PR will fixed it

```sh
[rank0]:   File "verl/scripts/converter_hf_to_mcore.py", line 305, in convert_hf_to_mcore
[rank0]:     test_conversion(megatron_model_provider, tfconfig, output_path, model)
[rank0]:   File "verl/scripts/converter_hf_to_mcore.py", line 78, in test_conversion
[rank0]:     assert dut_data.shape == ref_state_dict.shape, f"{name=} {dut_data.shape=} {ref_data.shape=}"
[rank0]: AttributeError: 'dict' object has no attribute 'shape'
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: lixiaoguang12 <lixiaoguang12@meituan.com>
Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
2025-06-14 00:45:24 +08:00
ffeaed8c41 [megatron] feat: robust and efficient mcore converter with meta device init and numel check for dpsk (#1995)
### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - [ ] In format of: [modules] type: Title
- [ ] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer,
tests, training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - [ ] type is in `feat, fix, refactor, chore`
- [ ] can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

- `DeepseekV3` is too large to load and init weights, as `meta device`
is a better approach.
- accumulate numel to check if model weight is not missed

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
2025-06-13 23:32:17 +08:00
f880ec4c72 [ckpt] feat: model_merger.py support processing checkpoints with LoRA adapters (#1821) 2025-06-10 20:29:16 +08:00
85fef90d51 [megatron] feat: qwen2.5vl (#1286)
works with qwen2.5vl 3b + geo3k


<img width="1148" alt="image"
src="https://github.com/user-attachments/assets/87c8746c-7f40-4189-9e82-eb1b459669f8"
/>
<img width="1143" alt="image"
src="https://github.com/user-attachments/assets/58bce88d-c53e-45a2-b89c-bfacf4ae9e85"
/>
<img width="1503" alt="image"
src="https://github.com/user-attachments/assets/284ef5c6-2057-4a73-ad56-bed2ef0ece43"
/>
2025-06-10 15:38:16 +08:00
cc9bc3fc21 [bugfix] fix megatron model merger (#1774)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Fix megatron model merger.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

- Fix get rank method to support just TP.
- Fix state_dict keys after convert.
- Add mla/moe convert support.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

Test with Qwen3-8B and Qwen2.5-7B.

### Additional Info.

- **Issue Number**: Fixes issue #1757
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.

---------

Signed-off-by: ShareLer <ShareLe@163.com>
Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
2025-06-09 13:28:24 +08:00
2a386cf0e9 [BugFix][CI] Megatron: add ep CI (#1726)
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models
which are more convenient to test.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
2025-06-05 14:02:00 +08:00
0e127b208b chore: fix typos across codebase (#1805)
Fixed typos across codebase.
2025-06-02 21:05:07 +08:00
be47ac44b2 [mcore] moonlight (small model with deepseekv3 arch) (#1284)
achieve 74.3 at gsm8k, while moonlight reported as 77.4

still WIP with the performance diff
2025-05-28 17:10:29 +08:00
3d5f15fa9a [fix] use correct variable for saving hf model (#1681) 2025-05-25 18:49:43 +08:00
54a5e6ee6d [megatron] feat: save hf model config in megatron checkpoint manager (#1562)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
https://github.com/volcengine/verl/pull/1468, fixes the comment
https://github.com/volcengine/verl/pull/1468#issuecomment-2883541227.

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-23 14:50:48 +08:00
1cfa2be530 [Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests (#1555)
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Allow to override of transformer config to enable custom megatron
features like variable PP layers distribution, with CI tests, which is
in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers
(DeepSeek V3)

We will first fix e2e_prime CI by use fused kernels.

**Notice that now the imbalance PP layers distribution only compatible
with dist_ckpt load and save, not support huggingface direct
load/save.**

Also, other megatron arguments can be passed through scripts.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

Breaking APIs:

```py
class MegatronWorker(Worker):
    def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):

# and the models building
```

```yaml
  actor:
    megatron:
      override_transformer_config: {} # common transformer config for all models
```

To avoid trouble of input same transformer config arguments, other
models will reuse actor's config, so just need to input once.

### Usage Example

```bash
run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: Megatron
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-22 13:38:34 +08:00
7b0426a738 [Docker Image] update images and fix sglang installation (#1606)
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

update images and fix sglang installation, the latest image:
`whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3`

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

- vLLM: 0.8.5.post1
- SGLang: 0.4.6.post4, fix installation
- Megatron: core_v0.12.0 announcement
- TransformerEngine: 2.3

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
2025-05-21 09:13:51 +08:00
530154e153 [merger] fix: avoid setting torch's global device to meta (#1564)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(https://github.com/volcengine/verl/issues/1484,
https://github.com/volcengine/verl/issues/1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

d36b5e81d6/scripts/model_merger.py (L131-L132)

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](5549da8af9/torchao/optim/subclass_4bit.py (L33)),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(417bc52965/src/accelerate/big_modeling.py (L96-L170)),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
https://github.com/volcengine/verl/issues/1484,
https://github.com/volcengine/verl/issues/1255,
https://github.com/volcengine/verl/pull/1468#issuecomment-2886345570
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-18 19:17:27 +08:00
d36b5e81d6 Add missing fi to install script (#1559) 2025-05-18 11:15:57 +08:00
b8bd596811 [Docker Image] use latest vLLM (0.8.5) to fully support Qwen3 moe (#1544) 2025-05-17 07:28:55 +08:00
3f4647f9bc [model merger] refactor model merger for better usage and maintainability (#1468)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
https://github.com/volcengine/verl/pull/1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-16 23:53:08 +08:00
H
771bd756b3 [misc] docs: move dev folder to scripts. add sandbox documentation to index.rst. (#1539)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

- move dev folder to scripts @ETOgaosion 
- add sandbox documentation to index.rst @chenhaiq  
- installation docs have been updated

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if neccessary.
2025-05-16 08:12:31 +08:00
ba6a2e0bb5 [FSDPCheckpointManager] feat: save huggingface model when 'hf_model' in checkpoint_contents (#1288)
Before, `FSDPCheckpointManager` will not save hf model when `hf_model`
is given in `checkpoint_contents`, instead, it only save the hf model's
config.

This PR correctly save the huggingface model when 'hf_model' is in
`checkpoint_contents`.
2025-05-07 20:44:46 +08:00
fd3f21cb0e [megatron] qwen3 support (#1337)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Support qwen3 to run with megatron backend.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

- Update offline weight convert script(from hf to megatron) for qwen3.
- Add config converter from hf config to mcore config for qwen3.
- Add qk_layernorm weight load logic in mcore loader for qwen3(dense).
- Add model initializer and forward func for qwen3(moe).
- Add online weight converter from mcore to hf for qwen3.
- Fix typo in megatron CriticWorker.update_critic.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```bash
# example for qwen3-8B

HF_MODEL_PATH="Your hf ckpt path"
DIST_CKPT_PATH="Your mcore ckpt path"

# convert ckpt from hf to megatron
python3 scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH --output_path $DIST_CKPT_PATH

NODES=1
N_PER_NODE=8
PP=1
TP=8
CP=1
VLLM_TP=8

python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=64 \
    data.max_prompt_length=1024 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$HF_MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.actor.megatron.context_parallel_size=$CP \
    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.actor.megatron.grad_offload=True \
    actor_rollout_ref.actor.megatron.optimizer_offload=True \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.ref.megatron.context_parallel_size=$CP \
    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.ref.megatron.param_offload=True \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \
    critic.optim.lr=1e-5 \
    critic.model.path=$HF_MODEL_PATH \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=4 \
    critic.megatron.tensor_model_parallel_size=$TP \
    critic.megatron.pipeline_model_parallel_size=$PP \
    critic.megatron.context_parallel_size=$CP \
    critic.megatron.use_dist_checkpointing=True \
    critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    critic.megatron.param_offload=True \
    critic.megatron.grad_offload=True \
    critic.megatron.optimizer_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_gsm8k_qwen3-8B' \
    trainer.experiment_name='qwen3_8b_gsm8k_gae_megatron' \
    trainer.n_gpus_per_node=$N_PER_NODE \
    trainer.nnodes=$NODES \
    trainer.save_freq=50 \
    trainer.test_freq=10 \
    trainer.total_epochs=100 $@


```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if neccessary.

---------

Signed-off-by: ShareLer <ShareLe@163.com>
2025-05-07 20:41:41 +08:00
c05f6c26b6 Qwen2moe[part1]: add cpu converter option, add CI test for current solutions temporarily (#1267)
Temporarily use CPU to initialize larger models for huggingface to
dist_ckpt conversion.

And Support GQA Moe model.

May not require CI as this function can be dependent to VeRL, but
current solution may need.
2025-05-07 13:11:02 +08:00
ba38413aa5 Option to make model private when pushing to hub, pushing the tokenizer for convenience (#1259)
Very small changes to `model_merger.py` so that tokenizer is pushed to
hub and model can be pushed privately.
2025-04-28 20:17:42 +08:00
ea4cd31987 [merger] fix: merged generation config is inconsistent with hf pre-trained model (#1277)
afeac9a023/scripts/model_merger.py (L195-L200)

Model created by `from_config` won't load the `generation_config.json`
from `args.hf_model_path`, instead it create a generation config
separately.

This inconsistency will lead to strange generating error when user using
vllm/hf rollout without carefully override
sampling_params/generation_config, see issue here:
https://github.com/volcengine/verl/issues/1246

This PR introduce a `patch_model_generation_config` function which patch
the model from config to correctly use the pretrained generation config.
Fix https://github.com/volcengine/verl/issues/1246.
2025-04-28 09:23:19 +08:00
8e5ad4688a [Lint] fix: linting errors in all files (#1280)
This PR enables checking on all files after fixing all the errors:

```
examples/data_preprocess/geo3k.py:41:121: E501 Line too long (121 > 120)
examples/data_preprocess/multiturn.py:54:121: E501 Line too long (185 > 120)
examples/data_preprocess/multiturn.py:59:121: E501 Line too long (210 > 120)
examples/data_preprocess/multiturn.py:73:121: E501 Line too long (229 > 120)
examples/data_preprocess/multiturn.py:78:121: E501 Line too long (211 > 120)
examples/ray/tutorial.ipynb:cell 9:1:121: E501 Line too long (179 > 120)
examples/ray/tutorial.ipynb:cell 15:1:121: E501 Line too long (143 > 120)
examples/ray/tutorial.ipynb:cell 42:14:1: E402 Module level import not at top of cell
recipe/prime/prime_dp_rm.py:145:121: E501 Line too long (153 > 120)
recipe/prime/prime_dp_rm.py:156:121: E501 Line too long (137 > 120)
recipe/prime/prime_dp_rm.py:292:121: E501 Line too long (148 > 120)
recipe/r1/data_process.py:56:121: E501 Line too long (289 > 120)
recipe/r1/data_process.py:113:121: E501 Line too long (166 > 120)
recipe/r1/data_process.py:118:121: E501 Line too long (137 > 120)
recipe/r1/data_process.py:123:121: E501 Line too long (297 > 120)
recipe/r1/data_process.py:131:9: E722 Do not use bare `except`
recipe/r1/tasks/livecodebench.py:61:5: E722 Do not use bare `except`
scripts/diagnose.py:55:9: F841 Local variable `ip` is assigned to but never used
scripts/diagnose.py:165:13: B028 No explicit `stacklevel` keyword argument found
scripts/model_merger.py:42:121: E501 Line too long (184 > 120)
scripts/model_merger.py:146:13: E722 Do not use bare `except`
tests/e2e/arithmetic_sequence/model/create_model_tokenizer.py:28:121: E501 Line too long (440 > 120)
tests/gpu_utility/test_memory_buffers.py:42:5: F841 Local variable `model_named_params` is assigned to but never used
tests/gpu_utility/test_memory_buffers.py:43:5: F841 Local variable `model_copy_named_params` is assigned to but never used
tests/gpu_utility/test_memory_buffers.py:53:5: F841 Local variable `model_wrapper` is assigned to but never used
tests/model/test_transformers_ulysses.py:102:5: F841 Local variable `response_length` is assigned to but never used
tests/model/test_transformers_ulysses.py:181:5: F841 Local variable `response_length` is assigned to but never used
tests/ray/detached_worker/server.py:83:13: F841 Local variable `vpp_rank` is assigned to but never used
tests/ray/test_check_worker_alive.py:37:121: E501 Line too long (121 > 120)
tests/rollout/run_fsdp_vllm.py:22:64: F811 Redefinition of unused `ShardingStrategy` from line 20
tests/rollout/test_sglang_spmd.py:210:121: E501 Line too long (157 > 120)
tests/rollout/test_vllm_spmd.py:20:64: F811 Redefinition of unused `ShardingStrategy` from line 18
tests/sandbox/test_sandbox.py:86:121: E501 Line too long (1615 > 120)
tests/sandbox/test_sandbox.py:87:121: E501 Line too long (1596 > 120)
tests/sanity/check_license.py:22:1: E402 Module level import not at top of file
tests/sanity/check_license.py:23:1: E402 Module level import not at top of file
tests/verl/utils/dataset/test_rl_dataset.py:23:5: F841 Local variable `url` is assigned to but never used
tests/verl/utils/dataset/test_rm_dataset.py:22:5: F841 Local variable `url` is assigned to but never used
tests/verl/utils/dataset/test_rm_dataset.py:36:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
tests/verl/utils/dataset/test_sft_dataset.py:22:5: F841 Local variable `url` is assigned to but never used
tests/verl/utils/dataset/test_sft_dataset.py:50:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
tests/verl/utils/dataset/test_sft_dataset.py:75:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
verl/__init__.py:22:1: E402 Module level import not at top of file
verl/__init__.py:24:1: E402 Module level import not at top of file
verl/__init__.py:25:1: E402 Module level import not at top of file
verl/__init__.py:29:1: E402 Module level import not at top of file
verl/__init__.py:29:15: F401 `.single_controller` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/__init__.py:16:5: F401 `.modeling_llama_megatron.ParallelLlamaForCausalLM` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/__init__.py:18:5: F401 `.modeling_llama_megatron.ParallelLlamaForCausalLMRmPad` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/__init__.py:20:5: F401 `.modeling_llama_megatron.ParallelLlamaForCausalLMRmPadPP` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/__init__.py:21:5: F401 `.modeling_llama_megatron.ParallelLlamaForValueRmPad` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/__init__.py:22:5: F401 `.modeling_llama_megatron.ParallelLlamaForValueRmPadPP` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/__init__.py:24:5: F401 `.modeling_llama_megatron.ParallelLlamaModel` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/checkpoint_utils/llama_loader.py:92:121: E501 Line too long (168 > 120)
verl/models/llama/megatron/checkpoint_utils/llama_loader_depracated.py:92:121: E501 Line too long (168 > 120)
verl/models/llama/megatron/checkpoint_utils/llama_loader_depracated.py:274:121: E501 Line too long (127 > 120)
verl/models/llama/megatron/checkpoint_utils/llama_saver.py:170:9: F841 Local variable `tp_rank` is assigned to but never used
verl/models/llama/megatron/checkpoint_utils/llama_saver.py:211:9: F841 Local variable `tp_rank` is assigned to but never used
verl/models/llama/megatron/checkpoint_utils/llama_saver.py:261:9: F841 Local variable `tp_rank` is assigned to but never used
verl/models/llama/megatron/layers/__init__.py:15:33: F401 `.parallel_attention.ParallelLlamaAttention` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/layers/__init__.py:16:31: F401 `.parallel_decoder.ParallelLlamaDecoderLayer` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/layers/__init__.py:16:58: F401 `.parallel_decoder.ParallelLlamaDecoderLayerRmPad` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/layers/__init__.py:17:27: F401 `.parallel_mlp.ParallelLlamaMLP` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/layers/__init__.py:18:31: F401 `.parallel_rmsnorm.ParallelLlamaRMSNorm` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/llama/megatron/layers/parallel_attention.py:196:121: E501 Line too long (134 > 120)
verl/models/llama/megatron/layers/parallel_attention.py:341:1: E402 Module level import not at top of file
verl/models/llama/megatron/layers/parallel_attention.py:342:1: E402 Module level import not at top of file
verl/models/llama/megatron/layers/parallel_attention.py:343:1: E402 Module level import not at top of file
verl/models/llama/megatron/layers/parallel_attention.py:366:1: E402 Module level import not at top of file
verl/models/llama/megatron/layers/parallel_attention.py:420:121: E501 Line too long (122 > 120)
verl/models/llama/megatron/layers/parallel_linear.py:82:1: E402 Module level import not at top of file
verl/models/mcore/loader.py:273:121: E501 Line too long (134 > 120)
verl/models/mcore/util.py:26:121: E501 Line too long (202 > 120)
verl/models/qwen2/megatron/__init__.py:16:5: F401 `.modeling_qwen2_megatron.ParallelQwen2ForCausalLM` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/__init__.py:18:5: F401 `.modeling_qwen2_megatron.ParallelQwen2ForCausalLMRmPad` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/__init__.py:20:5: F401 `.modeling_qwen2_megatron.ParallelQwen2ForCausalLMRmPadPP` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/__init__.py:21:5: F401 `.modeling_qwen2_megatron.ParallelQwen2ForValueRmPad` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/__init__.py:22:5: F401 `.modeling_qwen2_megatron.ParallelQwen2ForValueRmPadPP` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/__init__.py:24:5: F401 `.modeling_qwen2_megatron.ParallelQwen2Model` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/checkpoint_utils/qwen2_loader.py:90:121: E501 Line too long (169 > 120)
verl/models/qwen2/megatron/checkpoint_utils/qwen2_loader.py:256:121: E501 Line too long (172 > 120)
verl/models/qwen2/megatron/checkpoint_utils/qwen2_loader_depracated.py:90:121: E501 Line too long (169 > 120)
verl/models/qwen2/megatron/checkpoint_utils/qwen2_loader_depracated.py:272:121: E501 Line too long (127 > 120)
verl/models/qwen2/megatron/checkpoint_utils/qwen2_saver.py:170:9: F841 Local variable `tp_rank` is assigned to but never used
verl/models/qwen2/megatron/checkpoint_utils/qwen2_saver.py:211:9: F841 Local variable `tp_rank` is assigned to but never used
verl/models/qwen2/megatron/checkpoint_utils/qwen2_saver.py:261:9: F841 Local variable `tp_rank` is assigned to but never used
verl/models/qwen2/megatron/layers/__init__.py:15:33: F401 `.parallel_attention.ParallelQwen2Attention` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/layers/__init__.py:16:31: F401 `.parallel_decoder.ParallelQwen2DecoderLayer` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/layers/__init__.py:16:58: F401 `.parallel_decoder.ParallelQwen2DecoderLayerRmPad` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/layers/__init__.py:17:27: F401 `.parallel_mlp.ParallelQwen2MLP` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/layers/__init__.py:18:31: F401 `.parallel_rmsnorm.ParallelQwen2RMSNorm` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/models/qwen2/megatron/layers/parallel_attention.py:163:121: E501 Line too long (134 > 120)
verl/models/qwen2/megatron/layers/parallel_attention.py:282:1: E402 Module level import not at top of file
verl/models/qwen2/megatron/layers/parallel_attention.py:283:1: E402 Module level import not at top of file
verl/models/qwen2/megatron/layers/parallel_attention.py:284:1: E402 Module level import not at top of file
verl/models/qwen2/megatron/layers/parallel_attention.py:307:1: E402 Module level import not at top of file
verl/models/qwen2/megatron/layers/parallel_attention.py:361:121: E501 Line too long (122 > 120)
verl/models/qwen2/megatron/modeling_qwen2_megatron.py:630:121: E501 Line too long (130 > 120)
verl/models/transformers/llama.py:106:121: E501 Line too long (180 > 120)
verl/models/transformers/llama.py:214:121: E501 Line too long (128 > 120)
verl/models/transformers/llama.py:215:121: E501 Line too long (135 > 120)
verl/models/transformers/monkey_patch.py:145:1: E402 Module level import not at top of file
verl/models/transformers/monkey_patch.py:146:1: E402 Module level import not at top of file
verl/models/transformers/monkey_patch.py:148:1: E402 Module level import not at top of file
verl/models/transformers/monkey_patch.py:157:9: B904 Within an `except` clause, raise exceptions with `raise ... from err` or `raise ... from None` to distinguish them from errors in exception handling
verl/models/transformers/qwen2.py:215:121: E501 Line too long (128 > 120)
verl/models/transformers/qwen2.py:216:121: E501 Line too long (135 > 120)
verl/protocol.py:303:121: E501 Line too long (125 > 120)
verl/protocol.py:352:121: E501 Line too long (171 > 120)
verl/protocol.py:578:121: E501 Line too long (142 > 120)
verl/protocol.py:580:121: E501 Line too long (150 > 120)
verl/protocol.py:583:121: E501 Line too long (167 > 120)
verl/protocol.py:715:1: E402 Module level import not at top of file
verl/protocol.py:725:121: E501 Line too long (121 > 120)
verl/protocol.py:766:1: E402 Module level import not at top of file
verl/protocol.py:768:1: E402 Module level import not at top of file
verl/single_controller/__init__.py:23:1: E402 Module level import not at top of file
verl/single_controller/__init__.py:24:1: E402 Module level import not at top of file
verl/single_controller/base/decorator.py:149:16: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
verl/single_controller/base/decorator.py:198:121: E501 Line too long (134 > 120)
verl/single_controller/base/decorator.py:310:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
verl/single_controller/base/worker.py:137:121: E501 Line too long (131 > 120)
verl/single_controller/base/worker_group.py:89:33: G003 Logging statement uses `+`
verl/single_controller/base/worker_group.py:202:21: B904 Within an `except` clause, raise exceptions with `raise ... from err` or `raise ... from None` to distinguish them from errors in exception handling
verl/single_controller/ray/__init__.py:15:19: F401 `.base.RayClassWithInitArgs` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/single_controller/ray/__init__.py:15:41: F401 `.base.RayResourcePool` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/single_controller/ray/__init__.py:15:58: F401 `.base.RayWorkerGroup` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/single_controller/ray/__init__.py:15:74: F401 `.base.create_colocated_worker_cls` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/third_party/sglang/parallel_state.py:135:5: F841 Local variable `rank` is assigned to but never used
verl/third_party/vllm/__init__.py:40:40: F401 `.vllm_v_0_6_3.llm.LLMEngine` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/third_party/vllm/__init__.py:45:22: F401 `vllm.LLM` imported but unused
verl/third_party/vllm/__init__.py:46:34: F401 `vllm.distributed.parallel_state` imported but unused
verl/third_party/vllm/__init__.py:50:121: E501 Line too long (141 > 120)
verl/third_party/vllm/vllm_v_0_5_4/dtensor_weight_loaders.py:189:1: E402 Module level import not at top of file
verl/third_party/vllm/vllm_v_0_5_4/llm.py:136:121: E501 Line too long (132 > 120)
verl/third_party/vllm/vllm_v_0_5_4/llm.py:196:121: E501 Line too long (161 > 120)
verl/third_party/vllm/vllm_v_0_5_4/megatron_weight_loaders.py:174:5: F811 Redefinition of unused `llama_megatron_core_te_weight_loader` from line 90
verl/third_party/vllm/vllm_v_0_5_4/megatron_weight_loaders.py:205:5: F811 Redefinition of unused `llama_megatron_core_weight_loader` from line 121
verl/third_party/vllm/vllm_v_0_5_4/megatron_weight_loaders.py:254:121: E501 Line too long (150 > 120)
verl/third_party/vllm/vllm_v_0_5_4/model_loader.py:36:21: F811 Redefinition of unused `LoadConfig` from line 24
verl/third_party/vllm/vllm_v_0_5_4/model_loader.py:36:45: F811 Redefinition of unused `ModelConfig` from line 26
verl/third_party/vllm/vllm_v_0_5_4/model_loader.py:323:1: E402 Module level import not at top of file
verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py:127:5: F841 Local variable `rank` is assigned to but never used
verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py:245:5: F841 Local variable `rank` is assigned to but never used
verl/third_party/vllm/vllm_v_0_5_4/spmd_gpu_executor.py:147:121: E501 Line too long (144 > 120)
verl/third_party/vllm/vllm_v_0_5_4/spmd_gpu_executor.py:152:121: E501 Line too long (143 > 120)
verl/third_party/vllm/vllm_v_0_5_4/spmd_gpu_executor.py:232:5: F841 Local variable `port` is assigned to but never used
verl/third_party/vllm/vllm_v_0_5_4/worker.py:220:121: E501 Line too long (127 > 120)
verl/third_party/vllm/vllm_v_0_6_3/config.py:46:92: B026 Star-arg unpacking after a keyword argument is strongly discouraged
verl/third_party/vllm/vllm_v_0_6_3/dtensor_weight_loaders.py:225:1: E402 Module level import not at top of file
verl/third_party/vllm/vllm_v_0_6_3/llm.py:141:121: E501 Line too long (132 > 120)
verl/third_party/vllm/vllm_v_0_6_3/llm.py:169:121: E501 Line too long (161 > 120)
verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:52:24: F811 Redefinition of unused `EngineArgs` from line 35
verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:53:21: F811 Redefinition of unused `LoadConfig` from line 25
verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:53:33: F811 Redefinition of unused `ModelConfig` from line 27
verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:354:9: F841 Local variable `distributed_executor_backend` is assigned to but never used
verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py:360:121: E501 Line too long (152 > 120)
verl/third_party/vllm/vllm_v_0_6_3/megatron_weight_loaders.py:199:5: F841 Local variable `params_mapping` is assigned to but never used
verl/third_party/vllm/vllm_v_0_6_3/megatron_weight_loaders.py:229:121: E501 Line too long (150 > 120)
verl/third_party/vllm/vllm_v_0_6_3/model_loader.py:28:21: F811 Redefinition of unused `LoadConfig` from line 22
verl/third_party/vllm/vllm_v_0_6_3/model_loader.py:28:45: F811 Redefinition of unused `ModelConfig` from line 22
verl/third_party/vllm/vllm_v_0_6_3/model_loader.py:312:1: E402 Module level import not at top of file
verl/third_party/vllm/vllm_v_0_6_3/model_runner.py:44:21: F811 Redefinition of unused `LoadConfig` from line 27
verl/third_party/vllm/vllm_v_0_6_3/model_runner.py:44:33: F811 Redefinition of unused `ModelConfig` from line 29
verl/third_party/vllm/vllm_v_0_6_3/parallel_state.py:129:5: F841 Local variable `rank` is assigned to but never used
verl/third_party/vllm/vllm_v_0_6_3/parallel_state.py:247:5: F841 Local variable `rank` is assigned to but never used
verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py:147:121: E501 Line too long (144 > 120)
verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py:152:121: E501 Line too long (143 > 120)
verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py:232:5: F841 Local variable `port` is assigned to but never used
verl/third_party/vllm/vllm_v_0_6_3/worker.py:217:121: E501 Line too long (127 > 120)
verl/trainer/fsdp_sft_trainer.py:298:121: E501 Line too long (158 > 120)
verl/trainer/fsdp_sft_trainer.py:501:121: E501 Line too long (121 > 120)
verl/trainer/fsdp_sft_trainer.py:550:1: E402 Module level import not at top of file
verl/trainer/fsdp_sft_trainer.py:551:1: E402 Module level import not at top of file
verl/trainer/fsdp_sft_trainer.py:553:1: E402 Module level import not at top of file
verl/trainer/fsdp_sft_trainer.py:553:43: F811 Redefinition of unused `FSDPSFTTrainer` from line 82
verl/trainer/fsdp_sft_trainer.py:554:1: E402 Module level import not at top of file
verl/utils/__init__.py:16:24: F401 `.tokenizer.hf_processor` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/__init__.py:16:38: F401 `.tokenizer.hf_tokenizer` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/checkpoint/checkpoint_manager.py:48:37: B006 Do not use mutable data structures for argument defaults
verl/utils/checkpoint/fsdp_checkpoint_manager.py:51:37: B006 Do not use mutable data structures for argument defaults
verl/utils/checkpoint/fsdp_checkpoint_manager.py:56:13: B028 No explicit `stacklevel` keyword argument found
verl/utils/checkpoint/fsdp_checkpoint_manager.py:81:121: E501 Line too long (121 > 120)
verl/utils/checkpoint/fsdp_checkpoint_manager.py:98:121: E501 Line too long (124 > 120)
verl/utils/checkpoint/megatron_checkpoint_manager.py:64:37: B006 Do not use mutable data structures for argument defaults
verl/utils/checkpoint/megatron_checkpoint_manager.py:219:121: E501 Line too long (124 > 120)
verl/utils/dataset/__init__.py:15:25: F401 `.rl_dataset.RLHFDataset` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/dataset/__init__.py:16:25: F401 `.rm_dataset.RMDataset` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/dataset/__init__.py:17:26: F401 `.sft_dataset.SFTDataset` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/dataset/multiturn_sft_dataset.py:96:9: F841 Local variable `current_length` is assigned to but never used
verl/utils/dataset/sft_dataset.py:95:79: B023 Function definition does not bind loop variable `key`
verl/utils/dataset/sft_dataset.py:103:83: B023 Function definition does not bind loop variable `key`
verl/utils/debug/__init__.py:15:26: F401 `.performance.GPUMemoryLogger` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/debug/__init__.py:15:43: F401 `.performance.log_gpu_memory_usage` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/utils/debug/performance.py:68:121: E501 Line too long (127 > 120)
verl/utils/debug/performance.py:71:121: E501 Line too long (126 > 120)
verl/utils/debug/profile.py:15:1: I001 [*] Import block is un-sorted or un-formatted
verl/utils/debug/profile.py:19:15: UP039 [*] Unnecessary parentheses after class definition
verl/utils/debug/profile.py:50:23: F541 [*] f-string without any placeholders
verl/utils/debug/profile.py:52:49: F541 [*] f-string without any placeholders
verl/utils/debug/profile.py:53:47: F541 [*] f-string without any placeholders
verl/utils/debug/profile.py:54:67: F541 [*] f-string without any placeholders
verl/utils/debug/profile.py:54:121: E501 Line too long (122 > 120)
verl/utils/flops_counter.py:175:121: E501 Line too long (124 > 120)
verl/utils/hdfs_io.py:135:32: G004 Logging statement uses f-string
verl/utils/import_utils.py:78:9: B904 Within an `except` clause, raise exceptions with `raise ... from err` or `raise ... from None` to distinguish them from errors in exception handling
verl/utils/logger/aggregate_logger.py:46:121: E501 Line too long (131 > 120)
verl/utils/logger/aggregate_logger.py:64:41: G004 Logging statement uses f-string
verl/utils/megatron/tensor_parallel.py:152:121: E501 Line too long (123 > 120)
verl/utils/megatron_utils.py:17:1: I001 [*] Import block is un-sorted or un-formatted
verl/utils/megatron_utils.py:22:20: F401 [*] `torch.nn` imported but unused
verl/utils/megatron_utils.py:34:38: F401 [*] `verl.utils.memory_buffer.build_memory_reference_from_module` imported but unused
verl/utils/megatron_utils.py:332:30: B009 [*] Do not call `getattr` with a constant attribute value. It is not any safer than normal property access.
verl/utils/megatron_utils.py:366:27: B009 [*] Do not call `getattr` with a constant attribute value. It is not any safer than normal property access.
verl/utils/model.py:464:121: E501 Line too long (124 > 120)
verl/utils/rendezvous/ray_backend.py:39:25: G004 Logging statement uses f-string
verl/utils/rendezvous/ray_backend.py:41:22: G004 Logging statement uses f-string
verl/utils/rendezvous/ray_backend.py:63:30: G004 Logging statement uses f-string
verl/utils/rendezvous/ray_backend.py:65:30: G004 Logging statement uses f-string
verl/utils/rendezvous/ray_backend.py:72:26: G004 Logging statement uses f-string
verl/utils/reward_score/gsm8k.py:47:121: E501 Line too long (201 > 120)
verl/utils/reward_score/math.py:213:121: E501 Line too long (142 > 120)
verl/utils/reward_score/prime_code/__init__.py:16:8: F401 `re` imported but unused
verl/utils/reward_score/prime_code/testing_util.py:131:121: E501 Line too long (688 > 120)
verl/utils/reward_score/prime_code/testing_util.py:168:13: E722 Do not use bare `except`
verl/utils/reward_score/prime_code/testing_util.py:222:9: E722 Do not use bare `except`
verl/utils/reward_score/prime_code/testing_util.py:254:13: E722 Do not use bare `except`
verl/utils/reward_score/prime_code/testing_util.py:255:17: B018 Found useless expression. Either assign it to a variable or remove it.
verl/utils/reward_score/prime_code/testing_util.py:259:13: E722 Do not use bare `except`
verl/utils/reward_score/prime_code/testing_util.py:260:17: B018 Found useless expression. Either assign it to a variable or remove it.
verl/utils/reward_score/prime_code/testing_util.py:264:13: E722 Do not use bare `except`
verl/utils/reward_score/prime_code/testing_util.py:265:17: B018 Found useless expression. Either assign it to a variable or remove it.
verl/utils/reward_score/prime_code/testing_util.py:269:121: E501 Line too long (132 > 120)
verl/utils/reward_score/prime_code/testing_util.py:293:21: E722 Do not use bare `except`
verl/utils/reward_score/prime_code/testing_util.py:294:25: B018 Found useless expression. Either assign it to a variable or remove it.
verl/utils/reward_score/prime_code/testing_util.py:335:121: E501 Line too long (165 > 120)
verl/utils/reward_score/prime_code/testing_util.py:386:121: E501 Line too long (209 > 120)
verl/utils/reward_score/prime_code/testing_util.py:390:121: E501 Line too long (183 > 120)
verl/utils/reward_score/prime_code/testing_util.py:455:121: E501 Line too long (211 > 120)
verl/utils/reward_score/prime_code/testing_util.py:459:121: E501 Line too long (185 > 120)
verl/utils/reward_score/prime_code/testing_util.py:582:121: E501 Line too long (197 > 120)
verl/utils/reward_score/prime_code/testing_util.py:586:121: E501 Line too long (171 > 120)
verl/utils/reward_score/prime_math/__init__.py:106:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/__init__.py:119:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/__init__.py:246:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/__init__.py:315:121: E501 Line too long (128 > 120)
verl/utils/reward_score/prime_math/__init__.py:331:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/__init__.py:407:1: E402 Module level import not at top of file
verl/utils/reward_score/prime_math/__init__.py:429:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/grader.py:302:21: B005 Using `.strip()` with multi-character strings is misleading
verl/utils/reward_score/prime_math/grader.py:302:21: B005 Using `.strip()` with multi-character strings is misleading
verl/utils/reward_score/prime_math/math_normalize.py:54:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/math_normalize.py:70:17: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/math_normalize.py:101:5: E722 Do not use bare `except`
verl/utils/reward_score/prime_math/math_normalize.py:181:121: E501 Line too long (142 > 120)
verl/utils/tokenizer.py:30:9: B028 No explicit `stacklevel` keyword argument found
verl/utils/tokenizer.py:33:9: B028 No explicit `stacklevel` keyword argument found
verl/utils/tokenizer.py:55:9: B028 No explicit `stacklevel` keyword argument found
verl/utils/torch_functional.py:86:72: E741 Ambiguous variable name: `l`
verl/utils/torch_functional.py:177:5: F841 Local variable `total_params` is assigned to but never used
verl/utils/torch_functional.py:397:1: E402 Module level import not at top of file
verl/utils/torch_functional.py:399:1: E402 Module level import not at top of file
verl/utils/torch_functional.py:400:1: E402 Module level import not at top of file
verl/utils/ulysses.py:246:5: F841 Local variable `sp_size` is assigned to but never used
verl/workers/actor/dp_actor.py:244:13: F841 Local variable `response_mask` is assigned to but never used
verl/workers/actor/megatron_actor.py:22:1: I001 [*] Import block is un-sorted or un-formatted
verl/workers/actor/megatron_actor.py:85:121: E501 Line too long (122 > 120)
verl/workers/actor/megatron_actor.py:86:121: E501 Line too long (128 > 120)
verl/workers/actor/megatron_actor.py:89:121: E501 Line too long (133 > 120)
verl/workers/actor/megatron_actor.py:96:121: E501 Line too long (126 > 120)
verl/workers/actor/megatron_actor.py:175:121: E501 Line too long (135 > 120)
verl/workers/actor/megatron_actor.py:237:121: E501 Line too long (150 > 120)
verl/workers/actor/megatron_actor.py:243:121: E501 Line too long (144 > 120)
verl/workers/actor/megatron_actor.py:245:121: E501 Line too long (130 > 120)
verl/workers/actor/megatron_actor.py:247:121: E501 Line too long (122 > 120)
verl/workers/actor/megatron_actor.py:286:9: F841 Local variable `input_shapes` is assigned to but never used
verl/workers/critic/dp_critic.py:227:21: F841 Local variable `input_ids` is assigned to but never used
verl/workers/critic/dp_critic.py:230:21: F841 Local variable `position_ids` is assigned to but never used
verl/workers/megatron_workers.py:18:1: I001 [*] Import block is un-sorted or un-formatted
verl/workers/reward_manager/__init__.py:15:20: F401 `.batch.BatchRewardManager` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/reward_manager/__init__.py:16:19: F401 `.dapo.DAPORewardManager` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/reward_manager/__init__.py:17:20: F401 `.naive.NaiveRewardManager` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/reward_manager/__init__.py:18:20: F401 `.prime.PrimeRewardManager` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/reward_manager/prime.py:61:121: E501 Line too long (217 > 120)
verl/workers/reward_model/__init__.py:15:19: F401 `.base.BasePPORewardModel` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/reward_model/megatron/__init__.py:15:27: F401 `.reward_model.MegatronRewardModel` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/reward_model/megatron/reward_model.py:65:9: F841 Local variable `ori_bs` is assigned to but never used
verl/workers/reward_model/megatron/reward_model.py:89:121: E501 Line too long (132 > 120)
verl/workers/reward_model/megatron/reward_model.py:215:9: F841 Local variable `input_shapes` is assigned to but never used
verl/workers/rollout/naive/__init__.py:15:28: F401 `.naive_rollout.NaiveRollout` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/rollout/sglang_rollout/__init__.py:14:29: F401 `.sglang_rollout.SGLangRollout` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/rollout/vllm_rollout/fire_vllm_rollout.py:22:121: E501 Line too long (129 > 120)
verl/workers/rollout/vllm_rollout/fire_vllm_rollout.py:51:121: E501 Line too long (157 > 120)
verl/workers/rollout/vllm_rollout/fire_vllm_rollout.py:153:13: F841 Local variable `log_probs` is assigned to but never used
verl/workers/rollout/vllm_rollout/vllm_rollout.py:22:121: E501 Line too long (129 > 120)
verl/workers/rollout/vllm_rollout/vllm_rollout.py:60:121: E501 Line too long (157 > 120)
verl/workers/sharding_manager/__init__.py:16:5: F401 `verl.utils.import_utils.is_megatron_core_available` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/sharding_manager/__init__.py:17:5: F401 `verl.utils.import_utils.is_sglang_available` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/sharding_manager/__init__.py:21:19: F401 `.base.BaseShardingManager` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/sharding_manager/__init__.py:22:27: F401 `.fsdp_ulysses.FSDPUlyssesShardingManager` imported but unused; consider removing, adding to `__all__`, or using a redundant alias
verl/workers/sharding_manager/__init__.py:29:121: E501 Line too long (149 > 120)
verl/workers/sharding_manager/__init__.py:32:121: E501 Line too long (126 > 120)
verl/workers/sharding_manager/fsdp_sglang.py:99:9: F841 Local variable `load_format` is assigned to but never used
verl/workers/sharding_manager/fsdp_sglang.py:123:121: E501 Line too long (178 > 120)
verl/workers/sharding_manager/fsdp_ulysses.py:59:13: F841 Local variable `sp_size` is assigned to but never used
Found 305 errors.
```

---------

Co-authored-by: Haibin Lin <haibin.lin@bytedance.com>
2025-04-27 15:24:30 -07:00
7b6b7cb5b8 clean codes (#1219)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-04-23 18:11:23 +08:00
4fa7ed6c0d [mcore] qwen2moe support (#1139)
support qwen2moe structure to run with megatron-core
including:
* qwen2moe config converter 
* qwen2moe model initializer
* refactor the online weight converter from mcore to vllm
* qwen2moe online weight converter
* qwen2moe offline weight conversion script from hf to mcore
* a script to run training qwen1.5moe_a2.7b with 4 nodes

TODO
add option to freeze the MoE router weight during training
2025-04-20 12:48:46 +08:00
HL
568239fb38 CI: limit ruff checks and enable push tests (#1157) 2025-04-19 13:54:45 +08:00
b00f77d855 [dev] feat: immigrate from yapf & pylint to ruff based on pre-commit (#1010)
> [!WARNING]
> We are [immigrating to `ruff` as the linter and formatter and
`pre-commit` as the managing
tool](https://github.com/volcengine/verl/pull/1010).
>
> If your branch is based on a previous commit using `yapf` and
`pylint`, simply merging might trigger overwhelming linting errors,
while **you are only expected to resolve ones in the files related to
your PR**.
>
> To resolve this issue, please try the following workaround to only
include the files you **really changed** in the PR:
>
> 1. In your branch, fix linting and format with `ruff`: `ruff check
--fix && ruff-format`
> 2. Squash into a single commit in a new branch: `git reset --soft
$(git merge-base main HEAD) && git add -A && git commit -m "feat: ..."`
> 3. Merge with the latest main: `git merge origin/main`
> 4. Force push to your branch: `git push --force`

We add the reminder above to the documentation to tell contributors how
to avoid overwhelming linting errors.

### Motivation

According to dicussion in #896, this PR immigrates from yapf & pylint to
ruff based on pre-commit, which allows unified version control and
automatic hook on committing.

### Summary

The `pre-commit` hook and CI

- checks staged / committed files in commits / PR's
- checks all files each month (This should fail before we fix all the
files by the ruff standard)

### Explanation for the Failing CI Workflow `pre-commit`

For now, we only apply `ruff format` and `ruff check --fix` **without
resolving all the errors**, since there are too many errors to resolve,
which causes the CI workflow `pre-commit` fails.

For resolving the remaining errors, we leave to future commits.
Specifically, the `pre-commit` hook and CI will require every commit to
fix its related files with `ruff`, which will fix all the files
incrementally.

### Reviewing Suggestion

The commit
3d93f51ba8
is huge since we apply `ruff` to all the files. To review the main
changes, please check the commits before and after it.
2025-04-18 07:49:31 -07:00
d7978b66d9 chore: update diagnose.py (#1078)
occured -> occurred
2025-04-14 21:35:57 +08:00
f976b1853d Update vllm 0.8.2 with megatron 0.11.0 (#1054)
Parts of #851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
2025-04-14 09:27:35 +08:00
d4cae44726 [mcore] option to use dist checkpoint (#1030)
mcore dist checkpointing is a parallel-invariant weight format, you can
save and load in arbitrary parallel settings. e.g. save in tp2pp2 and
load in tp4pp1.

This PR introduce an option to use dist checkpoint with mcore backend.
It is *disabled* by default for backward compatibility. But future
support for *mcore MoE models and VLM models* will work only when dist
ckpt is enabled for a easier implementation.

Before this PR, when initing actor and critic workers, each GPU would
load the entire huggingface weights and then re-shard to correct mcore
model state dict, making the procedure slow and complicated.
With this PR, we convert hf weight to dist ckpt by offline scripts, and
each GPU will only load its parts from dist ckpt. The speed is faster
and no more online resharding needed.

When loading `Qwen2-7B-Instruct` for critic worker, the loading time
reduced from 109s to 25s, speedup by *4.36x*

The `converter_hf_to_mcore.py` in this version use existing online
resharding function to convert weights. And it should be refactored for
better efficiency and MoE/VLM models.
Thanks to #998 for the optimization of loading hf weight only at GPU 0.

Future TODO:
* refactor the converter for efficiency
* support converting MoE models
* support converting VLM models
* re-design `megatron_checkpoint_manager.py` with dist ckpt
* implement converter from mcore dist ckpt to hf / `model_merger.py`
* add docs and example scripts
2025-04-13 17:59:43 +08:00
550bbbbffe [vllm] fix oom when vllm wakeup (vllm >=0.8.3) (#987)
This is a memory optimization method implemented based on this
[fix](https://github.com/vllm-project/vllm/pull/15500). I just
successfully ran a 72B model on 8*H800 cards. Before the fix, I would
encounter an OOM issue. Please note that this fix is only effective for
vLLM >= 0.8.3.
2025-04-10 18:07:10 +08:00
fefe951f2a Add support to HSDP model merging. (#971)
Currently the model merger does not support HSDP (the `ddp` mesh dim is
not considered). This PR fixes this.
2025-04-09 07:55:39 +08:00
8400beb87c [merger] fix: move megatron import into megatron related branch (#958)
users using fsdp backend may no have megatron installed, directly
running this script will lead to an import error.
2025-04-07 09:50:21 -07:00