Commit Graph

49 Commits

Author SHA1 Message Date
515f2255ac [ci] fix: use local models/configs/datasets to increase stability (#3616)
### What does this PR do?

- As title

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-09-25 22:14:56 +08:00
6e6fafdc74 [model] feat: add FSDP/Megatron critic worker with model engine (#3439)
### What does this PR do?

- As title
- Add a test to compare the output of FSDP/Megatron engine with
huggingface model

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-13 12:18:58 +08:00
b03866768f [ci] feat: move more tests to volcano engine (#3455) 2025-09-12 18:54:55 +08:00
5c46f4f437 [model] feat: replace DataProto with TensorDict in engine (#3422) 2025-09-09 22:28:25 +08:00
d7a0469977 [model] feat: polish model engine (#3321) 2025-09-03 20:44:39 +08:00
91ee0a2c08 [fsdp, model] feat: support FSDP model engine (#3270)
### What does this PR do?

- Support FSDPEngine and FSDPEngineWithLMHead
- Add tests and show that fsdp engine matches with mcore and huggingface
on QWen 2.5 0.5b model

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

---------

Co-authored-by: ziheng.jiang <ziheng.jiang@bytedance.com>
2025-09-01 16:17:45 +08:00
1065a29d14 [megatron, model] feat: add MegatronEngine, MegatronEngineForCausalLM (#3235) 2025-08-28 19:36:05 +08:00
27b63c724a [env, sglang] feat: Bump new sglang version to fix vlm OOM (#3216)
### What does this PR do?
- Bump new version of sglang
- This version's sglang can fix vlm OOM issue, detail are in:
https://github.com/sgl-project/sglang/issues/9365

### Test

Using instruction following
https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/release_log/latest_sglang.md

Now we have new version of sglang:
<img width="786" height="154" alt="image"
src="https://github.com/user-attachments/assets/bcec557e-196c-40c0-aa0f-c19d9f5c3e98"
/>

`gsm8k`:
using `verl/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

[Wandb](https://wandb.ai/popsoda-university-of-washington/multi-turn-grpo-qwen2.5-3b-sglang/runs/dtcdin9b?nw=nwuserpopsoda)
<img width="532" height="329" alt="image"
src="https://github.com/user-attachments/assets/12f67d1a-a57e-497d-bfe5-6ff8c642e83f"
/>

It can work well.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-08-26 13:29:36 +08:00
9b6a07fa77 [docker] feat: update to vllm 0.10.0, mcore 0.13, transformers 4.55.4 (#3192) 2025-08-26 05:17:57 +08:00
ae46f5a41a [ci] fix: model tests, transformers 4.55 has troubles with backward (#3139)
### What does this PR do?

[ci] fix: model tests, transformers 4.55 has troubles with backward

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-08-20 13:33:12 +08:00
3ebe6717ad [megatron] fix: retain MLA config in mcore config converter (#2933)
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- in the current `check_and_disable_incompatible_configs` function, we
will drop config if it's not an attribute of `TransformerConfig`,
however when using `MLATransformerConfig`, this funcion will drop mla
config like `q_lora_rank`, and cause a lots of problems in the
downstream pipeline
- this pr refactored `check_and_disable_incompatible_configs` to a
factory function `check_and_construct_configs `, which accecpt a class
type bounded with TransformerConfig, and return a TransformerConfig
instance.

@ETOgaosion 

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
2025-08-07 12:35:18 +08:00
f32e54deaa [docker] feat: Upgrade sglang 0.4.9 + transformers 4.53.2 (#2794)
### What does this PR do?

feat: Upgrade sglang 0.4.9 + transformers 4.53.2

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-07-31 00:49:27 +08:00
H
f98ee1c697 [cfg] fix: fix failing rollout config test on main (#2771)
### What does this PR do?

The cpu unit test is broken when
https://github.com/volcengine/verl/pull/2757/files is merged.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
2025-07-28 16:43:56 +08:00
4879d619fc [docker] feat: upgrade to torch 2.7, sglang 0.4.8 (#2617)
### What does this PR do?

[docker] feat: upgrade to torch 2.7, sglang 0.4.8

Stage 2: vllm 0.9.1
Stage 3: mcore 0.13.0

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-07-24 14:53:24 -07:00
69a467f934 [docker] fix: downgrade TransformerEngine version 2.2.1 to allow mcore image using rope fusion and provide another set of v0.5 image (#2611)
### What does this PR do?

Downgrade TransformerEngine version to allow mcore image using rope
fusion and provide another set of v0.5 image.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-18 17:23:19 +08:00
ebb21b7fc7 [docker] refactor: Migrate images to verlai, support latest flash attention and newer CUDA versions in future (#2085)
### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore, test`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Migrate images to verlai, upgrade CUDA support to 12.6 and support
latest flash attention

```txt
docker
├── README.md
├── verl0.4-cu124-torch2.6-fa2.7.4
│   ├── Dockerfile.app.sglang.vllm.mcore0.12
│   ├── Dockerfile.app.sglang.vllm.mcore0.13.preview
│   ├── Dockerfile.app.vllm.mcore0.12
│   ├── Dockerfile.app.vllm.mcore0.13.preview
│   ├── Dockerfile.base
│   └── README.md
├── verl0.5-cu126-torch2.7.1-fa2.8.0
│   ├── Dockerfile.app.sglang.mcore0.12
│   ├── Dockerfile.app.sglang.mcore0.13.preview
│   ├── Dockerfile.base.fi0.2.6
│   └── README.md
└── verl0.5-preview-cu128-torch2.7.1-fa2.8.0
    ├── Dockerfile.app.sglang.megatron
    ├── Dockerfile.base.fi0.2.6
    └── README.md
```

- verlai/verl
  - verl0.4
    - base
    - app.sglang.vllm.mcore
    - app.vllm.mcore
  - verl0.5
    - base
    - app.sglang.mcore
    - app.vllm.mcore [may not support now, for debug]
  - verl0.5-preview
    - base
    - app.sglang.mcore
    - app.vllm.mcore [may not support now, for debug]


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
2025-07-04 14:32:02 +08:00
H
cfc5ff2452 [ci] fix: add tests for vllm (#2036)
### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Fix the failing vllm test

### Test

Added one more test to make sure problematic tool class should fail
during initialization

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: wuxibin <wuxibin@bytedance.com>
2025-06-16 18:27:28 +08:00
H
5fa911b3ce [ci] refactor: setup testing guidance (#1958) 2025-06-12 06:16:58 -07:00
OC
9afa8d6dff fix error when ci failed by incorrect sgl-kernel version (#1872)
### Checklist Before Starting

- [ done ] Search for similar PR(s).

### What does this PR do?

Fix ci failure from incorrect sgl-kernel version in docker image:

```
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 647, in assert_pkg_version
    raise Exception(
Exception: sgl-kernel is installed with version 0.1.0, which is less than the minimum required version 0.1.1. Please reinstall the latest version with `pip install sgl-kernel --force-reinstall`
```
2025-06-06 13:55:08 +08:00
7b0426a738 [Docker Image] update images and fix sglang installation (#1606)
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

update images and fix sglang installation, the latest image:
`whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3`

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

- vLLM: 0.8.5.post1
- SGLang: 0.4.6.post4, fix installation
- Megatron: core_v0.12.0 announcement
- TransformerEngine: 2.3

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
2025-05-21 09:13:51 +08:00
b8bd596811 [Docker Image] use latest vLLM (0.8.5) to fully support Qwen3 moe (#1544) 2025-05-17 07:28:55 +08:00
43782a24bd [Doc/Docker Image] Update mcore image to use vLLM which support qwen3 and rewrite installation from conda (#1505)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Update mcore image to use vLLM which support qwen3 and rewrite
installation from conda

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

Docker image and docs

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: both
- **Inference**: both

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-14 14:40:13 +08:00
d4a11ebb44 [utils] Enrich and fix utils from fsdp_utils and seqlen_balancing (#1495)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Enrich and fix utility functions in `verl/utils/fsdp_utils.py` and
`verl/utils/seqlen_balancing.py`.

* In `get_fsdp_wrap_policy`, introduce a unified `_get_attr` helper so
both dict‑based (OmegaConf) and dataclass‑style configs can work.

* In `rearrange_micro_batches`, add two new parameters
(`same_micro_num_in_dp`, `min_num_micro_batch`).

* Also re-organized the workflow pipeline structure to make it align
better with the verl file structure.

### API

In `verl.utils.seqlen_balancing.rearrange_micro_batches`, add two new
parameters (`same_micro_num_in_dp`, `min_num_micro_batch`).

### Usage Example

```python
# A very toy example
dataproto = DataProto.from_single_dict({"input_ids": input_ids, "attention_mask": attention_mask})
micros,  idx_map = rearrange_micro_batches(batch, max_token_len=300, same_micro_num_in_dp=False, min_num_micro_batch=2)
```

### Test
* Added in `tests/utils/gpu_tests/test_seqlen_balancing.py`

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
2025-05-13 17:01:16 +08:00
H
249c26fdc8 [tests] BREAKING: move recipe.dapo.src to recipe.dapo; move test files to their own namespaces (tests/verl/xxx -> tests/xxx) (#1392) 2025-05-10 11:21:53 +08:00
8bb009bf47 [CI] feat: separate FSDP2 test & fix: CI trigger (#1389)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

1. Separate the FSDP2 test to avoid blocking other tests.
2. Fix the CI trigger rule to avoid redundant runs (since I find the
original PR triggers unrelated tests, so I fix the rule based on [the
doc](https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore))

### Test

For 2, I test by commenting out the matching path for workflow `.yml`,
and see only related workflows are triggered:

Before: <img width="870" alt="image"
src="https://github.com/user-attachments/assets/2f7dbe0c-f638-4a75-8cbc-a364081271fc"
/>

After: <img width="869" alt="image"
src="https://github.com/user-attachments/assets/f5a35d85-f03c-452e-abed-3ca3ce22d699"
/>

### Additional Info.

- **Issue Number**: https://github.com/volcengine/verl/issues/1388
- **Training**: FSDP
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-05 07:20:35 -07:00
e0d035cd4a [sglang] feat: Add SGLang async multi-turn rollout with tool support (#1037)
A redesigned version of #917 

## Current Status
[Develop log &
Tracker](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/113)

**What Has Been Done**
- Async Rollout Refactoring: Integrate with the tool server to
coordinate tool calls during generation, leveraging request IDs for
state and progress tracking, support async multi-turn conversations in
Agentic RL training (with Tool support).
- Async Request Management: Encapsulate rollout requests into a unified
structure, enabling efficient tracking and handling of concurrent
multi-turn dialogues with chatml style messages.
- Extensible Tools: A modular design for adapt tools in
OpenAIFunctionTool format which is both support by SGLang and vLLM, with
create separate instance, execute when tool call, calc score according
to tool env state and release resource.
- Multi-turn support has been implemented for the GSM8K task (new
version working on). However, training has not yet converged, and we
hope the community could join to investigate the issue.

**What Is WIP**
- [x] Merge loss mask to training process from last version
- [x] Add more user friendly tool config and e2e tests for gsm8k with
tool training
- [ ] We are going to validate our multiturn feature in open-source
sandbox environments.

## Key Features will be introduced in future version

- Integrate a Ray-based agent trainer to enable explicit separation of
the rollout and training pipeline. Provide support for partial rollout
handling and fine-grained request state management.
- Extend the framework to support simulated user interactions (e.g.,
roleplay, interactive feedback) and more complex environment-in-the-loop
RL tasks.

**Future Plan**
[Discussion
Thread](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/74#issuecomment-2763192625)
[RFC
doc](https://github.com/SwordFaith/verl-sglang-dev-log/blob/main/rlhf/verl/multi-turn/veRL-multiturn-rollout-RFC.md)
will be updated soon.

## Contributors & Acknowledgement

- Xiang Long [mid.of.change@gmail.com](mailto:mid.of.change@gmail.com)
@SwordFaith (Design RFC & core-dev of refactor part)
- Yuzhen Zhou [zyzshishui@gmail.com](mailto:zyzshishui@gmail.com)
@zyzshishui (Core-dev)
- Chenyang Zhao [zhaochen20@outlook.com](mailto:zhaochen20@outlook.com)
@zhaochenyang20 (PM)
- Guanhua Wang @WANG-GH 
- Junrong Lin @ocss884 (verl-sglang support)
- Hanchen Zhang
[zhanghanchen77@gmail.com](mailto:zhanghanchen77@gmail.com)
- Haoran Wang [ubecwang@gmail.com](mailto:ubecwang@gmail.com)
- Rui Lu [learningrate1@gmail.com](mailto:learningrate1@gmail.com)
- Yujiang Li [liyujiang2020@gmail.com](mailto:liyujiang2020@gmail.com)
- Jiajun Li [guapisolo@gmail.com](mailto:guapisolo@gmail.com)
- Jin Pan [jpan236@wisc.edu](mailto:jpan236@wisc.edu)
- Zhi Zheng [zhengzhi@modelbest.cn](mailto:zhengzhi@modelbest.cn)
@zh-zheng

---------

Co-authored-by: zyzshishui <492129152@qq.com>
Co-authored-by: guanhua <281484683@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: ocss884 <ocss.lin@gmail.com>
Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>
Co-authored-by: HL <linhaibin.eric@gmail.com>
2025-04-29 13:20:06 -07:00
0234d8e3ab fix reward model and add CI test (#1252)
Fix bugs related to #1165 .

Megatron backend reward model has no CI test, add to current ppo
trainer.

Fix `micro_batch_size_per_gpu` but not sure whether it is right for
reward config.

The output format is also not right with current `forward_micro_batch`
implementation.
2025-04-29 21:20:21 +08:00
fbb93e44b1 [CI] feat: only test for push to main (#1271) 2025-04-27 09:51:09 +08:00
a35c044627 Migrate to new image with FlashInfer 0.2.2 + vLLM 0.8.3 + SGLang 0.4.5 + MCore 0.12.0 + TE 2.2 + cuDNN 9.8.0 (#1237)
As support both, we let TE to choose attention backend now.

New Image:
`whatcanyousee/verl:ngc-cu124-vllm0.8.3-sglang0.4.5-mcore0.12.0-te2.2`
2025-04-24 16:14:48 +08:00
HL
5313d96f9b [CI] fix: add additional pre-commit test before ppo trainer tests (#1175) 2025-04-20 11:16:19 -07:00
HL
568239fb38 CI: limit ruff checks and enable push tests (#1157) 2025-04-19 13:54:45 +08:00
5ba1dbc606 [ci] feat: improve CI speed to 1-2min per test (#1032)
### Summary

#### Minimize Test Workloads

This PR minimizes the test workloads while keeping them meaningful,
reducing the time cost of a test from >10 min to 1~2 min. Specifically,
we

1. set batch sizes and steps as small but still meaningful numbers:

```bash
train_traj_micro_bsz_per_gpu=2 # b
n_resp_per_prompt=4 # g

train_traj_micro_bsz=$((train_traj_micro_bsz_per_gpu * NUM_GPUS)) # b * n
train_traj_mini_bsz=$((train_traj_micro_bsz * 2)) # 2 * b * n
train_prompt_mini_bsz=$((train_traj_mini_bsz * n_resp_per_prompt)) # 2 * b * n / g
train_prompt_bsz=$((train_prompt_mini_bsz * 2)) # 4 * b * n / g
# ...
TOT_TRAIN_STEPS=${TOT_TRAIN_STEPS:-1}
```

2. disable validation (this costs a lot!) / saving / resuming for
training tests by default and leave them to specialized tests

```bash
# Validation
VAL_BEFORE_TRAIN=${VAL_BEFORE_TRAIN:-False}
TEST_FREQ=${TEST_FREQ:--1}
# Save & Resume
RESUME_MODE=${RESUME_MODE:-disable}
SAVE_FREQ=${SAVE_FREQ:--1}
```

#### Improve Triggering Mode

This PRs introduces a more comprehensive triggering logic mode.
Specifically, we

1. consider all Python code by default
2. include related entrypoints (the workflow config, scripts used by it
and hydra config, etc.)
3. exclude unrelated Python code from other components (e.g., recipes,
examples, Megatron, SFT, generation, evaluation, etc. for FSDP training)

An example from `e2e_ppo_trainer`:

```yaml
on:
    paths:
      - "**/*.py"
      # Entrypoints
      - ".github/workflows/e2e_ppo_trainer.yml"
      - "examples/data_preprocess/gsm8k.py"
      - "examples/data_preprocess/geo3k.py"
      - "tests/e2e/ppo_trainer"
      - "verl/trainer/main_ppo.py"
      - "verl/trainer/config/ppo_trainer.yaml"
      - "!examples"
      - "!verl/trainer/main_*.py"
      - "!verl/trainer/fsdp_sft_trainer.py"
      # Recipes
      - "!recipe"
      # Megatron
      - "!verl/workers/**/megatron_*.py"
```

#### Avoid missing out errors

Some test scripts didn't end with the main python command and might miss
out the error.

To address this issue, this PR introduces the following options:

```bash
set -xeuo pipefail
```

, which means

- `x`: Print each command before executing it (useful for debugging)
- `e`: Exit immediately if any command fails (returns non-zero exit
status)
- `u`: Treat unset variables as an error
- `o pipefail`: Return the exit status of the last command in a pipeline
that failed, or zero if all succeeded

Together, these options make the script fail fast and provide verbose
output, which helps with debugging and ensuring the script doesn't
continue after encountering errors.

#### Others

Besides, we also

1. unify runner labels into `"L20x8"` to enable preemptive scheduling of
jobs
2. reduce test scripts of minimal differences, grouping by entrypoint
(e.g. `ppo_trainer`, `ppo_megatron_trainer`, recipes, etc.), into a base
script with options
2025-04-14 09:48:10 -07:00
866e9808d4 [CI] feat: unify CI label to enbale preemptive schedule for jobs (#1072) 2025-04-14 16:52:30 +08:00
f976b1853d Update vllm 0.8.2 with megatron 0.11.0 (#1054)
Parts of #851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
2025-04-14 09:27:35 +08:00
HL
d882b62b01 tests: add import utils tests (#1042) 2025-04-11 18:55:54 -07:00
c9e3c57cf8 [megatron] feat: optimize entropy loss (#1007) 2025-04-11 09:37:37 +08:00
HL
526c0908be [ci] chore: reduce CI load (#934) 2025-04-06 10:06:10 -07:00
5d0a7eaf6d [feat] Megatron checkpoint support for current Llama and Qwen models (#687)
# Intro

Support Megatron checkpoint for Model, Optimizer States and RNG states,
with a new layer of abstraction: `MegatronCheckpointManager` like FSDP.
Also add checkpoint tests.

# Involved Issues and PRs

This solved issue #682 #605 , including PR #510 #634 #368 #330 . Thanks
for the great efforts of @uygnef, @ShareLer and @caaatch22 in these
contributions.

# TODOs

- [ ] Support Megatron dist checkpointing mechanism, now use
torch.save/load to store/restore model weights.
- [x] Quick: Also store hf format model.

---------

Co-authored-by: caaatch22 <mr.liumingjie@gmail.com>
Co-authored-by: Yu Feng <admin@fengyu.org>
Co-authored-by: ShareLer <sharele@163.com>
2025-03-23 14:36:05 +08:00
3f6d45d95b fix: support transformers==4.50.0 (#704)
https://github.com/volcengine/verl/issues/703
2025-03-22 13:54:34 +08:00
0cc2bdada0 [misc] feat: add allgather method to dataproto (#497)
- Add allgather method to dataproto
- Add tests
- Replace existing raw allgather with this function
2025-03-06 22:05:51 +08:00
c15c6447ca [ci] feat: add ci timeout (#487)
Set timeout in CI to avoid infinite hang.
close #468
2025-03-06 08:52:05 +08:00
d36422be5c feat: add support for ulysses sequence parallel for transformers >= 0.48 (#357)
close #312 

Add support for ulysses sp for transformers >= 0.48

I've tested transformers 0.45.0, 0.46.0, 0.47.0, 0.48.0 and 0.49.0,
using sp=2 with the following script in my local env
```bash
#!/bin/bash

set -ex
VERSIONS=("4.45.0" "4.46.0" "4.47.0" "4.48.0" "4.49.0")

for version in "${VERSIONS[@]}"; do
    echo "Testing with Transformers version ${version}"
    echo "----------------------------------------"
    
    pip install "transformers==${version}"
    
    PYTHONPATH=./ torchrun --nproc_per_node=2 tests/model/test_transformers_ulysses.py
    
    echo "----------------------------------------"
    echo "Completed testing for version ${version}"
    echo ""
done
```
2025-02-24 18:54:39 +08:00
HL
0a1b16f800 distro: bump up version to v0.2.0.dev, limit vllm version (#327) 2025-02-20 15:21:43 +08:00
dd09d47fe2 Added content permissions of the workflow (#303)
We need to specify the minimum  permission in the workflow.
2025-02-19 10:23:22 +08:00
27484a7bbb [misc] feat: add ckpt manager in utils (#216)
- Support FSDPCheckpointManager
- Support hdfs_io import if installed
- Add CI for FSDPCheckpointManager

TODO:
- Will integrate in the next PR
2025-02-07 09:09:03 +08:00
695bdbb030 [misc] fix: gradient accumulation in seq balance and modify default vllm log level (#141)
- Previous gradient accumulation value is computed by micro_batch_size,
which is wrong when using dynamic_bsz
- Fix ci script to avoid overlooking this issue
- Change vLLM state log default value to True to disable log.
- We will check the `self.config.actor.ppo_mini_batch_size %
self.config.actor.ppo_micro_batch_size_per_gpu == 0` after normalization
in fsdp_workers instead of in dp_actor and dp_critic.
2025-01-27 21:44:25 +08:00
ff0c7ccd41 [ci] fix: add force stop in ray e2e ci to clean env (#112)
- As titled
2025-01-17 21:41:50 +08:00
1facb9d2fb [misc] feat: support different flash_attn versions with variable num returns (#100)
* add ci

* fix reward model and write  more ci script

* support different flash_attn version with variable num returns

* update transformers rmpad workflow

* balance workload

* lint

* lint
2025-01-13 16:38:51 +08:00
569210e06c [misc] feat: spport rmpad/data-packing in FSDP with transformers (#91)
* init commit of rmpad

* add rmpad test

* support rmpad in actor model

* add test for value model

* support rmpad in critic and rm

* fix actor return and fix num_labels and clean not used rmpad

* fix critic and benchmark

* update script

* fix critic

* lint

* fix util issue

* fix unnecessary unpad

* address issues

* fix args

* update test and update rmpad support model list

* fix typo

* fix typo and fix name

* rename rmpad to rename padding

* fix arch to model_type

* add ci for e2e rmpad and fix typo

* lint

* fix ci

* fix typo

* update tests for customize tokenizer in actor

* fix rmpad test

* update requirement of transformers as hf_rollout may have issue
2025-01-11 16:50:15 +08:00