12 Commits

Author SHA1 Message Date
fd1a121324 [hardware] fix: update source in dockerfile.rocm (#3284)
### What does this PR do?

> Update the resource in `Dockerfile.rocm`

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> docker build -f Dockerfile.rocm -t verl-rocm:local .
```
docker run --rm -it verl-rocm:local python -c "import torch; print('ok')"
```

### Design & Code Changes

> Update the resource in `Dockerfile.rocm`

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-09-01 11:32:44 +08:00
526098d664 [Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image (#2390)
### What does this PR do?

> Update Dockerfile/Docker Image

### Checklist Before Starting
- [X] Search for similar PRs. 
- [X] Format the PR title (This will be checked by the CI)

### Test
>  Done

### API and Usage Example

>  Usage example(s) 

[AMD_toturial](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst).


### Design & Code Changes

>  Dockerfile/Docker Image dependency:
ROCm: 6.3.4 (patch version)
Pytoch: 2.7.0
vllm: >=0.8.5
sglang: >=v0.4.6.post4
megatron-lm: TransformerEngine==1.14.0, megatron-core==0.12.0
Ray: >=2.45

Also allow VLM training

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-09 10:05:43 -07:00
4a846aa8f5 [hardward] chore: Enable Generation of Wheel File During Docker Build (#2332)
### What does this PR do?

The PR enhances the Dockerfile.rocm by generating a Python wheel (.whl)
as a part of Docker build process.
Changes introduced:
- Add python setup.py bdist_wheel immediately after pip install -e .
--no-deps
 - The wheel is created inside the container under the dist/ directory

Co-authored-by: HIREMATH <rhiremat@ctr2-alola-ctrl-01.amd.com>
2025-07-02 13:10:51 -07:00
d2665c5eb5 [hardware] fix typo in dockerfile (#1950) 2025-06-11 06:46:46 +08:00
d02b3d5134 Dockerfile.rocm update tensordict==0.6.2 (#1898)
### Checklist Before Starting

- [x ] Search for similar PR(s).

### What does this PR do?

Update tensordict version

Resolve PPO training error
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=gae
data.train_files=/root/data/gsm8k/train.parquet
data.val_files=/root/data/gsm8k/test.parquet data.train_batch_size=256
data.max_prompt_length=512 data.max_response_length=512
data.return_raw_chat=True
actor_rollout_ref.model.path=/root/models/Qwen/Qwen2.5-0.5B
actor_rollout_ref.model.use_liger=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.use_dynamic_bsz=False
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=32768
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.actor.use_kl_loss=False
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=32768
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=32768
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2
critic.optim.lr=1e-5 critic.ulysses_sequence_parallel_size=1
critic.model.use_remove_padding=True
critic.optim.lr_warmup_steps_ratio=0.05
critic.model.path=/root/models/Qwen/Qwen2.5-0.5B
critic.model.enable_gradient_checkpointing=False
critic.use_dynamic_bsz=False critic.ppo_max_token_len_per_gpu=32768
critic.ppo_micro_batch_size_per_gpu=2
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.optimizer_offload=False
reward_model.enable=True reward_model.ulysses_sequence_parallel_size=1
reward_model.model.path=/root/models/Qwen/Qwen2.5-0.5B
reward_model.model.use_remove_padding=True
reward_model.model.fsdp_config.param_offload=True
reward_model.use_dynamic_bsz=False
reward_model.forward_max_token_len_per_gpu=32768
reward_model.micro_batch_size_per_gpu=2 algorithm.use_kl_in_reward=False
trainer.critic_warmup=0 'trainer.logger=[console]'
trainer.project_name=verl-test
trainer.experiment_name=qwen2.5-0.5b-model-reward-minimal
trainer.nnodes=1 trainer.n_gpus_per_node=8
trainer.val_before_train=False trainer.test_freq=False
trainer.save_freq=-1 trainer.resume_mode=disable trainer.total_epochs=2
trainer.total_training_steps=1
Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/sgl-workspace/verl/__init__.py", line 22, in <module>
    from .protocol import DataProto
  File "/sgl-workspace/verl/protocol.py", line 30, in <module>
    import tensordict
File "/usr/local/lib/python3.12/dist-packages/tensordict/__init__.py",
line 6, in <module>
    import tensordict._reductions
File
"/usr/local/lib/python3.12/dist-packages/tensordict/_reductions.py",
line 11, in <module>
    from tensordict._lazy import LazyStackedTensorDict
File "/usr/local/lib/python3.12/dist-packages/tensordict/_lazy.py", line
38, in <module>
    from tensordict.memmap import MemoryMappedTensor
File "/usr/local/lib/python3.12/dist-packages/tensordict/memmap.py",
line 25, in <module>
    from torch.multiprocessing.reductions import ForkingPickler
ImportError: cannot import name 'ForkingPickler' from
'torch.multiprocessing.reductions'
(/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py)

### Checklist Before Submitting

- [x ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [x ] Rely on existing unit tests on CI that covers the code path.

Signed-off-by: Vicky Tsang <vtsang@amd.com>
2025-06-07 08:09:12 +08:00
07897f84e5 [AMD] fix: Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) (#1465)
### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

c65ee728f0/torch/cuda/__init__.py (L342-L392)

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
https://github.com/pytorch/pytorch/issues/141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- https://github.com/ROCm/TransformerEngine/pull/183
- https://github.com/vllm-project/vllm/pull/15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to https://github.com/pytorch/pytorch/pull/144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: https://github.com/ray-project/ray/pull/52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-06-02 10:12:45 -07:00
4de247fe4d [sglang] refactor: Unify async rollout under SGLangRollout, and support sglang==0.4.6.post5 (#1717)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

- Unify the functionality of SGLangRollout and AsyncSGLangRollout,
remove original SGLangRollout and rename AsyncSGLangRollout to
SGLangRollout.
- Make trivial changes due to modification in sglang==0.4.6.post5.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.

---------

Co-authored-by: zyzshishui <@qq.com>
Co-authored-by: Xiang Long <mindsculptor@yeah.net>
Co-authored-by: ocss884 <ocss.lin@gmail.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: H <linhaibin.eric@gmail.com>
2025-05-31 19:47:25 -07:00
8160ec6a58 Bump to sglang 0.4.6.post4 & unified generate sequences ability between sgl and sgl async (#1577)
### Checklist Before Starting

- [x] Search for similar PR(s).
- Thanks to:
  - close #1558 due to mix of prs
  - close #1449 due to partial fix sgl new version issue
  - close #1300 which is part of current pr
- This pr is co-authored with @ocss884 

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

- bump sglang to 0.4.6.post4
- unified sglang and sglang_async `generate_sequences` api behavior,
e.g. image support
- fix warning for cuda barrier at start of fsdp_workers

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Co-authored-by: ocss884 <ocss.lin@gmail.com>
2025-05-20 09:39:07 +08:00
6b8706cd4f [Hardware] Support AMD (ROCMm Kernel) - hardware-agnostic (remove the redundant code) (#1453)
### Checklist Before Starting

- [X] Search for similar PR(s):
[PR#1369](https://github.com/volcengine/verl/pull/1369),
[issue#1488](https://github.com/volcengine/verl/issues/1448)

### What does this PR do?

- Complete [issue#1488](https://github.com/volcengine/verl/issues/1448)

### High-Level Design

- New PR for hardware-agnostic sglang rollout

### Specific Changes

- `verl/workers/rollout/sglang_rollout/async_sglang_rollout.py`
- `verl/workers/rollout/sglang_rollout/sglang_rollout.py`

> We've already submitted the PR to `ray>=2.45`. Actually, in that
version, it's been already supported hardware-agnostic rollout
implementation within verl codebase. Just need to assign
`HIP_VISIBLE_DEVICES` in the training script. Thus, I discard the patch
part that I added last time in verl codebase.

### Usage Example


[amd_tutorial](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst)

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst).
- [ ] Add CI test(s) if neccessary.

---------

Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>
2025-05-09 09:22:34 -07:00
76084d36cb [AMD] upgrade: Upgrade dockerfile and verl codebase (#1369)
## Checklist Before Starting

- [x] Search for similar PR(s). 

## What does this PR do?

1. Base Docker Image: Upgraded the base sglang docker to
`lmsysorg/sglang:v0.4.6.post1-rocm630` along with `torch_memory_saver
(hip version)`, which resolves the ROCm/aiter compatibility
[issue](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/amd-verl-dev/dev.md).

2. vLLM-0.6.3 Rollout Fix: Adjusted the rollout logic to ensure the
latest VeRL upstream codebase remains both compatible with `vLLM
versions ≤ 0.6.3`, along with sync mechanism, and `vLLM versions >=
0.6.3`, along with async mechanism.

3. Update the ray version to
[2.45.0](https://github.com/ray-project/ray/releases/tag/ray-2.45.0):
[PR#52794](https://github.com/ray-project/ray/pull/52794) and also
support `ray>=2.45.0` within verl - resolve
[verl-issues#1399](https://github.com/volcengine/verl/issues/1399).

- [To-do-1] 3rd party lib - `torch_memory_saver` - rocm virtual memory
allocator issue should be resolved within the [HIP
version](https://github.com/fzyzcjy/torch_memory_saver/issues/9).
- [To-do-2]  New PR for hardware-agnostic vllm/sglang rollout.


## Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide)
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting)
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.

---------

Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>
2025-05-06 18:06:05 -07:00
b0e3f1361e [AMD] docker: Support AMD (ROCMm Kernel) - Support SGLang (#1179)
[Done]
- Update the Docker file and Apptainer file to support the SGLang
engines
- Add the 3rd-party
[torch_memory_saver](torch_memory_saver](https://github.com/ExtremeViscent/torch_memory_saver)
within the docker file in rocm version
2025-04-20 12:51:10 -07:00
4a291fa760 [Hardware] Support AMD (Rocm kernel) (#360) 2025-03-06 13:56:20 +08:00