frozenleaves/verl - verl - Gitea: Git for Me

mirror of https://github.com/volcengine/verl.git synced 2025-10-20 13:43:50 +08:00

Author	SHA1	Message	Date
Ethan (Yusheng) Su	fd1a121324	[hardware] fix: update source in dockerfile.rocm (#3284 ) ### What does this PR do? > Update the resource in `Dockerfile.rocm` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > docker build -f Dockerfile.rocm -t verl-rocm:local . ``` docker run --rm -it verl-rocm:local python -c "import torch; print('ok')" ``` ### Design & Code Changes > Update the resource in `Dockerfile.rocm` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)	2025-09-01 11:32:44 +08:00
Ethan (Yusheng) Su	526098d664	[Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image (#2390 ) ### What does this PR do? > Update Dockerfile/Docker Image ### Checklist Before Starting - [X] Search for similar PRs. - [X] Format the PR title (This will be checked by the CI) ### Test > Done ### API and Usage Example > Usage example(s) [AMD_toturial](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst). ### Design & Code Changes > Dockerfile/Docker Image dependency: ROCm: 6.3.4 (patch version) Pytoch: 2.7.0 vllm: >=0.8.5 sglang: >=v0.4.6.post4 megatron-lm: TransformerEngine==1.14.0, megatron-core==0.12.0 Ray: >=2.45 Also allow VLM training ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).	2025-07-09 10:05:43 -07:00
rhiremat	4a846aa8f5	[hardward] chore: Enable Generation of Wheel File During Docker Build (#2332 ) ### What does this PR do? The PR enhances the Dockerfile.rocm by generating a Python wheel (.whl) as a part of Docker build process. Changes introduced: - Add python setup.py bdist_wheel immediately after pip install -e . --no-deps - The wheel is created inside the container under the dist/ directory Co-authored-by: HIREMATH <rhiremat@ctr2-alola-ctrl-01.amd.com>	2025-07-02 13:10:51 -07:00
vickytsang	d2665c5eb5	[hardware] fix typo in dockerfile (#1950 )	2025-06-11 06:46:46 +08:00
vickytsang	d02b3d5134	Dockerfile.rocm update tensordict==0.6.2 (#1898 ) ### Checklist Before Starting - [x ] Search for similar PR(s). ### What does this PR do? Update tensordict version Resolve PPO training error + python3 -m verl.trainer.main_ppo algorithm.adv_estimator=gae data.train_files=/root/data/gsm8k/train.parquet data.val_files=/root/data/gsm8k/test.parquet data.train_batch_size=256 data.max_prompt_length=512 data.max_response_length=512 data.return_raw_chat=True actor_rollout_ref.model.path=/root/models/Qwen/Qwen2.5-0.5B actor_rollout_ref.model.use_liger=True actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 actor_rollout_ref.actor.ppo_mini_batch_size=128 actor_rollout_ref.actor.use_dynamic_bsz=False actor_rollout_ref.actor.ppo_max_token_len_per_gpu=32768 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.actor.use_kl_loss=False actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=32768 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 actor_rollout_ref.rollout.tensor_model_parallel_size=2 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.8 actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=32768 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 critic.optim.lr=1e-5 critic.ulysses_sequence_parallel_size=1 critic.model.use_remove_padding=True critic.optim.lr_warmup_steps_ratio=0.05 critic.model.path=/root/models/Qwen/Qwen2.5-0.5B critic.model.enable_gradient_checkpointing=False critic.use_dynamic_bsz=False critic.ppo_max_token_len_per_gpu=32768 critic.ppo_micro_batch_size_per_gpu=2 critic.model.fsdp_config.param_offload=False critic.model.fsdp_config.optimizer_offload=False reward_model.enable=True reward_model.ulysses_sequence_parallel_size=1 reward_model.model.path=/root/models/Qwen/Qwen2.5-0.5B reward_model.model.use_remove_padding=True reward_model.model.fsdp_config.param_offload=True reward_model.use_dynamic_bsz=False reward_model.forward_max_token_len_per_gpu=32768 reward_model.micro_batch_size_per_gpu=2 algorithm.use_kl_in_reward=False trainer.critic_warmup=0 'trainer.logger=[console]' trainer.project_name=verl-test trainer.experiment_name=qwen2.5-0.5b-model-reward-minimal trainer.nnodes=1 trainer.n_gpus_per_node=8 trainer.val_before_train=False trainer.test_freq=False trainer.save_freq=-1 trainer.resume_mode=disable trainer.total_epochs=2 trainer.total_training_steps=1 Traceback (most recent call last): File "<frozen runpy>", line 189, in _run_module_as_main File "<frozen runpy>", line 112, in _get_module_details File "/sgl-workspace/verl/__init__.py", line 22, in <module> from .protocol import DataProto File "/sgl-workspace/verl/protocol.py", line 30, in <module> import tensordict File "/usr/local/lib/python3.12/dist-packages/tensordict/__init__.py", line 6, in <module> import tensordict._reductions File "/usr/local/lib/python3.12/dist-packages/tensordict/_reductions.py", line 11, in <module> from tensordict._lazy import LazyStackedTensorDict File "/usr/local/lib/python3.12/dist-packages/tensordict/_lazy.py", line 38, in <module> from tensordict.memmap import MemoryMappedTensor File "/usr/local/lib/python3.12/dist-packages/tensordict/memmap.py", line 25, in <module> from torch.multiprocessing.reductions import ForkingPickler ImportError: cannot import name 'ForkingPickler' from 'torch.multiprocessing.reductions' (/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py) ### Checklist Before Submitting - [x ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [x ] Rely on existing unit tests on CI that covers the code path. Signed-off-by: Vicky Tsang <vtsang@amd.com>	2025-06-07 08:09:12 +08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	07897f84e5	[AMD] fix: Add support for RAY_EXPERIMENTAL_NOSET__VISIBLE_DEVICES (Fix AMD support) (#1465 ) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET__VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: `c65ee728f0/torch/cuda/__init__.py (L342-L392)` Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check https://github.com/pytorch/pytorch/issues/141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.` during importing, e.g.: - https://github.com/ROCm/TransformerEngine/pull/183 - https://github.com/vllm-project/vllm/pull/15246 While ray/vllm manipulates those _VISIBLE_DEVICES during runtime, which cause those `torch.cuda.` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to https://github.com/pytorch/pytorch/pull/144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: https://github.com/ray-project/ray/pull/52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - Issue Number: none - Training: both FSDP and Megatron - Inference*: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-06-02 10:12:45 -07:00
Yuzhen Zhou	4de247fe4d	[sglang] refactor: Unify async rollout under SGLangRollout, and support sglang==0.4.6.post5 (#1717 ) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? - Unify the functionality of SGLangRollout and AsyncSGLangRollout, remove original SGLangRollout and rename AsyncSGLangRollout to SGLangRollout. - Make trivial changes due to modification in sglang==0.4.6.post5. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - Issue Number: Fixes issue # or discussion # if any. - Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Co-authored-by: zyzshishui <@qq.com> Co-authored-by: Xiang Long <mindsculptor@yeah.net> Co-authored-by: ocss884 <ocss.lin@gmail.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: H <linhaibin.eric@gmail.com>	2025-05-31 19:47:25 -07:00
Xiang Long	8160ec6a58	Bump to sglang 0.4.6.post4 & unified generate sequences ability between sgl and sgl async (#1577 ) ### Checklist Before Starting - [x] Search for similar PR(s). - Thanks to: - close #1558 due to mix of prs - close #1449 due to partial fix sgl new version issue - close #1300 which is part of current pr - This pr is co-authored with @ocss884 ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. - bump sglang to 0.4.6.post4 - unified sglang and sglang_async `generate_sequences` api behavior, e.g. image support - fix warning for cuda barrier at start of fsdp_workers ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: ocss884 <ocss.lin@gmail.com>	2025-05-20 09:39:07 +08:00
Ethan (Yusheng) Su	6b8706cd4f	[Hardware] Support AMD (ROCMm Kernel) - hardware-agnostic (remove the redundant code) (#1453 ) ### Checklist Before Starting - [X] Search for similar PR(s): [PR#1369](https://github.com/volcengine/verl/pull/1369), [issue#1488](https://github.com/volcengine/verl/issues/1448) ### What does this PR do? - Complete [issue#1488](https://github.com/volcengine/verl/issues/1448) ### High-Level Design - New PR for hardware-agnostic sglang rollout ### Specific Changes - `verl/workers/rollout/sglang_rollout/async_sglang_rollout.py` - `verl/workers/rollout/sglang_rollout/sglang_rollout.py` > We've already submitted the PR to `ray>=2.45`. Actually, in that version, it's been already supported hardware-agnostic rollout implementation within verl codebase. Just need to assign `HIP_VISIBLE_DEVICES` in the training script. Thus, I discard the patch part that I added last time in verl codebase. ### Usage Example [amd_tutorial](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### Additional Info. - Issue Number: Fixes issue # or discussion # if any. - Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst). - [ ] Add CI test(s) if neccessary. --------- Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>	2025-05-09 09:22:34 -07:00
Ethan (Yusheng) Su	76084d36cb	[AMD] upgrade: Upgrade dockerfile and verl codebase (#1369 ) ## Checklist Before Starting - [x] Search for similar PR(s). ## What does this PR do? 1. Base Docker Image: Upgraded the base sglang docker to `lmsysorg/sglang:v0.4.6.post1-rocm630` along with `torch_memory_saver (hip version)`, which resolves the ROCm/aiter compatibility [issue](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/amd-verl-dev/dev.md). 2. vLLM-0.6.3 Rollout Fix: Adjusted the rollout logic to ensure the latest VeRL upstream codebase remains both compatible with `vLLM versions ≤ 0.6.3`, along with sync mechanism, and `vLLM versions >= 0.6.3`, along with async mechanism. 3. Update the ray version to [2.45.0](https://github.com/ray-project/ray/releases/tag/ray-2.45.0): [PR#52794](https://github.com/ray-project/ray/pull/52794) and also support `ray>=2.45.0` within verl - resolve [verl-issues#1399](https://github.com/volcengine/verl/issues/1399). - [To-do-1] 3rd party lib - `torch_memory_saver` - rocm virtual memory allocator issue should be resolved within the [HIP version](https://github.com/fzyzcjy/torch_memory_saver/issues/9). - [To-do-2] New PR for hardware-agnostic vllm/sglang rollout. ## Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide) - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting) - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>	2025-05-06 18:06:05 -07:00
Ethan Yusheng Su	b0e3f1361e	[AMD] docker: Support AMD (ROCMm Kernel) - Support SGLang (#1179 ) [Done] - Update the Docker file and Apptainer file to support the SGLang engines - Add the 3rd-party [torch_memory_saver](torch_memory_saver](https://github.com/ExtremeViscent/torch_memory_saver) within the docker file in rocm version	2025-04-20 12:51:10 -07:00
Yusheng (Ethan) Su	4a291fa760	[Hardware] Support AMD (Rocm kernel) (#360 )	2025-03-06 13:56:20 +08:00

12 Commits