### What does this PR do?
> Update the resource in `Dockerfile.rocm`
### Checklist Before Starting
- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> docker build -f Dockerfile.rocm -t verl-rocm:local .
```
docker run --rm -it verl-rocm:local python -c "import torch; print('ok')"
```
### Design & Code Changes
> Update the resource in `Dockerfile.rocm`
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### What does this PR do?
The PR enhances the Dockerfile.rocm by generating a Python wheel (.whl)
as a part of Docker build process.
Changes introduced:
- Add python setup.py bdist_wheel immediately after pip install -e .
--no-deps
- The wheel is created inside the container under the dist/ directory
Co-authored-by: HIREMATH <rhiremat@ctr2-alola-ctrl-01.amd.com>
### Checklist Before Starting
- [x ] Search for similar PR(s).
### What does this PR do?
Update tensordict version
Resolve PPO training error
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=gae
data.train_files=/root/data/gsm8k/train.parquet
data.val_files=/root/data/gsm8k/test.parquet data.train_batch_size=256
data.max_prompt_length=512 data.max_response_length=512
data.return_raw_chat=True
actor_rollout_ref.model.path=/root/models/Qwen/Qwen2.5-0.5B
actor_rollout_ref.model.use_liger=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.use_dynamic_bsz=False
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=32768
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.actor.use_kl_loss=False
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=32768
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=32768
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2
critic.optim.lr=1e-5 critic.ulysses_sequence_parallel_size=1
critic.model.use_remove_padding=True
critic.optim.lr_warmup_steps_ratio=0.05
critic.model.path=/root/models/Qwen/Qwen2.5-0.5B
critic.model.enable_gradient_checkpointing=False
critic.use_dynamic_bsz=False critic.ppo_max_token_len_per_gpu=32768
critic.ppo_micro_batch_size_per_gpu=2
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.optimizer_offload=False
reward_model.enable=True reward_model.ulysses_sequence_parallel_size=1
reward_model.model.path=/root/models/Qwen/Qwen2.5-0.5B
reward_model.model.use_remove_padding=True
reward_model.model.fsdp_config.param_offload=True
reward_model.use_dynamic_bsz=False
reward_model.forward_max_token_len_per_gpu=32768
reward_model.micro_batch_size_per_gpu=2 algorithm.use_kl_in_reward=False
trainer.critic_warmup=0 'trainer.logger=[console]'
trainer.project_name=verl-test
trainer.experiment_name=qwen2.5-0.5b-model-reward-minimal
trainer.nnodes=1 trainer.n_gpus_per_node=8
trainer.val_before_train=False trainer.test_freq=False
trainer.save_freq=-1 trainer.resume_mode=disable trainer.total_epochs=2
trainer.total_training_steps=1
Traceback (most recent call last):
File "<frozen runpy>", line 189, in _run_module_as_main
File "<frozen runpy>", line 112, in _get_module_details
File "/sgl-workspace/verl/__init__.py", line 22, in <module>
from .protocol import DataProto
File "/sgl-workspace/verl/protocol.py", line 30, in <module>
import tensordict
File "/usr/local/lib/python3.12/dist-packages/tensordict/__init__.py",
line 6, in <module>
import tensordict._reductions
File
"/usr/local/lib/python3.12/dist-packages/tensordict/_reductions.py",
line 11, in <module>
from tensordict._lazy import LazyStackedTensorDict
File "/usr/local/lib/python3.12/dist-packages/tensordict/_lazy.py", line
38, in <module>
from tensordict.memmap import MemoryMappedTensor
File "/usr/local/lib/python3.12/dist-packages/tensordict/memmap.py",
line 25, in <module>
from torch.multiprocessing.reductions import ForkingPickler
ImportError: cannot import name 'ForkingPickler' from
'torch.multiprocessing.reductions'
(/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py)
### Checklist Before Submitting
- [x ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [x ] Rely on existing unit tests on CI that covers the code path.
Signed-off-by: Vicky Tsang <vtsang@amd.com>
### Checklist Before Starting
- [X] Search for similar PR(s).
### What does this PR do?
Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support
### High-Level Design
Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:
Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:
c65ee728f0/torch/cuda/__init__.py (L342-L392)
Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
https://github.com/pytorch/pytorch/issues/141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.
CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:
- https://github.com/ROCm/TransformerEngine/pull/183
- https://github.com/vllm-project/vllm/pull/15246
While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.
So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to https://github.com/pytorch/pytorch/pull/144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).
For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.
Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: https://github.com/ray-project/ray/pull/52794
### Test
Tested manually on both megatron and fsdp beckend with vllm.
### Additional Info.
- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang
### Checklist Before Submitting
- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
- Unify the functionality of SGLangRollout and AsyncSGLangRollout,
remove original SGLangRollout and rename AsyncSGLangRollout to
SGLangRollout.
- Make trivial changes due to modification in sglang==0.4.6.post5.
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
> List the specific changes.
### API
> Demonstrate how the API changes if any.
### Usage Example
> Provide usage example(s) for easier usage.
```python
# Add code snippet or script demonstrating how to use this
```
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.
### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]
### Checklist Before Submitting
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
---------
Co-authored-by: zyzshishui <@qq.com>
Co-authored-by: Xiang Long <mindsculptor@yeah.net>
Co-authored-by: ocss884 <ocss.lin@gmail.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: H <linhaibin.eric@gmail.com>
### Checklist Before Starting
- [x] Search for similar PR(s).
- Thanks to:
- close#1558 due to mix of prs
- close#1449 due to partial fix sgl new version issue
- close#1300 which is part of current pr
- This pr is co-authored with @ocss884
### What does this PR do?
> Add one-line overview of what this PR aims to achieve or accomplish.
- bump sglang to 0.4.6.post4
- unified sglang and sglang_async `generate_sequences` api behavior,
e.g. image support
- fix warning for cuda barrier at start of fsdp_workers
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
---------
Co-authored-by: ocss884 <ocss.lin@gmail.com>
[Done]
- Update the Docker file and Apptainer file to support the SGLang
engines
- Add the 3rd-party
[torch_memory_saver](torch_memory_saver](https://github.com/ExtremeViscent/torch_memory_saver)
within the docker file in rocm version