4 Commits

Author SHA1 Message Date
1cfa2be530 [Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests (#1555)
### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Allow to override of transformer config to enable custom megatron
features like variable PP layers distribution, with CI tests, which is
in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers
(DeepSeek V3)

We will first fix e2e_prime CI by use fused kernels.

**Notice that now the imbalance PP layers distribution only compatible
with dist_ckpt load and save, not support huggingface direct
load/save.**

Also, other megatron arguments can be passed through scripts.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

Breaking APIs:

```py
class MegatronWorker(Worker):
    def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):

# and the models building
```

```yaml
  actor:
    megatron:
      override_transformer_config: {} # common transformer config for all models
```

To avoid trouble of input same transformer config arguments, other
models will reuse actor's config, so just need to input once.

### Usage Example

```bash
run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: Megatron
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
2025-05-22 13:38:34 +08:00
76084d36cb [AMD] upgrade: Upgrade dockerfile and verl codebase (#1369)
## Checklist Before Starting

- [x] Search for similar PR(s). 

## What does this PR do?

1. Base Docker Image: Upgraded the base sglang docker to
`lmsysorg/sglang:v0.4.6.post1-rocm630` along with `torch_memory_saver
(hip version)`, which resolves the ROCm/aiter compatibility
[issue](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/amd-verl-dev/dev.md).

2. vLLM-0.6.3 Rollout Fix: Adjusted the rollout logic to ensure the
latest VeRL upstream codebase remains both compatible with `vLLM
versions ≤ 0.6.3`, along with sync mechanism, and `vLLM versions >=
0.6.3`, along with async mechanism.

3. Update the ray version to
[2.45.0](https://github.com/ray-project/ray/releases/tag/ray-2.45.0):
[PR#52794](https://github.com/ray-project/ray/pull/52794) and also
support `ray>=2.45.0` within verl - resolve
[verl-issues#1399](https://github.com/volcengine/verl/issues/1399).

- [To-do-1] 3rd party lib - `torch_memory_saver` - rocm virtual memory
allocator issue should be resolved within the [HIP
version](https://github.com/fzyzcjy/torch_memory_saver/issues/9).
- [To-do-2]  New PR for hardware-agnostic vllm/sglang rollout.


## Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide)
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting)
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.

---------

Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>
2025-05-06 18:06:05 -07:00
b00f77d855 [dev] feat: immigrate from yapf & pylint to ruff based on pre-commit (#1010)
> [!WARNING]
> We are [immigrating to `ruff` as the linter and formatter and
`pre-commit` as the managing
tool](https://github.com/volcengine/verl/pull/1010).
>
> If your branch is based on a previous commit using `yapf` and
`pylint`, simply merging might trigger overwhelming linting errors,
while **you are only expected to resolve ones in the files related to
your PR**.
>
> To resolve this issue, please try the following workaround to only
include the files you **really changed** in the PR:
>
> 1. In your branch, fix linting and format with `ruff`: `ruff check
--fix && ruff-format`
> 2. Squash into a single commit in a new branch: `git reset --soft
$(git merge-base main HEAD) && git add -A && git commit -m "feat: ..."`
> 3. Merge with the latest main: `git merge origin/main`
> 4. Force push to your branch: `git push --force`

We add the reminder above to the documentation to tell contributors how
to avoid overwhelming linting errors.

### Motivation

According to dicussion in #896, this PR immigrates from yapf & pylint to
ruff based on pre-commit, which allows unified version control and
automatic hook on committing.

### Summary

The `pre-commit` hook and CI

- checks staged / committed files in commits / PR's
- checks all files each month (This should fail before we fix all the
files by the ruff standard)

### Explanation for the Failing CI Workflow `pre-commit`

For now, we only apply `ruff format` and `ruff check --fix` **without
resolving all the errors**, since there are too many errors to resolve,
which causes the CI workflow `pre-commit` fails.

For resolving the remaining errors, we leave to future commits.
Specifically, the `pre-commit` hook and CI will require every commit to
fix its related files with `ruff`, which will fix all the files
incrementally.

### Reviewing Suggestion

The commit
3d93f51ba8
is huge since we apply `ruff` to all the files. To review the main
changes, please check the commits before and after it.
2025-04-18 07:49:31 -07:00
cc612dbae6 [dev] feat: default VSCode repo settings to help consistency with CI (#894)
This PR adds default VSCode repo settings to help keep consistent with
the CI, which:

1. enable the `pylint` linter extension
2. set the default formatter as `yapf`
3. but don't organize imports for now (since we haven't got a
functionality for this)
2025-04-04 03:36:47 +08:00