[BREAKING] [perf] refactor: Profiler api refactor (#2894)

### What does this PR do?

Refactor profiler CI to a unified way.

TODO:

- nsys use `save_path`
- nsys descrete tests are disabled
- torch profiler

cc: @davidmlw 

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

Global profiler config:

```yaml
global_profiler:
  _target_: verl.utils.profiler.ProfilerConfig
  tool: null
  steps: null
  profile_continuous_steps: false
  save_path: outputs/profile
  tool_config:
    nsys:
      _target_: verl.utils.profiler.config.NsightToolConfig
      discrete: false
    npu:
      _target_: verl.utils.profiler.config.NPUToolConfig
      discrete: false
      contents: []
      level: level1
      analysis: true
    torch:
      _target_: verl.utils.profiler.config.TorchProfilerToolConfig
      step_start: 0
      step_end: null
```

Local profiler config:

```yaml
profiler:

  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig

  # profiler tool, default same as profiler.tool in global config
  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}

  # whether enable profile on critic
  enable: False

  # Whether to profile all ranks.
  all_ranks: False

  # The ranks that will be profiled. [] or [0,1,...]
  ranks: []

  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}

  # specific tool config
  tool_config: ${oc.select:global_profiler.tool_config,null}
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
This commit is contained in:
Blue Space
2025-08-11 09:52:41 +08:00
committed by GitHub
parent 287ef7e262
commit 545f899844
41 changed files with 1005 additions and 694 deletions

1
.gitignore vendored
View File

@ -59,6 +59,7 @@ coverage.xml
*,cover *,cover
.hypothesis/ .hypothesis/
pytest.ini pytest.ini
output.txt
# Translations # Translations
*.mo *.mo

View File

@ -8,107 +8,87 @@ Last updated: 07/24/2025.
配置 配置
---- ----
复用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数 使用两级profile设置来控制数据采集
通过verl/trainer/config/npu_profile/npu_profile.yaml中的配置项控制例如采集等级等参数。
- 全局采集控制使用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数
- 角色profile控制通过每个角色中的配置项控制等参数。
全局采集控制 全局采集控制
~~~~~~~~~~~~ ~~~~~~~~~~~~
通过 ppo_trainer.yaml 中的参数控制采集步数和模式: 通过 ppo_trainer.yaml 中的参数控制采集步数和模式:
- trainer.profile_steps - profiler: 控制采集的rank和模式
该参数可以设置为一个包含采集步数的列表,例如[2
4] 意味着将会采集第二步和第四步。如果该参数为null则代表不进行采集
- actor_rollout_ref.profiler
控制采集的ranks和模式
- all_ranks设为True代表对所有rank进行采集 - tool: 使用的采集工具,选项有 nsys、npu、torch、torch_memory。
- ranks当all_ranks不为True时 - steps: 此参数可以设置为包含采集步数的列表,例如 [2, 4]表示将采集第2步和第4步。如果设置为 null则不进行采集。
通过ranks参数控制需要采集的rank该参数设置为一个包含采集rank的列表 例如[0 - save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
1]
- discrete
控制采集的模式。当该参数设置为False代表采集端到端的数据当该参数设置为True代表采用离散模式分训练阶段采集数据
通过 npu_profile.yaml 中的参数控制具体采集行为: 通过 ``profiler.tool_config.npu`` 中的参数控制具体采集行为:
- save_path采集数据的存放路径 - level: 采集级别—选项有 level_none、level0、level1 和 level2
- roles: 采集的角色,下列为可选项
- rollout_generate采集rollout的generate_sequences阶段 - level_none: 禁用所有基于级别的数据采集(关闭 profiler_level
- actor_compute_log_prob采集actor的compute_log_prob阶段 - level0: 采集高级应用数据、底层NPU数据和NPU上的算子执行详情。
- actor_update采集actor的update_actor阶段 - level1: 在level0基础上增加CANN层AscendCL数据和NPU上的AI Core性能指标。
- ref_compute_log_prob采集ref的compute_ref_log_prob阶段 - level2: 在level1基础上增加CANN层Runtime数据和AI CPU指标。
- all 采集以上所有阶段
- level采集等级可选项为level_none、level0、level1和level2 - contents: 控制采集内容的选项列表,例如
npu、cpu、memory、shapes、module、stack。
- npu: 是否采集设备端性能数据。
- cpu: 是否采集主机端性能数据。
- memory: 是否启用内存分析。
- shapes: 是否记录张量形状。
- module: 是否记录框架层Python调用栈信息。
- stack: 是否记录算子调用栈信息。
- level_none不采集所有Level层级控制的数据即关闭profiler_level - analysis: 启用自动数据解析。
- level0采集上层应用数据、底层NPU数据以及NPU上执行的算子信息
- level1在level0的基础上多采集CANN层AscendCL数据和NPU上执行的AI
Core性能指标信息
- level2在level1的基础上多采集CANN层Runtime数据以及AI CPU
- record_shapes是否记录张量形状 角色profile控制
- with_memory是否启用内存分析 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- with_npu是否采集device侧性能数据
- with_cpu是否采集host侧性能数据 在每个角色的 ``profile`` 字段中,您可以控制该角色的采集模式。
- with_module是否记录框架层python调用栈信息
- with_stack是否记录算子调用栈信息 - enable: 是否为此角色启用性能分析。
- analysis是否自动解析数据 - all_ranks: 是否从所有rank收集数据
- ranks: 要收集数据的rank列表。如果为空则不收集数据。
- tool_config: 此角色使用的性能分析工具的配置。
示例 示例
---- ----
禁用采集 禁用采集
~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
.. code:: yaml .. code:: yaml
trainer: profiler:
profile_steps: null # disable profile steps: null # disable profile
端到端采集 端到端采集
~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. code:: yaml .. code:: yaml
trainer: profiler:
profile_steps: [1, 2, 5] steps: [1, 2, 5]
actor_rollout_ref: discrete: False
profiler: actor_rollout_ref:
discrete: False actor:
all_ranks: True profile:
enable: True
all_ranks: True
# rollout & ref follow actor settings
离散模式采集 离散模式采集
~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: yaml .. code:: yaml
trainer: profiler:
profile_steps: [1, 2, 5] discrete: True
actor_rollout_ref:
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
离散模式采集actor
~~~~~~~~~~~~~~~~~~
.. code:: yaml
trainer:
profile_steps: [1, 2, 5]
npu_profile:
options:
roles: ["actor_compute_log_prob", "actor_update"]
actor_rollout_ref:
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
可视化 可视化

View File

@ -9,10 +9,10 @@ based on FSDP on Ascend devices.
Configuration Configuration
------------- -------------
Reuse the configuration items in Leverage two levels of configuration to control data collection:
verl/trainer/config/ppo_trainer.yaml to control the collection mode
and steps, you can also manage the collection behaviors such as 1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
collection level via verl/trainer/config/npu_profile/npu_profile.yaml. 2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.
Global collection control Global collection control
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
@ -20,31 +20,17 @@ Global collection control
Use parameters in ppo_trainer.yaml to control the collection mode Use parameters in ppo_trainer.yaml to control the collection mode
and steps. and steps.
- trainer.profile_steps: This parameter can be set as a list that has - profiler: Control the ranks and mode of profiling
collection steps, such as [2, 4], which means it will collect steps 2
and 4. If set to null, no collection occurs.
- actor_rollout_ref.profiler: Control the ranks and mode of profiling
- all_ranks: Collects data from all ranks when set to true. - tool: The profiling tool to use, options are nsys, npu, torch,
- ranks: This parameter specifies which ranks to collect (e.g., [0, torch_memory.
1]) when all_ranks is False. - steps: This parameter can be set as a list that has
- discrete: Controls the collection mode. If False, end-to-end data collection steps, such as [2, 4], which means it will collect steps 2
is collected; if True, data is collected in discrete phases during and 4. If set to null, no collection occurs.
training. - save_path: The path to save the collected data. Default is
"outputs/profile".
Use parameters in npu_profile.yaml to control collection behavior: Use parameters in ``profiler.tool_config.npu`` to control npu profiler behavior:
- save_path: Storage path for collected data.
- roles: Roles to collect. The following options are available
- rollout_generate: Collect the `generate_sequences` phase
of rollout worker.
- actor_compute_log_prob: Collect the `compute_log_prob` phase
of the actor worker.
- actor_update: Collect the `update_actor` phase of the actor worker.
- ref_compute_log_prob: Collect the `compute_ref_log_prob` phase
of the ref worker.
- all: Collect all of the above phases.
- level: Collection level—options are level_none, level0, level1, and - level: Collection level—options are level_none, level0, level1, and
level2 level2
@ -58,15 +44,31 @@ Use parameters in npu_profile.yaml to control collection behavior:
- level2: Extends level1 by adding CANN-layer Runtime data and AI - level2: Extends level1 by adding CANN-layer Runtime data and AI
CPU metrics. CPU metrics.
- record_shapes: Whether to record tensor shapes. - contents: A list of options to control the collection content, such as
- with_memory: Whether to enable memory analysis. npu, cpu, memory, shapes, module, stack.
- with_npu: Whether to collect device-side performance data.
- with_cpu: Whether to collect host-side performance data. - npu: Whether to collect device-side performance data.
- with_module: Whether to record framework-layer Python call stack - cpu: Whether to collect host-side performance data.
information. - memory: Whether to enable memory analysis.
- with_stack: Whether to record operator call stack information. - shapes: Whether to record tensor shapes.
- module: Whether to record framework-layer Python call stack
information.
- stack: Whether to record operator call stack information.
- analysis: Enables automatic data parsing. - analysis: Enables automatic data parsing.
Role collection control
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In each role's ``profile`` field, you can control the collection mode for that role.
- enable: Whether to enable profiling for this role.
- all_ranks: Whether to collect data from all ranks.
- ranks: A list of ranks to collect data from. If empty, no data is collected.
- tool_config: Configuration for the profiling tool used by this role.
Examples Examples
-------- --------
@ -75,20 +77,22 @@ Disabling collection
.. code:: yaml .. code:: yaml
trainer: profiler:
profile_steps: null # disable profile steps: null # disable profile
End-to-End collection End-to-End collection
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. code:: yaml .. code:: yaml
trainer: profiler:
profile_steps: [1, 2, 5] steps: [1, 2, 5]
actor_rollout_ref: discrete: False
actor_rollout_ref:
actor:
profiler: profiler:
discrete: False enable: True
all_ranks: True all_ranks: True
Discrete Mode Collection Discrete Mode Collection
@ -96,30 +100,8 @@ Discrete Mode Collection
.. code:: yaml .. code:: yaml
trainer: profiler:
profile_steps: [1, 2, 5] discrete: True
actor_rollout_ref:
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
Enable actor collection in discrete mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: yaml
trainer:
profile_steps: [1, 2, 5]
npu_profile:
options:
roles: ["actor_compute_log_prob", "actor_update"]
actor_rollout_ref:
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
Visualization Visualization

View File

@ -16,31 +16,29 @@ Nsight Systems version is important, please reference `docker/Dockerfile.vllm.sg
verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id. verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.
In `trainer`, three new config entries control the profiler behaviors: In `profiler`, three new config entries control the profiler behaviors:
* **`trainer.profile_steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling. * **`profiler.steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
* **`trainer.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profile_steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps. * **`profiler.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profiler.steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
* **`controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details. Nsys options in controller nodes and worker nodes are configured in `trainer`:
* **`worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`. * **`trainer.controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
* **`trainer.worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
### Worker process profiling ### Worker process profiling
Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields: Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields:
* **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID. * **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID.
* **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`. * **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`.
* **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it. * **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it.
* **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway. * **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway.
### where to find the profiling data ### where to find the profiling data
By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. ["however, Ray preserves the `--output` option of the default config"](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html). By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. [&#34;however, Ray preserves the `--output` option of the default config&#34;](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place. Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place.
@ -49,51 +47,40 @@ Some users may think it is not convenient, but it is understandable that Ray may
To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this: To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this:
### Disable profiler ### Disable profiler
```yaml ```yaml
trainer: profiler:
profile_steps: null # disable profile steps: null # disable profile
``` ```
### Enable profiler and one database for one training step ### Enable profiler and one database for one training step
```yaml ```yaml
trainer: profiler:
profile_steps: [1, 2, 5] steps: [1, 2, 5]
discrete: False
actor_rollout_ref: actor_rollout_ref:
profiler: actor:
discrete: False profile:
all_ranks: False enable: True
ranks: [0, 1] all_ranks: True
# rollout & ref follow actor settings
critic: critic:
profiler: profile:
discrete: False enable: True
all_ranks: False all_ranks: True
ranks: [0, 1]
reward_model: reward_model:
profiler: profile:
discrete: False enable: True
all_ranks: False all_ranks: True
ranks: [0, 1]
``` ```
### Enable profiler and multiple databases for one training step ### Enable profiler and multiple databases for one training step
```yaml ```yaml
trainer: profiler:
profile_steps: [1, 2, 5] steps: [1, 2, 5]
actor_rollout_ref: discrete: True
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
critic:
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
reward_model:
profiler:
discrete: True
all_ranks: False
ranks: [0, 1]
``` ```
## Profiling Output ## Profiling Output

View File

@ -275,27 +275,6 @@ For the critic, you can include these parameters.
critic.megatron.grad_offload=True \ critic.megatron.grad_offload=True \
critic.megatron.optimizer_offload=True \ critic.megatron.optimizer_offload=True \
Profiler
^^^^^^^^
The profiler is a tool that helps you understand the performance of your
model. It can be used to profile the time spent on different operations
and identify the bottlenecks. You can get more information from
`torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_.
In verl, now the profiler is only support for the actor role In Megatron. You can set
the begin step and end step to profile. Notice, one step means one gradient update. And
the profile result will be saved in the save_path. If you just want to profile in the
specific rank, you can set the profile_ranks, by default, it will be [0].
.. code:: python
actor_rollout_ref.actor.profile.use_profile=True \
actor_rollout_ref.actor.profile.profile_ranks=[0] \
actor_rollout_ref.actor.profile.step_start=0 \
actor_rollout_ref.actor.profile.step_end=1 \
actor_rollout_ref.actor.profile.save_path="./profile"
Related MCore Document Related MCore Document
---------------------- ----------------------

View File

@ -9,14 +9,8 @@ PROFILE_RANKS="[1,2]"
# profiling NPU options # profiling NPU options
SAVE_PATH="$HOME/profile_data" SAVE_PATH="$HOME/profile_data"
LEVEL="level1" LEVEL="level1"
WITH_MEMORY=False CONTENTS=['npu','cpu']
RECORD_SHAPES=False
WITH_NPU=True
WITH_CPU=True
WITH_MODULE=False
WITH_STACK=False
ANALYSIS=True ANALYSIS=True
ROLES=["all"]
python3 -m verl.trainer.main_ppo \ python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \ algorithm.adv_estimator=grpo \
@ -28,20 +22,20 @@ python3 -m verl.trainer.main_ppo \
data.filter_overlong_prompts=True \ data.filter_overlong_prompts=True \
data.truncation='error' \ data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \ actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=5e-8 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.model.use_remove_padding=False \ actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \ actor_rollout_ref.actor.optim.lr=5e-8 \
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.profiler.discrete=$DISCRETE \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.profiler.enable=True \
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.name=vllm \
@ -51,16 +45,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \ actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \ algorithm.use_kl_in_reward=False \
trainer.npu_profile.options.save_path=$SAVE_PATH \
trainer.npu_profile.options.level=$LEVEL \
trainer.npu_profile.options.with_memory=$WITH_MEMORY \
trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
trainer.npu_profile.options.with_npu=$WITH_NPU \
trainer.npu_profile.options.with_cpu=$WITH_CPU \
trainer.npu_profile.options.with_module=$WITH_MODULE \
trainer.npu_profile.options.with_stack=$WITH_STACK \
trainer.npu_profile.options.analysis=$ANALYSIS \
trainer.npu_profile.options.roles=$ROLES \
trainer.critic_warmup=0 \ trainer.critic_warmup=0 \
trainer.logger=console \ trainer.logger=console \
trainer.project_name='verl_grpo_example_gsm8k' \ trainer.project_name='verl_grpo_example_gsm8k' \
@ -70,5 +54,12 @@ python3 -m verl.trainer.main_ppo \
trainer.save_freq=-1 \ trainer.save_freq=-1 \
trainer.test_freq=5 \ trainer.test_freq=5 \
trainer.total_epochs=5 \ trainer.total_epochs=5 \
trainer.profile_steps=$PROFILE_STEPS \ trainer.device=npu \
trainer.device=npu $@ profiler.tool=npu \
profiler.steps=$PROFILE_STEPS \
profiler.save_path=$SAVE_PATH \
profiler.tool_config.npu.discrete=$DISCRETE \
profiler.tool_config.npu.contents=$CONTENTS \
profiler.tool_config.npu.level=$LEVEL \
profiler.tool_config.npu.analysis=$ANALYSIS
$@

View File

@ -8,12 +8,7 @@ DISCRETE=False
# profiling NPU options # profiling NPU options
SAVE_PATH="$HOME/profile_data" SAVE_PATH="$HOME/profile_data"
LEVEL="level1" LEVEL="level1"
WITH_MEMORY=False CONTENTS=['npu','cpu']
RECORD_SHAPES=False
WITH_NPU=True
WITH_CPU=True
WITH_MODULE=False
WITH_STACK=False
ANALYSIS=True ANALYSIS=True
python3 -m verl.trainer.main_ppo \ python3 -m verl.trainer.main_ppo \
@ -28,15 +23,16 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \ actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=5e-8 \ actor_rollout_ref.actor.optim.lr=5e-8 \
actor_rollout_ref.model.use_remove_padding=False \ actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \ actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.profiler.discrete=$DISCRETE \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.profiler.enable=True \
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
@ -48,15 +44,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \ actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \ algorithm.use_kl_in_reward=False \
trainer.npu_profile.options.save_path=$SAVE_PATH \
trainer.npu_profile.options.level=$LEVEL \
trainer.npu_profile.options.with_memory=$WITH_MEMORY \
trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
trainer.npu_profile.options.with_npu=$WITH_NPU \
trainer.npu_profile.options.with_cpu=$WITH_CPU \
trainer.npu_profile.options.with_module=$WITH_MODULE \
trainer.npu_profile.options.with_stack=$WITH_STACK \
trainer.npu_profile.options.analysis=$ANALYSIS \
trainer.critic_warmup=0 \ trainer.critic_warmup=0 \
trainer.logger=console \ trainer.logger=console \
trainer.project_name='verl_grpo_example_gsm8k' \ trainer.project_name='verl_grpo_example_gsm8k' \
@ -66,5 +53,12 @@ python3 -m verl.trainer.main_ppo \
trainer.save_freq=-1 \ trainer.save_freq=-1 \
trainer.test_freq=5 \ trainer.test_freq=5 \
trainer.total_epochs=5 \ trainer.total_epochs=5 \
trainer.profile_steps=$PROFILE_STEPS \ trainer.device=npu \
trainer.device=npu $@ profiler.tool=npu \
profiler.steps=$PROFILE_STEPS \
profiler.save_path=$SAVE_PATH \
profiler.tool_config.npu.discrete=$DISCRETE \
profiler.tool_config.npu.contents=$CONTENTS \
profiler.tool_config.npu.level=$LEVEL \
profiler.tool_config.npu.analysis=$ANALYSIS \
$@

View File

@ -13,9 +13,9 @@ train_files=${train_files:-"$gsm8k_train_path"}
test_files=${test_files:-"$gsm8k_test_path"} test_files=${test_files:-"$gsm8k_test_path"}
# Nsight profiling configuration # Nsight profiling configuration
PROFILE_STEPS="[1,2,5]" # or [] or null PROFILE_STEPS="[1]" # or [] or null
PROFILE_RANKS_ALL=False # or True PROFILE_RANKS_ALL=False # or True
PROFILE_RANKS=[0,4,8,12] PROFILE_RANKS=[0,4]
DISCRETE=True # or True DISCRETE=True # or True
python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
@ -34,30 +34,32 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2 \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2 \
actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.profiler.enable=True \
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.profiler.discrete=$DISCRETE \
critic.optim.lr=1e-5 \ critic.optim.lr=1e-5 \
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \ critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
critic.ppo_micro_batch_size_per_gpu=4 \ critic.ppo_micro_batch_size_per_gpu=4 \
critic.profiler.enable=True \
critic.profiler.ranks=$PROFILE_RANKS \ critic.profiler.ranks=$PROFILE_RANKS \
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \ critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
critic.profiler.discrete=$DISCRETE \
algorithm.use_kl_in_reward=False \ algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \ trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \ trainer.logger='["console","wandb"]' \
trainer.project_name='verl_ppo_gsm8k_math_examples' \ trainer.project_name='verl_ppo_gsm8k_math_examples' \
trainer.experiment_name='deepseek_llm_7b_megatron' \ trainer.experiment_name='deepseek_llm_7b_megatron' \
trainer.n_gpus_per_node=8 \ trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \ trainer.nnodes=1 \
trainer.save_freq=-1 \ trainer.save_freq=-1 \
trainer.test_freq=-1 \ trainer.test_freq=-1 \
trainer.total_epochs=100 \ trainer.total_epochs=100 \
trainer.total_training_steps=6 \ trainer.total_training_steps=1 \
trainer.profile_steps=$PROFILE_STEPS $@ profiler.tool=nsys \
profiler.steps=$PROFILE_STEPS \
profiler.tool_config.nsys.discrete=$DISCRETE $@

View File

@ -10,8 +10,8 @@ test_files=${test_files:-"$gsm8k_test_path"}
PROFILE_STEPS="[1,2,5]" # or [] or null PROFILE_STEPS="[1,2,5]" # or [] or null
PROFILE_RANKS_ALL=False # or True PROFILE_RANKS_ALL=False # or True
PROFILE_RANKS=[0,4,8,12] PROFILE_RANKS=[0,4]
DISCRETE=False # or True DISCRETE=True # or True
python3 -m verl.trainer.main_ppo \ python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \ algorithm.adv_estimator=gae \
@ -30,17 +30,17 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.actor.ppo_mini_batch_size=512 \ actor_rollout_ref.actor.ppo_mini_batch_size=512 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_dynamic_bsz=True \ actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=12000 \
actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.profiler.enable=True \
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=24000 \ actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=24000 \
actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
actor_rollout_ref.profiler.discrete=$DISCRETE \
critic.optim.lr=1e-5 \ critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \ critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2-7B-Instruct \ critic.model.path=Qwen/Qwen2-7B-Instruct \
@ -50,9 +50,9 @@ python3 -m verl.trainer.main_ppo \
critic.ppo_max_token_len_per_gpu=98304 \ critic.ppo_max_token_len_per_gpu=98304 \
critic.model.fsdp_config.param_offload=False \ critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \ critic.model.fsdp_config.optimizer_offload=False \
critic.profiler.enable=True \
critic.profiler.ranks=$PROFILE_RANKS \ critic.profiler.ranks=$PROFILE_RANKS \
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \ critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
critic.profiler.discrete=$DISCRETE \
reward_model.enable=True \ reward_model.enable=True \
reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\ reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
reward_model.model.use_remove_padding=True \ reward_model.model.use_remove_padding=True \
@ -60,9 +60,9 @@ python3 -m verl.trainer.main_ppo \
reward_model.micro_batch_size_per_gpu=32 \ reward_model.micro_batch_size_per_gpu=32 \
reward_model.use_dynamic_bsz=True \ reward_model.use_dynamic_bsz=True \
reward_model.forward_max_token_len_per_gpu=98304 \ reward_model.forward_max_token_len_per_gpu=98304 \
reward_model.profiler.enable=True \
reward_model.profiler.ranks=$PROFILE_RANKS \ reward_model.profiler.ranks=$PROFILE_RANKS \
reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \ reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \
reward_model.profiler.discrete=$DISCRETE \
algorithm.use_kl_in_reward=False \ algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \ trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \ trainer.logger='["console","wandb"]' \
@ -70,10 +70,12 @@ python3 -m verl.trainer.main_ppo \
trainer.experiment_name='qwen2-7b_hybrid_rm_bsz8k_p4k_r4k_seq_packing' \ trainer.experiment_name='qwen2-7b_hybrid_rm_bsz8k_p4k_r4k_seq_packing' \
trainer.n_gpus_per_node=8 \ trainer.n_gpus_per_node=8 \
trainer.val_before_train=False \ trainer.val_before_train=False \
trainer.nnodes=2 \ trainer.nnodes=1 \
trainer.save_freq=-1 \ trainer.save_freq=-1 \
trainer.test_freq=-1 \ trainer.test_freq=-1 \
trainer.total_epochs=15 \ trainer.total_epochs=15 \
trainer.total_training_steps=6 \ trainer.total_training_steps=6 \
trainer.profile_continuous_steps=True \ profiler.profile_continuous_steps=True \
trainer.profile_steps=$PROFILE_STEPS $@ profiler.tool=nsys \
profiler.steps=$PROFILE_STEPS \
profiler.tool_config.nsys.discrete=$DISCRETE $@

View File

@ -97,8 +97,8 @@ class RayDAPOTrainer(RayPPOTrainer):
prev_step_profile = False prev_step_profile = False
curr_step_profile = ( curr_step_profile = (
self.global_steps in self.config.trainer.profile_steps self.global_steps in self.config.global_profiler.steps
if self.config.trainer.profile_steps is not None if self.config.global_profiler.steps is not None
else False else False
) )
next_step_profile = False next_step_profile = False
@ -114,7 +114,7 @@ class RayDAPOTrainer(RayPPOTrainer):
with marked_timer("start_profile", timing_raw): with marked_timer("start_profile", timing_raw):
self._start_profiling( self._start_profiling(
not prev_step_profile and curr_step_profile not prev_step_profile and curr_step_profile
if self.config.trainer.profile_continuous_steps if self.config.global_profiler.profile_continuous_steps
else curr_step_profile else curr_step_profile
) )
@ -350,13 +350,13 @@ class RayDAPOTrainer(RayPPOTrainer):
with marked_timer("stop_profile", timing_raw): with marked_timer("stop_profile", timing_raw):
next_step_profile = ( next_step_profile = (
self.global_steps + 1 in self.config.trainer.profile_steps self.global_steps + 1 in self.config.global_profiler.steps
if self.config.trainer.profile_steps is not None if self.config.global_profiler.steps is not None
else False else False
) )
self._stop_profiling( self._stop_profiling(
curr_step_profile and not next_step_profile curr_step_profile and not next_step_profile
if self.config.trainer.profile_continuous_steps if self.config.global_profiler.profile_continuous_steps
else curr_step_profile else curr_step_profile
) )
prev_step_profile = curr_step_profile prev_step_profile = curr_step_profile

View File

@ -45,10 +45,13 @@ def run_ppo(config) -> None:
if ( if (
is_cuda_available is_cuda_available
and OmegaConf.select(config.trainer, "profile_steps") is not None and config.global_profiler.tool == "nsys"
and len(OmegaConf.select(config.trainer, "profile_steps")) > 0 and OmegaConf.select(config.global_profiler, "steps") is not None
and len(OmegaConf.select(config.global_profiler, "steps")) > 0
): ):
nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options) nsight_options = OmegaConf.to_container(
config.global_profiler.global_tool_config.nsys.controller_nsight_options
)
runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote() runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
else: else:
runner = TaskRunner.remote() runner = TaskRunner.remote()

View File

@ -38,6 +38,7 @@ from verl.utils.fsdp_utils import (
) )
from verl.utils.import_utils import import_external_libs from verl.utils.import_utils import import_external_libs
from verl.utils.model import get_generation_config, update_model_config from verl.utils.model import get_generation_config, update_model_config
from verl.utils.profiler import ProfilerConfig
from verl.workers.fsdp_workers import ActorRolloutRefWorker as ARRWorker from verl.workers.fsdp_workers import ActorRolloutRefWorker as ARRWorker
from verl.workers.fsdp_workers import CriticWorker from verl.workers.fsdp_workers import CriticWorker
@ -131,8 +132,17 @@ class RolloutWorker(ActorRolloutRefWorker):
# We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py) # We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
# as they provides DictConfig-like interface # as they provides DictConfig-like interface
# The benefit of creating the dataclass config is to perform validation during __post_init__ # The benefit of creating the dataclass config is to perform validation during __post_init__
profiler_config = omega_conf_to_dataclass(config.rollout.get("profiler", {})) omega_profiler_config = config.get("profiler", {})
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config)) profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
)
self._is_rollout = True self._is_rollout = True
self._is_actor = False self._is_actor = False

View File

@ -51,10 +51,11 @@ def run_ppo(config) -> None:
# Create a remote instance of the TaskRunner class, and # Create a remote instance of the TaskRunner class, and
# Execute the `run` method of the TaskRunner instance remotely and wait for it to complete # Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
if ( if (
OmegaConf.select(config.trainer, "profile_steps") is not None config.global_profiler.tool == "nsys"
and len(OmegaConf.select(config.trainer, "profile_steps")) > 0 and OmegaConf.select(config.global_profiler, "steps") is not None
and len(OmegaConf.select(config.global_profiler, "steps")) > 0
): ):
nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options) nsight_options = OmegaConf.to_container(config.global_profiler.tool_config.nsys.controller_nsight_options)
runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote() runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
else: else:
runner = TaskRunner.remote() runner = TaskRunner.remote()

View File

@ -213,7 +213,6 @@ class OneStepOffRayTrainer(RayPPOTrainer):
self.role_worker_mapping[Role.RefPolicy], self.role_worker_mapping[Role.RefPolicy],
config=self.config.actor_rollout_ref, config=self.config.actor_rollout_ref,
role="ref", role="ref",
profile_option=self.config.trainer.npu_profile.options,
) )
self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls
@ -233,13 +232,13 @@ class OneStepOffRayTrainer(RayPPOTrainer):
wg_kwargs = {} # Setting up kwargs for RayWorkerGroup wg_kwargs = {} # Setting up kwargs for RayWorkerGroup
if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None: if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
if OmegaConf.select(self.config.trainer, "profile_steps") is not None: if OmegaConf.select(self.config.global_profiler, "steps") is not None:
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps") wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "steps")
assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, ( assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
"worker_nsight_options must be set when profile_steps is set" "worker_nsight_options must be set when profile_steps is set"
) )
wg_kwargs["worker_nsight_options"] = OmegaConf.to_container( wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
OmegaConf.select(self.config.trainer, "worker_nsight_options") OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
) )
for resource_pool, class_dict in self.resource_pool_to_cls.items(): for resource_pool, class_dict in self.resource_pool_to_cls.items():
@ -391,8 +390,8 @@ class OneStepOffRayTrainer(RayPPOTrainer):
while batch_data_future is not None: while batch_data_future is not None:
do_profile = ( do_profile = (
self.global_steps in self.config.trainer.profile_steps self.global_steps in self.config.global_profiler.steps
if self.config.trainer.profile_steps is not None if self.config.global_profiler.steps is not None
else False else False
) )
if do_profile: if do_profile:

View File

@ -37,6 +37,14 @@ class TestConfigComparison(unittest.TestCase):
"activations_checkpoint_method", "activations_checkpoint_method",
"activations_checkpoint_granularity", "activations_checkpoint_granularity",
"activations_checkpoint_num_layers", "activations_checkpoint_num_layers",
"discrete",
"profiler",
"profile",
"use_profile",
"npu_profile",
"profile_steps",
"worker_nsight_options",
"controller_nsight_options",
] ]
def _compare_configs_recursively( def _compare_configs_recursively(

View File

@ -79,7 +79,7 @@ class TestPrintCfgCommand(unittest.TestCase):
# Run the command # Run the command
result = subprocess.run( result = subprocess.run(
["python3", "scripts/print_cfg.py", "critic.profiler.discrete=True", "+critic.profiler.extra.any_key=val"], ["python3", "scripts/print_cfg.py", "+critic.profiler.extra.any_key=val"],
capture_output=True, capture_output=True,
text=True, text=True,
) )
@ -90,7 +90,6 @@ class TestPrintCfgCommand(unittest.TestCase):
# Verify the output contains expected config information # Verify the output contains expected config information
self.assertIn("critic", result.stdout) self.assertIn("critic", result.stdout)
self.assertIn("profiler", result.stdout) self.assertIn("profiler", result.stdout)
self.assertIn("discrete=True", result.stdout)
self.assertIn("extra={'any_key': 'val'}", result.stdout) self.assertIn("extra={'any_key': 'val'}", result.stdout)

View File

@ -17,7 +17,7 @@ import unittest
from unittest.mock import MagicMock, patch from unittest.mock import MagicMock, patch
from verl.utils import omega_conf_to_dataclass from verl.utils import omega_conf_to_dataclass
from verl.utils.profiler import ProfilerConfig from verl.utils.profiler.config import NsightToolConfig, ProfilerConfig
from verl.utils.profiler.nvtx_profile import NsightSystemsProfiler from verl.utils.profiler.nvtx_profile import NsightSystemsProfiler
@ -29,26 +29,25 @@ class TestProfilerConfig(unittest.TestCase):
with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")): with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
cfg = compose(config_name="ppo_trainer") cfg = compose(config_name="ppo_trainer")
arr = cfg.actor_rollout_ref
for config in [ for config in [
cfg.actor_rollout_ref.actor.profiler,
cfg.actor_rollout_ref.rollout.profiler,
cfg.actor_rollout_ref.ref.profiler,
cfg.critic.profiler, cfg.critic.profiler,
arr.profiler,
cfg.reward_model.profiler, cfg.reward_model.profiler,
]: ]:
profiler_config = omega_conf_to_dataclass(config) profiler_config = omega_conf_to_dataclass(config)
self.assertEqual(profiler_config.discrete, config.discrete) self.assertEqual(profiler_config.tool, config.tool)
self.assertEqual(profiler_config.enable, config.enable)
self.assertEqual(profiler_config.all_ranks, config.all_ranks) self.assertEqual(profiler_config.all_ranks, config.all_ranks)
self.assertEqual(profiler_config.ranks, config.ranks) self.assertEqual(profiler_config.ranks, config.ranks)
self.assertEqual(profiler_config.save_path, config.save_path)
self.assertEqual(profiler_config.ranks, config.ranks)
assert isinstance(profiler_config, ProfilerConfig) assert isinstance(profiler_config, ProfilerConfig)
with self.assertRaises(AttributeError): with self.assertRaises(AttributeError):
_ = profiler_config.non_existing_key _ = profiler_config.non_existing_key
assert config.get("non_existing_key") == profiler_config.get("non_existing_key") assert config.get("non_existing_key") == profiler_config.get("non_existing_key")
assert config.get("non_existing_key", 1) == profiler_config.get("non_existing_key", 1) assert config.get("non_existing_key", 1) == profiler_config.get("non_existing_key", 1)
assert config["discrete"] == profiler_config["discrete"]
from dataclasses import FrozenInstanceError
with self.assertRaises(FrozenInstanceError):
profiler_config.discrete = False
def test_frozen_config(self): def test_frozen_config(self):
"""Test that modifying frozen keys in ProfilerConfig raises exceptions.""" """Test that modifying frozen keys in ProfilerConfig raises exceptions."""
@ -57,11 +56,7 @@ class TestProfilerConfig(unittest.TestCase):
from verl.utils.profiler.config import ProfilerConfig from verl.utils.profiler.config import ProfilerConfig
# Create a new ProfilerConfig instance # Create a new ProfilerConfig instance
config = ProfilerConfig(discrete=True, all_ranks=False, ranks=[0], extra={"key": "value"}) config = ProfilerConfig(all_ranks=False, ranks=[0], extra={"key": "value"})
# Test direct attribute assignment
with self.assertRaises(FrozenInstanceError):
config.discrete = False
with self.assertRaises(FrozenInstanceError): with self.assertRaises(FrozenInstanceError):
config.all_ranks = True config.all_ranks = True
@ -69,10 +64,6 @@ class TestProfilerConfig(unittest.TestCase):
with self.assertRaises(FrozenInstanceError): with self.assertRaises(FrozenInstanceError):
config.ranks = [1, 2, 3] config.ranks = [1, 2, 3]
# Test dictionary-style assignment
with self.assertRaises(TypeError):
config["discrete"] = False
with self.assertRaises(TypeError): with self.assertRaises(TypeError):
config["all_ranks"] = True config["all_ranks"] = True
@ -90,20 +81,19 @@ class TestNsightSystemsProfiler(unittest.TestCase):
Test Plan: Test Plan:
1. Initialization: Verify profiler state after creation 1. Initialization: Verify profiler state after creation
2. Basic Profiling: Test start/stop functionality 2. Basic Profiling: Test start/stop functionality
3. Discrete Mode: Test discrete profiling behavior 3. Discrete Mode: TODO: Test discrete profiling behavior
4. Annotation: Test the annotate decorator in both normal and discrete modes 4. Annotation: Test the annotate decorator in both normal and discrete modes
5. Config Validation: Verify proper config initialization from OmegaConf 5. Config Validation: Verify proper config initialization from OmegaConf
""" """
def setUp(self): def setUp(self):
self.config = ProfilerConfig(all_ranks=True) self.config = ProfilerConfig(enable=True, all_ranks=True)
self.rank = 0 self.rank = 0
self.profiler = NsightSystemsProfiler(self.rank, self.config) self.profiler = NsightSystemsProfiler(self.rank, self.config, tool_config=NsightToolConfig(discrete=False))
def test_initialization(self): def test_initialization(self):
self.assertEqual(self.profiler.this_rank, True) self.assertEqual(self.profiler.this_rank, True)
self.assertEqual(self.profiler.this_step, False) self.assertEqual(self.profiler.this_step, False)
self.assertEqual(self.profiler.discrete, False)
def test_start_stop_profiling(self): def test_start_stop_profiling(self):
with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop: with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
@ -117,18 +107,18 @@ class TestNsightSystemsProfiler(unittest.TestCase):
self.assertFalse(self.profiler.this_step) self.assertFalse(self.profiler.this_step)
mock_stop.assert_called_once() mock_stop.assert_called_once()
def test_discrete_profiling(self): # def test_discrete_profiling(self):
discrete_config = ProfilerConfig(discrete=True, all_ranks=True) # discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
profiler = NsightSystemsProfiler(self.rank, discrete_config) # profiler = NsightSystemsProfiler(self.rank, discrete_config)
with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop: # with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
profiler.start() # profiler.start()
self.assertTrue(profiler.this_step) # self.assertTrue(profiler.this_step)
mock_start.assert_not_called() # Shouldn't start immediately in discrete mode # mock_start.assert_not_called() # Shouldn't start immediately in discrete mode
profiler.stop() # profiler.stop()
self.assertFalse(profiler.this_step) # self.assertFalse(profiler.this_step)
mock_stop.assert_not_called() # Shouldn't stop immediately in discrete mode # mock_stop.assert_not_called() # Shouldn't stop immediately in discrete mode
def test_annotate_decorator(self): def test_annotate_decorator(self):
mock_self = MagicMock() mock_self = MagicMock()
@ -152,29 +142,29 @@ class TestNsightSystemsProfiler(unittest.TestCase):
mock_start.assert_not_called() # Not discrete mode mock_start.assert_not_called() # Not discrete mode
mock_stop.assert_not_called() # Not discrete mode mock_stop.assert_not_called() # Not discrete mode
def test_annotate_discrete_mode(self): # def test_annotate_discrete_mode(self):
discrete_config = ProfilerConfig(discrete=True, all_ranks=True) # discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
profiler = NsightSystemsProfiler(self.rank, discrete_config) # profiler = NsightSystemsProfiler(self.rank, discrete_config)
mock_self = MagicMock() # mock_self = MagicMock()
mock_self.profiler = profiler # mock_self.profiler = profiler
mock_self.profiler.this_step = True # mock_self.profiler.this_step = True
@NsightSystemsProfiler.annotate(message="test") # @NsightSystemsProfiler.annotate(message="test")
def test_func(self, *args, **kwargs): # def test_func(self, *args, **kwargs):
return "result" # return "result"
with ( # with (
patch("torch.cuda.profiler.start") as mock_start, # patch("torch.cuda.profiler.start") as mock_start,
patch("torch.cuda.profiler.stop") as mock_stop, # patch("torch.cuda.profiler.stop") as mock_stop,
patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range, # patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range, # patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
): # ):
result = test_func(mock_self) # result = test_func(mock_self)
self.assertEqual(result, "result") # self.assertEqual(result, "result")
mock_start_range.assert_called_once() # mock_start_range.assert_called_once()
mock_end_range.assert_called_once() # mock_end_range.assert_called_once()
mock_start.assert_called_once() # Should start in discrete mode # mock_start.assert_called_once() # Should start in discrete mode
mock_stop.assert_called_once() # Should stop in discrete mode # mock_stop.assert_called_once() # Should stop in discrete mode
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -184,29 +184,26 @@ class TestCriticConfig:
optim = OptimizerConfig(lr=0.1) optim = OptimizerConfig(lr=0.1)
critic_config = CriticConfig(ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim) critic_config = CriticConfig(ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim)
assert isinstance(critic_config.profiler, ProfilerConfig) assert isinstance(critic_config.profiler, ProfilerConfig)
assert critic_config.profiler.discrete is False
assert critic_config.profiler.all_ranks is False assert critic_config.profiler.all_ranks is False
assert critic_config.profiler.ranks == [] assert critic_config.profiler.ranks == []
custom_profiler = ProfilerConfig(discrete=True, all_ranks=True, ranks=[0, 1]) custom_profiler = ProfilerConfig(all_ranks=True, ranks=[0, 1])
critic_config_custom = CriticConfig( critic_config_custom = CriticConfig(
profiler=custom_profiler, ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim profiler=custom_profiler, ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim
) )
assert isinstance(critic_config_custom.profiler, ProfilerConfig) assert isinstance(critic_config_custom.profiler, ProfilerConfig)
assert critic_config_custom.profiler.discrete is True
assert critic_config_custom.profiler.all_ranks is True assert critic_config_custom.profiler.all_ranks is True
assert critic_config_custom.profiler.ranks == [0, 1] assert critic_config_custom.profiler.ranks == [0, 1]
profiler1 = ProfilerConfig(discrete=True, ranks=[0, 1]) profiler1 = ProfilerConfig(enable=True, ranks=[0, 1])
profiler2 = ProfilerConfig(all_ranks=True, ranks=[1, 2]) profiler2 = ProfilerConfig(all_ranks=True, ranks=[1, 2])
union_result = profiler1.union(profiler2) union_result = profiler1.union(profiler2)
assert union_result.discrete is True assert union_result.enable is True
assert union_result.all_ranks is True assert union_result.all_ranks is True
assert set(union_result.ranks) == {0, 1, 2} assert set(union_result.ranks) == {0, 1, 2}
intersect_result = profiler1.intersect(profiler2) intersect_result = profiler1.intersect(profiler2)
assert intersect_result.discrete is False
assert intersect_result.all_ranks is False assert intersect_result.all_ranks is False
assert intersect_result.ranks == [1] assert intersect_result.ranks == [1]

View File

@ -59,6 +59,25 @@ actor_rollout_ref:
use_checkpoint_opt_param_scheduler: false use_checkpoint_opt_param_scheduler: false
override_optimizer_config: {} override_optimizer_config: {}
use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false} use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: ${oc.select:global_profiler.tool,null}
enable: false
all_ranks: false
ranks: []
save_path: ${oc.select:global_profiler.save_path,null}
tool_config:
nsys:
discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
npu:
_target_: verl.utils.profiler.config.NPUToolConfig
contents: []
level: level1
analysis: true
torch:
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
step_start: 0
step_end: null
data_loader_seed: null data_loader_seed: null
load_weight: true load_weight: true
megatron: megatron:
@ -85,12 +104,6 @@ actor_rollout_ref:
recompute_method: null recompute_method: null
recompute_num_layers: null recompute_num_layers: null
use_mbridge: false use_mbridge: false
profile:
use_profile: false
profile_ranks: null
step_start: -1
step_end: -1
save_path: null
ref: ref:
strategy: megatron strategy: megatron
use_torch_compile: ${oc.select:actor_rollout_ref.actor.use_torch_compile,true} use_torch_compile: ${oc.select:actor_rollout_ref.actor.use_torch_compile,true}
@ -98,6 +111,14 @@ actor_rollout_ref:
log_prob_micro_batch_size_per_gpu: null log_prob_micro_batch_size_per_gpu: null
log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false} log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384} log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: ${oc.select:global_profiler.tool,null}
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
megatron: megatron:
_target_: verl.workers.config.MegatronEngineConfig _target_: verl.workers.config.MegatronEngineConfig
param_offload: false param_offload: false
@ -114,12 +135,6 @@ actor_rollout_ref:
seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42} seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}} override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False} use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
profile:
use_profile: false
profile_ranks: null
step_start: -1
step_end: -1
save_path: null
load_weight: true load_weight: true
rollout: rollout:
name: ??? name: ???
@ -184,6 +199,14 @@ actor_rollout_ref:
token2text: false token2text: false
skip_rollout: false skip_rollout: false
skip_dump_dir: /tmp/rollout_dump skip_dump_dir: /tmp/rollout_dump
profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: ${oc.select:global_profiler.tool,null}
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
enable_chunked_prefill: false enable_chunked_prefill: false
load_format: dummy_megatron load_format: dummy_megatron
layer_name_map: layer_name_map:
@ -201,63 +224,6 @@ actor_rollout_ref:
freeze_moe_router: false freeze_moe_router: false
use_fused_kernels: false use_fused_kernels: false
trust_remote_code: false trust_remote_code: false
profiler:
_target_: verl.utils.profiler.ProfilerConfig
discrete: false
all_ranks: false
ranks: []
trainer:
npu_profile:
options:
save_path: ./profiler_data
roles:
- all
level: level1
with_memory: false
record_shapes: false
with_npu: true
with_cpu: true
with_module: false
with_stack: false
analysis: true
balance_batch: true
total_epochs: 30
total_training_steps: null
profile_steps: null
profile_continuous_steps: false
project_name: verl_examples
experiment_name: gsm8k
logger:
- console
- wandb
log_val_generations: 0
nnodes: 1
n_gpus_per_node: 8
save_freq: -1
esi_redundant_time: 0
resume_mode: auto
resume_from_path: null
del_local_ckpt_after_load: false
val_before_train: true
test_freq: -1
critic_warmup: 0
default_hdfs_dir: null
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
max_actor_ckpt_to_keep: null
max_critic_ckpt_to_keep: null
ray_wait_register_center_timeout: 300
device: cuda
controller_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
worker_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
capture-range: cudaProfilerApi
capture-range-end: null
kill: none
data: data:
tokenizer: null tokenizer: null
use_shm: false use_shm: false
@ -344,9 +310,12 @@ critic:
async_save: false async_save: false
profiler: profiler:
_target_: verl.utils.profiler.ProfilerConfig _target_: verl.utils.profiler.ProfilerConfig
discrete: false tool: ${oc.select:global_profiler.tool,null}
enable: false
all_ranks: false all_ranks: false
ranks: [] ranks: []
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
nccl_timeout: 600 nccl_timeout: 600
megatron: megatron:
_target_: verl.workers.config.McoreEngineConfig _target_: verl.workers.config.McoreEngineConfig
@ -390,9 +359,12 @@ reward_model:
memory_limit_mb: 1024 memory_limit_mb: 1024
profiler: profiler:
_target_: verl.utils.profiler.ProfilerConfig _target_: verl.utils.profiler.ProfilerConfig
discrete: false tool: ${oc.select:global_profiler.tool,null}
enable: false
all_ranks: false all_ranks: false
ranks: [] ranks: []
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
nccl_timeout: 600 nccl_timeout: 600
megatron: megatron:
_target_: verl.workers.config.MegatronEngineConfig _target_: verl.workers.config.MegatronEngineConfig
@ -432,6 +404,52 @@ algorithm:
pf_ppo: pf_ppo:
reweight_method: pow reweight_method: pow
weight_pow: 2.0 weight_pow: 2.0
trainer:
balance_batch: true
total_epochs: 30
total_training_steps: null
project_name: verl_examples
experiment_name: gsm8k
logger:
- console
- wandb
log_val_generations: 0
nnodes: 1
n_gpus_per_node: 8
save_freq: -1
esi_redundant_time: 0
resume_mode: auto
resume_from_path: null
del_local_ckpt_after_load: false
val_before_train: true
test_freq: -1
critic_warmup: 0
default_hdfs_dir: null
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
max_actor_ckpt_to_keep: null
max_critic_ckpt_to_keep: null
ray_wait_register_center_timeout: 300
device: cuda
global_profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: null
steps: null
profile_continuous_steps: false
save_path: outputs/profile
global_tool_config:
nsys:
discrete: false
controller_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
worker_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
capture-range: cudaProfilerApi
capture-range-end: null
kill: none
ray_init: ray_init:
num_cpus: null num_cpus: null
timeline_json_file: null timeline_json_file: null

View File

@ -51,6 +51,25 @@ actor_rollout_ref:
num_cycles: 0.5 num_cycles: 0.5
warmup_style: constant warmup_style: constant
use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false} use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: ${oc.select:global_profiler.tool,null}
enable: false
all_ranks: false
ranks: []
save_path: ${oc.select:global_profiler.save_path,null}
tool_config:
nsys:
discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
npu:
_target_: verl.utils.profiler.config.NPUToolConfig
contents: []
level: level1
analysis: true
torch:
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
step_start: 0
step_end: null
grad_clip: 1.0 grad_clip: 1.0
ulysses_sequence_parallel_size: 1 ulysses_sequence_parallel_size: 1
entropy_from_logits_with_chunking: false entropy_from_logits_with_chunking: false
@ -73,6 +92,14 @@ actor_rollout_ref:
log_prob_micro_batch_size_per_gpu: null log_prob_micro_batch_size_per_gpu: null
log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false} log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384} log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: ${oc.select:global_profiler.tool,null}
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
model: null model: null
fsdp_config: fsdp_config:
_target_: verl.workers.config.FSDPEngineConfig _target_: verl.workers.config.FSDPEngineConfig
@ -147,6 +174,14 @@ actor_rollout_ref:
token2text: false token2text: false
skip_rollout: false skip_rollout: false
skip_dump_dir: /tmp/rollout_dump skip_dump_dir: /tmp/rollout_dump
profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: ${oc.select:global_profiler.tool,null}
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
enable_chunked_prefill: true enable_chunked_prefill: true
load_format: dummy_dtensor load_format: dummy_dtensor
layered_summon: false layered_summon: false
@ -170,67 +205,6 @@ actor_rollout_ref:
fused_kernel_options: fused_kernel_options:
impl_backend: torch impl_backend: torch
trust_remote_code: false trust_remote_code: false
profiler:
_target_: verl.utils.profiler.ProfilerConfig
discrete: false
all_ranks: false
ranks: []
trainer:
npu_profile:
options:
save_path: ./profiler_data
roles:
- all
level: level1
with_memory: false
record_shapes: false
with_npu: true
with_cpu: true
with_module: false
with_stack: false
analysis: true
balance_batch: true
total_epochs: 30
total_training_steps: null
profile_steps: null
profile_continuous_steps: false
controller_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
worker_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
capture-range: cudaProfilerApi
capture-range-end: null
kill: none
project_name: verl_examples
experiment_name: gsm8k
logger:
- console
- wandb
log_val_generations: 0
rollout_data_dir: null
validation_data_dir: null
nnodes: 1
n_gpus_per_node: 8
save_freq: -1
esi_redundant_time: 0
resume_mode: auto
resume_from_path: null
val_before_train: true
val_only: false
test_freq: -1
critic_warmup: 0
default_hdfs_dir: null
del_local_ckpt_after_load: false
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
max_actor_ckpt_to_keep: null
max_critic_ckpt_to_keep: null
ray_wait_register_center_timeout: 300
device: cuda
use_legacy_worker_impl: auto
data: data:
tokenizer: null tokenizer: null
use_shm: false use_shm: false
@ -322,9 +296,12 @@ critic:
async_save: false async_save: false
profiler: profiler:
_target_: verl.utils.profiler.ProfilerConfig _target_: verl.utils.profiler.ProfilerConfig
discrete: false tool: ${oc.select:global_profiler.tool,null}
enable: false
all_ranks: false all_ranks: false
ranks: [] ranks: []
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
forward_micro_batch_size: ${oc.select:.ppo_micro_batch_size,null} forward_micro_batch_size: ${oc.select:.ppo_micro_batch_size,null}
forward_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size_per_gpu,null} forward_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size_per_gpu,null}
ulysses_sequence_parallel_size: 1 ulysses_sequence_parallel_size: 1
@ -361,9 +338,12 @@ reward_model:
memory_limit_mb: 1024 memory_limit_mb: 1024
profiler: profiler:
_target_: verl.utils.profiler.ProfilerConfig _target_: verl.utils.profiler.ProfilerConfig
discrete: false tool: ${oc.select:global_profiler.tool,null}
enable: false
all_ranks: false all_ranks: false
ranks: [] ranks: []
save_path: ${oc.select:global_profiler.save_path,null}
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
ulysses_sequence_parallel_size: 1 ulysses_sequence_parallel_size: 1
custom_reward_function: custom_reward_function:
path: null path: null
@ -386,6 +366,57 @@ algorithm:
pf_ppo: pf_ppo:
reweight_method: pow reweight_method: pow
weight_pow: 2.0 weight_pow: 2.0
trainer:
balance_batch: true
total_epochs: 30
total_training_steps: null
project_name: verl_examples
experiment_name: gsm8k
logger:
- console
- wandb
log_val_generations: 0
rollout_data_dir: null
validation_data_dir: null
nnodes: 1
n_gpus_per_node: 8
save_freq: -1
esi_redundant_time: 0
resume_mode: auto
resume_from_path: null
val_before_train: true
val_only: false
test_freq: -1
critic_warmup: 0
default_hdfs_dir: null
del_local_ckpt_after_load: false
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
max_actor_ckpt_to_keep: null
max_critic_ckpt_to_keep: null
ray_wait_register_center_timeout: 300
device: cuda
use_legacy_worker_impl: auto
global_profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: null
steps: null
profile_continuous_steps: false
save_path: outputs/profile
global_tool_config:
nsys:
_target_: verl.utils.profiler.config.NsightToolConfig
discrete: false
controller_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
worker_nsight_options:
trace: cuda,nvtx,cublas,ucx
cuda-memory-usage: 'true'
cuda-graph-trace: graph
capture-range: cudaProfilerApi
capture-range-end: null
kill: none
ray_init: ray_init:
num_cpus: null num_cpus: null
timeline_json_file: null timeline_json_file: null

View File

@ -128,3 +128,65 @@ optim:
# Whether to use custom fused kernels (e.g., FlashAttention, fused MLP) # Whether to use custom fused kernels (e.g., FlashAttention, fused MLP)
use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false} use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
# profile the actor model in `update_policy`
profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig
# profiler tool, default same as profiler.tool in global config
# choices: nsys, npu, torch
tool: ${oc.select:global_profiler.tool,null}
# whether enable profile on Actor
enable: False
# Whether to profile all ranks.
all_ranks: False
# The ranks that will be profiled. [] or [0,1,...]
ranks: []
# profile results saving path
save_path: ${oc.select:global_profiler.save_path,null}
# specific tool config which only related to the role
tool_config:
# nsys tool config
nsys:
# True for each task has its own database, False for all tasks in one training step share one database.
discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
# npu config
npu:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.config.NPUToolConfig
# Contents to profile, can be empty
# options: npu, cpu, memory, shapes, module, stack
contents: []
# Collection level, optional values: level_none, level0, level1, level2.
level: "level1"
# Whether to automatically parse the data.
analysis: True
# torch profiler config
torch:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
# start profile mini-batch in training
# NOTICE: different with global steps config which refers to iteration
# This field only related with mini-batch
step_start: 0
# stop profile mini-batch in training
step_end: null

View File

@ -103,22 +103,4 @@ megatron:
recompute_num_layers: null recompute_num_layers: null
# oc.select: default val for ref.megatron.use_mbridge # oc.select: default val for ref.megatron.use_mbridge
use_mbridge: False use_mbridge: False
# profile the actor model in `update_policy`
profile:
# turn it on when you want to profile the actor model
use_profile: False
# list, you can specify the ranks to profile
profile_ranks: null
# start step in update_policy
step_start: -1
# end step
step_end: -1
# the path to save the profile result
save_path: null

View File

@ -45,14 +45,12 @@ class ProfileConfig(BaseConfig):
The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config. The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.
Args: Args:
use_profile (bool): Whether to enable profiling.
profile_ranks (Optional[list[int]]): List of ranks to profile. None means all ranks. profile_ranks (Optional[list[int]]): List of ranks to profile. None means all ranks.
step_start (int): Starting step for profiling. step_start (int): Starting step for profiling.
step_end (int): Ending step for profiling. step_end (int): Ending step for profiling.
save_path (Optional[str]): Path to save profiling results. save_path (Optional[str]): Path to save profiling results.
""" """
use_profile: bool = False
profile_ranks: Optional[list[int]] = None profile_ranks: Optional[list[int]] = None
step_start: int = -1 step_start: int = -1
step_end: int = -1 step_end: int = -1

View File

@ -95,18 +95,27 @@ checkpoint:
# Whether to save checkpoints asynchronously. Only effective for Megatron as of now. # Whether to save checkpoints asynchronously. Only effective for Megatron as of now.
async_save: False async_save: False
# profiler configs # profile the critic model in `update_policy`
# the corresponding dataclass is verl.utils.profiler.ProfilerConfig.
profiler: profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig _target_: verl.utils.profiler.ProfilerConfig
# True for each task has its own database, False for all tasks in one training step share one database. # profiler tool, default same as profiler.tool in global config
discrete: False # choices: nsys, npu, torch
tool: ${oc.select:global_profiler.tool,null}
# whether enable profile on critic
enable: False
# Whether to profile all ranks. # Whether to profile all ranks.
all_ranks: False all_ranks: False
# The ranks that will be profiled. [] or [0,1,...] # The ranks that will be profiled. [] or [0,1,...]
ranks: [] ranks: []
# profile results saving path
save_path: ${oc.select:global_profiler.save_path,null}
# specific tool config
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}

View File

@ -4,8 +4,6 @@ defaults:
# <folder_name>@<field_name>.<field_name>: <yaml_file_name> # <folder_name>@<field_name>.<field_name>: <yaml_file_name>
# actor_rollout_ref.actor: trainer/config/actor/megatron_actor.yaml # actor_rollout_ref.actor: trainer/config/actor/megatron_actor.yaml
- actor@actor_rollout_ref.actor: megatron_actor - actor@actor_rollout_ref.actor: megatron_actor
# trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
- npu_profile@trainer.npu_profile: npu_profile
# data: trainer/config/data/legacy_data.yaml # data: trainer/config/data/legacy_data.yaml
- data@data: legacy_data - data@data: legacy_data
# load the reference default config, then apply the fields in the current yaml # load the reference default config, then apply the fields in the current yaml
@ -57,12 +55,6 @@ actor_rollout_ref:
qkv_layer_name: qkv qkv_layer_name: qkv
gate_proj_layer_name: gate_up gate_proj_layer_name: gate_up
profiler:
_target_: verl.utils.profiler.ProfilerConfig
discrete: False
all_ranks: False
ranks: []
custom_reward_function: custom_reward_function:
path: null path: null
name: compute_score name: compute_score
@ -92,8 +84,6 @@ trainer:
balance_batch: True balance_batch: True
total_epochs: 30 total_epochs: 30
total_training_steps: null total_training_steps: null
profile_steps: null # [1,2,5] or [] or null
profile_continuous_steps: False
project_name: verl_examples project_name: verl_examples
experiment_name: gsm8k experiment_name: gsm8k
logger: ['console', 'wandb'] logger: ['console', 'wandb']
@ -117,18 +107,62 @@ trainer:
# The timeout for ray worker group to wait for the register center to be ready # The timeout for ray worker group to wait for the register center to be ready
ray_wait_register_center_timeout: 300 ray_wait_register_center_timeout: 300
device: cuda device: cuda
# see ppo_trainer.yaml for more details
controller_nsight_options: global_profiler:
trace: "cuda,nvtx,cublas,ucx" _target_: verl.utils.profiler.ProfilerConfig
cuda-memory-usage: "true" tool: null # choose between nsys, npu, torch
cuda-graph-trace: "graph" steps: null # profile steps
worker_nsight_options: profile_continuous_steps: False
trace: "cuda,nvtx,cublas,ucx" save_path: "outputs/profile" # profiler saving path
cuda-memory-usage: "true" # Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
cuda-graph-trace: "graph" global_tool_config:
capture-range: "cudaProfilerApi"
capture-range-end: null # nsys config
kill: none nsys:
# True for each task has its own database, False for all tasks in one training step share one database.
discrete: False
# controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
controller_nsight_options:
# Select the API(s) to be traced.
trace: "cuda,nvtx,cublas,ucx"
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
cuda-memory-usage: "true"
# CUDA graphs will be traced as a whole
cuda-graph-trace: "graph"
# worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
worker_nsight_options:
# Select the API(s) to be traced.
trace: "cuda,nvtx,cublas,ucx"
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
cuda-memory-usage: "true"
# CUDA graphs will be traced as a whole
cuda-graph-trace: "graph"
# Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
capture-range: "cudaProfilerApi"
# Specify the desired behavior when a capture range ends.
# In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
# valid values are "repeat-shutdown:n" or null.
# For normal whole step profiling, n = len(profile_steps);
# but for discrete profiling, n = len(profile_steps) * Number(subtasks).
# Or you can just leave it null and the program will use n = len(profile_steps) * 6;
capture-range-end: null
# Send signal to the target application's process group. We let the program to exit by itself.
kill: none
ray_init: ray_init:
num_cpus: null # `None` means using all CPUs, which might cause hang if limited in systems like SLURM. Please set to a number allowed then. num_cpus: null # `None` means using all CPUs, which might cause hang if limited in systems like SLURM. Please set to a number allowed then.
timeline_json_file: null timeline_json_file: null

View File

@ -11,9 +11,6 @@ defaults:
# actor_rollout_ref.actor: trainer/config/actor/dp_actor.yaml # actor_rollout_ref.actor: trainer/config/actor/dp_actor.yaml
- actor@actor_rollout_ref.actor: dp_actor - actor@actor_rollout_ref.actor: dp_actor
# trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
- npu_profile@trainer.npu_profile: npu_profile
# data: trainer/config/data/legacy_data.yaml # data: trainer/config/data/legacy_data.yaml
- data@data: legacy_data - data@data: legacy_data
@ -112,21 +109,6 @@ actor_rollout_ref:
# for huge model, layered summon can save memory (prevent OOM) but make it slower # for huge model, layered summon can save memory (prevent OOM) but make it slower
layered_summon: False layered_summon: False
# profiler configs
profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig
# True for each task has its own database, False for all tasks in one training step share one database.
discrete: False
# Whether to profile all ranks.
all_ranks: False
# The ranks that will be profiled. [] or [0,1,...]
ranks: []
# custom reward function definition # custom reward function definition
custom_reward_function: custom_reward_function:
@ -203,54 +185,6 @@ trainer:
# Total training steps (can be set explicitly or derived from epochs) # Total training steps (can be set explicitly or derived from epochs)
total_training_steps: null total_training_steps: null
# The steps that will be profiled. null means no profiling. null or [1,2,5,...]
profile_steps: null
# Whether to combine continuous steps into one database.
## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
## If False, [1] in one, [2] in another, [5] in another.
profile_continuous_steps: False
# controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
controller_nsight_options:
# Select the API(s) to be traced.
trace: "cuda,nvtx,cublas,ucx"
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
cuda-memory-usage: "true"
# CUDA graphs will be traced as a whole
cuda-graph-trace: "graph"
# worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
worker_nsight_options:
# Select the API(s) to be traced.
trace: "cuda,nvtx,cublas,ucx"
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
cuda-memory-usage: "true"
# CUDA graphs will be traced as a whole
cuda-graph-trace: "graph"
# Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
capture-range: "cudaProfilerApi"
# Specify the desired behavior when a capture range ends.
# In verl we need the orch.cuda.profiler.start/stop pair to repeats n times.
# valid values are "repeat-shutdown:n" or null.
# For normal whole step profiling, n = len(profile_steps);
# but for discrete profiling, n = len(profile_steps) * Number(subtasks).
# Or you can just leave it null and the program will use n = len(profile_steps) * 6;
capture-range-end: null
# Send signal to the target application's process group. We let the program to exit by itself.
kill: none
# Project name for experiment tracking (e.g., wandb) # Project name for experiment tracking (e.g., wandb)
project_name: verl_examples project_name: verl_examples
@ -331,6 +265,79 @@ trainer:
# mode: "auto", "enable", or "disable" # mode: "auto", "enable", or "disable"
use_legacy_worker_impl: auto use_legacy_worker_impl: auto
# profiler configs
global_profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig
# Profiling tool: choose between nsys, npu, torch
tool: null
# profile steps
steps: null
# Whether to combine continuous steps into one database.
## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
## If False, [1] in one, [2] in another, [5] in another.
profile_continuous_steps: False
# Path to save profiling contents
save_path: "outputs/profile"
# Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
global_tool_config:
# nsys config
nsys:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.config.NsightToolConfig
# True for each task has its own database, False for all tasks in one training step share one database.
discrete: False
# controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
controller_nsight_options:
# Select the API(s) to be traced.
trace: "cuda,nvtx,cublas,ucx"
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
cuda-memory-usage: "true"
# CUDA graphs will be traced as a whole
cuda-graph-trace: "graph"
# worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
worker_nsight_options:
# Select the API(s) to be traced.
trace: "cuda,nvtx,cublas,ucx"
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
cuda-memory-usage: "true"
# CUDA graphs will be traced as a whole
cuda-graph-trace: "graph"
# Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
capture-range: "cudaProfilerApi"
# Specify the desired behavior when a capture range ends.
# In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
# valid values are "repeat-shutdown:n" or null.
# For normal whole step profiling, n = len(profile_steps);
# but for discrete profiling, n = len(profile_steps) * Number(subtasks).
# Or you can just leave it null and the program will use n = len(profile_steps) * 6;
capture-range-end: null
# Send signal to the target application's process group. We let the program to exit by itself.
kill: none
# configs related to ray initialization # configs related to ray initialization
ray_init: ray_init:

View File

@ -23,11 +23,4 @@ megatron:
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}} override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False} use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
profile:
use_profile: False
profile_ranks: null
step_start: -1
step_end: -1
save_path: null
load_weight: True load_weight: True

View File

@ -19,3 +19,28 @@ log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,fa
# the max token length per GPU # the max token length per GPU
# same as actor_rollout_ref.actor.ppo_max_token_len_per_gpu if it exists, otherwise 16384 # same as actor_rollout_ref.actor.ppo_max_token_len_per_gpu if it exists, otherwise 16384
log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384} log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
# profile the ref model in `compute_log_prob`
profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig
# profiler tool, default same as profiler.tool in global config
# choices: nsys, npu, torch
tool: ${oc.select:global_profiler.tool,null}
# whether enable profile on ref
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
# Whether to profile all ranks.
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
# The ranks that will be profiled. [] or [0,1,...]
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
# profile results saving path
save_path: ${oc.select:global_profiler.save_path,null}
# specific tool config
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}

View File

@ -65,17 +65,27 @@ sandbox_fusion:
# Max memory limit for each sandbox process in MB # Max memory limit for each sandbox process in MB
memory_limit_mb: 1024 memory_limit_mb: 1024
# profiler configs # profile the reward model in `compute_reward`
profiler: profiler:
# hint for the target config dataclass # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig _target_: verl.utils.profiler.ProfilerConfig
# True for each task has its own database, False for all tasks in one training step share one database. # profiler tool, default same as profiler.tool in global config
discrete: False # choices: nsys, npu, torch
tool: ${oc.select:global_profiler.tool,null}
# whether enable profile on ref
enable: False
# Whether to profile all ranks. # Whether to profile all ranks.
all_ranks: False all_ranks: False
# The ranks that will be profiled. [] or [0,1,...] # The ranks that will be profiled. [] or [0,1,...]
ranks: [] ranks: []
# profile results saving path
save_path: ${oc.select:global_profiler.save_path,null}
# specific tool config
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}

View File

@ -225,3 +225,28 @@ skip_rollout: False
# Specifies the filesystem path where rollout data should be cached when skip_rollout is enabled. # Specifies the filesystem path where rollout data should be cached when skip_rollout is enabled.
# Note: Giving path under /tmp/ray/session* is not recommended as these are temporary Ray cluster directories. # Note: Giving path under /tmp/ray/session* is not recommended as these are temporary Ray cluster directories.
skip_dump_dir: /tmp/rollout_dump skip_dump_dir: /tmp/rollout_dump
# profile the rollout model in `generate_sequence`
profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig
# profiler tool, default same as profiler.tool in global config
# choices: nsys, npu, torch
tool: ${oc.select:global_profiler.tool,null}
# whether enable profile on ref
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
# Whether to profile all ranks.
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
# The ranks that will be profiled. [] or [0,1,...]
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
# profile results saving path
save_path: ${oc.select:global_profiler.save_path,null}
# specific tool config
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}

View File

@ -64,13 +64,16 @@ def run_ppo(config) -> None:
# Execute the `run` method of the TaskRunner instance remotely and wait for it to complete # Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
if ( if (
is_cuda_available is_cuda_available
and config.trainer.get("profile_steps") is not None and config.global_profiler.tool == "nsys"
and len(config.trainer.get("profile_steps", [])) > 0 and config.global_profiler.get("steps") is not None
and len(config.global_profiler.get("steps", [])) > 0
): ):
from verl.utils.import_utils import is_nvtx_available from verl.utils.import_utils import is_nvtx_available
assert is_nvtx_available(), "nvtx is not available in CUDA platform. Please 'pip3 install nvtx'" assert is_nvtx_available(), "nvtx is not available in CUDA platform. Please 'pip3 install nvtx'"
nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options) nsight_options = OmegaConf.to_container(
config.global_profiler.global_tool_config.nsys.controller_nsight_options
)
runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote() runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
else: else:
runner = TaskRunner.remote() runner = TaskRunner.remote()

View File

@ -795,7 +795,6 @@ class RayPPOTrainer:
cls=self.role_worker_mapping[Role.ActorRollout], cls=self.role_worker_mapping[Role.ActorRollout],
config=self.config.actor_rollout_ref, config=self.config.actor_rollout_ref,
role="actor_rollout", role="actor_rollout",
profile_option=self.config.trainer.npu_profile.options,
) )
self.resource_pool_to_cls[resource_pool]["actor_rollout"] = actor_rollout_cls self.resource_pool_to_cls[resource_pool]["actor_rollout"] = actor_rollout_cls
else: else:
@ -815,7 +814,6 @@ class RayPPOTrainer:
self.role_worker_mapping[Role.RefPolicy], self.role_worker_mapping[Role.RefPolicy],
config=self.config.actor_rollout_ref, config=self.config.actor_rollout_ref,
role="ref", role="ref",
profile_option=self.config.trainer.npu_profile.options,
) )
self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls
@ -835,13 +833,13 @@ class RayPPOTrainer:
wg_kwargs = {} # Setting up kwargs for RayWorkerGroup wg_kwargs = {} # Setting up kwargs for RayWorkerGroup
if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None: if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
if OmegaConf.select(self.config.trainer, "profile_steps") is not None: if OmegaConf.select(self.config.global_profiler, "steps") is not None:
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps") wg_kwargs["profile_steps"] = OmegaConf.select(self.config.global_profiler, "steps")
assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, ( assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
"worker_nsight_options must be set when profile_steps is set" "worker_nsight_options must be set when profile_steps is set"
) )
wg_kwargs["worker_nsight_options"] = OmegaConf.to_container( wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
OmegaConf.select(self.config.trainer, "worker_nsight_options") OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
) )
wg_kwargs["device_name"] = self.device_name wg_kwargs["device_name"] = self.device_name
@ -1083,8 +1081,8 @@ class RayPPOTrainer:
prev_step_profile = False prev_step_profile = False
curr_step_profile = ( curr_step_profile = (
self.global_steps in self.config.trainer.profile_steps self.global_steps in self.config.global_profiler.steps
if self.config.trainer.profile_steps is not None if self.config.global_profiler.steps is not None
else False else False
) )
next_step_profile = False next_step_profile = False
@ -1097,7 +1095,7 @@ class RayPPOTrainer:
with marked_timer("start_profile", timing_raw): with marked_timer("start_profile", timing_raw):
self._start_profiling( self._start_profiling(
not prev_step_profile and curr_step_profile not prev_step_profile and curr_step_profile
if self.config.trainer.profile_continuous_steps if self.config.global_profiler.profile_continuous_steps
else curr_step_profile else curr_step_profile
) )
@ -1341,13 +1339,13 @@ class RayPPOTrainer:
with marked_timer("stop_profile", timing_raw): with marked_timer("stop_profile", timing_raw):
next_step_profile = ( next_step_profile = (
self.global_steps + 1 in self.config.trainer.profile_steps self.global_steps + 1 in self.config.global_profiler.steps
if self.config.trainer.profile_steps is not None if self.config.global_profiler.steps is not None
else False else False
) )
self._stop_profiling( self._stop_profiling(
curr_step_profile and not next_step_profile curr_step_profile and not next_step_profile
if self.config.trainer.profile_continuous_steps if self.config.global_profiler.profile_continuous_steps
else curr_step_profile else curr_step_profile
) )
prev_step_profile = curr_step_profile prev_step_profile = curr_step_profile

View File

@ -12,14 +12,74 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import warnings
from dataclasses import dataclass, field from dataclasses import dataclass, field
from typing import Any, Optional
from omegaconf import MISSING
from verl.base_config import BaseConfig from verl.base_config import BaseConfig
@dataclass
class NsightToolConfig(BaseConfig):
"""Nsight tool config."""
"True for each task has its own database, False for all tasks in one training step share one database."
discrete: bool = False
def __post_init__(self) -> None:
pass
@dataclass
class TorchProfilerToolConfig(BaseConfig):
"""Torch profiler tool config.
Args:
step_start (int): Start step in update_policy.
step_end (int): End step.
"""
step_start: int = -1
step_end: int = -1
def __post_init__(self) -> None:
"""config validation logics go here"""
warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
assert isinstance(self.step_start, int), f"Profiler step_start must be of type int, got {type(self.step_start)}"
@dataclass
class NPUToolConfig(NsightToolConfig):
"""NPU profiler too; config."""
# options: npu, cpu, memory, shapes, module, stack
contents: list[str] = field(default_factory=list)
# Collection level, optional values: level_none, level0, level1, level2.
level: str = "level1"
# Whether to automatically parse the data.
analysis: bool = False
def __post_init__(self) -> None:
"""config validation logics go here"""
assert isinstance(self.contents, list), f"Profiler contents must be of type list, got {type(self.contents)}"
assert isinstance(self.level, str), f"Profiler level must be of type str, got {type(self.level)}"
assert isinstance(self.analysis, bool), f"Profiler analysis must be of type bool, got {type(self.analysis)}"
for content in self.contents:
assert content in ["npu", "cpu", "memory", "shapes", "module", "stack"], (
f"Profiler contents only supports npu, cpu, memory, shapes, module, stack, but gets {content}"
)
assert self.level in ["level_none", "level0", "level1", "level2"], (
f"Profiler level only supports level0, 1, 2, and level_none, but gets {self.level}"
)
@dataclass @dataclass
class ProfilerConfig(BaseConfig): class ProfilerConfig(BaseConfig):
"""Worker profiler config. Currently only support Nsight system profiler. """Worker profiler config.
The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config. The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.
@ -30,22 +90,33 @@ class ProfilerConfig(BaseConfig):
ranks (list[int]): The ranks that will be profiled. Defaults to []. ranks (list[int]): The ranks that will be profiled. Defaults to [].
""" """
discrete: bool = False tool: Optional[str] = MISSING
enable: bool = False
all_ranks: bool = False all_ranks: bool = False
ranks: list[int] = field(default_factory=list) ranks: list[int] = field(default_factory=list)
save_path: Optional[str] = MISSING
tool_config: Any = MISSING # Just a placeholder, will use configs above directly
def union(self, other: "ProfilerConfig") -> "ProfilerConfig": def union(self, other: "ProfilerConfig") -> "ProfilerConfig":
assert self.tool == other.tool, f"Cannot union ProfilerConfig with different tools: {self.tool} vs {other.tool}"
return ProfilerConfig( return ProfilerConfig(
tool=self.tool,
enable=self.enable or other.enable,
all_ranks=self.all_ranks or other.all_ranks, all_ranks=self.all_ranks or other.all_ranks,
ranks=list(set(self.ranks or []) | set(other.ranks or [])), ranks=list(set(self.ranks or []) | set(other.ranks or [])),
discrete=self.discrete or other.discrete, tool_config=self.tool_config,
) )
def intersect(self, other: "ProfilerConfig") -> "ProfilerConfig": def intersect(self, other: "ProfilerConfig") -> "ProfilerConfig":
assert self.tool == other.tool, (
f"Cannot intersect ProfilerConfig with different tools: {self.tool} vs {other.tool}"
)
return ProfilerConfig( return ProfilerConfig(
tool=self.tool,
enable=self.enable and other.enable,
all_ranks=self.all_ranks and other.all_ranks, all_ranks=self.all_ranks and other.all_ranks,
ranks=list(set(self.ranks or []) & set(other.ranks or [])), ranks=list(set(self.ranks or []) & set(other.ranks or [])),
discrete=self.discrete and other.discrete, tool_config=self.tool_config,
) )
def __post_init__(self) -> None: def __post_init__(self) -> None:

View File

@ -20,9 +20,9 @@ from contextlib import contextmanager
from typing import Any, Callable, Optional from typing import Any, Callable, Optional
import torch_npu import torch_npu
from omegaconf import DictConfig
from torch_npu.npu import mstx from torch_npu.npu import mstx
from .config import NPUToolConfig
from .profile import DistProfiler, ProfilerConfig from .profile import DistProfiler, ProfilerConfig
@ -86,7 +86,14 @@ def marked_timer(name: str, timing_raw: dict[str, float], *args: Any, **kwargs:
mark_end_range(mark_range) mark_end_range(mark_range)
def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_step: Optional[str] = None): def get_npu_profiler(
contents: list[str],
profile_level: str,
profile_save_path: str,
analysis: bool,
role: Optional[str] = None,
profile_step: Optional[str] = None,
):
"""Generate and return an NPU profiler object. """Generate and return an NPU profiler object.
Args: Args:
@ -97,18 +104,7 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
profile_step(str, optional): profile_step(str, optional):
The current training step. Defaults to None. The current training step. Defaults to None.
""" """
if option.level == "level_none":
profile_level = torch_npu.profiler.ProfilerLevel.Level_none
elif option.level == "level0":
profile_level = torch_npu.profiler.ProfilerLevel.Level0
elif option.level == "level1":
profile_level = torch_npu.profiler.ProfilerLevel.Level1
elif option.level == "level2":
profile_level = torch_npu.profiler.ProfilerLevel.Level2
else:
raise ValueError(f"level only supports level0, 1, 2, and level_none, but gets {option.level}")
profile_save_path = option.save_path
if profile_step: if profile_step:
profile_save_path = os.path.join(profile_save_path, profile_step) profile_save_path = os.path.join(profile_save_path, profile_step)
if role: if role:
@ -123,18 +119,18 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
) )
activites = [] activites = []
if option.with_npu: if contents is None or "npu" in contents:
activites.append(torch_npu.profiler.ProfilerActivity.NPU) activites.append(torch_npu.profiler.ProfilerActivity.NPU)
if option.with_cpu: if contents is None or "cpu" in contents:
activites.append(torch_npu.profiler.ProfilerActivity.CPU) activites.append(torch_npu.profiler.ProfilerActivity.CPU)
prof = torch_npu.profiler.profile( prof = torch_npu.profiler.profile(
with_modules=option.with_module, with_modules=contents is None or "module" in contents,
with_stack=option.with_stack, with_stack=contents is None or "stack" in contents,
record_shapes=option.record_shapes, record_shapes=contents is None or "shapes" in contents,
profile_memory=option.with_memory, profile_memory=contents is None or "memory" in contents,
activities=activites, activities=activites,
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=option.analysis), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=analysis),
experimental_config=experimental_config, experimental_config=experimental_config,
) )
return prof return prof
@ -147,7 +143,7 @@ class NPUProfiler(DistProfiler):
_define_count = 0 _define_count = 0
def __init__(self, rank: int, config: ProfilerConfig, **kwargs): def __init__(self, rank: int, config: ProfilerConfig, tool_config: NPUToolConfig, **kwargs):
"""Initialize the NsightSystemsProfiler. """Initialize the NsightSystemsProfiler.
Args: Args:
@ -155,12 +151,20 @@ class NPUProfiler(DistProfiler):
config (Optional[ProfilerConfig]): Configuration for the profiler. If None, a default configuration is used. config (Optional[ProfilerConfig]): Configuration for the profiler. If None, a default configuration is used.
""" """
if not config: if not config:
config = ProfilerConfig(ranks=[]) config = ProfilerConfig(ranks=[], enable=False)
if not tool_config:
assert not config.enable, "tool_config must be set when profiler is enabled"
self.enable: bool = config.enable
if not config.enable:
return
self.this_step: bool = False self.this_step: bool = False
self.discrete: bool = config.discrete self.discrete: bool = tool_config.discrete
self.this_rank: bool = False self.this_rank: bool = False
self.profile_npu = None self.profile_npu = None
self.profile_option = kwargs.get("option", None) self.profile_contents = tool_config.contents
self.profile_level = tool_config.level
self.profile_save_path = config.save_path
self.analysis = tool_config.analysis
if config.all_ranks: if config.all_ranks:
self.this_rank = True self.this_rank = True
elif config.ranks: elif config.ranks:
@ -169,15 +173,22 @@ class NPUProfiler(DistProfiler):
def start(self, **kwargs): def start(self, **kwargs):
role, profile_step = kwargs.get("role", None), kwargs.get("profile_step", None) role, profile_step = kwargs.get("role", None), kwargs.get("profile_step", None)
profile_step = str(profile_step) if profile_step is not None else None profile_step = str(profile_step) if profile_step is not None else None
if self.this_rank and self.profile_option is not None: if self.this_rank and self.enable:
self.this_step = True self.this_step = True
if not self.discrete and NPUProfiler._define_count == 0: if not self.discrete and NPUProfiler._define_count == 0:
self.profile_npu = get_npu_profiler(option=self.profile_option, role=role, profile_step=profile_step) self.profile_npu = get_npu_profiler(
contents=self.profile_contents,
profile_level=self.profile_level,
profile_save_path=self.profile_save_path,
analysis=self.analysis,
role=role,
profile_step=profile_step,
)
self.profile_npu.start() self.profile_npu.start()
NPUProfiler._define_count += 1 NPUProfiler._define_count += 1
def stop(self): def stop(self):
if self.this_rank and self.profile_option is not None: if self.this_rank and self.enable:
self.this_step = False self.this_step = False
if not self.discrete and NPUProfiler._define_count == 1: if not self.discrete and NPUProfiler._define_count == 1:
self.profile_npu.step() self.profile_npu.step()
@ -201,26 +212,23 @@ class NPUProfiler(DistProfiler):
def decorator(func): def decorator(func):
@functools.wraps(func) @functools.wraps(func)
def wrapper(self, *args, **kwargs): def wrapper(self, *args, **kwargs):
if not self.profiler.enable:
return func(self, *args, **kwargs)
profile_name = message or func.__name__ profile_name = message or func.__name__
profile_this_role = True
discrete_mode = self.profiler.discrete discrete_mode = self.profiler.discrete
profile_enable = self.profiler.this_step and self.profile_option is not None profile_enable = self.profiler.this_step and self.profiler.enable
if not profile_enable: if not profile_enable:
return func(self, *args, **kwargs) return func(self, *args, **kwargs)
if profile_enable and role is not None:
target_roles = self.profile_option.get("roles", [])
profile_this_role = "all" in target_roles or role in target_roles
if profile_enable: if profile_enable:
if not discrete_mode: if not discrete_mode:
mark_range = mark_start_range(message=profile_name) mark_range = mark_start_range(message=profile_name)
else: else:
if profile_this_role: profile_npu = get_npu_profiler(option=self.profile_option, role=role)
profile_npu = get_npu_profiler(option=self.profile_option, role=role) profile_npu.start()
profile_npu.start() mark_range = mark_start_range(message=profile_name)
mark_range = mark_start_range(message=profile_name)
result = func(self, *args, **kwargs) result = func(self, *args, **kwargs)
@ -228,10 +236,9 @@ class NPUProfiler(DistProfiler):
if not discrete_mode: if not discrete_mode:
mark_end_range(mark_range) mark_end_range(mark_range)
else: else:
if profile_this_role: mark_end_range(mark_range)
mark_end_range(mark_range) profile_npu.step()
profile_npu.step() profile_npu.stop()
profile_npu.stop()
return result return result

View File

@ -20,6 +20,7 @@ from typing import Callable, Optional
import nvtx import nvtx
import torch import torch
from .config import NsightToolConfig
from .profile import DistProfiler, ProfilerConfig from .profile import DistProfiler, ProfilerConfig
@ -113,7 +114,7 @@ def marked_timer(
class NsightSystemsProfiler(DistProfiler): class NsightSystemsProfiler(DistProfiler):
"""Nsight system profiler. Installed in a worker to control the Nsight system profiler.""" """Nsight system profiler. Installed in a worker to control the Nsight system profiler."""
def __init__(self, rank: int, config: Optional[ProfilerConfig], **kwargs): def __init__(self, rank: int, config: Optional[ProfilerConfig], tool_config: Optional[NsightToolConfig], **kwargs):
"""Initialize the NsightSystemsProfiler. """Initialize the NsightSystemsProfiler.
Args: Args:
@ -123,8 +124,13 @@ class NsightSystemsProfiler(DistProfiler):
# If no configuration is provided, create a default ProfilerConfig with an empty list of ranks # If no configuration is provided, create a default ProfilerConfig with an empty list of ranks
if not config: if not config:
config = ProfilerConfig(ranks=[]) config = ProfilerConfig(ranks=[])
if not tool_config:
assert not config.enable, "tool_config must be provided when profiler is enabled"
self.enable = config.enable
if not config.enable:
return
self.this_step: bool = False self.this_step: bool = False
self.discrete: bool = config.discrete self.discrete: bool = tool_config.discrete
self.this_rank: bool = False self.this_rank: bool = False
if config.all_ranks: if config.all_ranks:
self.this_rank = True self.this_rank = True
@ -170,6 +176,9 @@ class NsightSystemsProfiler(DistProfiler):
def decorator(func): def decorator(func):
@functools.wraps(func) @functools.wraps(func)
def wrapper(self, *args, **kwargs): def wrapper(self, *args, **kwargs):
if not self.profiler.enable:
return func(self, *args, **kwargs)
profile_name = message or func.__name__ profile_name = message or func.__name__
if self.profiler.this_step: if self.profiler.this_step:

View File

@ -17,9 +17,8 @@ from typing import Callable, Optional
import torch import torch
import torch.distributed import torch.distributed
from omegaconf import DictConfig, OmegaConf
from .config import ProfilerConfig from .config import ProfilerConfig, TorchProfilerToolConfig
class Profiler: class Profiler:
@ -39,18 +38,23 @@ class Profiler:
config: Configuration object containing profiling parameters config: Configuration object containing profiling parameters
""" """
def __init__(self, config): def __init__(self, config: ProfilerConfig, tool_config: Optional[TorchProfilerToolConfig] = None):
# note : if we do not set use_profile, it will be set as None, so that all function will be skip # note : if we do not set use_profile, it will be set as None, so that all function will be skip
if not isinstance(config, DictConfig): if not config:
config = OmegaConf.create(config) config = ProfilerConfig(ranks=[], enable=False)
if not tool_config:
assert not config.enable, "tool_config must be provided when profiler is enabled"
self.enable = config.enable
if not config.enable:
return
self.config = config self.config = config
self.skip_prof = False self.tool_config = tool_config
self.saved = False self.saved = False
self.prof = None self.prof = None
self.rank = torch.distributed.get_rank() self.rank = torch.distributed.get_rank()
# we need to validate the config before using the profiler # we need to validate the config before using the profiler
self._validate() self._validate()
if config.use_profile and self.rank in self.config.profile_ranks: if self.rank in self.config.profile_ranks:
print(f"[Profiler] Profiler init for rank {self.rank}") print(f"[Profiler] Profiler init for rank {self.rank}")
self.prof = torch.profiler.profile( self.prof = torch.profiler.profile(
@ -59,9 +63,9 @@ class Profiler:
torch.profiler.ProfilerActivity.CUDA, torch.profiler.ProfilerActivity.CUDA,
], ],
schedule=torch.profiler.schedule( schedule=torch.profiler.schedule(
wait=max(self.config.step_start - 1, 0), wait=max(self.tool_config.step_start - 1, 0),
warmup=1 if self.config.step_start > 0 else 0, warmup=1 if self.tool_config.step_start > 0 else 0,
active=self.config.step_end - self.config.step_start, active=self.tool_config.step_end - self.tool_config.step_start,
repeat=1, repeat=1,
), ),
record_shapes=True, record_shapes=True,
@ -73,9 +77,9 @@ class Profiler:
if self.config.profile_ranks is None: if self.config.profile_ranks is None:
print("[WARNING] Profile ranks is not set, default to rank 0") print("[WARNING] Profile ranks is not set, default to rank 0")
self.config.profile_ranks = [0] self.config.profile_ranks = [0]
assert self.config.step_start >= 0, "[ERROR] Profile step start must be greater than 0" assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
assert self.config.step_end >= 0, "[ERROR] Profile step end must be greater than 0" assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
assert self.config.step_start < self.config.step_end, ( assert self.tool_config.step_start < self.tool_config.step_end, (
"[ERROR] Profile step start must be less than step end" "[ERROR] Profile step start must be less than step end"
) )

View File

@ -122,7 +122,7 @@ class MegatronPPOActor(BasePPOActor):
self.tf_config = tf_config self.tf_config = tf_config
self.actor_module = actor_module self.actor_module = actor_module
self.actor_optimizer: DistributedOptimizer = actor_optimizer self.actor_optimizer: DistributedOptimizer = actor_optimizer
self.prof = Profiler(self.config.profile) self.prof = Profiler(self.config.profiler)
self.use_fused_kernels = self.config.get("use_fused_kernels", False) self.use_fused_kernels = self.config.get("use_fused_kernels", False)
if self.use_fused_kernels: if self.use_fused_kernels:
from verl.models.mcore.model_forward_fused import patch_fused_forward from verl.models.mcore.model_forward_fused import patch_fused_forward
@ -600,7 +600,8 @@ class MegatronPPOActor(BasePPOActor):
""" """
metrics = {} metrics = {}
self.prof.start() if self.prof.enable:
self.prof.start()
for data in dataloader: for data in dataloader:
data.to(get_device_id()) data.to(get_device_id())
self.actor_optimizer.zero_grad() self.actor_optimizer.zero_grad()
@ -639,9 +640,11 @@ class MegatronPPOActor(BasePPOActor):
pass pass
else: else:
raise NotImplementedError raise NotImplementedError
self.prof.step() if self.prof.enable:
self.prof.step()
# add empty cache after each compute # add empty cache after each compute
self.prof.stop_and_save() if self.prof.enable:
self.prof.stop_trace() self.prof.stop_and_save()
self.prof.stop_trace()
get_torch_device().empty_cache() get_torch_device().empty_cache()
return metrics return metrics

View File

@ -19,6 +19,7 @@ from omegaconf import MISSING
from verl.base_config import BaseConfig from verl.base_config import BaseConfig
from verl.trainer.config import CheckpointConfig from verl.trainer.config import CheckpointConfig
from verl.utils.profiler.config import ProfilerConfig
from .engine import FSDPEngineConfig, McoreEngineConfig from .engine import FSDPEngineConfig, McoreEngineConfig
from .optimizer import OptimizerConfig from .optimizer import OptimizerConfig
@ -109,6 +110,7 @@ class ActorConfig(BaseConfig):
checkpoint: CheckpointConfig = field(default_factory=CheckpointConfig) checkpoint: CheckpointConfig = field(default_factory=CheckpointConfig)
optim: OptimizerConfig = field(default_factory=OptimizerConfig) optim: OptimizerConfig = field(default_factory=OptimizerConfig)
use_fused_kernels: bool = False use_fused_kernels: bool = False
profiler: ProfilerConfig = field(default_factory=ProfilerConfig)
def __post_init__(self): def __post_init__(self):
"""Validate actor configuration parameters.""" """Validate actor configuration parameters."""
@ -218,6 +220,7 @@ class FSDPActorConfig(ActorConfig):
entropy_checkpointing: bool = False entropy_checkpointing: bool = False
fsdp_config: FSDPEngineConfig = field(default_factory=FSDPEngineConfig) fsdp_config: FSDPEngineConfig = field(default_factory=FSDPEngineConfig)
use_remove_padding: bool = False use_remove_padding: bool = False
profiler: ProfilerConfig = field(default_factory=ProfilerConfig)
def __post_init__(self): def __post_init__(self):
"""Validate FSDP actor configuration parameters.""" """Validate FSDP actor configuration parameters."""

View File

@ -72,7 +72,7 @@ from verl.utils.fsdp_utils import (
) )
from verl.utils.import_utils import import_external_libs from verl.utils.import_utils import import_external_libs
from verl.utils.model import compute_position_id_with_mask from verl.utils.model import compute_position_id_with_mask
from verl.utils.profiler import DistProfiler, DistProfilerExtension, log_gpu_memory_usage, simple_timer from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig, log_gpu_memory_usage, simple_timer
from verl.utils.profiler.performance import reduce_timing from verl.utils.profiler.performance import reduce_timing
from verl.utils.py_functional import convert_to_regular_types from verl.utils.py_functional import convert_to_regular_types
from verl.workers.config import FSDPCriticConfig, FSDPEngineConfig from verl.workers.config import FSDPCriticConfig, FSDPEngineConfig
@ -116,7 +116,6 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
Worker.__init__(self) Worker.__init__(self)
self.config = config self.config = config
self.profile_option = kwargs.get("profile_option", None)
import torch.distributed import torch.distributed
if not torch.distributed.is_initialized(): if not torch.distributed.is_initialized():
@ -170,9 +169,30 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
# We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py) # We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
# as they provides DictConfig-like interface # as they provides DictConfig-like interface
# The benefit of creating the dataclass config is to perform validation during __post_init__ # The benefit of creating the dataclass config is to perform validation during __post_init__
profiler_config = omega_conf_to_dataclass(config.get("profiler")) if self._is_actor:
omega_profiler_config = config.actor.get("profiler", {})
elif self._is_rollout:
# NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
# This is for extendability in AsyncRL cases
omega_profiler_config = config.rollout.get("profiler", {})
elif self._is_ref:
omega_profiler_config = config.ref.get("profiler", {})
else:
raise ValueError(
f"Invalid role {self.role}, should be one of "
"['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
)
# omega_profiler_config is DictConfig
# profiler_config is a ProfilerConfig dataclass
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__( DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=profiler_config, option=self.profile_option) self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
) )
self._is_offload_param = False self._is_offload_param = False
@ -938,7 +958,17 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
class CriticWorker(Worker, DistProfilerExtension): class CriticWorker(Worker, DistProfilerExtension):
def __init__(self, config: FSDPCriticConfig): def __init__(self, config: FSDPCriticConfig):
Worker.__init__(self) Worker.__init__(self)
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler"))) omega_profiler_config = config.get("profiler", {})
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
)
import torch.distributed import torch.distributed
self.config = config self.config = config
@ -1336,8 +1366,18 @@ class RewardModelWorker(Worker, DistProfilerExtension):
def __init__(self, config): def __init__(self, config):
Worker.__init__(self) Worker.__init__(self)
omega_profiler_config = config.get("profiler", {})
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__( DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler"))) self,
DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
) )
import torch.distributed import torch.distributed

View File

@ -55,6 +55,7 @@ from verl.utils.profiler import (
DistProfiler, DistProfiler,
DistProfilerExtension, DistProfilerExtension,
GPUMemoryLogger, GPUMemoryLogger,
ProfilerConfig,
log_gpu_memory_usage, log_gpu_memory_usage,
simple_timer, simple_timer,
) )
@ -213,8 +214,31 @@ class ActorRolloutRefWorker(MegatronWorker, DistProfilerExtension):
self._is_rollout = self.role in ["rollout", "actor_rollout", "actor_rollout_ref"] self._is_rollout = self.role in ["rollout", "actor_rollout", "actor_rollout_ref"]
self._is_ref = self.role in ["ref", "actor_rollout_ref"] self._is_ref = self.role in ["ref", "actor_rollout_ref"]
profiler_config = omega_conf_to_dataclass(config.get("profiler")) if self._is_actor:
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config)) omega_profiler_config = config.actor.get("profiler", {})
elif self._is_rollout:
# NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
# This is for extendability in AsyncRL cases
omega_profiler_config = config.rollout.get("profiler", {})
elif self._is_ref:
omega_profiler_config = config.ref.get("profiler", {})
else:
raise ValueError(
f"Invalid role {self.role}, should be one of "
"['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
)
# omega_profiler_config is DictConfig
# profiler_config is a ProfilerConfig dataclass
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
)
# TODO(sgm): Currently, we only support reference model param offload # TODO(sgm): Currently, we only support reference model param offload
# will support other offload later # will support other offload later
@ -804,7 +828,18 @@ class AsyncActorRolloutRefWorker(ActorRolloutRefWorker):
class CriticWorker(MegatronWorker, DistProfilerExtension): class CriticWorker(MegatronWorker, DistProfilerExtension):
def __init__(self, config: McoreCriticConfig): def __init__(self, config: McoreCriticConfig):
Worker.__init__(self) Worker.__init__(self)
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
omega_profiler_config = config.get("profiler", {})
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
)
self.config: McoreCriticConfig = config self.config: McoreCriticConfig = config
# NOTE(sgm): We utilize colocate WorkerGroup by default. # NOTE(sgm): We utilize colocate WorkerGroup by default.
@ -1072,8 +1107,19 @@ class RewardModelWorker(MegatronWorker, DistProfilerExtension):
def __init__(self, config): def __init__(self, config):
Worker.__init__(self) Worker.__init__(self)
profiler_config = omega_conf_to_dataclass(config.get("profiler", {}), dataclass_type=ProfilerConfig)
omega_profiler_config = config.get("profiler", {})
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__( DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler"))) self,
DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
) )
self.config = config self.config = config

View File

@ -30,7 +30,7 @@ from verl.utils.device import (
get_device_id, get_device_id,
get_nccl_backend, get_nccl_backend,
) )
from verl.utils.profiler import DistProfiler, DistProfilerExtension from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig
from verl.utils.py_functional import append_to_dict from verl.utils.py_functional import append_to_dict
from verl.utils.torch_functional import masked_mean from verl.utils.torch_functional import masked_mean
from verl.workers.engine import EngineRegistry from verl.workers.engine import EngineRegistry
@ -42,8 +42,16 @@ logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
class CriticWorker(Worker, DistProfilerExtension): class CriticWorker(Worker, DistProfilerExtension):
def __init__(self, config): def __init__(self, config):
Worker.__init__(self) Worker.__init__(self)
omega_profiler_config = config.get("profiler", {})
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
tool_config = omega_conf_to_dataclass(
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
)
else:
tool_config = None
DistProfilerExtension.__init__( DistProfilerExtension.__init__(
self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler"))) self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
) )
import torch.distributed import torch.distributed