mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
[BREAKING] [perf] refactor: Profiler api refactor (#2894)
### What does this PR do? Refactor profiler CI to a unified way. TODO: - nsys use `save_path` - nsys descrete tests are disabled - torch profiler cc: @davidmlw ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example Global profiler config: ```yaml global_profiler: _target_: verl.utils.profiler.ProfilerConfig tool: null steps: null profile_continuous_steps: false save_path: outputs/profile tool_config: nsys: _target_: verl.utils.profiler.config.NsightToolConfig discrete: false npu: _target_: verl.utils.profiler.config.NPUToolConfig discrete: false contents: [] level: level1 analysis: true torch: _target_: verl.utils.profiler.config.TorchProfilerToolConfig step_start: 0 step_end: null ``` Local profiler config: ```yaml profiler: # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs _target_: verl.utils.profiler.ProfilerConfig # profiler tool, default same as profiler.tool in global config # choices: nsys, npu, torch tool: ${oc.select:global_profiler.tool,null} # whether enable profile on critic enable: False # Whether to profile all ranks. all_ranks: False # The ranks that will be profiled. [] or [0,1,...] ranks: [] # profile results saving path save_path: ${oc.select:global_profiler.save_path,null} # specific tool config tool_config: ${oc.select:global_profiler.tool_config,null} ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@ -59,6 +59,7 @@ coverage.xml
|
||||
*,cover
|
||||
.hypothesis/
|
||||
pytest.ini
|
||||
output.txt
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
|
@ -8,107 +8,87 @@ Last updated: 07/24/2025.
|
||||
配置
|
||||
----
|
||||
|
||||
复用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数,
|
||||
通过verl/trainer/config/npu_profile/npu_profile.yaml中的配置项控制例如采集等级等参数。
|
||||
使用两级profile设置来控制数据采集
|
||||
|
||||
- 全局采集控制:使用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数,
|
||||
- 角色profile控制:通过每个角色中的配置项控制等参数。
|
||||
|
||||
全局采集控制
|
||||
~~~~~~~~~~~~
|
||||
|
||||
通过 ppo_trainer.yaml 中的参数控制采集步数和模式:
|
||||
|
||||
- trainer.profile_steps:
|
||||
该参数可以设置为一个包含采集步数的列表,例如[2,
|
||||
4], 意味着将会采集第二步和第四步。如果该参数为null,则代表不进行采集
|
||||
- actor_rollout_ref.profiler:
|
||||
控制采集的ranks和模式
|
||||
- profiler: 控制采集的rank和模式
|
||||
|
||||
- all_ranks:设为True代表对所有rank进行采集
|
||||
- ranks:当all_ranks不为True时,
|
||||
通过ranks参数控制需要采集的rank,该参数设置为一个包含采集rank的列表, 例如[0,
|
||||
1]
|
||||
- discrete:
|
||||
控制采集的模式。当该参数设置为False,代表采集端到端的数据;当该参数设置为True,代表采用离散模式分训练阶段采集数据
|
||||
- tool: 使用的采集工具,选项有 nsys、npu、torch、torch_memory。
|
||||
- steps: 此参数可以设置为包含采集步数的列表,例如 [2, 4],表示将采集第2步和第4步。如果设置为 null,则不进行采集。
|
||||
- save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
|
||||
|
||||
通过 npu_profile.yaml 中的参数控制具体采集行为:
|
||||
通过 ``profiler.tool_config.npu`` 中的参数控制具体采集行为:
|
||||
|
||||
- save_path:采集数据的存放路径
|
||||
- roles: 采集的角色,下列为可选项
|
||||
- level: 采集级别—选项有 level_none、level0、level1 和 level2
|
||||
|
||||
- rollout_generate:采集rollout的generate_sequences阶段
|
||||
- actor_compute_log_prob:采集actor的compute_log_prob阶段
|
||||
- actor_update:采集actor的update_actor阶段
|
||||
- ref_compute_log_prob:采集ref的compute_ref_log_prob阶段
|
||||
- all: 采集以上所有阶段
|
||||
- level_none: 禁用所有基于级别的数据采集(关闭 profiler_level)。
|
||||
- level0: 采集高级应用数据、底层NPU数据和NPU上的算子执行详情。
|
||||
- level1: 在level0基础上增加CANN层AscendCL数据和NPU上的AI Core性能指标。
|
||||
- level2: 在level1基础上增加CANN层Runtime数据和AI CPU指标。
|
||||
|
||||
- level:采集等级,可选项为level_none、level0、level1和level2
|
||||
- contents: 控制采集内容的选项列表,例如
|
||||
npu、cpu、memory、shapes、module、stack。
|
||||
|
||||
- level_none:不采集所有Level层级控制的数据,即关闭profiler_level
|
||||
- level0:采集上层应用数据、底层NPU数据以及NPU上执行的算子信息
|
||||
- level1:在level0的基础上多采集CANN层AscendCL数据和NPU上执行的AI
|
||||
Core性能指标信息
|
||||
- level2:在level1的基础上多采集CANN层Runtime数据以及AI CPU
|
||||
- npu: 是否采集设备端性能数据。
|
||||
- cpu: 是否采集主机端性能数据。
|
||||
- memory: 是否启用内存分析。
|
||||
- shapes: 是否记录张量形状。
|
||||
- module: 是否记录框架层Python调用栈信息。
|
||||
- stack: 是否记录算子调用栈信息。
|
||||
|
||||
- record_shapes:是否记录张量形状
|
||||
- with_memory:是否启用内存分析
|
||||
- with_npu:是否采集device侧性能数据
|
||||
- with_cpu:是否采集host侧性能数据
|
||||
- with_module:是否记录框架层python调用栈信息
|
||||
- with_stack:是否记录算子调用栈信息
|
||||
- analysis:是否自动解析数据
|
||||
- analysis: 启用自动数据解析。
|
||||
|
||||
角色profile控制
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
在每个角色的 ``profile`` 字段中,您可以控制该角色的采集模式。
|
||||
|
||||
- enable: 是否为此角色启用性能分析。
|
||||
- all_ranks: 是否从所有rank收集数据。
|
||||
- ranks: 要收集数据的rank列表。如果为空,则不收集数据。
|
||||
- tool_config: 此角色使用的性能分析工具的配置。
|
||||
|
||||
示例
|
||||
----
|
||||
|
||||
禁用采集
|
||||
~~~~~~~~
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: null # disable profile
|
||||
profiler:
|
||||
steps: null # disable profile
|
||||
|
||||
端到端采集
|
||||
~~~~~~~~~~
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
steps: [1, 2, 5]
|
||||
discrete: False
|
||||
actor_rollout_ref:
|
||||
actor:
|
||||
profile:
|
||||
enable: True
|
||||
all_ranks: True
|
||||
# rollout & ref follow actor settings
|
||||
|
||||
|
||||
离散模式采集
|
||||
~~~~~~~~~~~~
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
|
||||
|
||||
离散模式采集actor
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
npu_profile:
|
||||
options:
|
||||
roles: ["actor_compute_log_prob", "actor_update"]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
|
||||
|
||||
可视化
|
||||
|
@ -9,10 +9,10 @@ based on FSDP on Ascend devices.
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Reuse the configuration items in
|
||||
verl/trainer/config/ppo_trainer.yaml to control the collection mode
|
||||
and steps, you can also manage the collection behaviors such as
|
||||
collection level via verl/trainer/config/npu_profile/npu_profile.yaml.
|
||||
Leverage two levels of configuration to control data collection:
|
||||
|
||||
1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
|
||||
2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.
|
||||
|
||||
Global collection control
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -20,31 +20,17 @@ Global collection control
|
||||
Use parameters in ppo_trainer.yaml to control the collection mode
|
||||
and steps.
|
||||
|
||||
- trainer.profile_steps: This parameter can be set as a list that has
|
||||
- profiler: Control the ranks and mode of profiling
|
||||
|
||||
- tool: The profiling tool to use, options are nsys, npu, torch,
|
||||
torch_memory.
|
||||
- steps: This parameter can be set as a list that has
|
||||
collection steps, such as [2, 4], which means it will collect steps 2
|
||||
and 4. If set to null, no collection occurs.
|
||||
- actor_rollout_ref.profiler: Control the ranks and mode of profiling
|
||||
- save_path: The path to save the collected data. Default is
|
||||
"outputs/profile".
|
||||
|
||||
- all_ranks: Collects data from all ranks when set to true.
|
||||
- ranks: This parameter specifies which ranks to collect (e.g., [0,
|
||||
1]) when all_ranks is False.
|
||||
- discrete: Controls the collection mode. If False, end-to-end data
|
||||
is collected; if True, data is collected in discrete phases during
|
||||
training.
|
||||
|
||||
Use parameters in npu_profile.yaml to control collection behavior:
|
||||
|
||||
- save_path: Storage path for collected data.
|
||||
- roles: Roles to collect. The following options are available
|
||||
|
||||
- rollout_generate: Collect the `generate_sequences` phase
|
||||
of rollout worker.
|
||||
- actor_compute_log_prob: Collect the `compute_log_prob` phase
|
||||
of the actor worker.
|
||||
- actor_update: Collect the `update_actor` phase of the actor worker.
|
||||
- ref_compute_log_prob: Collect the `compute_ref_log_prob` phase
|
||||
of the ref worker.
|
||||
- all: Collect all of the above phases.
|
||||
Use parameters in ``profiler.tool_config.npu`` to control npu profiler behavior:
|
||||
|
||||
- level: Collection level—options are level_none, level0, level1, and
|
||||
level2
|
||||
@ -58,15 +44,31 @@ Use parameters in npu_profile.yaml to control collection behavior:
|
||||
- level2: Extends level1 by adding CANN-layer Runtime data and AI
|
||||
CPU metrics.
|
||||
|
||||
- record_shapes: Whether to record tensor shapes.
|
||||
- with_memory: Whether to enable memory analysis.
|
||||
- with_npu: Whether to collect device-side performance data.
|
||||
- with_cpu: Whether to collect host-side performance data.
|
||||
- with_module: Whether to record framework-layer Python call stack
|
||||
- contents: A list of options to control the collection content, such as
|
||||
npu, cpu, memory, shapes, module, stack.
|
||||
|
||||
- npu: Whether to collect device-side performance data.
|
||||
- cpu: Whether to collect host-side performance data.
|
||||
- memory: Whether to enable memory analysis.
|
||||
- shapes: Whether to record tensor shapes.
|
||||
- module: Whether to record framework-layer Python call stack
|
||||
information.
|
||||
- with_stack: Whether to record operator call stack information.
|
||||
- stack: Whether to record operator call stack information.
|
||||
|
||||
- analysis: Enables automatic data parsing.
|
||||
|
||||
|
||||
Role collection control
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In each role's ``profile`` field, you can control the collection mode for that role.
|
||||
|
||||
- enable: Whether to enable profiling for this role.
|
||||
- all_ranks: Whether to collect data from all ranks.
|
||||
- ranks: A list of ranks to collect data from. If empty, no data is collected.
|
||||
- tool_config: Configuration for the profiling tool used by this role.
|
||||
|
||||
|
||||
Examples
|
||||
--------
|
||||
|
||||
@ -75,19 +77,21 @@ Disabling collection
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: null # disable profile
|
||||
profiler:
|
||||
steps: null # disable profile
|
||||
|
||||
End-to-End collection
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
steps: [1, 2, 5]
|
||||
discrete: False
|
||||
actor_rollout_ref:
|
||||
actor:
|
||||
profiler:
|
||||
enable: True
|
||||
all_ranks: True
|
||||
|
||||
|
||||
@ -96,30 +100,8 @@ Discrete Mode Collection
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
|
||||
|
||||
Enable actor collection in discrete mode
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code:: yaml
|
||||
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
npu_profile:
|
||||
options:
|
||||
roles: ["actor_compute_log_prob", "actor_update"]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
|
||||
|
||||
Visualization
|
||||
|
@ -16,31 +16,29 @@ Nsight Systems version is important, please reference `docker/Dockerfile.vllm.sg
|
||||
|
||||
verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.
|
||||
|
||||
In `trainer`, three new config entries control the profiler behaviors:
|
||||
In `profiler`, three new config entries control the profiler behaviors:
|
||||
|
||||
* **`trainer.profile_steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
|
||||
* **`profiler.steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
|
||||
|
||||
* **`trainer.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profile_steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
|
||||
* **`profiler.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profiler.steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
|
||||
|
||||
* **`controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
|
||||
Nsys options in controller nodes and worker nodes are configured in `trainer`:
|
||||
|
||||
* **`worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
|
||||
* **`trainer.controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
|
||||
* **`trainer.worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
|
||||
|
||||
### Worker process profiling
|
||||
|
||||
Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields:
|
||||
|
||||
* **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID.
|
||||
|
||||
* **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`.
|
||||
|
||||
* **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it.
|
||||
|
||||
* **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway.
|
||||
|
||||
### where to find the profiling data
|
||||
|
||||
By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. ["however, Ray preserves the `--output` option of the default config"](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
|
||||
By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. ["however, Ray preserves the `--output` option of the default config"](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
|
||||
|
||||
Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place.
|
||||
|
||||
@ -49,51 +47,40 @@ Some users may think it is not convenient, but it is understandable that Ray may
|
||||
To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this:
|
||||
|
||||
### Disable profiler
|
||||
|
||||
```yaml
|
||||
trainer:
|
||||
profile_steps: null # disable profile
|
||||
profiler:
|
||||
steps: null # disable profile
|
||||
```
|
||||
|
||||
### Enable profiler and one database for one training step
|
||||
|
||||
```yaml
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
profiler:
|
||||
steps: [1, 2, 5]
|
||||
discrete: False
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
discrete: False
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
actor:
|
||||
profile:
|
||||
enable: True
|
||||
all_ranks: True
|
||||
# rollout & ref follow actor settings
|
||||
critic:
|
||||
profiler:
|
||||
discrete: False
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
profile:
|
||||
enable: True
|
||||
all_ranks: True
|
||||
reward_model:
|
||||
profiler:
|
||||
discrete: False
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
profile:
|
||||
enable: True
|
||||
all_ranks: True
|
||||
```
|
||||
|
||||
### Enable profiler and multiple databases for one training step
|
||||
|
||||
```yaml
|
||||
trainer:
|
||||
profile_steps: [1, 2, 5]
|
||||
actor_rollout_ref:
|
||||
profiler:
|
||||
steps: [1, 2, 5]
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
critic:
|
||||
profiler:
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
reward_model:
|
||||
profiler:
|
||||
discrete: True
|
||||
all_ranks: False
|
||||
ranks: [0, 1]
|
||||
```
|
||||
|
||||
## Profiling Output
|
||||
|
@ -275,27 +275,6 @@ For the critic, you can include these parameters.
|
||||
critic.megatron.grad_offload=True \
|
||||
critic.megatron.optimizer_offload=True \
|
||||
|
||||
Profiler
|
||||
^^^^^^^^
|
||||
|
||||
The profiler is a tool that helps you understand the performance of your
|
||||
model. It can be used to profile the time spent on different operations
|
||||
and identify the bottlenecks. You can get more information from
|
||||
`torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_.
|
||||
|
||||
In verl, now the profiler is only support for the actor role In Megatron. You can set
|
||||
the begin step and end step to profile. Notice, one step means one gradient update. And
|
||||
the profile result will be saved in the save_path. If you just want to profile in the
|
||||
specific rank, you can set the profile_ranks, by default, it will be [0].
|
||||
|
||||
.. code:: python
|
||||
|
||||
actor_rollout_ref.actor.profile.use_profile=True \
|
||||
actor_rollout_ref.actor.profile.profile_ranks=[0] \
|
||||
actor_rollout_ref.actor.profile.step_start=0 \
|
||||
actor_rollout_ref.actor.profile.step_end=1 \
|
||||
actor_rollout_ref.actor.profile.save_path="./profile"
|
||||
|
||||
|
||||
Related MCore Document
|
||||
----------------------
|
||||
|
@ -9,14 +9,8 @@ PROFILE_RANKS="[1,2]"
|
||||
# profiling NPU options
|
||||
SAVE_PATH="$HOME/profile_data"
|
||||
LEVEL="level1"
|
||||
WITH_MEMORY=False
|
||||
RECORD_SHAPES=False
|
||||
WITH_NPU=True
|
||||
WITH_CPU=True
|
||||
WITH_MODULE=False
|
||||
WITH_STACK=False
|
||||
CONTENTS=['npu','cpu']
|
||||
ANALYSIS=True
|
||||
ROLES=["all"]
|
||||
|
||||
python3 -m verl.trainer.main_ppo \
|
||||
algorithm.adv_estimator=grpo \
|
||||
@ -28,20 +22,20 @@ python3 -m verl.trainer.main_ppo \
|
||||
data.filter_overlong_prompts=True \
|
||||
data.truncation='error' \
|
||||
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
|
||||
actor_rollout_ref.actor.optim.lr=5e-8 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.model.use_remove_padding=False \
|
||||
actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.profiler.discrete=$DISCRETE \
|
||||
actor_rollout_ref.actor.optim.lr=5e-8 \
|
||||
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
|
||||
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.actor.use_kl_loss=True \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.fsdp_config.param_offload=False \
|
||||
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
|
||||
actor_rollout_ref.actor.profiler.enable=True \
|
||||
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
@ -51,16 +45,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.ref.fsdp_config.param_offload=True \
|
||||
algorithm.use_kl_in_reward=False \
|
||||
trainer.npu_profile.options.save_path=$SAVE_PATH \
|
||||
trainer.npu_profile.options.level=$LEVEL \
|
||||
trainer.npu_profile.options.with_memory=$WITH_MEMORY \
|
||||
trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
|
||||
trainer.npu_profile.options.with_npu=$WITH_NPU \
|
||||
trainer.npu_profile.options.with_cpu=$WITH_CPU \
|
||||
trainer.npu_profile.options.with_module=$WITH_MODULE \
|
||||
trainer.npu_profile.options.with_stack=$WITH_STACK \
|
||||
trainer.npu_profile.options.analysis=$ANALYSIS \
|
||||
trainer.npu_profile.options.roles=$ROLES \
|
||||
trainer.critic_warmup=0 \
|
||||
trainer.logger=console \
|
||||
trainer.project_name='verl_grpo_example_gsm8k' \
|
||||
@ -70,5 +54,12 @@ python3 -m verl.trainer.main_ppo \
|
||||
trainer.save_freq=-1 \
|
||||
trainer.test_freq=5 \
|
||||
trainer.total_epochs=5 \
|
||||
trainer.profile_steps=$PROFILE_STEPS \
|
||||
trainer.device=npu $@
|
||||
trainer.device=npu \
|
||||
profiler.tool=npu \
|
||||
profiler.steps=$PROFILE_STEPS \
|
||||
profiler.save_path=$SAVE_PATH \
|
||||
profiler.tool_config.npu.discrete=$DISCRETE \
|
||||
profiler.tool_config.npu.contents=$CONTENTS \
|
||||
profiler.tool_config.npu.level=$LEVEL \
|
||||
profiler.tool_config.npu.analysis=$ANALYSIS
|
||||
$@
|
@ -8,12 +8,7 @@ DISCRETE=False
|
||||
# profiling NPU options
|
||||
SAVE_PATH="$HOME/profile_data"
|
||||
LEVEL="level1"
|
||||
WITH_MEMORY=False
|
||||
RECORD_SHAPES=False
|
||||
WITH_NPU=True
|
||||
WITH_CPU=True
|
||||
WITH_MODULE=False
|
||||
WITH_STACK=False
|
||||
CONTENTS=['npu','cpu']
|
||||
ANALYSIS=True
|
||||
|
||||
python3 -m verl.trainer.main_ppo \
|
||||
@ -28,15 +23,16 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
|
||||
actor_rollout_ref.actor.optim.lr=5e-8 \
|
||||
actor_rollout_ref.model.use_remove_padding=False \
|
||||
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.profiler.discrete=$DISCRETE \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
|
||||
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.actor.use_kl_loss=True \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.profiler.enable=True \
|
||||
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.actor.fsdp_config.param_offload=False \
|
||||
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
|
||||
@ -48,15 +44,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.ref.fsdp_config.param_offload=True \
|
||||
algorithm.use_kl_in_reward=False \
|
||||
trainer.npu_profile.options.save_path=$SAVE_PATH \
|
||||
trainer.npu_profile.options.level=$LEVEL \
|
||||
trainer.npu_profile.options.with_memory=$WITH_MEMORY \
|
||||
trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
|
||||
trainer.npu_profile.options.with_npu=$WITH_NPU \
|
||||
trainer.npu_profile.options.with_cpu=$WITH_CPU \
|
||||
trainer.npu_profile.options.with_module=$WITH_MODULE \
|
||||
trainer.npu_profile.options.with_stack=$WITH_STACK \
|
||||
trainer.npu_profile.options.analysis=$ANALYSIS \
|
||||
trainer.critic_warmup=0 \
|
||||
trainer.logger=console \
|
||||
trainer.project_name='verl_grpo_example_gsm8k' \
|
||||
@ -66,5 +53,12 @@ python3 -m verl.trainer.main_ppo \
|
||||
trainer.save_freq=-1 \
|
||||
trainer.test_freq=5 \
|
||||
trainer.total_epochs=5 \
|
||||
trainer.profile_steps=$PROFILE_STEPS \
|
||||
trainer.device=npu $@
|
||||
trainer.device=npu \
|
||||
profiler.tool=npu \
|
||||
profiler.steps=$PROFILE_STEPS \
|
||||
profiler.save_path=$SAVE_PATH \
|
||||
profiler.tool_config.npu.discrete=$DISCRETE \
|
||||
profiler.tool_config.npu.contents=$CONTENTS \
|
||||
profiler.tool_config.npu.level=$LEVEL \
|
||||
profiler.tool_config.npu.analysis=$ANALYSIS \
|
||||
$@
|
@ -13,9 +13,9 @@ train_files=${train_files:-"$gsm8k_train_path"}
|
||||
test_files=${test_files:-"$gsm8k_test_path"}
|
||||
|
||||
# Nsight profiling configuration
|
||||
PROFILE_STEPS="[1,2,5]" # or [] or null
|
||||
PROFILE_STEPS="[1]" # or [] or null
|
||||
PROFILE_RANKS_ALL=False # or True
|
||||
PROFILE_RANKS=[0,4,8,12]
|
||||
PROFILE_RANKS=[0,4]
|
||||
DISCRETE=True # or True
|
||||
|
||||
python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
|
||||
@ -34,30 +34,32 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
|
||||
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
|
||||
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.actor.use_kl_loss=False \
|
||||
actor_rollout_ref.actor.profiler.enable=True \
|
||||
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
|
||||
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
|
||||
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.profiler.discrete=$DISCRETE \
|
||||
critic.optim.lr=1e-5 \
|
||||
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
|
||||
critic.ppo_micro_batch_size_per_gpu=4 \
|
||||
critic.profiler.enable=True \
|
||||
critic.profiler.ranks=$PROFILE_RANKS \
|
||||
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
critic.profiler.discrete=$DISCRETE \
|
||||
algorithm.use_kl_in_reward=False \
|
||||
trainer.critic_warmup=0 \
|
||||
trainer.logger='["console","wandb"]' \
|
||||
trainer.project_name='verl_ppo_gsm8k_math_examples' \
|
||||
trainer.experiment_name='deepseek_llm_7b_megatron' \
|
||||
trainer.n_gpus_per_node=8 \
|
||||
trainer.nnodes=2 \
|
||||
trainer.nnodes=1 \
|
||||
trainer.save_freq=-1 \
|
||||
trainer.test_freq=-1 \
|
||||
trainer.total_epochs=100 \
|
||||
trainer.total_training_steps=6 \
|
||||
trainer.profile_steps=$PROFILE_STEPS $@
|
||||
trainer.total_training_steps=1 \
|
||||
profiler.tool=nsys \
|
||||
profiler.steps=$PROFILE_STEPS \
|
||||
profiler.tool_config.nsys.discrete=$DISCRETE $@
|
||||
|
@ -10,8 +10,8 @@ test_files=${test_files:-"$gsm8k_test_path"}
|
||||
|
||||
PROFILE_STEPS="[1,2,5]" # or [] or null
|
||||
PROFILE_RANKS_ALL=False # or True
|
||||
PROFILE_RANKS=[0,4,8,12]
|
||||
DISCRETE=False # or True
|
||||
PROFILE_RANKS=[0,4]
|
||||
DISCRETE=True # or True
|
||||
|
||||
python3 -m verl.trainer.main_ppo \
|
||||
algorithm.adv_estimator=gae \
|
||||
@ -30,17 +30,17 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.actor.ppo_mini_batch_size=512 \
|
||||
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.actor.use_dynamic_bsz=True \
|
||||
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
|
||||
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=12000 \
|
||||
actor_rollout_ref.actor.fsdp_config.param_offload=False \
|
||||
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
|
||||
actor_rollout_ref.actor.use_kl_loss=False \
|
||||
actor_rollout_ref.actor.profiler.enable=True \
|
||||
actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
|
||||
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=24000 \
|
||||
actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
|
||||
actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
actor_rollout_ref.profiler.discrete=$DISCRETE \
|
||||
critic.optim.lr=1e-5 \
|
||||
critic.model.use_remove_padding=True \
|
||||
critic.model.path=Qwen/Qwen2-7B-Instruct \
|
||||
@ -50,9 +50,9 @@ python3 -m verl.trainer.main_ppo \
|
||||
critic.ppo_max_token_len_per_gpu=98304 \
|
||||
critic.model.fsdp_config.param_offload=False \
|
||||
critic.model.fsdp_config.optimizer_offload=False \
|
||||
critic.profiler.enable=True \
|
||||
critic.profiler.ranks=$PROFILE_RANKS \
|
||||
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
critic.profiler.discrete=$DISCRETE \
|
||||
reward_model.enable=True \
|
||||
reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
|
||||
reward_model.model.use_remove_padding=True \
|
||||
@ -60,9 +60,9 @@ python3 -m verl.trainer.main_ppo \
|
||||
reward_model.micro_batch_size_per_gpu=32 \
|
||||
reward_model.use_dynamic_bsz=True \
|
||||
reward_model.forward_max_token_len_per_gpu=98304 \
|
||||
reward_model.profiler.enable=True \
|
||||
reward_model.profiler.ranks=$PROFILE_RANKS \
|
||||
reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
reward_model.profiler.discrete=$DISCRETE \
|
||||
algorithm.use_kl_in_reward=False \
|
||||
trainer.critic_warmup=0 \
|
||||
trainer.logger='["console","wandb"]' \
|
||||
@ -70,10 +70,12 @@ python3 -m verl.trainer.main_ppo \
|
||||
trainer.experiment_name='qwen2-7b_hybrid_rm_bsz8k_p4k_r4k_seq_packing' \
|
||||
trainer.n_gpus_per_node=8 \
|
||||
trainer.val_before_train=False \
|
||||
trainer.nnodes=2 \
|
||||
trainer.nnodes=1 \
|
||||
trainer.save_freq=-1 \
|
||||
trainer.test_freq=-1 \
|
||||
trainer.total_epochs=15 \
|
||||
trainer.total_training_steps=6 \
|
||||
trainer.profile_continuous_steps=True \
|
||||
trainer.profile_steps=$PROFILE_STEPS $@
|
||||
profiler.profile_continuous_steps=True \
|
||||
profiler.tool=nsys \
|
||||
profiler.steps=$PROFILE_STEPS \
|
||||
profiler.tool_config.nsys.discrete=$DISCRETE $@
|
||||
|
@ -97,8 +97,8 @@ class RayDAPOTrainer(RayPPOTrainer):
|
||||
|
||||
prev_step_profile = False
|
||||
curr_step_profile = (
|
||||
self.global_steps in self.config.trainer.profile_steps
|
||||
if self.config.trainer.profile_steps is not None
|
||||
self.global_steps in self.config.global_profiler.steps
|
||||
if self.config.global_profiler.steps is not None
|
||||
else False
|
||||
)
|
||||
next_step_profile = False
|
||||
@ -114,7 +114,7 @@ class RayDAPOTrainer(RayPPOTrainer):
|
||||
with marked_timer("start_profile", timing_raw):
|
||||
self._start_profiling(
|
||||
not prev_step_profile and curr_step_profile
|
||||
if self.config.trainer.profile_continuous_steps
|
||||
if self.config.global_profiler.profile_continuous_steps
|
||||
else curr_step_profile
|
||||
)
|
||||
|
||||
@ -350,13 +350,13 @@ class RayDAPOTrainer(RayPPOTrainer):
|
||||
|
||||
with marked_timer("stop_profile", timing_raw):
|
||||
next_step_profile = (
|
||||
self.global_steps + 1 in self.config.trainer.profile_steps
|
||||
if self.config.trainer.profile_steps is not None
|
||||
self.global_steps + 1 in self.config.global_profiler.steps
|
||||
if self.config.global_profiler.steps is not None
|
||||
else False
|
||||
)
|
||||
self._stop_profiling(
|
||||
curr_step_profile and not next_step_profile
|
||||
if self.config.trainer.profile_continuous_steps
|
||||
if self.config.global_profiler.profile_continuous_steps
|
||||
else curr_step_profile
|
||||
)
|
||||
prev_step_profile = curr_step_profile
|
||||
|
@ -45,10 +45,13 @@ def run_ppo(config) -> None:
|
||||
|
||||
if (
|
||||
is_cuda_available
|
||||
and OmegaConf.select(config.trainer, "profile_steps") is not None
|
||||
and len(OmegaConf.select(config.trainer, "profile_steps")) > 0
|
||||
and config.global_profiler.tool == "nsys"
|
||||
and OmegaConf.select(config.global_profiler, "steps") is not None
|
||||
and len(OmegaConf.select(config.global_profiler, "steps")) > 0
|
||||
):
|
||||
nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
|
||||
nsight_options = OmegaConf.to_container(
|
||||
config.global_profiler.global_tool_config.nsys.controller_nsight_options
|
||||
)
|
||||
runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
|
||||
else:
|
||||
runner = TaskRunner.remote()
|
||||
|
@ -38,6 +38,7 @@ from verl.utils.fsdp_utils import (
|
||||
)
|
||||
from verl.utils.import_utils import import_external_libs
|
||||
from verl.utils.model import get_generation_config, update_model_config
|
||||
from verl.utils.profiler import ProfilerConfig
|
||||
from verl.workers.fsdp_workers import ActorRolloutRefWorker as ARRWorker
|
||||
from verl.workers.fsdp_workers import CriticWorker
|
||||
|
||||
@ -131,8 +132,17 @@ class RolloutWorker(ActorRolloutRefWorker):
|
||||
# We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
|
||||
# as they provides DictConfig-like interface
|
||||
# The benefit of creating the dataclass config is to perform validation during __post_init__
|
||||
profiler_config = omega_conf_to_dataclass(config.rollout.get("profiler", {}))
|
||||
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config))
|
||||
omega_profiler_config = config.get("profiler", {})
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
|
||||
)
|
||||
self._is_rollout = True
|
||||
self._is_actor = False
|
||||
|
||||
|
@ -51,10 +51,11 @@ def run_ppo(config) -> None:
|
||||
# Create a remote instance of the TaskRunner class, and
|
||||
# Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
|
||||
if (
|
||||
OmegaConf.select(config.trainer, "profile_steps") is not None
|
||||
and len(OmegaConf.select(config.trainer, "profile_steps")) > 0
|
||||
config.global_profiler.tool == "nsys"
|
||||
and OmegaConf.select(config.global_profiler, "steps") is not None
|
||||
and len(OmegaConf.select(config.global_profiler, "steps")) > 0
|
||||
):
|
||||
nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
|
||||
nsight_options = OmegaConf.to_container(config.global_profiler.tool_config.nsys.controller_nsight_options)
|
||||
runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
|
||||
else:
|
||||
runner = TaskRunner.remote()
|
||||
|
@ -213,7 +213,6 @@ class OneStepOffRayTrainer(RayPPOTrainer):
|
||||
self.role_worker_mapping[Role.RefPolicy],
|
||||
config=self.config.actor_rollout_ref,
|
||||
role="ref",
|
||||
profile_option=self.config.trainer.npu_profile.options,
|
||||
)
|
||||
self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls
|
||||
|
||||
@ -233,13 +232,13 @@ class OneStepOffRayTrainer(RayPPOTrainer):
|
||||
wg_kwargs = {} # Setting up kwargs for RayWorkerGroup
|
||||
if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
|
||||
wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
|
||||
if OmegaConf.select(self.config.trainer, "profile_steps") is not None:
|
||||
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps")
|
||||
assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, (
|
||||
if OmegaConf.select(self.config.global_profiler, "steps") is not None:
|
||||
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "steps")
|
||||
assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
|
||||
"worker_nsight_options must be set when profile_steps is set"
|
||||
)
|
||||
wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
|
||||
OmegaConf.select(self.config.trainer, "worker_nsight_options")
|
||||
OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
|
||||
)
|
||||
|
||||
for resource_pool, class_dict in self.resource_pool_to_cls.items():
|
||||
@ -391,8 +390,8 @@ class OneStepOffRayTrainer(RayPPOTrainer):
|
||||
|
||||
while batch_data_future is not None:
|
||||
do_profile = (
|
||||
self.global_steps in self.config.trainer.profile_steps
|
||||
if self.config.trainer.profile_steps is not None
|
||||
self.global_steps in self.config.global_profiler.steps
|
||||
if self.config.global_profiler.steps is not None
|
||||
else False
|
||||
)
|
||||
if do_profile:
|
||||
|
@ -37,6 +37,14 @@ class TestConfigComparison(unittest.TestCase):
|
||||
"activations_checkpoint_method",
|
||||
"activations_checkpoint_granularity",
|
||||
"activations_checkpoint_num_layers",
|
||||
"discrete",
|
||||
"profiler",
|
||||
"profile",
|
||||
"use_profile",
|
||||
"npu_profile",
|
||||
"profile_steps",
|
||||
"worker_nsight_options",
|
||||
"controller_nsight_options",
|
||||
]
|
||||
|
||||
def _compare_configs_recursively(
|
||||
|
@ -79,7 +79,7 @@ class TestPrintCfgCommand(unittest.TestCase):
|
||||
|
||||
# Run the command
|
||||
result = subprocess.run(
|
||||
["python3", "scripts/print_cfg.py", "critic.profiler.discrete=True", "+critic.profiler.extra.any_key=val"],
|
||||
["python3", "scripts/print_cfg.py", "+critic.profiler.extra.any_key=val"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
@ -90,7 +90,6 @@ class TestPrintCfgCommand(unittest.TestCase):
|
||||
# Verify the output contains expected config information
|
||||
self.assertIn("critic", result.stdout)
|
||||
self.assertIn("profiler", result.stdout)
|
||||
self.assertIn("discrete=True", result.stdout)
|
||||
self.assertIn("extra={'any_key': 'val'}", result.stdout)
|
||||
|
||||
|
||||
|
@ -17,7 +17,7 @@ import unittest
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
from verl.utils import omega_conf_to_dataclass
|
||||
from verl.utils.profiler import ProfilerConfig
|
||||
from verl.utils.profiler.config import NsightToolConfig, ProfilerConfig
|
||||
from verl.utils.profiler.nvtx_profile import NsightSystemsProfiler
|
||||
|
||||
|
||||
@ -29,26 +29,25 @@ class TestProfilerConfig(unittest.TestCase):
|
||||
|
||||
with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
|
||||
cfg = compose(config_name="ppo_trainer")
|
||||
arr = cfg.actor_rollout_ref
|
||||
for config in [
|
||||
cfg.actor_rollout_ref.actor.profiler,
|
||||
cfg.actor_rollout_ref.rollout.profiler,
|
||||
cfg.actor_rollout_ref.ref.profiler,
|
||||
cfg.critic.profiler,
|
||||
arr.profiler,
|
||||
cfg.reward_model.profiler,
|
||||
]:
|
||||
profiler_config = omega_conf_to_dataclass(config)
|
||||
self.assertEqual(profiler_config.discrete, config.discrete)
|
||||
self.assertEqual(profiler_config.tool, config.tool)
|
||||
self.assertEqual(profiler_config.enable, config.enable)
|
||||
self.assertEqual(profiler_config.all_ranks, config.all_ranks)
|
||||
self.assertEqual(profiler_config.ranks, config.ranks)
|
||||
self.assertEqual(profiler_config.save_path, config.save_path)
|
||||
self.assertEqual(profiler_config.ranks, config.ranks)
|
||||
assert isinstance(profiler_config, ProfilerConfig)
|
||||
with self.assertRaises(AttributeError):
|
||||
_ = profiler_config.non_existing_key
|
||||
assert config.get("non_existing_key") == profiler_config.get("non_existing_key")
|
||||
assert config.get("non_existing_key", 1) == profiler_config.get("non_existing_key", 1)
|
||||
assert config["discrete"] == profiler_config["discrete"]
|
||||
from dataclasses import FrozenInstanceError
|
||||
|
||||
with self.assertRaises(FrozenInstanceError):
|
||||
profiler_config.discrete = False
|
||||
|
||||
def test_frozen_config(self):
|
||||
"""Test that modifying frozen keys in ProfilerConfig raises exceptions."""
|
||||
@ -57,11 +56,7 @@ class TestProfilerConfig(unittest.TestCase):
|
||||
from verl.utils.profiler.config import ProfilerConfig
|
||||
|
||||
# Create a new ProfilerConfig instance
|
||||
config = ProfilerConfig(discrete=True, all_ranks=False, ranks=[0], extra={"key": "value"})
|
||||
|
||||
# Test direct attribute assignment
|
||||
with self.assertRaises(FrozenInstanceError):
|
||||
config.discrete = False
|
||||
config = ProfilerConfig(all_ranks=False, ranks=[0], extra={"key": "value"})
|
||||
|
||||
with self.assertRaises(FrozenInstanceError):
|
||||
config.all_ranks = True
|
||||
@ -69,10 +64,6 @@ class TestProfilerConfig(unittest.TestCase):
|
||||
with self.assertRaises(FrozenInstanceError):
|
||||
config.ranks = [1, 2, 3]
|
||||
|
||||
# Test dictionary-style assignment
|
||||
with self.assertRaises(TypeError):
|
||||
config["discrete"] = False
|
||||
|
||||
with self.assertRaises(TypeError):
|
||||
config["all_ranks"] = True
|
||||
|
||||
@ -90,20 +81,19 @@ class TestNsightSystemsProfiler(unittest.TestCase):
|
||||
Test Plan:
|
||||
1. Initialization: Verify profiler state after creation
|
||||
2. Basic Profiling: Test start/stop functionality
|
||||
3. Discrete Mode: Test discrete profiling behavior
|
||||
3. Discrete Mode: TODO: Test discrete profiling behavior
|
||||
4. Annotation: Test the annotate decorator in both normal and discrete modes
|
||||
5. Config Validation: Verify proper config initialization from OmegaConf
|
||||
"""
|
||||
|
||||
def setUp(self):
|
||||
self.config = ProfilerConfig(all_ranks=True)
|
||||
self.config = ProfilerConfig(enable=True, all_ranks=True)
|
||||
self.rank = 0
|
||||
self.profiler = NsightSystemsProfiler(self.rank, self.config)
|
||||
self.profiler = NsightSystemsProfiler(self.rank, self.config, tool_config=NsightToolConfig(discrete=False))
|
||||
|
||||
def test_initialization(self):
|
||||
self.assertEqual(self.profiler.this_rank, True)
|
||||
self.assertEqual(self.profiler.this_step, False)
|
||||
self.assertEqual(self.profiler.discrete, False)
|
||||
|
||||
def test_start_stop_profiling(self):
|
||||
with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
|
||||
@ -117,18 +107,18 @@ class TestNsightSystemsProfiler(unittest.TestCase):
|
||||
self.assertFalse(self.profiler.this_step)
|
||||
mock_stop.assert_called_once()
|
||||
|
||||
def test_discrete_profiling(self):
|
||||
discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
|
||||
profiler = NsightSystemsProfiler(self.rank, discrete_config)
|
||||
# def test_discrete_profiling(self):
|
||||
# discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
|
||||
# profiler = NsightSystemsProfiler(self.rank, discrete_config)
|
||||
|
||||
with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
|
||||
profiler.start()
|
||||
self.assertTrue(profiler.this_step)
|
||||
mock_start.assert_not_called() # Shouldn't start immediately in discrete mode
|
||||
# with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
|
||||
# profiler.start()
|
||||
# self.assertTrue(profiler.this_step)
|
||||
# mock_start.assert_not_called() # Shouldn't start immediately in discrete mode
|
||||
|
||||
profiler.stop()
|
||||
self.assertFalse(profiler.this_step)
|
||||
mock_stop.assert_not_called() # Shouldn't stop immediately in discrete mode
|
||||
# profiler.stop()
|
||||
# self.assertFalse(profiler.this_step)
|
||||
# mock_stop.assert_not_called() # Shouldn't stop immediately in discrete mode
|
||||
|
||||
def test_annotate_decorator(self):
|
||||
mock_self = MagicMock()
|
||||
@ -152,29 +142,29 @@ class TestNsightSystemsProfiler(unittest.TestCase):
|
||||
mock_start.assert_not_called() # Not discrete mode
|
||||
mock_stop.assert_not_called() # Not discrete mode
|
||||
|
||||
def test_annotate_discrete_mode(self):
|
||||
discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
|
||||
profiler = NsightSystemsProfiler(self.rank, discrete_config)
|
||||
mock_self = MagicMock()
|
||||
mock_self.profiler = profiler
|
||||
mock_self.profiler.this_step = True
|
||||
# def test_annotate_discrete_mode(self):
|
||||
# discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
|
||||
# profiler = NsightSystemsProfiler(self.rank, discrete_config)
|
||||
# mock_self = MagicMock()
|
||||
# mock_self.profiler = profiler
|
||||
# mock_self.profiler.this_step = True
|
||||
|
||||
@NsightSystemsProfiler.annotate(message="test")
|
||||
def test_func(self, *args, **kwargs):
|
||||
return "result"
|
||||
# @NsightSystemsProfiler.annotate(message="test")
|
||||
# def test_func(self, *args, **kwargs):
|
||||
# return "result"
|
||||
|
||||
with (
|
||||
patch("torch.cuda.profiler.start") as mock_start,
|
||||
patch("torch.cuda.profiler.stop") as mock_stop,
|
||||
patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
|
||||
patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
|
||||
):
|
||||
result = test_func(mock_self)
|
||||
self.assertEqual(result, "result")
|
||||
mock_start_range.assert_called_once()
|
||||
mock_end_range.assert_called_once()
|
||||
mock_start.assert_called_once() # Should start in discrete mode
|
||||
mock_stop.assert_called_once() # Should stop in discrete mode
|
||||
# with (
|
||||
# patch("torch.cuda.profiler.start") as mock_start,
|
||||
# patch("torch.cuda.profiler.stop") as mock_stop,
|
||||
# patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
|
||||
# patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
|
||||
# ):
|
||||
# result = test_func(mock_self)
|
||||
# self.assertEqual(result, "result")
|
||||
# mock_start_range.assert_called_once()
|
||||
# mock_end_range.assert_called_once()
|
||||
# mock_start.assert_called_once() # Should start in discrete mode
|
||||
# mock_stop.assert_called_once() # Should stop in discrete mode
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
@ -184,29 +184,26 @@ class TestCriticConfig:
|
||||
optim = OptimizerConfig(lr=0.1)
|
||||
critic_config = CriticConfig(ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim)
|
||||
assert isinstance(critic_config.profiler, ProfilerConfig)
|
||||
assert critic_config.profiler.discrete is False
|
||||
assert critic_config.profiler.all_ranks is False
|
||||
assert critic_config.profiler.ranks == []
|
||||
|
||||
custom_profiler = ProfilerConfig(discrete=True, all_ranks=True, ranks=[0, 1])
|
||||
custom_profiler = ProfilerConfig(all_ranks=True, ranks=[0, 1])
|
||||
critic_config_custom = CriticConfig(
|
||||
profiler=custom_profiler, ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim
|
||||
)
|
||||
assert isinstance(critic_config_custom.profiler, ProfilerConfig)
|
||||
assert critic_config_custom.profiler.discrete is True
|
||||
assert critic_config_custom.profiler.all_ranks is True
|
||||
assert critic_config_custom.profiler.ranks == [0, 1]
|
||||
|
||||
profiler1 = ProfilerConfig(discrete=True, ranks=[0, 1])
|
||||
profiler1 = ProfilerConfig(enable=True, ranks=[0, 1])
|
||||
profiler2 = ProfilerConfig(all_ranks=True, ranks=[1, 2])
|
||||
|
||||
union_result = profiler1.union(profiler2)
|
||||
assert union_result.discrete is True
|
||||
assert union_result.enable is True
|
||||
assert union_result.all_ranks is True
|
||||
assert set(union_result.ranks) == {0, 1, 2}
|
||||
|
||||
intersect_result = profiler1.intersect(profiler2)
|
||||
assert intersect_result.discrete is False
|
||||
assert intersect_result.all_ranks is False
|
||||
assert intersect_result.ranks == [1]
|
||||
|
||||
|
@ -59,6 +59,25 @@ actor_rollout_ref:
|
||||
use_checkpoint_opt_param_scheduler: false
|
||||
override_optimizer_config: {}
|
||||
use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config:
|
||||
nsys:
|
||||
discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
|
||||
npu:
|
||||
_target_: verl.utils.profiler.config.NPUToolConfig
|
||||
contents: []
|
||||
level: level1
|
||||
analysis: true
|
||||
torch:
|
||||
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
|
||||
step_start: 0
|
||||
step_end: null
|
||||
data_loader_seed: null
|
||||
load_weight: true
|
||||
megatron:
|
||||
@ -85,12 +104,6 @@ actor_rollout_ref:
|
||||
recompute_method: null
|
||||
recompute_num_layers: null
|
||||
use_mbridge: false
|
||||
profile:
|
||||
use_profile: false
|
||||
profile_ranks: null
|
||||
step_start: -1
|
||||
step_end: -1
|
||||
save_path: null
|
||||
ref:
|
||||
strategy: megatron
|
||||
use_torch_compile: ${oc.select:actor_rollout_ref.actor.use_torch_compile,true}
|
||||
@ -98,6 +111,14 @@ actor_rollout_ref:
|
||||
log_prob_micro_batch_size_per_gpu: null
|
||||
log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
|
||||
log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
|
||||
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
|
||||
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
megatron:
|
||||
_target_: verl.workers.config.MegatronEngineConfig
|
||||
param_offload: false
|
||||
@ -114,12 +135,6 @@ actor_rollout_ref:
|
||||
seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
|
||||
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
|
||||
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
|
||||
profile:
|
||||
use_profile: false
|
||||
profile_ranks: null
|
||||
step_start: -1
|
||||
step_end: -1
|
||||
save_path: null
|
||||
load_weight: true
|
||||
rollout:
|
||||
name: ???
|
||||
@ -184,6 +199,14 @@ actor_rollout_ref:
|
||||
token2text: false
|
||||
skip_rollout: false
|
||||
skip_dump_dir: /tmp/rollout_dump
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
|
||||
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
|
||||
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
enable_chunked_prefill: false
|
||||
load_format: dummy_megatron
|
||||
layer_name_map:
|
||||
@ -201,63 +224,6 @@ actor_rollout_ref:
|
||||
freeze_moe_router: false
|
||||
use_fused_kernels: false
|
||||
trust_remote_code: false
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
trainer:
|
||||
npu_profile:
|
||||
options:
|
||||
save_path: ./profiler_data
|
||||
roles:
|
||||
- all
|
||||
level: level1
|
||||
with_memory: false
|
||||
record_shapes: false
|
||||
with_npu: true
|
||||
with_cpu: true
|
||||
with_module: false
|
||||
with_stack: false
|
||||
analysis: true
|
||||
balance_batch: true
|
||||
total_epochs: 30
|
||||
total_training_steps: null
|
||||
profile_steps: null
|
||||
profile_continuous_steps: false
|
||||
project_name: verl_examples
|
||||
experiment_name: gsm8k
|
||||
logger:
|
||||
- console
|
||||
- wandb
|
||||
log_val_generations: 0
|
||||
nnodes: 1
|
||||
n_gpus_per_node: 8
|
||||
save_freq: -1
|
||||
esi_redundant_time: 0
|
||||
resume_mode: auto
|
||||
resume_from_path: null
|
||||
del_local_ckpt_after_load: false
|
||||
val_before_train: true
|
||||
test_freq: -1
|
||||
critic_warmup: 0
|
||||
default_hdfs_dir: null
|
||||
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
|
||||
max_actor_ckpt_to_keep: null
|
||||
max_critic_ckpt_to_keep: null
|
||||
ray_wait_register_center_timeout: 300
|
||||
device: cuda
|
||||
controller_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
worker_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
capture-range: cudaProfilerApi
|
||||
capture-range-end: null
|
||||
kill: none
|
||||
data:
|
||||
tokenizer: null
|
||||
use_shm: false
|
||||
@ -344,9 +310,12 @@ critic:
|
||||
async_save: false
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: false
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
nccl_timeout: 600
|
||||
megatron:
|
||||
_target_: verl.workers.config.McoreEngineConfig
|
||||
@ -390,9 +359,12 @@ reward_model:
|
||||
memory_limit_mb: 1024
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: false
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
nccl_timeout: 600
|
||||
megatron:
|
||||
_target_: verl.workers.config.MegatronEngineConfig
|
||||
@ -432,6 +404,52 @@ algorithm:
|
||||
pf_ppo:
|
||||
reweight_method: pow
|
||||
weight_pow: 2.0
|
||||
trainer:
|
||||
balance_batch: true
|
||||
total_epochs: 30
|
||||
total_training_steps: null
|
||||
project_name: verl_examples
|
||||
experiment_name: gsm8k
|
||||
logger:
|
||||
- console
|
||||
- wandb
|
||||
log_val_generations: 0
|
||||
nnodes: 1
|
||||
n_gpus_per_node: 8
|
||||
save_freq: -1
|
||||
esi_redundant_time: 0
|
||||
resume_mode: auto
|
||||
resume_from_path: null
|
||||
del_local_ckpt_after_load: false
|
||||
val_before_train: true
|
||||
test_freq: -1
|
||||
critic_warmup: 0
|
||||
default_hdfs_dir: null
|
||||
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
|
||||
max_actor_ckpt_to_keep: null
|
||||
max_critic_ckpt_to_keep: null
|
||||
ray_wait_register_center_timeout: 300
|
||||
device: cuda
|
||||
global_profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: null
|
||||
steps: null
|
||||
profile_continuous_steps: false
|
||||
save_path: outputs/profile
|
||||
global_tool_config:
|
||||
nsys:
|
||||
discrete: false
|
||||
controller_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
worker_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
capture-range: cudaProfilerApi
|
||||
capture-range-end: null
|
||||
kill: none
|
||||
ray_init:
|
||||
num_cpus: null
|
||||
timeline_json_file: null
|
||||
|
@ -51,6 +51,25 @@ actor_rollout_ref:
|
||||
num_cycles: 0.5
|
||||
warmup_style: constant
|
||||
use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config:
|
||||
nsys:
|
||||
discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
|
||||
npu:
|
||||
_target_: verl.utils.profiler.config.NPUToolConfig
|
||||
contents: []
|
||||
level: level1
|
||||
analysis: true
|
||||
torch:
|
||||
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
|
||||
step_start: 0
|
||||
step_end: null
|
||||
grad_clip: 1.0
|
||||
ulysses_sequence_parallel_size: 1
|
||||
entropy_from_logits_with_chunking: false
|
||||
@ -73,6 +92,14 @@ actor_rollout_ref:
|
||||
log_prob_micro_batch_size_per_gpu: null
|
||||
log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
|
||||
log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
|
||||
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
|
||||
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
model: null
|
||||
fsdp_config:
|
||||
_target_: verl.workers.config.FSDPEngineConfig
|
||||
@ -147,6 +174,14 @@ actor_rollout_ref:
|
||||
token2text: false
|
||||
skip_rollout: false
|
||||
skip_dump_dir: /tmp/rollout_dump
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
|
||||
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
|
||||
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
enable_chunked_prefill: true
|
||||
load_format: dummy_dtensor
|
||||
layered_summon: false
|
||||
@ -170,67 +205,6 @@ actor_rollout_ref:
|
||||
fused_kernel_options:
|
||||
impl_backend: torch
|
||||
trust_remote_code: false
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
trainer:
|
||||
npu_profile:
|
||||
options:
|
||||
save_path: ./profiler_data
|
||||
roles:
|
||||
- all
|
||||
level: level1
|
||||
with_memory: false
|
||||
record_shapes: false
|
||||
with_npu: true
|
||||
with_cpu: true
|
||||
with_module: false
|
||||
with_stack: false
|
||||
analysis: true
|
||||
balance_batch: true
|
||||
total_epochs: 30
|
||||
total_training_steps: null
|
||||
profile_steps: null
|
||||
profile_continuous_steps: false
|
||||
controller_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
worker_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
capture-range: cudaProfilerApi
|
||||
capture-range-end: null
|
||||
kill: none
|
||||
project_name: verl_examples
|
||||
experiment_name: gsm8k
|
||||
logger:
|
||||
- console
|
||||
- wandb
|
||||
log_val_generations: 0
|
||||
rollout_data_dir: null
|
||||
validation_data_dir: null
|
||||
nnodes: 1
|
||||
n_gpus_per_node: 8
|
||||
save_freq: -1
|
||||
esi_redundant_time: 0
|
||||
resume_mode: auto
|
||||
resume_from_path: null
|
||||
val_before_train: true
|
||||
val_only: false
|
||||
test_freq: -1
|
||||
critic_warmup: 0
|
||||
default_hdfs_dir: null
|
||||
del_local_ckpt_after_load: false
|
||||
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
|
||||
max_actor_ckpt_to_keep: null
|
||||
max_critic_ckpt_to_keep: null
|
||||
ray_wait_register_center_timeout: 300
|
||||
device: cuda
|
||||
use_legacy_worker_impl: auto
|
||||
data:
|
||||
tokenizer: null
|
||||
use_shm: false
|
||||
@ -322,9 +296,12 @@ critic:
|
||||
async_save: false
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: false
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
forward_micro_batch_size: ${oc.select:.ppo_micro_batch_size,null}
|
||||
forward_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size_per_gpu,null}
|
||||
ulysses_sequence_parallel_size: 1
|
||||
@ -361,9 +338,12 @@ reward_model:
|
||||
memory_limit_mb: 1024
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: false
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
enable: false
|
||||
all_ranks: false
|
||||
ranks: []
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
ulysses_sequence_parallel_size: 1
|
||||
custom_reward_function:
|
||||
path: null
|
||||
@ -386,6 +366,57 @@ algorithm:
|
||||
pf_ppo:
|
||||
reweight_method: pow
|
||||
weight_pow: 2.0
|
||||
trainer:
|
||||
balance_batch: true
|
||||
total_epochs: 30
|
||||
total_training_steps: null
|
||||
project_name: verl_examples
|
||||
experiment_name: gsm8k
|
||||
logger:
|
||||
- console
|
||||
- wandb
|
||||
log_val_generations: 0
|
||||
rollout_data_dir: null
|
||||
validation_data_dir: null
|
||||
nnodes: 1
|
||||
n_gpus_per_node: 8
|
||||
save_freq: -1
|
||||
esi_redundant_time: 0
|
||||
resume_mode: auto
|
||||
resume_from_path: null
|
||||
val_before_train: true
|
||||
val_only: false
|
||||
test_freq: -1
|
||||
critic_warmup: 0
|
||||
default_hdfs_dir: null
|
||||
del_local_ckpt_after_load: false
|
||||
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
|
||||
max_actor_ckpt_to_keep: null
|
||||
max_critic_ckpt_to_keep: null
|
||||
ray_wait_register_center_timeout: 300
|
||||
device: cuda
|
||||
use_legacy_worker_impl: auto
|
||||
global_profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: null
|
||||
steps: null
|
||||
profile_continuous_steps: false
|
||||
save_path: outputs/profile
|
||||
global_tool_config:
|
||||
nsys:
|
||||
_target_: verl.utils.profiler.config.NsightToolConfig
|
||||
discrete: false
|
||||
controller_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
worker_nsight_options:
|
||||
trace: cuda,nvtx,cublas,ucx
|
||||
cuda-memory-usage: 'true'
|
||||
cuda-graph-trace: graph
|
||||
capture-range: cudaProfilerApi
|
||||
capture-range-end: null
|
||||
kill: none
|
||||
ray_init:
|
||||
num_cpus: null
|
||||
timeline_json_file: null
|
||||
|
@ -128,3 +128,65 @@ optim:
|
||||
|
||||
# Whether to use custom fused kernels (e.g., FlashAttention, fused MLP)
|
||||
use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
|
||||
|
||||
# profile the actor model in `update_policy`
|
||||
profiler:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# profiler tool, default same as profiler.tool in global config
|
||||
# choices: nsys, npu, torch
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
|
||||
# whether enable profile on Actor
|
||||
enable: False
|
||||
|
||||
# Whether to profile all ranks.
|
||||
all_ranks: False
|
||||
|
||||
# The ranks that will be profiled. [] or [0,1,...]
|
||||
ranks: []
|
||||
|
||||
# profile results saving path
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
|
||||
# specific tool config which only related to the role
|
||||
tool_config:
|
||||
|
||||
# nsys tool config
|
||||
nsys:
|
||||
|
||||
# True for each task has its own database, False for all tasks in one training step share one database.
|
||||
discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
|
||||
|
||||
# npu config
|
||||
npu:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.config.NPUToolConfig
|
||||
|
||||
# Contents to profile, can be empty
|
||||
# options: npu, cpu, memory, shapes, module, stack
|
||||
contents: []
|
||||
|
||||
# Collection level, optional values: level_none, level0, level1, level2.
|
||||
level: "level1"
|
||||
|
||||
# Whether to automatically parse the data.
|
||||
analysis: True
|
||||
|
||||
# torch profiler config
|
||||
torch:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
|
||||
|
||||
# start profile mini-batch in training
|
||||
# NOTICE: different with global steps config which refers to iteration
|
||||
# This field only related with mini-batch
|
||||
step_start: 0
|
||||
|
||||
# stop profile mini-batch in training
|
||||
step_end: null
|
||||
|
||||
|
@ -104,21 +104,3 @@ megatron:
|
||||
|
||||
# oc.select: default val for ref.megatron.use_mbridge
|
||||
use_mbridge: False
|
||||
|
||||
# profile the actor model in `update_policy`
|
||||
profile:
|
||||
|
||||
# turn it on when you want to profile the actor model
|
||||
use_profile: False
|
||||
|
||||
# list, you can specify the ranks to profile
|
||||
profile_ranks: null
|
||||
|
||||
# start step in update_policy
|
||||
step_start: -1
|
||||
|
||||
# end step
|
||||
step_end: -1
|
||||
|
||||
# the path to save the profile result
|
||||
save_path: null
|
||||
|
@ -45,14 +45,12 @@ class ProfileConfig(BaseConfig):
|
||||
The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.
|
||||
|
||||
Args:
|
||||
use_profile (bool): Whether to enable profiling.
|
||||
profile_ranks (Optional[list[int]]): List of ranks to profile. None means all ranks.
|
||||
step_start (int): Starting step for profiling.
|
||||
step_end (int): Ending step for profiling.
|
||||
save_path (Optional[str]): Path to save profiling results.
|
||||
"""
|
||||
|
||||
use_profile: bool = False
|
||||
profile_ranks: Optional[list[int]] = None
|
||||
step_start: int = -1
|
||||
step_end: int = -1
|
||||
|
@ -95,18 +95,27 @@ checkpoint:
|
||||
# Whether to save checkpoints asynchronously. Only effective for Megatron as of now.
|
||||
async_save: False
|
||||
|
||||
# profiler configs
|
||||
# the corresponding dataclass is verl.utils.profiler.ProfilerConfig.
|
||||
# profile the critic model in `update_policy`
|
||||
profiler:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# True for each task has its own database, False for all tasks in one training step share one database.
|
||||
discrete: False
|
||||
# profiler tool, default same as profiler.tool in global config
|
||||
# choices: nsys, npu, torch
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
|
||||
# whether enable profile on critic
|
||||
enable: False
|
||||
|
||||
# Whether to profile all ranks.
|
||||
all_ranks: False
|
||||
|
||||
# The ranks that will be profiled. [] or [0,1,...]
|
||||
ranks: []
|
||||
|
||||
# profile results saving path
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
|
||||
# specific tool config
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
@ -4,8 +4,6 @@ defaults:
|
||||
# <folder_name>@<field_name>.<field_name>: <yaml_file_name>
|
||||
# actor_rollout_ref.actor: trainer/config/actor/megatron_actor.yaml
|
||||
- actor@actor_rollout_ref.actor: megatron_actor
|
||||
# trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
|
||||
- npu_profile@trainer.npu_profile: npu_profile
|
||||
# data: trainer/config/data/legacy_data.yaml
|
||||
- data@data: legacy_data
|
||||
# load the reference default config, then apply the fields in the current yaml
|
||||
@ -57,12 +55,6 @@ actor_rollout_ref:
|
||||
qkv_layer_name: qkv
|
||||
gate_proj_layer_name: gate_up
|
||||
|
||||
profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
discrete: False
|
||||
all_ranks: False
|
||||
ranks: []
|
||||
|
||||
custom_reward_function:
|
||||
path: null
|
||||
name: compute_score
|
||||
@ -92,8 +84,6 @@ trainer:
|
||||
balance_batch: True
|
||||
total_epochs: 30
|
||||
total_training_steps: null
|
||||
profile_steps: null # [1,2,5] or [] or null
|
||||
profile_continuous_steps: False
|
||||
project_name: verl_examples
|
||||
experiment_name: gsm8k
|
||||
logger: ['console', 'wandb']
|
||||
@ -117,18 +107,62 @@ trainer:
|
||||
# The timeout for ray worker group to wait for the register center to be ready
|
||||
ray_wait_register_center_timeout: 300
|
||||
device: cuda
|
||||
# see ppo_trainer.yaml for more details
|
||||
|
||||
global_profiler:
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
tool: null # choose between nsys, npu, torch
|
||||
steps: null # profile steps
|
||||
profile_continuous_steps: False
|
||||
save_path: "outputs/profile" # profiler saving path
|
||||
# Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
|
||||
global_tool_config:
|
||||
|
||||
# nsys config
|
||||
nsys:
|
||||
|
||||
# True for each task has its own database, False for all tasks in one training step share one database.
|
||||
discrete: False
|
||||
|
||||
# controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
|
||||
## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
|
||||
## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
|
||||
controller_nsight_options:
|
||||
|
||||
# Select the API(s) to be traced.
|
||||
trace: "cuda,nvtx,cublas,ucx"
|
||||
|
||||
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
|
||||
cuda-memory-usage: "true"
|
||||
|
||||
# CUDA graphs will be traced as a whole
|
||||
cuda-graph-trace: "graph"
|
||||
|
||||
# worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
|
||||
worker_nsight_options:
|
||||
|
||||
# Select the API(s) to be traced.
|
||||
trace: "cuda,nvtx,cublas,ucx"
|
||||
|
||||
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
|
||||
cuda-memory-usage: "true"
|
||||
|
||||
# CUDA graphs will be traced as a whole
|
||||
cuda-graph-trace: "graph"
|
||||
|
||||
# Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
|
||||
capture-range: "cudaProfilerApi"
|
||||
|
||||
# Specify the desired behavior when a capture range ends.
|
||||
# In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
|
||||
# valid values are "repeat-shutdown:n" or null.
|
||||
# For normal whole step profiling, n = len(profile_steps);
|
||||
# but for discrete profiling, n = len(profile_steps) * Number(subtasks).
|
||||
# Or you can just leave it null and the program will use n = len(profile_steps) * 6;
|
||||
capture-range-end: null
|
||||
|
||||
# Send signal to the target application's process group. We let the program to exit by itself.
|
||||
kill: none
|
||||
|
||||
ray_init:
|
||||
num_cpus: null # `None` means using all CPUs, which might cause hang if limited in systems like SLURM. Please set to a number allowed then.
|
||||
timeline_json_file: null
|
||||
|
@ -11,9 +11,6 @@ defaults:
|
||||
# actor_rollout_ref.actor: trainer/config/actor/dp_actor.yaml
|
||||
- actor@actor_rollout_ref.actor: dp_actor
|
||||
|
||||
# trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
|
||||
- npu_profile@trainer.npu_profile: npu_profile
|
||||
|
||||
# data: trainer/config/data/legacy_data.yaml
|
||||
- data@data: legacy_data
|
||||
|
||||
@ -112,21 +109,6 @@ actor_rollout_ref:
|
||||
# for huge model, layered summon can save memory (prevent OOM) but make it slower
|
||||
layered_summon: False
|
||||
|
||||
# profiler configs
|
||||
profiler:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# True for each task has its own database, False for all tasks in one training step share one database.
|
||||
discrete: False
|
||||
|
||||
# Whether to profile all ranks.
|
||||
all_ranks: False
|
||||
|
||||
# The ranks that will be profiled. [] or [0,1,...]
|
||||
ranks: []
|
||||
|
||||
# custom reward function definition
|
||||
custom_reward_function:
|
||||
|
||||
@ -203,54 +185,6 @@ trainer:
|
||||
# Total training steps (can be set explicitly or derived from epochs)
|
||||
total_training_steps: null
|
||||
|
||||
# The steps that will be profiled. null means no profiling. null or [1,2,5,...]
|
||||
profile_steps: null
|
||||
|
||||
# Whether to combine continuous steps into one database.
|
||||
## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
|
||||
## If False, [1] in one, [2] in another, [5] in another.
|
||||
profile_continuous_steps: False
|
||||
|
||||
# controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
|
||||
## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
|
||||
## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
|
||||
controller_nsight_options:
|
||||
|
||||
# Select the API(s) to be traced.
|
||||
trace: "cuda,nvtx,cublas,ucx"
|
||||
|
||||
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
|
||||
cuda-memory-usage: "true"
|
||||
|
||||
# CUDA graphs will be traced as a whole
|
||||
cuda-graph-trace: "graph"
|
||||
|
||||
# worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
|
||||
worker_nsight_options:
|
||||
|
||||
# Select the API(s) to be traced.
|
||||
trace: "cuda,nvtx,cublas,ucx"
|
||||
|
||||
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
|
||||
cuda-memory-usage: "true"
|
||||
|
||||
# CUDA graphs will be traced as a whole
|
||||
cuda-graph-trace: "graph"
|
||||
|
||||
# Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
|
||||
capture-range: "cudaProfilerApi"
|
||||
|
||||
# Specify the desired behavior when a capture range ends.
|
||||
# In verl we need the orch.cuda.profiler.start/stop pair to repeats n times.
|
||||
# valid values are "repeat-shutdown:n" or null.
|
||||
# For normal whole step profiling, n = len(profile_steps);
|
||||
# but for discrete profiling, n = len(profile_steps) * Number(subtasks).
|
||||
# Or you can just leave it null and the program will use n = len(profile_steps) * 6;
|
||||
capture-range-end: null
|
||||
|
||||
# Send signal to the target application's process group. We let the program to exit by itself.
|
||||
kill: none
|
||||
|
||||
# Project name for experiment tracking (e.g., wandb)
|
||||
project_name: verl_examples
|
||||
|
||||
@ -331,6 +265,79 @@ trainer:
|
||||
# mode: "auto", "enable", or "disable"
|
||||
use_legacy_worker_impl: auto
|
||||
|
||||
|
||||
# profiler configs
|
||||
global_profiler:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# Profiling tool: choose between nsys, npu, torch
|
||||
tool: null
|
||||
|
||||
# profile steps
|
||||
steps: null
|
||||
|
||||
# Whether to combine continuous steps into one database.
|
||||
## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
|
||||
## If False, [1] in one, [2] in another, [5] in another.
|
||||
profile_continuous_steps: False
|
||||
|
||||
# Path to save profiling contents
|
||||
save_path: "outputs/profile"
|
||||
|
||||
# Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
|
||||
global_tool_config:
|
||||
|
||||
# nsys config
|
||||
nsys:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.config.NsightToolConfig
|
||||
|
||||
# True for each task has its own database, False for all tasks in one training step share one database.
|
||||
discrete: False
|
||||
|
||||
# controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
|
||||
## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
|
||||
## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
|
||||
controller_nsight_options:
|
||||
|
||||
# Select the API(s) to be traced.
|
||||
trace: "cuda,nvtx,cublas,ucx"
|
||||
|
||||
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
|
||||
cuda-memory-usage: "true"
|
||||
|
||||
# CUDA graphs will be traced as a whole
|
||||
cuda-graph-trace: "graph"
|
||||
|
||||
# worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
|
||||
worker_nsight_options:
|
||||
|
||||
# Select the API(s) to be traced.
|
||||
trace: "cuda,nvtx,cublas,ucx"
|
||||
|
||||
# Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
|
||||
cuda-memory-usage: "true"
|
||||
|
||||
# CUDA graphs will be traced as a whole
|
||||
cuda-graph-trace: "graph"
|
||||
|
||||
# Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
|
||||
capture-range: "cudaProfilerApi"
|
||||
|
||||
# Specify the desired behavior when a capture range ends.
|
||||
# In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
|
||||
# valid values are "repeat-shutdown:n" or null.
|
||||
# For normal whole step profiling, n = len(profile_steps);
|
||||
# but for discrete profiling, n = len(profile_steps) * Number(subtasks).
|
||||
# Or you can just leave it null and the program will use n = len(profile_steps) * 6;
|
||||
capture-range-end: null
|
||||
|
||||
# Send signal to the target application's process group. We let the program to exit by itself.
|
||||
kill: none
|
||||
|
||||
# configs related to ray initialization
|
||||
ray_init:
|
||||
|
||||
|
@ -23,11 +23,4 @@ megatron:
|
||||
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
|
||||
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
|
||||
|
||||
profile:
|
||||
use_profile: False
|
||||
profile_ranks: null
|
||||
step_start: -1
|
||||
step_end: -1
|
||||
save_path: null
|
||||
|
||||
load_weight: True
|
@ -19,3 +19,28 @@ log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,fa
|
||||
# the max token length per GPU
|
||||
# same as actor_rollout_ref.actor.ppo_max_token_len_per_gpu if it exists, otherwise 16384
|
||||
log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
|
||||
|
||||
# profile the ref model in `compute_log_prob`
|
||||
profiler:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# profiler tool, default same as profiler.tool in global config
|
||||
# choices: nsys, npu, torch
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
|
||||
# whether enable profile on ref
|
||||
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
|
||||
|
||||
# Whether to profile all ranks.
|
||||
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
|
||||
|
||||
# The ranks that will be profiled. [] or [0,1,...]
|
||||
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
|
||||
|
||||
# profile results saving path
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
|
||||
# specific tool config
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
@ -65,17 +65,27 @@ sandbox_fusion:
|
||||
# Max memory limit for each sandbox process in MB
|
||||
memory_limit_mb: 1024
|
||||
|
||||
# profiler configs
|
||||
# profile the reward model in `compute_reward`
|
||||
profiler:
|
||||
|
||||
# hint for the target config dataclass
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# True for each task has its own database, False for all tasks in one training step share one database.
|
||||
discrete: False
|
||||
# profiler tool, default same as profiler.tool in global config
|
||||
# choices: nsys, npu, torch
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
|
||||
# whether enable profile on ref
|
||||
enable: False
|
||||
|
||||
# Whether to profile all ranks.
|
||||
all_ranks: False
|
||||
|
||||
# The ranks that will be profiled. [] or [0,1,...]
|
||||
ranks: []
|
||||
|
||||
# profile results saving path
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
|
||||
# specific tool config
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
@ -225,3 +225,28 @@ skip_rollout: False
|
||||
# Specifies the filesystem path where rollout data should be cached when skip_rollout is enabled.
|
||||
# Note: Giving path under /tmp/ray/session* is not recommended as these are temporary Ray cluster directories.
|
||||
skip_dump_dir: /tmp/rollout_dump
|
||||
|
||||
# profile the rollout model in `generate_sequence`
|
||||
profiler:
|
||||
|
||||
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
|
||||
_target_: verl.utils.profiler.ProfilerConfig
|
||||
|
||||
# profiler tool, default same as profiler.tool in global config
|
||||
# choices: nsys, npu, torch
|
||||
tool: ${oc.select:global_profiler.tool,null}
|
||||
|
||||
# whether enable profile on ref
|
||||
enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
|
||||
|
||||
# Whether to profile all ranks.
|
||||
all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
|
||||
|
||||
# The ranks that will be profiled. [] or [0,1,...]
|
||||
ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
|
||||
|
||||
# profile results saving path
|
||||
save_path: ${oc.select:global_profiler.save_path,null}
|
||||
|
||||
# specific tool config
|
||||
tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
|
||||
|
@ -64,13 +64,16 @@ def run_ppo(config) -> None:
|
||||
# Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
|
||||
if (
|
||||
is_cuda_available
|
||||
and config.trainer.get("profile_steps") is not None
|
||||
and len(config.trainer.get("profile_steps", [])) > 0
|
||||
and config.global_profiler.tool == "nsys"
|
||||
and config.global_profiler.get("steps") is not None
|
||||
and len(config.global_profiler.get("steps", [])) > 0
|
||||
):
|
||||
from verl.utils.import_utils import is_nvtx_available
|
||||
|
||||
assert is_nvtx_available(), "nvtx is not available in CUDA platform. Please 'pip3 install nvtx'"
|
||||
nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
|
||||
nsight_options = OmegaConf.to_container(
|
||||
config.global_profiler.global_tool_config.nsys.controller_nsight_options
|
||||
)
|
||||
runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
|
||||
else:
|
||||
runner = TaskRunner.remote()
|
||||
|
@ -795,7 +795,6 @@ class RayPPOTrainer:
|
||||
cls=self.role_worker_mapping[Role.ActorRollout],
|
||||
config=self.config.actor_rollout_ref,
|
||||
role="actor_rollout",
|
||||
profile_option=self.config.trainer.npu_profile.options,
|
||||
)
|
||||
self.resource_pool_to_cls[resource_pool]["actor_rollout"] = actor_rollout_cls
|
||||
else:
|
||||
@ -815,7 +814,6 @@ class RayPPOTrainer:
|
||||
self.role_worker_mapping[Role.RefPolicy],
|
||||
config=self.config.actor_rollout_ref,
|
||||
role="ref",
|
||||
profile_option=self.config.trainer.npu_profile.options,
|
||||
)
|
||||
self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls
|
||||
|
||||
@ -835,13 +833,13 @@ class RayPPOTrainer:
|
||||
wg_kwargs = {} # Setting up kwargs for RayWorkerGroup
|
||||
if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
|
||||
wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
|
||||
if OmegaConf.select(self.config.trainer, "profile_steps") is not None:
|
||||
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps")
|
||||
assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, (
|
||||
if OmegaConf.select(self.config.global_profiler, "steps") is not None:
|
||||
wg_kwargs["profile_steps"] = OmegaConf.select(self.config.global_profiler, "steps")
|
||||
assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
|
||||
"worker_nsight_options must be set when profile_steps is set"
|
||||
)
|
||||
wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
|
||||
OmegaConf.select(self.config.trainer, "worker_nsight_options")
|
||||
OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
|
||||
)
|
||||
wg_kwargs["device_name"] = self.device_name
|
||||
|
||||
@ -1083,8 +1081,8 @@ class RayPPOTrainer:
|
||||
|
||||
prev_step_profile = False
|
||||
curr_step_profile = (
|
||||
self.global_steps in self.config.trainer.profile_steps
|
||||
if self.config.trainer.profile_steps is not None
|
||||
self.global_steps in self.config.global_profiler.steps
|
||||
if self.config.global_profiler.steps is not None
|
||||
else False
|
||||
)
|
||||
next_step_profile = False
|
||||
@ -1097,7 +1095,7 @@ class RayPPOTrainer:
|
||||
with marked_timer("start_profile", timing_raw):
|
||||
self._start_profiling(
|
||||
not prev_step_profile and curr_step_profile
|
||||
if self.config.trainer.profile_continuous_steps
|
||||
if self.config.global_profiler.profile_continuous_steps
|
||||
else curr_step_profile
|
||||
)
|
||||
|
||||
@ -1341,13 +1339,13 @@ class RayPPOTrainer:
|
||||
|
||||
with marked_timer("stop_profile", timing_raw):
|
||||
next_step_profile = (
|
||||
self.global_steps + 1 in self.config.trainer.profile_steps
|
||||
if self.config.trainer.profile_steps is not None
|
||||
self.global_steps + 1 in self.config.global_profiler.steps
|
||||
if self.config.global_profiler.steps is not None
|
||||
else False
|
||||
)
|
||||
self._stop_profiling(
|
||||
curr_step_profile and not next_step_profile
|
||||
if self.config.trainer.profile_continuous_steps
|
||||
if self.config.global_profiler.profile_continuous_steps
|
||||
else curr_step_profile
|
||||
)
|
||||
prev_step_profile = curr_step_profile
|
||||
|
@ -12,14 +12,74 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import warnings
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, Optional
|
||||
|
||||
from omegaconf import MISSING
|
||||
|
||||
from verl.base_config import BaseConfig
|
||||
|
||||
|
||||
@dataclass
|
||||
class NsightToolConfig(BaseConfig):
|
||||
"""Nsight tool config."""
|
||||
|
||||
"True for each task has its own database, False for all tasks in one training step share one database."
|
||||
discrete: bool = False
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class TorchProfilerToolConfig(BaseConfig):
|
||||
"""Torch profiler tool config.
|
||||
|
||||
Args:
|
||||
step_start (int): Start step in update_policy.
|
||||
step_end (int): End step.
|
||||
"""
|
||||
|
||||
step_start: int = -1
|
||||
step_end: int = -1
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
"""config validation logics go here"""
|
||||
warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
|
||||
assert isinstance(self.step_start, int), f"Profiler step_start must be of type int, got {type(self.step_start)}"
|
||||
|
||||
|
||||
@dataclass
|
||||
class NPUToolConfig(NsightToolConfig):
|
||||
"""NPU profiler too; config."""
|
||||
|
||||
# options: npu, cpu, memory, shapes, module, stack
|
||||
contents: list[str] = field(default_factory=list)
|
||||
|
||||
# Collection level, optional values: level_none, level0, level1, level2.
|
||||
level: str = "level1"
|
||||
|
||||
# Whether to automatically parse the data.
|
||||
analysis: bool = False
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
"""config validation logics go here"""
|
||||
assert isinstance(self.contents, list), f"Profiler contents must be of type list, got {type(self.contents)}"
|
||||
assert isinstance(self.level, str), f"Profiler level must be of type str, got {type(self.level)}"
|
||||
assert isinstance(self.analysis, bool), f"Profiler analysis must be of type bool, got {type(self.analysis)}"
|
||||
for content in self.contents:
|
||||
assert content in ["npu", "cpu", "memory", "shapes", "module", "stack"], (
|
||||
f"Profiler contents only supports npu, cpu, memory, shapes, module, stack, but gets {content}"
|
||||
)
|
||||
assert self.level in ["level_none", "level0", "level1", "level2"], (
|
||||
f"Profiler level only supports level0, 1, 2, and level_none, but gets {self.level}"
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ProfilerConfig(BaseConfig):
|
||||
"""Worker profiler config. Currently only support Nsight system profiler.
|
||||
"""Worker profiler config.
|
||||
|
||||
The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.
|
||||
|
||||
@ -30,22 +90,33 @@ class ProfilerConfig(BaseConfig):
|
||||
ranks (list[int]): The ranks that will be profiled. Defaults to [].
|
||||
"""
|
||||
|
||||
discrete: bool = False
|
||||
tool: Optional[str] = MISSING
|
||||
enable: bool = False
|
||||
all_ranks: bool = False
|
||||
ranks: list[int] = field(default_factory=list)
|
||||
save_path: Optional[str] = MISSING
|
||||
tool_config: Any = MISSING # Just a placeholder, will use configs above directly
|
||||
|
||||
def union(self, other: "ProfilerConfig") -> "ProfilerConfig":
|
||||
assert self.tool == other.tool, f"Cannot union ProfilerConfig with different tools: {self.tool} vs {other.tool}"
|
||||
return ProfilerConfig(
|
||||
tool=self.tool,
|
||||
enable=self.enable or other.enable,
|
||||
all_ranks=self.all_ranks or other.all_ranks,
|
||||
ranks=list(set(self.ranks or []) | set(other.ranks or [])),
|
||||
discrete=self.discrete or other.discrete,
|
||||
tool_config=self.tool_config,
|
||||
)
|
||||
|
||||
def intersect(self, other: "ProfilerConfig") -> "ProfilerConfig":
|
||||
assert self.tool == other.tool, (
|
||||
f"Cannot intersect ProfilerConfig with different tools: {self.tool} vs {other.tool}"
|
||||
)
|
||||
return ProfilerConfig(
|
||||
tool=self.tool,
|
||||
enable=self.enable and other.enable,
|
||||
all_ranks=self.all_ranks and other.all_ranks,
|
||||
ranks=list(set(self.ranks or []) & set(other.ranks or [])),
|
||||
discrete=self.discrete and other.discrete,
|
||||
tool_config=self.tool_config,
|
||||
)
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
|
@ -20,9 +20,9 @@ from contextlib import contextmanager
|
||||
from typing import Any, Callable, Optional
|
||||
|
||||
import torch_npu
|
||||
from omegaconf import DictConfig
|
||||
from torch_npu.npu import mstx
|
||||
|
||||
from .config import NPUToolConfig
|
||||
from .profile import DistProfiler, ProfilerConfig
|
||||
|
||||
|
||||
@ -86,7 +86,14 @@ def marked_timer(name: str, timing_raw: dict[str, float], *args: Any, **kwargs:
|
||||
mark_end_range(mark_range)
|
||||
|
||||
|
||||
def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_step: Optional[str] = None):
|
||||
def get_npu_profiler(
|
||||
contents: list[str],
|
||||
profile_level: str,
|
||||
profile_save_path: str,
|
||||
analysis: bool,
|
||||
role: Optional[str] = None,
|
||||
profile_step: Optional[str] = None,
|
||||
):
|
||||
"""Generate and return an NPU profiler object.
|
||||
|
||||
Args:
|
||||
@ -97,18 +104,7 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
|
||||
profile_step(str, optional):
|
||||
The current training step. Defaults to None.
|
||||
"""
|
||||
if option.level == "level_none":
|
||||
profile_level = torch_npu.profiler.ProfilerLevel.Level_none
|
||||
elif option.level == "level0":
|
||||
profile_level = torch_npu.profiler.ProfilerLevel.Level0
|
||||
elif option.level == "level1":
|
||||
profile_level = torch_npu.profiler.ProfilerLevel.Level1
|
||||
elif option.level == "level2":
|
||||
profile_level = torch_npu.profiler.ProfilerLevel.Level2
|
||||
else:
|
||||
raise ValueError(f"level only supports level0, 1, 2, and level_none, but gets {option.level}")
|
||||
|
||||
profile_save_path = option.save_path
|
||||
if profile_step:
|
||||
profile_save_path = os.path.join(profile_save_path, profile_step)
|
||||
if role:
|
||||
@ -123,18 +119,18 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
|
||||
)
|
||||
|
||||
activites = []
|
||||
if option.with_npu:
|
||||
if contents is None or "npu" in contents:
|
||||
activites.append(torch_npu.profiler.ProfilerActivity.NPU)
|
||||
if option.with_cpu:
|
||||
if contents is None or "cpu" in contents:
|
||||
activites.append(torch_npu.profiler.ProfilerActivity.CPU)
|
||||
|
||||
prof = torch_npu.profiler.profile(
|
||||
with_modules=option.with_module,
|
||||
with_stack=option.with_stack,
|
||||
record_shapes=option.record_shapes,
|
||||
profile_memory=option.with_memory,
|
||||
with_modules=contents is None or "module" in contents,
|
||||
with_stack=contents is None or "stack" in contents,
|
||||
record_shapes=contents is None or "shapes" in contents,
|
||||
profile_memory=contents is None or "memory" in contents,
|
||||
activities=activites,
|
||||
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=option.analysis),
|
||||
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=analysis),
|
||||
experimental_config=experimental_config,
|
||||
)
|
||||
return prof
|
||||
@ -147,7 +143,7 @@ class NPUProfiler(DistProfiler):
|
||||
|
||||
_define_count = 0
|
||||
|
||||
def __init__(self, rank: int, config: ProfilerConfig, **kwargs):
|
||||
def __init__(self, rank: int, config: ProfilerConfig, tool_config: NPUToolConfig, **kwargs):
|
||||
"""Initialize the NsightSystemsProfiler.
|
||||
|
||||
Args:
|
||||
@ -155,12 +151,20 @@ class NPUProfiler(DistProfiler):
|
||||
config (Optional[ProfilerConfig]): Configuration for the profiler. If None, a default configuration is used.
|
||||
"""
|
||||
if not config:
|
||||
config = ProfilerConfig(ranks=[])
|
||||
config = ProfilerConfig(ranks=[], enable=False)
|
||||
if not tool_config:
|
||||
assert not config.enable, "tool_config must be set when profiler is enabled"
|
||||
self.enable: bool = config.enable
|
||||
if not config.enable:
|
||||
return
|
||||
self.this_step: bool = False
|
||||
self.discrete: bool = config.discrete
|
||||
self.discrete: bool = tool_config.discrete
|
||||
self.this_rank: bool = False
|
||||
self.profile_npu = None
|
||||
self.profile_option = kwargs.get("option", None)
|
||||
self.profile_contents = tool_config.contents
|
||||
self.profile_level = tool_config.level
|
||||
self.profile_save_path = config.save_path
|
||||
self.analysis = tool_config.analysis
|
||||
if config.all_ranks:
|
||||
self.this_rank = True
|
||||
elif config.ranks:
|
||||
@ -169,15 +173,22 @@ class NPUProfiler(DistProfiler):
|
||||
def start(self, **kwargs):
|
||||
role, profile_step = kwargs.get("role", None), kwargs.get("profile_step", None)
|
||||
profile_step = str(profile_step) if profile_step is not None else None
|
||||
if self.this_rank and self.profile_option is not None:
|
||||
if self.this_rank and self.enable:
|
||||
self.this_step = True
|
||||
if not self.discrete and NPUProfiler._define_count == 0:
|
||||
self.profile_npu = get_npu_profiler(option=self.profile_option, role=role, profile_step=profile_step)
|
||||
self.profile_npu = get_npu_profiler(
|
||||
contents=self.profile_contents,
|
||||
profile_level=self.profile_level,
|
||||
profile_save_path=self.profile_save_path,
|
||||
analysis=self.analysis,
|
||||
role=role,
|
||||
profile_step=profile_step,
|
||||
)
|
||||
self.profile_npu.start()
|
||||
NPUProfiler._define_count += 1
|
||||
|
||||
def stop(self):
|
||||
if self.this_rank and self.profile_option is not None:
|
||||
if self.this_rank and self.enable:
|
||||
self.this_step = False
|
||||
if not self.discrete and NPUProfiler._define_count == 1:
|
||||
self.profile_npu.step()
|
||||
@ -201,23 +212,20 @@ class NPUProfiler(DistProfiler):
|
||||
def decorator(func):
|
||||
@functools.wraps(func)
|
||||
def wrapper(self, *args, **kwargs):
|
||||
if not self.profiler.enable:
|
||||
return func(self, *args, **kwargs)
|
||||
|
||||
profile_name = message or func.__name__
|
||||
profile_this_role = True
|
||||
discrete_mode = self.profiler.discrete
|
||||
profile_enable = self.profiler.this_step and self.profile_option is not None
|
||||
profile_enable = self.profiler.this_step and self.profiler.enable
|
||||
|
||||
if not profile_enable:
|
||||
return func(self, *args, **kwargs)
|
||||
|
||||
if profile_enable and role is not None:
|
||||
target_roles = self.profile_option.get("roles", [])
|
||||
profile_this_role = "all" in target_roles or role in target_roles
|
||||
|
||||
if profile_enable:
|
||||
if not discrete_mode:
|
||||
mark_range = mark_start_range(message=profile_name)
|
||||
else:
|
||||
if profile_this_role:
|
||||
profile_npu = get_npu_profiler(option=self.profile_option, role=role)
|
||||
profile_npu.start()
|
||||
mark_range = mark_start_range(message=profile_name)
|
||||
@ -228,7 +236,6 @@ class NPUProfiler(DistProfiler):
|
||||
if not discrete_mode:
|
||||
mark_end_range(mark_range)
|
||||
else:
|
||||
if profile_this_role:
|
||||
mark_end_range(mark_range)
|
||||
profile_npu.step()
|
||||
profile_npu.stop()
|
||||
|
@ -20,6 +20,7 @@ from typing import Callable, Optional
|
||||
import nvtx
|
||||
import torch
|
||||
|
||||
from .config import NsightToolConfig
|
||||
from .profile import DistProfiler, ProfilerConfig
|
||||
|
||||
|
||||
@ -113,7 +114,7 @@ def marked_timer(
|
||||
class NsightSystemsProfiler(DistProfiler):
|
||||
"""Nsight system profiler. Installed in a worker to control the Nsight system profiler."""
|
||||
|
||||
def __init__(self, rank: int, config: Optional[ProfilerConfig], **kwargs):
|
||||
def __init__(self, rank: int, config: Optional[ProfilerConfig], tool_config: Optional[NsightToolConfig], **kwargs):
|
||||
"""Initialize the NsightSystemsProfiler.
|
||||
|
||||
Args:
|
||||
@ -123,8 +124,13 @@ class NsightSystemsProfiler(DistProfiler):
|
||||
# If no configuration is provided, create a default ProfilerConfig with an empty list of ranks
|
||||
if not config:
|
||||
config = ProfilerConfig(ranks=[])
|
||||
if not tool_config:
|
||||
assert not config.enable, "tool_config must be provided when profiler is enabled"
|
||||
self.enable = config.enable
|
||||
if not config.enable:
|
||||
return
|
||||
self.this_step: bool = False
|
||||
self.discrete: bool = config.discrete
|
||||
self.discrete: bool = tool_config.discrete
|
||||
self.this_rank: bool = False
|
||||
if config.all_ranks:
|
||||
self.this_rank = True
|
||||
@ -170,6 +176,9 @@ class NsightSystemsProfiler(DistProfiler):
|
||||
def decorator(func):
|
||||
@functools.wraps(func)
|
||||
def wrapper(self, *args, **kwargs):
|
||||
if not self.profiler.enable:
|
||||
return func(self, *args, **kwargs)
|
||||
|
||||
profile_name = message or func.__name__
|
||||
|
||||
if self.profiler.this_step:
|
||||
|
@ -17,9 +17,8 @@ from typing import Callable, Optional
|
||||
|
||||
import torch
|
||||
import torch.distributed
|
||||
from omegaconf import DictConfig, OmegaConf
|
||||
|
||||
from .config import ProfilerConfig
|
||||
from .config import ProfilerConfig, TorchProfilerToolConfig
|
||||
|
||||
|
||||
class Profiler:
|
||||
@ -39,18 +38,23 @@ class Profiler:
|
||||
config: Configuration object containing profiling parameters
|
||||
"""
|
||||
|
||||
def __init__(self, config):
|
||||
def __init__(self, config: ProfilerConfig, tool_config: Optional[TorchProfilerToolConfig] = None):
|
||||
# note : if we do not set use_profile, it will be set as None, so that all function will be skip
|
||||
if not isinstance(config, DictConfig):
|
||||
config = OmegaConf.create(config)
|
||||
if not config:
|
||||
config = ProfilerConfig(ranks=[], enable=False)
|
||||
if not tool_config:
|
||||
assert not config.enable, "tool_config must be provided when profiler is enabled"
|
||||
self.enable = config.enable
|
||||
if not config.enable:
|
||||
return
|
||||
self.config = config
|
||||
self.skip_prof = False
|
||||
self.tool_config = tool_config
|
||||
self.saved = False
|
||||
self.prof = None
|
||||
self.rank = torch.distributed.get_rank()
|
||||
# we need to validate the config before using the profiler
|
||||
self._validate()
|
||||
if config.use_profile and self.rank in self.config.profile_ranks:
|
||||
if self.rank in self.config.profile_ranks:
|
||||
print(f"[Profiler] Profiler init for rank {self.rank}")
|
||||
|
||||
self.prof = torch.profiler.profile(
|
||||
@ -59,9 +63,9 @@ class Profiler:
|
||||
torch.profiler.ProfilerActivity.CUDA,
|
||||
],
|
||||
schedule=torch.profiler.schedule(
|
||||
wait=max(self.config.step_start - 1, 0),
|
||||
warmup=1 if self.config.step_start > 0 else 0,
|
||||
active=self.config.step_end - self.config.step_start,
|
||||
wait=max(self.tool_config.step_start - 1, 0),
|
||||
warmup=1 if self.tool_config.step_start > 0 else 0,
|
||||
active=self.tool_config.step_end - self.tool_config.step_start,
|
||||
repeat=1,
|
||||
),
|
||||
record_shapes=True,
|
||||
@ -73,9 +77,9 @@ class Profiler:
|
||||
if self.config.profile_ranks is None:
|
||||
print("[WARNING] Profile ranks is not set, default to rank 0")
|
||||
self.config.profile_ranks = [0]
|
||||
assert self.config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
|
||||
assert self.config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
|
||||
assert self.config.step_start < self.config.step_end, (
|
||||
assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
|
||||
assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
|
||||
assert self.tool_config.step_start < self.tool_config.step_end, (
|
||||
"[ERROR] Profile step start must be less than step end"
|
||||
)
|
||||
|
||||
|
@ -122,7 +122,7 @@ class MegatronPPOActor(BasePPOActor):
|
||||
self.tf_config = tf_config
|
||||
self.actor_module = actor_module
|
||||
self.actor_optimizer: DistributedOptimizer = actor_optimizer
|
||||
self.prof = Profiler(self.config.profile)
|
||||
self.prof = Profiler(self.config.profiler)
|
||||
self.use_fused_kernels = self.config.get("use_fused_kernels", False)
|
||||
if self.use_fused_kernels:
|
||||
from verl.models.mcore.model_forward_fused import patch_fused_forward
|
||||
@ -600,6 +600,7 @@ class MegatronPPOActor(BasePPOActor):
|
||||
|
||||
"""
|
||||
metrics = {}
|
||||
if self.prof.enable:
|
||||
self.prof.start()
|
||||
for data in dataloader:
|
||||
data.to(get_device_id())
|
||||
@ -639,8 +640,10 @@ class MegatronPPOActor(BasePPOActor):
|
||||
pass
|
||||
else:
|
||||
raise NotImplementedError
|
||||
if self.prof.enable:
|
||||
self.prof.step()
|
||||
# add empty cache after each compute
|
||||
if self.prof.enable:
|
||||
self.prof.stop_and_save()
|
||||
self.prof.stop_trace()
|
||||
get_torch_device().empty_cache()
|
||||
|
@ -19,6 +19,7 @@ from omegaconf import MISSING
|
||||
|
||||
from verl.base_config import BaseConfig
|
||||
from verl.trainer.config import CheckpointConfig
|
||||
from verl.utils.profiler.config import ProfilerConfig
|
||||
|
||||
from .engine import FSDPEngineConfig, McoreEngineConfig
|
||||
from .optimizer import OptimizerConfig
|
||||
@ -109,6 +110,7 @@ class ActorConfig(BaseConfig):
|
||||
checkpoint: CheckpointConfig = field(default_factory=CheckpointConfig)
|
||||
optim: OptimizerConfig = field(default_factory=OptimizerConfig)
|
||||
use_fused_kernels: bool = False
|
||||
profiler: ProfilerConfig = field(default_factory=ProfilerConfig)
|
||||
|
||||
def __post_init__(self):
|
||||
"""Validate actor configuration parameters."""
|
||||
@ -218,6 +220,7 @@ class FSDPActorConfig(ActorConfig):
|
||||
entropy_checkpointing: bool = False
|
||||
fsdp_config: FSDPEngineConfig = field(default_factory=FSDPEngineConfig)
|
||||
use_remove_padding: bool = False
|
||||
profiler: ProfilerConfig = field(default_factory=ProfilerConfig)
|
||||
|
||||
def __post_init__(self):
|
||||
"""Validate FSDP actor configuration parameters."""
|
||||
|
@ -72,7 +72,7 @@ from verl.utils.fsdp_utils import (
|
||||
)
|
||||
from verl.utils.import_utils import import_external_libs
|
||||
from verl.utils.model import compute_position_id_with_mask
|
||||
from verl.utils.profiler import DistProfiler, DistProfilerExtension, log_gpu_memory_usage, simple_timer
|
||||
from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig, log_gpu_memory_usage, simple_timer
|
||||
from verl.utils.profiler.performance import reduce_timing
|
||||
from verl.utils.py_functional import convert_to_regular_types
|
||||
from verl.workers.config import FSDPCriticConfig, FSDPEngineConfig
|
||||
@ -116,7 +116,6 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
|
||||
Worker.__init__(self)
|
||||
|
||||
self.config = config
|
||||
self.profile_option = kwargs.get("profile_option", None)
|
||||
import torch.distributed
|
||||
|
||||
if not torch.distributed.is_initialized():
|
||||
@ -170,9 +169,30 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
|
||||
# We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
|
||||
# as they provides DictConfig-like interface
|
||||
# The benefit of creating the dataclass config is to perform validation during __post_init__
|
||||
profiler_config = omega_conf_to_dataclass(config.get("profiler"))
|
||||
if self._is_actor:
|
||||
omega_profiler_config = config.actor.get("profiler", {})
|
||||
elif self._is_rollout:
|
||||
# NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
|
||||
# This is for extendability in AsyncRL cases
|
||||
omega_profiler_config = config.rollout.get("profiler", {})
|
||||
elif self._is_ref:
|
||||
omega_profiler_config = config.ref.get("profiler", {})
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Invalid role {self.role}, should be one of "
|
||||
"['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
|
||||
)
|
||||
# omega_profiler_config is DictConfig
|
||||
# profiler_config is a ProfilerConfig dataclass
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, option=self.profile_option)
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
|
||||
)
|
||||
|
||||
self._is_offload_param = False
|
||||
@ -938,7 +958,17 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
|
||||
class CriticWorker(Worker, DistProfilerExtension):
|
||||
def __init__(self, config: FSDPCriticConfig):
|
||||
Worker.__init__(self)
|
||||
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
|
||||
omega_profiler_config = config.get("profiler", {})
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
|
||||
)
|
||||
import torch.distributed
|
||||
|
||||
self.config = config
|
||||
@ -1336,8 +1366,18 @@ class RewardModelWorker(Worker, DistProfilerExtension):
|
||||
|
||||
def __init__(self, config):
|
||||
Worker.__init__(self)
|
||||
|
||||
omega_profiler_config = config.get("profiler", {})
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
|
||||
self,
|
||||
DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
|
||||
)
|
||||
|
||||
import torch.distributed
|
||||
|
@ -55,6 +55,7 @@ from verl.utils.profiler import (
|
||||
DistProfiler,
|
||||
DistProfilerExtension,
|
||||
GPUMemoryLogger,
|
||||
ProfilerConfig,
|
||||
log_gpu_memory_usage,
|
||||
simple_timer,
|
||||
)
|
||||
@ -213,8 +214,31 @@ class ActorRolloutRefWorker(MegatronWorker, DistProfilerExtension):
|
||||
self._is_rollout = self.role in ["rollout", "actor_rollout", "actor_rollout_ref"]
|
||||
self._is_ref = self.role in ["ref", "actor_rollout_ref"]
|
||||
|
||||
profiler_config = omega_conf_to_dataclass(config.get("profiler"))
|
||||
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config))
|
||||
if self._is_actor:
|
||||
omega_profiler_config = config.actor.get("profiler", {})
|
||||
elif self._is_rollout:
|
||||
# NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
|
||||
# This is for extendability in AsyncRL cases
|
||||
omega_profiler_config = config.rollout.get("profiler", {})
|
||||
elif self._is_ref:
|
||||
omega_profiler_config = config.ref.get("profiler", {})
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Invalid role {self.role}, should be one of "
|
||||
"['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
|
||||
)
|
||||
# omega_profiler_config is DictConfig
|
||||
# profiler_config is a ProfilerConfig dataclass
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
|
||||
)
|
||||
|
||||
# TODO(sgm): Currently, we only support reference model param offload
|
||||
# will support other offload later
|
||||
@ -804,7 +828,18 @@ class AsyncActorRolloutRefWorker(ActorRolloutRefWorker):
|
||||
class CriticWorker(MegatronWorker, DistProfilerExtension):
|
||||
def __init__(self, config: McoreCriticConfig):
|
||||
Worker.__init__(self)
|
||||
DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
|
||||
|
||||
omega_profiler_config = config.get("profiler", {})
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
|
||||
)
|
||||
self.config: McoreCriticConfig = config
|
||||
|
||||
# NOTE(sgm): We utilize colocate WorkerGroup by default.
|
||||
@ -1072,8 +1107,19 @@ class RewardModelWorker(MegatronWorker, DistProfilerExtension):
|
||||
|
||||
def __init__(self, config):
|
||||
Worker.__init__(self)
|
||||
|
||||
profiler_config = omega_conf_to_dataclass(config.get("profiler", {}), dataclass_type=ProfilerConfig)
|
||||
omega_profiler_config = config.get("profiler", {})
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
|
||||
self,
|
||||
DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
|
||||
)
|
||||
self.config = config
|
||||
|
||||
|
@ -30,7 +30,7 @@ from verl.utils.device import (
|
||||
get_device_id,
|
||||
get_nccl_backend,
|
||||
)
|
||||
from verl.utils.profiler import DistProfiler, DistProfilerExtension
|
||||
from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig
|
||||
from verl.utils.py_functional import append_to_dict
|
||||
from verl.utils.torch_functional import masked_mean
|
||||
from verl.workers.engine import EngineRegistry
|
||||
@ -42,8 +42,16 @@ logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
|
||||
class CriticWorker(Worker, DistProfilerExtension):
|
||||
def __init__(self, config):
|
||||
Worker.__init__(self)
|
||||
omega_profiler_config = config.get("profiler", {})
|
||||
profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
|
||||
if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
|
||||
tool_config = omega_conf_to_dataclass(
|
||||
omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
|
||||
)
|
||||
else:
|
||||
tool_config = None
|
||||
DistProfilerExtension.__init__(
|
||||
self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
|
||||
self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
|
||||
)
|
||||
import torch.distributed
|
||||
|
||||
|
Reference in New Issue
Block a user