Files
verl/docs/perf/nsight_profiling.md
Blue Space 545f899844 [BREAKING] [perf] refactor: Profiler api refactor (#2894)
### What does this PR do?

Refactor profiler CI to a unified way.

TODO:

- nsys use `save_path`
- nsys descrete tests are disabled
- torch profiler

cc: @davidmlw 

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

Global profiler config:

```yaml
global_profiler:
  _target_: verl.utils.profiler.ProfilerConfig
  tool: null
  steps: null
  profile_continuous_steps: false
  save_path: outputs/profile
  tool_config:
    nsys:
      _target_: verl.utils.profiler.config.NsightToolConfig
      discrete: false
    npu:
      _target_: verl.utils.profiler.config.NPUToolConfig
      discrete: false
      contents: []
      level: level1
      analysis: true
    torch:
      _target_: verl.utils.profiler.config.TorchProfilerToolConfig
      step_start: 0
      step_end: null
```

Local profiler config:

```yaml
profiler:

  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig

  # profiler tool, default same as profiler.tool in global config
  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}

  # whether enable profile on critic
  enable: False

  # Whether to profile all ranks.
  all_ranks: False

  # The ranks that will be profiled. [] or [0,1,...]
  ranks: []

  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}

  # specific tool config
  tool_config: ${oc.select:global_profiler.tool_config,null}
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-08-11 09:52:41 +08:00

5.6 KiB

NVIDIA Nsight Systems profiling in verl

Last updated: 06/20/2025.

This guide explains how to use NVIDIA Nsight Systems for profiling verl training runs.

Configuration

Profiling in verl can be configured through several parameters in the trainer configuration file (ppo_trainer.yaml or other files like dapo_trainer.yaml):

Prerequisites

Nsight Systems version is important, please reference docker/Dockerfile.vllm.sglang.megatron for the version we used.

Global profiling control

verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.

In profiler, three new config entries control the profiler behaviors:

  • profiler.steps. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And null means no profiling.

  • profiler.profile_continuous_steps. If true, and the following profiler.discrete==False, then the continuous steps in profiler.steps will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.

Nsys options in controller nodes and worker nodes are configured in trainer:

  • trainer.controller_nsight_options. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. ppo_trainer.yaml provides a workable example. Users can reference Nsight Systems manual and Ray user guide for more details.
  • trainer.worker_nsight_options. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So capture-range: "cudaProfilerApi" is fixed and does not change it. Users can change capture-range-end with some accurate calculation or just leave it null.

Worker process profiling

Verl manages mulitiple RL roles, Actor, Ref, Rollout, Critic, Reward, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, profiler, which consists of three fields:

  • all_ranks and ranks. When all_ranks is set True then all ranks will be profiled; when set False, ranks will be profiled. By default, verl profiles the whole training process in a series worker_process_<PID>.<RID>.nsys-rep files for each process rank. PID is the process ID; RID is the capture range ID.
  • discrete. When set False, all the roles actions in one training step will be dumped in one database. When set True, the actions annotated by DistProfiler.annotate will be dumped into a discrete database. In this case, each role's action occupies one <RID>.
  • actor_rollout_ref. This Worker can be configured to contain at most 3 roles and executes together. So actor_rollout_ref has a profiler config and all the inside roles inherit it.
  • Verl collocate mode. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent discrete. The Nsight Systems profiler uses a torch.cuda.profiler.start() and stop() pair to dump a <step> database anyway.

where to find the profiling data

By default the *.nsys-rep files are saved in the directory /tmp/ray/session_latest/logs/nsight/ at each node. According to the Ray manual, this default directory is not changeable. "however, Ray preserves the --output option of the default config".

Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place.

Usage Example

To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this:

Disable profiler

    profiler:
        steps: null # disable profile

Enable profiler and one database for one training step

    profiler:
        steps: [1, 2, 5]
        discrete: False
    actor_rollout_ref:
        actor:
            profile:
                enable: True
                all_ranks: True
        # rollout & ref follow actor settings
    critic:
            profile:
                enable: True
                all_ranks: True
    reward_model:
            profile:
                enable: True
                all_ranks: True

Enable profiler and multiple databases for one training step

    profiler:
        steps: [1, 2, 5]
        discrete: True

Profiling Output

When profiling is enabled, verl will generate Nsight Systems profiles for the specified components and steps. The profiles will include:

  • CUDA kernel execution
  • Memory operations
  • CPU-GPU synchronization
  • NVTX markers for key operations

Nsight Systems supports multi-report view, to open multiple databases together. In this mode, different processes and steps can be aligned in one time line for better analysis.