mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
### What does this PR do? Refactor profiler CI to a unified way. TODO: - nsys use `save_path` - nsys descrete tests are disabled - torch profiler cc: @davidmlw ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example Global profiler config: ```yaml global_profiler: _target_: verl.utils.profiler.ProfilerConfig tool: null steps: null profile_continuous_steps: false save_path: outputs/profile tool_config: nsys: _target_: verl.utils.profiler.config.NsightToolConfig discrete: false npu: _target_: verl.utils.profiler.config.NPUToolConfig discrete: false contents: [] level: level1 analysis: true torch: _target_: verl.utils.profiler.config.TorchProfilerToolConfig step_start: 0 step_end: null ``` Local profiler config: ```yaml profiler: # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs _target_: verl.utils.profiler.ProfilerConfig # profiler tool, default same as profiler.tool in global config # choices: nsys, npu, torch tool: ${oc.select:global_profiler.tool,null} # whether enable profile on critic enable: False # Whether to profile all ranks. all_ranks: False # The ranks that will be profiled. [] or [0,1,...] ranks: [] # profile results saving path save_path: ${oc.select:global_profiler.save_path,null} # specific tool config tool_config: ${oc.select:global_profiler.tool_config,null} ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
119 lines
3.7 KiB
ReStructuredText
119 lines
3.7 KiB
ReStructuredText
Data collection based on FSDP (Fully Sharded Data Parallel) backend on Ascend devices(NPU)
|
|
==========================================================================================
|
|
|
|
Last updated: 07/24/2025.
|
|
|
|
This is a tutorial for data collection using the GRPO or DAPO algorithm
|
|
based on FSDP on Ascend devices.
|
|
|
|
Configuration
|
|
-------------
|
|
|
|
Leverage two levels of configuration to control data collection:
|
|
|
|
1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
|
|
2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.
|
|
|
|
Global collection control
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Use parameters in ppo_trainer.yaml to control the collection mode
|
|
and steps.
|
|
|
|
- profiler: Control the ranks and mode of profiling
|
|
|
|
- tool: The profiling tool to use, options are nsys, npu, torch,
|
|
torch_memory.
|
|
- steps: This parameter can be set as a list that has
|
|
collection steps, such as [2, 4], which means it will collect steps 2
|
|
and 4. If set to null, no collection occurs.
|
|
- save_path: The path to save the collected data. Default is
|
|
"outputs/profile".
|
|
|
|
Use parameters in ``profiler.tool_config.npu`` to control npu profiler behavior:
|
|
|
|
- level: Collection level—options are level_none, level0, level1, and
|
|
level2
|
|
|
|
- level_none: Disables all level-based data collection (turns off
|
|
profiler_level).
|
|
- level0: Collect high-level application data, underlying NPU data,
|
|
and operator execution details on NPU.
|
|
- level1: Extends level0 by adding CANN-layer AscendCL data and AI
|
|
Core performance metrics on NPU.
|
|
- level2: Extends level1 by adding CANN-layer Runtime data and AI
|
|
CPU metrics.
|
|
|
|
- contents: A list of options to control the collection content, such as
|
|
npu, cpu, memory, shapes, module, stack.
|
|
|
|
- npu: Whether to collect device-side performance data.
|
|
- cpu: Whether to collect host-side performance data.
|
|
- memory: Whether to enable memory analysis.
|
|
- shapes: Whether to record tensor shapes.
|
|
- module: Whether to record framework-layer Python call stack
|
|
information.
|
|
- stack: Whether to record operator call stack information.
|
|
|
|
- analysis: Enables automatic data parsing.
|
|
|
|
|
|
Role collection control
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
In each role's ``profile`` field, you can control the collection mode for that role.
|
|
|
|
- enable: Whether to enable profiling for this role.
|
|
- all_ranks: Whether to collect data from all ranks.
|
|
- ranks: A list of ranks to collect data from. If empty, no data is collected.
|
|
- tool_config: Configuration for the profiling tool used by this role.
|
|
|
|
|
|
Examples
|
|
--------
|
|
|
|
Disabling collection
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code:: yaml
|
|
|
|
profiler:
|
|
steps: null # disable profile
|
|
|
|
End-to-End collection
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code:: yaml
|
|
|
|
profiler:
|
|
steps: [1, 2, 5]
|
|
discrete: False
|
|
actor_rollout_ref:
|
|
actor:
|
|
profiler:
|
|
enable: True
|
|
all_ranks: True
|
|
|
|
|
|
Discrete Mode Collection
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code:: yaml
|
|
|
|
profiler:
|
|
discrete: True
|
|
|
|
|
|
Visualization
|
|
-------------
|
|
|
|
Collected data is stored in the user-defined save_path and can be
|
|
visualized by using the `MindStudio Insight <https://www.hiascend.com/document/detail/zh/mindstudio/80RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html>`_ tool.
|
|
|
|
If the analysis parameter is set to False, offline parsing is required after data collection:
|
|
|
|
.. code:: python
|
|
|
|
import torch_npu
|
|
# Set profiler_path to the parent directory of the "localhost.localdomain_<PID>_<timestamp>_ascend_pt" folder
|
|
torch_npu.profiler.profiler.analyse(profiler_path=profiler_path) |