[BREAKING] [perf] refactor: Profiler api refactor (#2894)

### What does this PR do? Refactor profiler CI to a unified way. TODO: - nsys use `save_path` - nsys descrete tests are disabled - torch profiler cc: @davidmlw ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example Global profiler config: ```yaml global_profiler: _target_: verl.utils.profiler.ProfilerConfig tool: null steps: null profile_continuous_steps: false save_path: outputs/profile tool_config: nsys: _target_: verl.utils.profiler.config.NsightToolConfig discrete: false npu: _target_: verl.utils.profiler.config.NPUToolConfig discrete: false contents: [] level: level1 analysis: true torch: _target_: verl.utils.profiler.config.TorchProfilerToolConfig step_start: 0 step_end: null ``` Local profiler config: ```yaml profiler: # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs _target_: verl.utils.profiler.ProfilerConfig # profiler tool, default same as profiler.tool in global config # choices: nsys, npu, torch tool: ${oc.select:global_profiler.tool,null} # whether enable profile on critic enable: False # Whether to profile all ranks. all_ranks: False # The ranks that will be profiled. [] or [0,1,...] ranks: [] # profile results saving path save_path: ${oc.select:global_profiler.save_path,null} # specific tool config tool_config: ${oc.select:global_profiler.tool_config,null} ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-10-20 13:43:50 +08:00 · 2025-08-11 09:52:41 +08:00
parent 287ef7e262
commit 545f899844
41 changed files with 1005 additions and 694 deletions
--- a/.gitignore
+++ b/.gitignore
@ -59,6 +59,7 @@ coverage.xml
 *,cover
 .hypothesis/
 pytest.ini
+output.txt

 # Translations
 *.mo
--- a/docs/ascend_tutorial/ascend_profiling.rst
+++ b/docs/ascend_tutorial/ascend_profiling.rst
@ -8,107 +8,87 @@ Last updated: 07/24/2025.
 配置
 ----

-复用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数，
-通过verl/trainer/config/npu_profile/npu_profile.yaml中的配置项控制例如采集等级等参数。
+使用两级profile设置来控制数据采集
+
+- 全局采集控制：使用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数，
+- 角色profile控制：通过每个角色中的配置项控制等参数。

 全局采集控制
 ~~~~~~~~~~~~

 通过 ppo_trainer.yaml 中的参数控制采集步数和模式：

-  trainer.profile_steps：
-   该参数可以设置为一个包含采集步数的列表，例如[2，
-   4]， 意味着将会采集第二步和第四步。如果该参数为null，则代表不进行采集
-  actor_rollout_ref.profiler：
-   控制采集的ranks和模式
+-  profiler: 控制采集的rank和模式

-   -  all_ranks：设为True代表对所有rank进行采集
-   -  ranks：当all_ranks不为True时，
-      通过ranks参数控制需要采集的rank，该参数设置为一个包含采集rank的列表， 例如[0，
-      1]
-   -  discrete：
-      控制采集的模式。当该参数设置为False，代表采集端到端的数据；当该参数设置为True，代表采用离散模式分训练阶段采集数据
+   -  tool: 使用的采集工具，选项有 nsys、npu、torch、torch_memory。
+   -  steps: 此参数可以设置为包含采集步数的列表，例如 [2, 4]，表示将采集第2步和第4步。如果设置为 null，则不进行采集。
+   -  save_path: 保存采集数据的路径。默认值为 "outputs/profile"。

-通过 npu_profile.yaml 中的参数控制具体采集行为：
+通过 ``profiler.tool_config.npu`` 中的参数控制具体采集行为：

-  save_path：采集数据的存放路径
-  roles: 采集的角色，下列为可选项
+-  level: 采集级别—选项有 level_none、level0、level1 和 level2

-   -  rollout_generate：采集rollout的generate_sequences阶段
-   -  actor_compute_log_prob：采集actor的compute_log_prob阶段
-   -  actor_update：采集actor的update_actor阶段
-   -  ref_compute_log_prob：采集ref的compute_ref_log_prob阶段
-   -  all： 采集以上所有阶段
+   -  level_none: 禁用所有基于级别的数据采集（关闭 profiler_level）。
+   -  level0: 采集高级应用数据、底层NPU数据和NPU上的算子执行详情。
+   -  level1: 在level0基础上增加CANN层AscendCL数据和NPU上的AI Core性能指标。
+   -  level2: 在level1基础上增加CANN层Runtime数据和AI CPU指标。

-  level：采集等级，可选项为level_none、level0、level1和level2
+-  contents: 控制采集内容的选项列表，例如
+   npu、cpu、memory、shapes、module、stack。
+   
+   -  npu: 是否采集设备端性能数据。
+   -  cpu: 是否采集主机端性能数据。
+   -  memory: 是否启用内存分析。
+   -  shapes: 是否记录张量形状。
+   -  module: 是否记录框架层Python调用栈信息。
+   -  stack: 是否记录算子调用栈信息。

-   -  level_none：不采集所有Level层级控制的数据，即关闭profiler_level
-   -  level0：采集上层应用数据、底层NPU数据以及NPU上执行的算子信息
-   -  level1：在level0的基础上多采集CANN层AscendCL数据和NPU上执行的AI
-      Core性能指标信息
-   -  level2：在level1的基础上多采集CANN层Runtime数据以及AI CPU
+-  analysis: 启用自动数据解析。

-  record_shapes：是否记录张量形状
-  with_memory：是否启用内存分析
-  with_npu：是否采集device侧性能数据
-  with_cpu：是否采集host侧性能数据
-  with_module：是否记录框架层python调用栈信息
-  with_stack：是否记录算子调用栈信息
-  analysis：是否自动解析数据
+角色profile控制
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+在每个角色的 ``profile`` 字段中，您可以控制该角色的采集模式。
+
+-  enable: 是否为此角色启用性能分析。
+-  all_ranks: 是否从所有rank收集数据。
+-  ranks: 要收集数据的rank列表。如果为空，则不收集数据。
+-  tool_config: 此角色使用的性能分析工具的配置。

 示例
 ----

 禁用采集
-~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~

 .. code:: yaml

-       trainer:
-           profile_steps: null # disable profile
+      profiler:
+         steps: null # disable profile

 端到端采集
-~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~

 .. code:: yaml

-       trainer:
-           profile_steps: [1, 2, 5]
-       actor_rollout_ref:
-            profiler:
-                discrete: False
-                all_ranks: True
+      profiler:
+         steps: [1, 2, 5]
+         discrete: False
+      actor_rollout_ref:
+         actor:
+            profile:
+               enable: True
+               all_ranks: True
+        # rollout & ref follow actor settings


 离散模式采集
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~

 .. code:: yaml

-       trainer:
-           profile_steps: [1, 2, 5]
-       actor_rollout_ref:
-            profiler:
-                discrete: True
-                all_ranks: False
-                ranks: [0, 1]
-
-
-离散模式采集actor
-~~~~~~~~~~~~~~~~~~
-
-.. code:: yaml
-
-       trainer:
-           profile_steps: [1, 2, 5]
-           npu_profile:
-                options:
-                    roles: ["actor_compute_log_prob", "actor_update"]
-       actor_rollout_ref:
-            profiler:
-                discrete: True
-                all_ranks: False
-                ranks: [0, 1]
+      profiler:
+         discrete: True


 可视化
--- a/docs/ascend_tutorial/ascend_profiling_en.rst
+++ b/docs/ascend_tutorial/ascend_profiling_en.rst
@ -9,10 +9,10 @@ based on FSDP on Ascend devices.
 Configuration
 -------------

-Reuse the configuration items in
-verl/trainer/config/ppo_trainer.yaml to control the collection mode
-and steps, you can also manage the collection behaviors such as
-collection level via verl/trainer/config/npu_profile/npu_profile.yaml.
+Leverage two levels of configuration to control data collection:
+
+1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
+2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.

 Global collection control
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@ -20,31 +20,17 @@ Global collection control
 Use parameters in ppo_trainer.yaml to control the collection mode
 and steps.

-  trainer.profile_steps: This parameter can be set as a list that has
-   collection steps, such as [2, 4], which means it will collect steps 2
-   and 4. If set to null, no collection occurs.
-  actor_rollout_ref.profiler: Control the ranks and mode of profiling
+-  profiler: Control the ranks and mode of profiling

-   -  all_ranks: Collects data from all ranks when set to true.
-   -  ranks: This parameter specifies which ranks to collect (e.g., [0,
-      1]) when all_ranks is False.
-   -  discrete: Controls the collection mode. If False, end-to-end data
-      is collected; if True, data is collected in discrete phases during
-      training.
+   -  tool: The profiling tool to use, options are nsys, npu, torch,
+      torch_memory.
+   -  steps: This parameter can be set as a list that has
+      collection steps, such as [2, 4], which means it will collect steps 2
+      and 4. If set to null, no collection occurs.
+   -  save_path: The path to save the collected data. Default is
+      "outputs/profile".

-Use parameters in npu_profile.yaml to control collection behavior:
-
-  save_path: Storage path for collected data.
-  roles: Roles to collect. The following options are available
-
-   -  rollout_generate: Collect the `generate_sequences` phase 
-      of rollout worker.
-   -  actor_compute_log_prob: Collect the `compute_log_prob` phase 
-      of the actor worker.
-   -  actor_update:  Collect the `update_actor` phase of the actor worker.
-   -  ref_compute_log_prob: Collect the `compute_ref_log_prob` phase 
-      of the ref worker.
-   -  all: Collect all of the above phases.
+Use parameters in ``profiler.tool_config.npu`` to control npu profiler behavior:

 -  level: Collection level—options are level_none, level0, level1, and
   level2
@ -58,15 +44,31 @@ Use parameters in npu_profile.yaml to control collection behavior:
   -  level2: Extends level1 by adding CANN-layer Runtime data and AI
      CPU metrics.

-  record_shapes: Whether to record tensor shapes.
-  with_memory: Whether to enable memory analysis.
-  with_npu: Whether to collect device-side performance data.
-  with_cpu: Whether to collect host-side performance data.
-  with_module: Whether to record framework-layer Python call stack
-   information.
-  with_stack: Whether to record operator call stack information.
+-  contents: A list of options to control the collection content, such as
+   npu, cpu, memory, shapes, module, stack.
+   
+   -  npu: Whether to collect device-side performance data.
+   -  cpu: Whether to collect host-side performance data.
+   -  memory: Whether to enable memory analysis.
+   -  shapes: Whether to record tensor shapes.
+   -  module: Whether to record framework-layer Python call stack
+      information.
+   -  stack: Whether to record operator call stack information.
+
 -  analysis: Enables automatic data parsing.

+
+Role collection control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In each role's ``profile`` field, you can control the collection mode for that role.
+
+-  enable: Whether to enable profiling for this role.
+-  all_ranks: Whether to collect data from all ranks.
+-  ranks: A list of ranks to collect data from. If empty, no data is collected.
+-  tool_config: Configuration for the profiling tool used by this role.
+
+
 Examples
 --------

@ -75,20 +77,22 @@ Disabling collection

 .. code:: yaml

-       trainer:
-           profile_steps: null # disable profile
+      profiler:
+         steps: null # disable profile

 End-to-End collection
 ~~~~~~~~~~~~~~~~~~~~~

 .. code:: yaml

-       trainer:
-           profile_steps: [1, 2, 5]
-       actor_rollout_ref:
+      profiler:
+         steps: [1, 2, 5]
+         discrete: False
+      actor_rollout_ref:
+         actor:
            profiler:
-                discrete: False
-                all_ranks: True
+               enable: True
+               all_ranks: True


 Discrete Mode Collection
@ -96,30 +100,8 @@ Discrete Mode Collection

 .. code:: yaml

-       trainer:
-           profile_steps: [1, 2, 5]
-       actor_rollout_ref:
-            profiler:
-                discrete: True
-                all_ranks: False
-                ranks: [0, 1]
-
-
-Enable actor collection in discrete mode
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: yaml
-
-       trainer:
-           profile_steps: [1, 2, 5]
-           npu_profile:
-                options:
-                    roles: ["actor_compute_log_prob", "actor_update"]
-       actor_rollout_ref:
-            profiler:
-                discrete: True
-                all_ranks: False
-                ranks: [0, 1]
+      profiler:
+         discrete: True


 Visualization
--- a/docs/perf/nsight_profiling.md
+++ b/docs/perf/nsight_profiling.md
@ -16,31 +16,29 @@ Nsight Systems version is important, please reference `docker/Dockerfile.vllm.sg

 verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.

-In `trainer`, three new config entries control the profiler behaviors:
+In `profiler`, three new config entries control the profiler behaviors:

-* **`trainer.profile_steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
+* **`profiler.steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.

-* **`trainer.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profile_steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
+* **`profiler.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profiler.steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.

-* **`controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
+Nsys options in controller nodes and worker nodes are configured in `trainer`:

-* **`worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
+* **`trainer.controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
+* **`trainer.worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.

 ### Worker process profiling

 Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields:

 * **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID.
-
 * **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`.
-
 * **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it.
-
 * **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway.

 ### where to find the profiling data

-By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. ["however, Ray preserves the `--output` option of the default config"](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
+By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. [&#34;however, Ray preserves the `--output` option of the default config&#34;](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).

 Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place.

@ -49,51 +47,40 @@ Some users may think it is not convenient, but it is understandable that Ray may
 To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this:

 ### Disable profiler
+
 ```yaml
-    trainer:
-        profile_steps: null # disable profile
+    profiler:
+        steps: null # disable profile
 ```

 ### Enable profiler and one database for one training step
+
 ```yaml
-    trainer:
-        profile_steps: [1, 2, 5]
+    profiler:
+        steps: [1, 2, 5]
+        discrete: False
    actor_rollout_ref:
-        profiler:
-            discrete: False
-            all_ranks: False
-            ranks: [0, 1]
+        actor:
+            profile:
+                enable: True
+                all_ranks: True
+        # rollout & ref follow actor settings
    critic:
-        profiler:
-            discrete: False
-            all_ranks: False
-            ranks: [0, 1]
+            profile:
+                enable: True
+                all_ranks: True
    reward_model:
-        profiler:
-            discrete: False
-            all_ranks: False
-            ranks: [0, 1]
+            profile:
+                enable: True
+                all_ranks: True
 ```

 ### Enable profiler and multiple databases for one training step
+
 ```yaml
-    trainer:
-        profile_steps: [1, 2, 5]
-    actor_rollout_ref:
-        profiler:
-            discrete: True
-            all_ranks: False
-            ranks: [0, 1]
-    critic:
-        profiler:
-            discrete: True
-            all_ranks: False
-            ranks: [0, 1]
-    reward_model:
-        profiler:
-            discrete: True
-            all_ranks: False
-            ranks: [0, 1]
+    profiler:
+        steps: [1, 2, 5]
+        discrete: True
 ```

 ## Profiling Output
--- a/docs/workers/megatron_workers.rst
+++ b/docs/workers/megatron_workers.rst
@ -275,27 +275,6 @@ For the critic, you can include these parameters.
   critic.megatron.grad_offload=True \
   critic.megatron.optimizer_offload=True \

-Profiler
-^^^^^^^^
-
-The profiler is a tool that helps you understand the performance of your 
-model. It can be used to profile the time spent on different operations 
-and identify the bottlenecks. You can get more information from 
-`torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_.
-
-In verl, now the profiler is only support for the actor role In Megatron. You can set 
-the begin step and end step to profile. Notice, one step means one gradient update. And 
-the profile result will be saved in the save_path. If you just want to profile in the 
-specific rank, you can set the profile_ranks, by default, it will be [0].
-
-.. code:: python
-
-   actor_rollout_ref.actor.profile.use_profile=True \
-   actor_rollout_ref.actor.profile.profile_ranks=[0] \
-   actor_rollout_ref.actor.profile.step_start=0 \
-   actor_rollout_ref.actor.profile.step_end=1 \
-   actor_rollout_ref.actor.profile.save_path="./profile"
-

 Related MCore Document
 ----------------------
--- a/examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh
+++ b/examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh
@ -9,14 +9,8 @@ PROFILE_RANKS="[1,2]"
 # profiling NPU options
 SAVE_PATH="$HOME/profile_data"
 LEVEL="level1"
-WITH_MEMORY=False
-RECORD_SHAPES=False
-WITH_NPU=True
-WITH_CPU=True
-WITH_MODULE=False
-WITH_STACK=False
+CONTENTS=['npu','cpu']
 ANALYSIS=True
-ROLES=["all"]

 python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
@ -28,20 +22,20 @@ python3 -m verl.trainer.main_ppo \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
-    actor_rollout_ref.actor.optim.lr=5e-8 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.model.use_remove_padding=False \
-    actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
-    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    actor_rollout_ref.profiler.discrete=$DISCRETE \
+    actor_rollout_ref.actor.optim.lr=5e-8 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.actor.profiler.enable=True \
+    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
+    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
@ -51,16 +45,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
-    trainer.npu_profile.options.save_path=$SAVE_PATH \
-    trainer.npu_profile.options.level=$LEVEL \
-    trainer.npu_profile.options.with_memory=$WITH_MEMORY \
-    trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
-    trainer.npu_profile.options.with_npu=$WITH_NPU \
-    trainer.npu_profile.options.with_cpu=$WITH_CPU \
-    trainer.npu_profile.options.with_module=$WITH_MODULE \
-    trainer.npu_profile.options.with_stack=$WITH_STACK \
-    trainer.npu_profile.options.analysis=$ANALYSIS \
-    trainer.npu_profile.options.roles=$ROLES \
    trainer.critic_warmup=0 \
    trainer.logger=console \
    trainer.project_name='verl_grpo_example_gsm8k' \
@ -70,5 +54,12 @@ python3 -m verl.trainer.main_ppo \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=5 \
-    trainer.profile_steps=$PROFILE_STEPS \
-    trainer.device=npu $@
+    trainer.device=npu \
+    profiler.tool=npu \
+    profiler.steps=$PROFILE_STEPS \
+    profiler.save_path=$SAVE_PATH \
+    profiler.tool_config.npu.discrete=$DISCRETE \
+    profiler.tool_config.npu.contents=$CONTENTS \
+    profiler.tool_config.npu.level=$LEVEL \
+    profiler.tool_config.npu.analysis=$ANALYSIS
+    $@
--- a/examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh
+++ b/examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh
@ -8,12 +8,7 @@ DISCRETE=False
 # profiling NPU options
 SAVE_PATH="$HOME/profile_data"
 LEVEL="level1"
-WITH_MEMORY=False
-RECORD_SHAPES=False
-WITH_NPU=True
-WITH_CPU=True
-WITH_MODULE=False
-WITH_STACK=False
+CONTENTS=['npu','cpu']
 ANALYSIS=True

 python3 -m verl.trainer.main_ppo \
@ -28,15 +23,16 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=5e-8 \
    actor_rollout_ref.model.use_remove_padding=False \
-    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    actor_rollout_ref.profiler.discrete=$DISCRETE \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.profiler.enable=True \
+    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
+    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
@ -48,15 +44,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
-    trainer.npu_profile.options.save_path=$SAVE_PATH \
-    trainer.npu_profile.options.level=$LEVEL \
-    trainer.npu_profile.options.with_memory=$WITH_MEMORY \
-    trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
-    trainer.npu_profile.options.with_npu=$WITH_NPU \
-    trainer.npu_profile.options.with_cpu=$WITH_CPU \
-    trainer.npu_profile.options.with_module=$WITH_MODULE \
-    trainer.npu_profile.options.with_stack=$WITH_STACK \
-    trainer.npu_profile.options.analysis=$ANALYSIS \
    trainer.critic_warmup=0 \
    trainer.logger=console \
    trainer.project_name='verl_grpo_example_gsm8k' \
@ -66,5 +53,12 @@ python3 -m verl.trainer.main_ppo \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=5 \
-    trainer.profile_steps=$PROFILE_STEPS \
-    trainer.device=npu $@
+    trainer.device=npu \
+    profiler.tool=npu \
+    profiler.steps=$PROFILE_STEPS \
+    profiler.save_path=$SAVE_PATH \
+    profiler.tool_config.npu.discrete=$DISCRETE \
+    profiler.tool_config.npu.contents=$CONTENTS \
+    profiler.tool_config.npu.level=$LEVEL \
+    profiler.tool_config.npu.analysis=$ANALYSIS \
+    $@
--- a/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
+++ b/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
@ -13,9 +13,9 @@ train_files=${train_files:-"$gsm8k_train_path"}
 test_files=${test_files:-"$gsm8k_test_path"}

 # Nsight profiling configuration
-PROFILE_STEPS="[1,2,5]" # or [] or null
+PROFILE_STEPS="[1]" # or [] or null
 PROFILE_RANKS_ALL=False # or True
-PROFILE_RANKS=[0,4,8,12]
+PROFILE_RANKS=[0,4]
 DISCRETE=True  # or True

 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
@ -34,30 +34,32 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2 \
    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.actor.profiler.enable=True \
+    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
+    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
-    actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
-    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    actor_rollout_ref.profiler.discrete=$DISCRETE \
    critic.optim.lr=1e-5 \
    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.profiler.enable=True \
    critic.profiler.ranks=$PROFILE_RANKS \
    critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    critic.profiler.discrete=$DISCRETE \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_ppo_gsm8k_math_examples' \
    trainer.experiment_name='deepseek_llm_7b_megatron' \
    trainer.n_gpus_per_node=8 \
-    trainer.nnodes=2 \
+    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=-1 \
    trainer.total_epochs=100 \
-    trainer.total_training_steps=6 \
-    trainer.profile_steps=$PROFILE_STEPS $@
+    trainer.total_training_steps=1 \
+    profiler.tool=nsys \
+    profiler.steps=$PROFILE_STEPS \
+    profiler.tool_config.nsys.discrete=$DISCRETE $@
--- a/examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_nsys.sh
+++ b/examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_nsys.sh
@ -10,8 +10,8 @@ test_files=${test_files:-"$gsm8k_test_path"}

 PROFILE_STEPS="[1,2,5]" # or [] or null
 PROFILE_RANKS_ALL=False # or True
-PROFILE_RANKS=[0,4,8,12]
-DISCRETE=False  # or True
+PROFILE_RANKS=[0,4]
+DISCRETE=True  # or True

 python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
@ -30,17 +30,17 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.actor.ppo_mini_batch_size=512 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
-    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=12000 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.actor.profiler.enable=True \
+    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
+    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=24000 \
-    actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
-    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    actor_rollout_ref.profiler.discrete=$DISCRETE \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2-7B-Instruct \
@ -50,9 +50,9 @@ python3 -m verl.trainer.main_ppo \
    critic.ppo_max_token_len_per_gpu=98304 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
+    critic.profiler.enable=True \
    critic.profiler.ranks=$PROFILE_RANKS \
    critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    critic.profiler.discrete=$DISCRETE \
    reward_model.enable=True \
    reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
    reward_model.model.use_remove_padding=True \
@ -60,9 +60,9 @@ python3 -m verl.trainer.main_ppo \
    reward_model.micro_batch_size_per_gpu=32 \
    reward_model.use_dynamic_bsz=True \
    reward_model.forward_max_token_len_per_gpu=98304 \
+    reward_model.profiler.enable=True \
    reward_model.profiler.ranks=$PROFILE_RANKS \
    reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \
-    reward_model.profiler.discrete=$DISCRETE \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
@ -70,10 +70,12 @@ python3 -m verl.trainer.main_ppo \
    trainer.experiment_name='qwen2-7b_hybrid_rm_bsz8k_p4k_r4k_seq_packing' \
    trainer.n_gpus_per_node=8 \
    trainer.val_before_train=False \
-    trainer.nnodes=2 \
+    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=-1 \
    trainer.total_epochs=15 \
    trainer.total_training_steps=6 \
-    trainer.profile_continuous_steps=True \
-    trainer.profile_steps=$PROFILE_STEPS $@
+    profiler.profile_continuous_steps=True \
+    profiler.tool=nsys \
+    profiler.steps=$PROFILE_STEPS \
+    profiler.tool_config.nsys.discrete=$DISCRETE $@
--- a/recipe/dapo/dapo_ray_trainer.py
+++ b/recipe/dapo/dapo_ray_trainer.py
@ -97,8 +97,8 @@ class RayDAPOTrainer(RayPPOTrainer):

        prev_step_profile = False
        curr_step_profile = (
-            self.global_steps in self.config.trainer.profile_steps
-            if self.config.trainer.profile_steps is not None
+            self.global_steps in self.config.global_profiler.steps
+            if self.config.global_profiler.steps is not None
            else False
        )
        next_step_profile = False
@ -114,7 +114,7 @@ class RayDAPOTrainer(RayPPOTrainer):
                with marked_timer("start_profile", timing_raw):
                    self._start_profiling(
                        not prev_step_profile and curr_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )

@ -350,13 +350,13 @@ class RayDAPOTrainer(RayPPOTrainer):

                with marked_timer("stop_profile", timing_raw):
                    next_step_profile = (
-                        self.global_steps + 1 in self.config.trainer.profile_steps
-                        if self.config.trainer.profile_steps is not None
+                        self.global_steps + 1 in self.config.global_profiler.steps
+                        if self.config.global_profiler.steps is not None
                        else False
                    )
                    self._stop_profiling(
                        curr_step_profile and not next_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )
                    prev_step_profile = curr_step_profile
--- a/recipe/dapo/main_dapo.py
+++ b/recipe/dapo/main_dapo.py
@ -45,10 +45,13 @@ def run_ppo(config) -> None:

    if (
        is_cuda_available
-        and OmegaConf.select(config.trainer, "profile_steps") is not None
-        and len(OmegaConf.select(config.trainer, "profile_steps")) > 0
+        and config.global_profiler.tool == "nsys"
+        and OmegaConf.select(config.global_profiler, "steps") is not None
+        and len(OmegaConf.select(config.global_profiler, "steps")) > 0
    ):
-        nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
+        nsight_options = OmegaConf.to_container(
+            config.global_profiler.global_tool_config.nsys.controller_nsight_options
+        )
        runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
    else:
        runner = TaskRunner.remote()
--- a/recipe/one_step_off_policy/fsdp_workers.py
+++ b/recipe/one_step_off_policy/fsdp_workers.py
@ -38,6 +38,7 @@ from verl.utils.fsdp_utils import (
 )
 from verl.utils.import_utils import import_external_libs
 from verl.utils.model import get_generation_config, update_model_config
+from verl.utils.profiler import ProfilerConfig
 from verl.workers.fsdp_workers import ActorRolloutRefWorker as ARRWorker
 from verl.workers.fsdp_workers import CriticWorker

@ -131,8 +132,17 @@ class RolloutWorker(ActorRolloutRefWorker):
        # We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
        # as they provides DictConfig-like interface
        # The benefit of creating the dataclass config is to perform validation during __post_init__
-        profiler_config = omega_conf_to_dataclass(config.rollout.get("profiler", {}))
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config))
+        omega_profiler_config = config.get("profiler", {})
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
+        DistProfilerExtension.__init__(
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
+        )
        self._is_rollout = True
        self._is_actor = False

--- a/recipe/one_step_off_policy/main_ppo.py
+++ b/recipe/one_step_off_policy/main_ppo.py
@ -51,10 +51,11 @@ def run_ppo(config) -> None:
    # Create a remote instance of the TaskRunner class, and
    # Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
    if (
-        OmegaConf.select(config.trainer, "profile_steps") is not None
-        and len(OmegaConf.select(config.trainer, "profile_steps")) > 0
+        config.global_profiler.tool == "nsys"
+        and OmegaConf.select(config.global_profiler, "steps") is not None
+        and len(OmegaConf.select(config.global_profiler, "steps")) > 0
    ):
-        nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
+        nsight_options = OmegaConf.to_container(config.global_profiler.tool_config.nsys.controller_nsight_options)
        runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
    else:
        runner = TaskRunner.remote()
--- a/recipe/one_step_off_policy/ray_trainer.py
+++ b/recipe/one_step_off_policy/ray_trainer.py
@ -213,7 +213,6 @@ class OneStepOffRayTrainer(RayPPOTrainer):
                self.role_worker_mapping[Role.RefPolicy],
                config=self.config.actor_rollout_ref,
                role="ref",
-                profile_option=self.config.trainer.npu_profile.options,
            )
            self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls

@ -233,13 +232,13 @@ class OneStepOffRayTrainer(RayPPOTrainer):
        wg_kwargs = {}  # Setting up kwargs for RayWorkerGroup
        if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
            wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
-        if OmegaConf.select(self.config.trainer, "profile_steps") is not None:
-            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps")
-            assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, (
+        if OmegaConf.select(self.config.global_profiler, "steps") is not None:
+            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "steps")
+            assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
                "worker_nsight_options must be set when profile_steps is set"
            )
            wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
-                OmegaConf.select(self.config.trainer, "worker_nsight_options")
+                OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
            )

        for resource_pool, class_dict in self.resource_pool_to_cls.items():
@ -391,8 +390,8 @@ class OneStepOffRayTrainer(RayPPOTrainer):

        while batch_data_future is not None:
            do_profile = (
-                self.global_steps in self.config.trainer.profile_steps
-                if self.config.trainer.profile_steps is not None
+                self.global_steps in self.config.global_profiler.steps
+                if self.config.global_profiler.steps is not None
                else False
            )
            if do_profile:
--- a/tests/trainer/config/test_legacy_config_on_cpu.py
+++ b/tests/trainer/config/test_legacy_config_on_cpu.py
@ -37,6 +37,14 @@ class TestConfigComparison(unittest.TestCase):
        "activations_checkpoint_method",
        "activations_checkpoint_granularity",
        "activations_checkpoint_num_layers",
+        "discrete",
+        "profiler",
+        "profile",
+        "use_profile",
+        "npu_profile",
+        "profile_steps",
+        "worker_nsight_options",
+        "controller_nsight_options",
    ]

    def _compare_configs_recursively(
--- a/tests/utils/test_config_on_cpu.py
+++ b/tests/utils/test_config_on_cpu.py
@ -79,7 +79,7 @@ class TestPrintCfgCommand(unittest.TestCase):

        # Run the command
        result = subprocess.run(
-            ["python3", "scripts/print_cfg.py", "critic.profiler.discrete=True", "+critic.profiler.extra.any_key=val"],
+            ["python3", "scripts/print_cfg.py", "+critic.profiler.extra.any_key=val"],
            capture_output=True,
            text=True,
        )
@ -90,7 +90,6 @@ class TestPrintCfgCommand(unittest.TestCase):
        # Verify the output contains expected config information
        self.assertIn("critic", result.stdout)
        self.assertIn("profiler", result.stdout)
-        self.assertIn("discrete=True", result.stdout)
        self.assertIn("extra={'any_key': 'val'}", result.stdout)


--- a/tests/utils/test_nvtx_profile.py
+++ b/tests/utils/test_nvtx_profile.py
@ -17,7 +17,7 @@ import unittest
 from unittest.mock import MagicMock, patch

 from verl.utils import omega_conf_to_dataclass
-from verl.utils.profiler import ProfilerConfig
+from verl.utils.profiler.config import NsightToolConfig, ProfilerConfig
 from verl.utils.profiler.nvtx_profile import NsightSystemsProfiler


@ -29,26 +29,25 @@ class TestProfilerConfig(unittest.TestCase):

        with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
            cfg = compose(config_name="ppo_trainer")
-        arr = cfg.actor_rollout_ref
        for config in [
+            cfg.actor_rollout_ref.actor.profiler,
+            cfg.actor_rollout_ref.rollout.profiler,
+            cfg.actor_rollout_ref.ref.profiler,
            cfg.critic.profiler,
-            arr.profiler,
            cfg.reward_model.profiler,
        ]:
            profiler_config = omega_conf_to_dataclass(config)
-            self.assertEqual(profiler_config.discrete, config.discrete)
+            self.assertEqual(profiler_config.tool, config.tool)
+            self.assertEqual(profiler_config.enable, config.enable)
            self.assertEqual(profiler_config.all_ranks, config.all_ranks)
            self.assertEqual(profiler_config.ranks, config.ranks)
+            self.assertEqual(profiler_config.save_path, config.save_path)
+            self.assertEqual(profiler_config.ranks, config.ranks)
            assert isinstance(profiler_config, ProfilerConfig)
            with self.assertRaises(AttributeError):
                _ = profiler_config.non_existing_key
            assert config.get("non_existing_key") == profiler_config.get("non_existing_key")
            assert config.get("non_existing_key", 1) == profiler_config.get("non_existing_key", 1)
-            assert config["discrete"] == profiler_config["discrete"]
-            from dataclasses import FrozenInstanceError
-
-            with self.assertRaises(FrozenInstanceError):
-                profiler_config.discrete = False

    def test_frozen_config(self):
        """Test that modifying frozen keys in ProfilerConfig raises exceptions."""
@ -57,11 +56,7 @@ class TestProfilerConfig(unittest.TestCase):
        from verl.utils.profiler.config import ProfilerConfig

        # Create a new ProfilerConfig instance
-        config = ProfilerConfig(discrete=True, all_ranks=False, ranks=[0], extra={"key": "value"})
-
-        # Test direct attribute assignment
-        with self.assertRaises(FrozenInstanceError):
-            config.discrete = False
+        config = ProfilerConfig(all_ranks=False, ranks=[0], extra={"key": "value"})

        with self.assertRaises(FrozenInstanceError):
            config.all_ranks = True
@ -69,10 +64,6 @@ class TestProfilerConfig(unittest.TestCase):
        with self.assertRaises(FrozenInstanceError):
            config.ranks = [1, 2, 3]

-        # Test dictionary-style assignment
-        with self.assertRaises(TypeError):
-            config["discrete"] = False
-
        with self.assertRaises(TypeError):
            config["all_ranks"] = True

@ -90,20 +81,19 @@ class TestNsightSystemsProfiler(unittest.TestCase):
    Test Plan:
    1. Initialization: Verify profiler state after creation
    2. Basic Profiling: Test start/stop functionality
-    3. Discrete Mode: Test discrete profiling behavior
+    3. Discrete Mode: TODO: Test discrete profiling behavior
    4. Annotation: Test the annotate decorator in both normal and discrete modes
    5. Config Validation: Verify proper config initialization from OmegaConf
    """

    def setUp(self):
-        self.config = ProfilerConfig(all_ranks=True)
+        self.config = ProfilerConfig(enable=True, all_ranks=True)
        self.rank = 0
-        self.profiler = NsightSystemsProfiler(self.rank, self.config)
+        self.profiler = NsightSystemsProfiler(self.rank, self.config, tool_config=NsightToolConfig(discrete=False))

    def test_initialization(self):
        self.assertEqual(self.profiler.this_rank, True)
        self.assertEqual(self.profiler.this_step, False)
-        self.assertEqual(self.profiler.discrete, False)

    def test_start_stop_profiling(self):
        with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
@ -117,18 +107,18 @@ class TestNsightSystemsProfiler(unittest.TestCase):
            self.assertFalse(self.profiler.this_step)
            mock_stop.assert_called_once()

-    def test_discrete_profiling(self):
-        discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
-        profiler = NsightSystemsProfiler(self.rank, discrete_config)
+    # def test_discrete_profiling(self):
+    #     discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
+    #     profiler = NsightSystemsProfiler(self.rank, discrete_config)

-        with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
-            profiler.start()
-            self.assertTrue(profiler.this_step)
-            mock_start.assert_not_called()  # Shouldn't start immediately in discrete mode
+    #     with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
+    #         profiler.start()
+    #         self.assertTrue(profiler.this_step)
+    #         mock_start.assert_not_called()  # Shouldn't start immediately in discrete mode

-            profiler.stop()
-            self.assertFalse(profiler.this_step)
-            mock_stop.assert_not_called()  # Shouldn't stop immediately in discrete mode
+    #         profiler.stop()
+    #         self.assertFalse(profiler.this_step)
+    #         mock_stop.assert_not_called()  # Shouldn't stop immediately in discrete mode

    def test_annotate_decorator(self):
        mock_self = MagicMock()
@ -152,29 +142,29 @@ class TestNsightSystemsProfiler(unittest.TestCase):
            mock_start.assert_not_called()  # Not discrete mode
            mock_stop.assert_not_called()  # Not discrete mode

-    def test_annotate_discrete_mode(self):
-        discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
-        profiler = NsightSystemsProfiler(self.rank, discrete_config)
-        mock_self = MagicMock()
-        mock_self.profiler = profiler
-        mock_self.profiler.this_step = True
+    # def test_annotate_discrete_mode(self):
+    #     discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
+    #     profiler = NsightSystemsProfiler(self.rank, discrete_config)
+    #     mock_self = MagicMock()
+    #     mock_self.profiler = profiler
+    #     mock_self.profiler.this_step = True

-        @NsightSystemsProfiler.annotate(message="test")
-        def test_func(self, *args, **kwargs):
-            return "result"
+    #     @NsightSystemsProfiler.annotate(message="test")
+    #     def test_func(self, *args, **kwargs):
+    #         return "result"

-        with (
-            patch("torch.cuda.profiler.start") as mock_start,
-            patch("torch.cuda.profiler.stop") as mock_stop,
-            patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
-            patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
-        ):
-            result = test_func(mock_self)
-            self.assertEqual(result, "result")
-            mock_start_range.assert_called_once()
-            mock_end_range.assert_called_once()
-            mock_start.assert_called_once()  # Should start in discrete mode
-            mock_stop.assert_called_once()  # Should stop in discrete mode
+    #     with (
+    #         patch("torch.cuda.profiler.start") as mock_start,
+    #         patch("torch.cuda.profiler.stop") as mock_stop,
+    #         patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
+    #         patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
+    #     ):
+    #         result = test_func(mock_self)
+    #         self.assertEqual(result, "result")
+    #         mock_start_range.assert_called_once()
+    #         mock_end_range.assert_called_once()
+    #         mock_start.assert_called_once()  # Should start in discrete mode
+    #         mock_stop.assert_called_once()  # Should stop in discrete mode


 if __name__ == "__main__":
--- a/tests/workers/config/test_critic_config_on_cpu.py
+++ b/tests/workers/config/test_critic_config_on_cpu.py
@ -184,29 +184,26 @@ class TestCriticConfig:
        optim = OptimizerConfig(lr=0.1)
        critic_config = CriticConfig(ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim)
        assert isinstance(critic_config.profiler, ProfilerConfig)
-        assert critic_config.profiler.discrete is False
        assert critic_config.profiler.all_ranks is False
        assert critic_config.profiler.ranks == []

-        custom_profiler = ProfilerConfig(discrete=True, all_ranks=True, ranks=[0, 1])
+        custom_profiler = ProfilerConfig(all_ranks=True, ranks=[0, 1])
        critic_config_custom = CriticConfig(
            profiler=custom_profiler, ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim
        )
        assert isinstance(critic_config_custom.profiler, ProfilerConfig)
-        assert critic_config_custom.profiler.discrete is True
        assert critic_config_custom.profiler.all_ranks is True
        assert critic_config_custom.profiler.ranks == [0, 1]

-        profiler1 = ProfilerConfig(discrete=True, ranks=[0, 1])
+        profiler1 = ProfilerConfig(enable=True, ranks=[0, 1])
        profiler2 = ProfilerConfig(all_ranks=True, ranks=[1, 2])

        union_result = profiler1.union(profiler2)
-        assert union_result.discrete is True
+        assert union_result.enable is True
        assert union_result.all_ranks is True
        assert set(union_result.ranks) == {0, 1, 2}

        intersect_result = profiler1.intersect(profiler2)
-        assert intersect_result.discrete is False
        assert intersect_result.all_ranks is False
        assert intersect_result.ranks == [1]

--- a/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
@ -59,6 +59,25 @@ actor_rollout_ref:
      use_checkpoint_opt_param_scheduler: false
      override_optimizer_config: {}
    use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: false
+      all_ranks: false
+      ranks: []
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config:
+        nsys:
+          discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
+        npu:
+          _target_: verl.utils.profiler.config.NPUToolConfig
+          contents: []
+          level: level1
+          analysis: true
+        torch:
+          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+          step_start: 0
+          step_end: null
    data_loader_seed: null
    load_weight: true
    megatron:
@ -85,12 +104,6 @@ actor_rollout_ref:
        recompute_method: null
        recompute_num_layers: null
      use_mbridge: false
-    profile:
-      use_profile: false
-      profile_ranks: null
-      step_start: -1
-      step_end: -1
-      save_path: null
  ref:
    strategy: megatron
    use_torch_compile: ${oc.select:actor_rollout_ref.actor.use_torch_compile,true}
@ -98,6 +111,14 @@ actor_rollout_ref:
    log_prob_micro_batch_size_per_gpu: null
    log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
    log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    megatron:
      _target_: verl.workers.config.MegatronEngineConfig
      param_offload: false
@ -114,12 +135,6 @@ actor_rollout_ref:
      seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
      override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
      use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
-    profile:
-      use_profile: false
-      profile_ranks: null
-      step_start: -1
-      step_end: -1
-      save_path: null
    load_weight: true
  rollout:
    name: ???
@ -184,6 +199,14 @@ actor_rollout_ref:
      token2text: false
    skip_rollout: false
    skip_dump_dir: /tmp/rollout_dump
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    enable_chunked_prefill: false
    load_format: dummy_megatron
    layer_name_map:
@ -201,63 +224,6 @@ actor_rollout_ref:
        freeze_moe_router: false
    use_fused_kernels: false
    trust_remote_code: false
-  profiler:
-    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
-    all_ranks: false
-    ranks: []
-trainer:
-  npu_profile:
-    options:
-      save_path: ./profiler_data
-      roles:
-      - all
-      level: level1
-      with_memory: false
-      record_shapes: false
-      with_npu: true
-      with_cpu: true
-      with_module: false
-      with_stack: false
-      analysis: true
-  balance_batch: true
-  total_epochs: 30
-  total_training_steps: null
-  profile_steps: null
-  profile_continuous_steps: false
-  project_name: verl_examples
-  experiment_name: gsm8k
-  logger:
-  - console
-  - wandb
-  log_val_generations: 0
-  nnodes: 1
-  n_gpus_per_node: 8
-  save_freq: -1
-  esi_redundant_time: 0
-  resume_mode: auto
-  resume_from_path: null
-  del_local_ckpt_after_load: false
-  val_before_train: true
-  test_freq: -1
-  critic_warmup: 0
-  default_hdfs_dir: null
-  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
-  max_actor_ckpt_to_keep: null
-  max_critic_ckpt_to_keep: null
-  ray_wait_register_center_timeout: 300
-  device: cuda
-  controller_nsight_options:
-    trace: cuda,nvtx,cublas,ucx
-    cuda-memory-usage: 'true'
-    cuda-graph-trace: graph
-  worker_nsight_options:
-    trace: cuda,nvtx,cublas,ucx
-    cuda-memory-usage: 'true'
-    cuda-graph-trace: graph
-    capture-range: cudaProfilerApi
-    capture-range-end: null
-    kill: none
 data:
  tokenizer: null
  use_shm: false
@ -344,9 +310,12 @@ critic:
    async_save: false
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
+    enable: false
    all_ranks: false
    ranks: []
+    save_path: ${oc.select:global_profiler.save_path,null}
+    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  nccl_timeout: 600
  megatron:
    _target_: verl.workers.config.McoreEngineConfig
@ -390,9 +359,12 @@ reward_model:
    memory_limit_mb: 1024
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
+    enable: false
    all_ranks: false
    ranks: []
+    save_path: ${oc.select:global_profiler.save_path,null}
+    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  nccl_timeout: 600
  megatron:
    _target_: verl.workers.config.MegatronEngineConfig
@ -432,6 +404,52 @@ algorithm:
  pf_ppo:
    reweight_method: pow
    weight_pow: 2.0
+trainer:
+  balance_batch: true
+  total_epochs: 30
+  total_training_steps: null
+  project_name: verl_examples
+  experiment_name: gsm8k
+  logger:
+  - console
+  - wandb
+  log_val_generations: 0
+  nnodes: 1
+  n_gpus_per_node: 8
+  save_freq: -1
+  esi_redundant_time: 0
+  resume_mode: auto
+  resume_from_path: null
+  del_local_ckpt_after_load: false
+  val_before_train: true
+  test_freq: -1
+  critic_warmup: 0
+  default_hdfs_dir: null
+  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
+  max_actor_ckpt_to_keep: null
+  max_critic_ckpt_to_keep: null
+  ray_wait_register_center_timeout: 300
+  device: cuda
+global_profiler:
+  _target_: verl.utils.profiler.ProfilerConfig
+  tool: null
+  steps: null
+  profile_continuous_steps: false
+  save_path: outputs/profile
+  global_tool_config:
+    nsys:
+      discrete: false
+      controller_nsight_options:
+        trace: cuda,nvtx,cublas,ucx
+        cuda-memory-usage: 'true'
+        cuda-graph-trace: graph
+      worker_nsight_options:
+        trace: cuda,nvtx,cublas,ucx
+        cuda-memory-usage: 'true'
+        cuda-graph-trace: graph
+        capture-range: cudaProfilerApi
+        capture-range-end: null
+        kill: none
 ray_init:
  num_cpus: null
  timeline_json_file: null
--- a/verl/trainer/config/_generated_ppo_trainer.yaml
+++ b/verl/trainer/config/_generated_ppo_trainer.yaml
@ -51,6 +51,25 @@ actor_rollout_ref:
      num_cycles: 0.5
      warmup_style: constant
    use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: false
+      all_ranks: false
+      ranks: []
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config:
+        nsys:
+          discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
+        npu:
+          _target_: verl.utils.profiler.config.NPUToolConfig
+          contents: []
+          level: level1
+          analysis: true
+        torch:
+          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+          step_start: 0
+          step_end: null
    grad_clip: 1.0
    ulysses_sequence_parallel_size: 1
    entropy_from_logits_with_chunking: false
@ -73,6 +92,14 @@ actor_rollout_ref:
    log_prob_micro_batch_size_per_gpu: null
    log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
    log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    model: null
    fsdp_config:
      _target_: verl.workers.config.FSDPEngineConfig
@ -147,6 +174,14 @@ actor_rollout_ref:
      token2text: false
    skip_rollout: false
    skip_dump_dir: /tmp/rollout_dump
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    enable_chunked_prefill: true
    load_format: dummy_dtensor
    layered_summon: false
@ -170,67 +205,6 @@ actor_rollout_ref:
    fused_kernel_options:
      impl_backend: torch
    trust_remote_code: false
-  profiler:
-    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
-    all_ranks: false
-    ranks: []
-trainer:
-  npu_profile:
-    options:
-      save_path: ./profiler_data
-      roles:
-      - all
-      level: level1
-      with_memory: false
-      record_shapes: false
-      with_npu: true
-      with_cpu: true
-      with_module: false
-      with_stack: false
-      analysis: true
-  balance_batch: true
-  total_epochs: 30
-  total_training_steps: null
-  profile_steps: null
-  profile_continuous_steps: false
-  controller_nsight_options:
-    trace: cuda,nvtx,cublas,ucx
-    cuda-memory-usage: 'true'
-    cuda-graph-trace: graph
-  worker_nsight_options:
-    trace: cuda,nvtx,cublas,ucx
-    cuda-memory-usage: 'true'
-    cuda-graph-trace: graph
-    capture-range: cudaProfilerApi
-    capture-range-end: null
-    kill: none
-  project_name: verl_examples
-  experiment_name: gsm8k
-  logger:
-  - console
-  - wandb
-  log_val_generations: 0
-  rollout_data_dir: null
-  validation_data_dir: null
-  nnodes: 1
-  n_gpus_per_node: 8
-  save_freq: -1
-  esi_redundant_time: 0
-  resume_mode: auto
-  resume_from_path: null
-  val_before_train: true
-  val_only: false
-  test_freq: -1
-  critic_warmup: 0
-  default_hdfs_dir: null
-  del_local_ckpt_after_load: false
-  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
-  max_actor_ckpt_to_keep: null
-  max_critic_ckpt_to_keep: null
-  ray_wait_register_center_timeout: 300
-  device: cuda
-  use_legacy_worker_impl: auto
 data:
  tokenizer: null
  use_shm: false
@ -322,9 +296,12 @@ critic:
    async_save: false
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
+    enable: false
    all_ranks: false
    ranks: []
+    save_path: ${oc.select:global_profiler.save_path,null}
+    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  forward_micro_batch_size: ${oc.select:.ppo_micro_batch_size,null}
  forward_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size_per_gpu,null}
  ulysses_sequence_parallel_size: 1
@ -361,9 +338,12 @@ reward_model:
    memory_limit_mb: 1024
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
+    enable: false
    all_ranks: false
    ranks: []
+    save_path: ${oc.select:global_profiler.save_path,null}
+    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  ulysses_sequence_parallel_size: 1
 custom_reward_function:
  path: null
@ -386,6 +366,57 @@ algorithm:
  pf_ppo:
    reweight_method: pow
    weight_pow: 2.0
+trainer:
+  balance_batch: true
+  total_epochs: 30
+  total_training_steps: null
+  project_name: verl_examples
+  experiment_name: gsm8k
+  logger:
+  - console
+  - wandb
+  log_val_generations: 0
+  rollout_data_dir: null
+  validation_data_dir: null
+  nnodes: 1
+  n_gpus_per_node: 8
+  save_freq: -1
+  esi_redundant_time: 0
+  resume_mode: auto
+  resume_from_path: null
+  val_before_train: true
+  val_only: false
+  test_freq: -1
+  critic_warmup: 0
+  default_hdfs_dir: null
+  del_local_ckpt_after_load: false
+  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
+  max_actor_ckpt_to_keep: null
+  max_critic_ckpt_to_keep: null
+  ray_wait_register_center_timeout: 300
+  device: cuda
+  use_legacy_worker_impl: auto
+global_profiler:
+  _target_: verl.utils.profiler.ProfilerConfig
+  tool: null
+  steps: null
+  profile_continuous_steps: false
+  save_path: outputs/profile
+  global_tool_config:
+    nsys:
+      _target_: verl.utils.profiler.config.NsightToolConfig
+      discrete: false
+      controller_nsight_options:
+        trace: cuda,nvtx,cublas,ucx
+        cuda-memory-usage: 'true'
+        cuda-graph-trace: graph
+      worker_nsight_options:
+        trace: cuda,nvtx,cublas,ucx
+        cuda-memory-usage: 'true'
+        cuda-graph-trace: graph
+        capture-range: cudaProfilerApi
+        capture-range-end: null
+        kill: none
 ray_init:
  num_cpus: null
  timeline_json_file: null
--- a/verl/trainer/config/actor/actor.yaml
+++ b/verl/trainer/config/actor/actor.yaml
@ -128,3 +128,65 @@ optim:

 # Whether to use custom fused kernels (e.g., FlashAttention, fused MLP)
 use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
+
+# profile the actor model in `update_policy` 
+profiler:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.utils.profiler.ProfilerConfig
+
+  # profiler tool, default same as profiler.tool in global config
+  # choices: nsys, npu, torch
+  tool: ${oc.select:global_profiler.tool,null}
+
+  # whether enable profile on Actor
+  enable: False
+  
+  # Whether to profile all ranks.
+  all_ranks: False
+
+  # The ranks that will be profiled. [] or [0,1,...]
+  ranks: []
+
+  # profile results saving path
+  save_path: ${oc.select:global_profiler.save_path,null}
+
+  # specific tool config which only related to the role
+  tool_config:
+
+    # nsys tool config
+    nsys:
+    
+      # True for each task has its own database, False for all tasks in one training step share one database.
+      discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
+    
+    # npu config
+    npu:
+
+      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+      _target_: verl.utils.profiler.config.NPUToolConfig
+
+      # Contents to profile, can be empty
+      # options: npu, cpu, memory, shapes, module, stack
+      contents: []
+
+      # Collection level, optional values: level_none, level0, level1, level2.
+      level: "level1"
+
+      # Whether to automatically parse the data.
+      analysis: True
+    
+    # torch profiler config
+    torch:
+
+      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+      _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+
+      # start profile mini-batch in training
+      # NOTICE: different with global steps config which refers to iteration
+      # This field only related with mini-batch
+      step_start: 0
+
+      # stop profile mini-batch in training
+      step_end: null
+
--- a/verl/trainer/config/actor/megatron_actor.yaml
+++ b/verl/trainer/config/actor/megatron_actor.yaml
@ -103,22 +103,4 @@ megatron:
    recompute_num_layers: null

  # oc.select: default val for ref.megatron.use_mbridge
-  use_mbridge: False
-
-# profile the actor model in `update_policy` 
-profile:
-
-  # turn it on when you want to profile the actor model
-  use_profile: False
-
-  # list, you can specify the ranks to profile
-  profile_ranks: null
-
-  # start step in update_policy
-  step_start: -1
-
-  # end step
-  step_end: -1
-
-  # the path to save the profile result
-  save_path: null
+  use_mbridge: False
--- a/verl/trainer/config/config.py
+++ b/verl/trainer/config/config.py
@ -45,14 +45,12 @@ class ProfileConfig(BaseConfig):
    The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.

    Args:
-        use_profile (bool): Whether to enable profiling.
        profile_ranks (Optional[list[int]]): List of ranks to profile. None means all ranks.
        step_start (int): Starting step for profiling.
        step_end (int): Ending step for profiling.
        save_path (Optional[str]): Path to save profiling results.
    """

-    use_profile: bool = False
    profile_ranks: Optional[list[int]] = None
    step_start: int = -1
    step_end: int = -1
--- a/verl/trainer/config/critic/critic.yaml
+++ b/verl/trainer/config/critic/critic.yaml
@ -95,18 +95,27 @@ checkpoint:
  # Whether to save checkpoints asynchronously. Only effective for Megatron as of now.
  async_save: False

-# profiler configs
-# the corresponding dataclass is verl.utils.profiler.ProfilerConfig.
+# profile the critic model in `update_policy` 
 profiler:

  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig

-  # True for each task has its own database, False for all tasks in one training step share one database.
-  discrete: False
+  # profiler tool, default same as profiler.tool in global config
+  # choices: nsys, npu, torch
+  tool: ${oc.select:global_profiler.tool,null}

+  # whether enable profile on critic
+  enable: False
+  
  # Whether to profile all ranks.
  all_ranks: False

  # The ranks that will be profiled. [] or [0,1,...]
  ranks: []
+
+  # profile results saving path
+  save_path: ${oc.select:global_profiler.save_path,null}
+
+  # specific tool config
+  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/config/ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/ppo_megatron_trainer.yaml
@ -4,8 +4,6 @@ defaults:
  # <folder_name>@<field_name>.<field_name>: <yaml_file_name>
  # actor_rollout_ref.actor: trainer/config/actor/megatron_actor.yaml
  - actor@actor_rollout_ref.actor: megatron_actor
-  # trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
-  - npu_profile@trainer.npu_profile: npu_profile
  # data: trainer/config/data/legacy_data.yaml
  - data@data: legacy_data
  # load the reference default config, then apply the fields in the current yaml
@ -57,12 +55,6 @@ actor_rollout_ref:
      qkv_layer_name: qkv
      gate_proj_layer_name: gate_up

-  profiler:
-    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: False
-    all_ranks: False
-    ranks: []
-
 custom_reward_function:
  path: null
  name: compute_score
@ -92,8 +84,6 @@ trainer:
  balance_batch: True
  total_epochs: 30
  total_training_steps: null
-  profile_steps: null # [1,2,5] or [] or null
-  profile_continuous_steps: False
  project_name: verl_examples
  experiment_name: gsm8k
  logger: ['console', 'wandb']
@ -117,18 +107,62 @@ trainer:
  # The timeout for ray worker group to wait for the register center to be ready
  ray_wait_register_center_timeout: 300
  device: cuda
-  # see ppo_trainer.yaml for more details
-  controller_nsight_options:
-    trace: "cuda,nvtx,cublas,ucx"
-    cuda-memory-usage: "true"
-    cuda-graph-trace: "graph"
-  worker_nsight_options:
-    trace: "cuda,nvtx,cublas,ucx"
-    cuda-memory-usage: "true"
-    cuda-graph-trace: "graph"
-    capture-range: "cudaProfilerApi"
-    capture-range-end: null
-    kill: none
+
+global_profiler:
+  _target_: verl.utils.profiler.ProfilerConfig
+  tool: null  # choose between nsys, npu, torch
+  steps: null   # profile steps
+  profile_continuous_steps: False
+  save_path: "outputs/profile"   # profiler saving path
+  # Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
+  global_tool_config:
+  
+    # nsys config
+    nsys:
+
+      # True for each task has its own database, False for all tasks in one training step share one database.
+      discrete: False
+
+      # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
+      ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
+      ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
+      controller_nsight_options:
+
+        # Select the API(s) to be traced.
+        trace: "cuda,nvtx,cublas,ucx"
+
+        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
+        cuda-memory-usage: "true"
+
+        # CUDA graphs will be traced as a whole
+        cuda-graph-trace: "graph"
+
+      # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
+      worker_nsight_options:
+
+        # Select the API(s) to be traced.
+        trace: "cuda,nvtx,cublas,ucx"
+
+        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
+        cuda-memory-usage: "true"
+
+        # CUDA graphs will be traced as a whole
+        cuda-graph-trace: "graph"
+
+        # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
+        capture-range: "cudaProfilerApi"
+
+        # Specify the desired behavior when a capture range ends.
+        # In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
+        # valid values are "repeat-shutdown:n" or null.
+        # For normal whole step profiling, n = len(profile_steps);
+        # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
+        # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
+        capture-range-end: null
+
+        # Send signal to the target application's process group. We let the program to exit by itself.
+        kill: none
+
 ray_init:
  num_cpus: null # `None` means using all CPUs, which might cause hang if limited in systems like SLURM. Please set to a number allowed then.
  timeline_json_file: null
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@ -11,9 +11,6 @@ defaults:
  # actor_rollout_ref.actor: trainer/config/actor/dp_actor.yaml
  - actor@actor_rollout_ref.actor: dp_actor

-  # trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
-  - npu_profile@trainer.npu_profile: npu_profile
-
  # data: trainer/config/data/legacy_data.yaml
  - data@data: legacy_data

@ -112,21 +109,6 @@ actor_rollout_ref:
    # for huge model, layered summon can save memory (prevent OOM) but make it slower
    layered_summon: False

-  # profiler configs
-  profiler:
-
-    # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
-    _target_: verl.utils.profiler.ProfilerConfig
-
-    # True for each task has its own database, False for all tasks in one training step share one database.
-    discrete: False
-
-    # Whether to profile all ranks.
-    all_ranks: False
-
-    # The ranks that will be profiled. [] or [0,1,...]
-    ranks: []
-
 # custom reward function definition
 custom_reward_function:

@ -203,54 +185,6 @@ trainer:
  # Total training steps (can be set explicitly or derived from epochs)
  total_training_steps: null

-  # The steps that will be profiled. null means no profiling. null or [1,2,5,...]
-  profile_steps: null
-
-  # Whether to combine continuous steps into one database.
-  ## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
-  ## If False, [1] in one, [2] in another, [5] in another.
-  profile_continuous_steps: False
-
-  # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
-  ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
-  ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
-  controller_nsight_options:
-
-    # Select the API(s) to be traced.
-    trace: "cuda,nvtx,cublas,ucx"
-
-    # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
-    cuda-memory-usage: "true"
-
-    # CUDA graphs will be traced as a whole
-    cuda-graph-trace: "graph"
-
-  # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
-  worker_nsight_options:
-
-    # Select the API(s) to be traced.
-    trace: "cuda,nvtx,cublas,ucx"
-
-    # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
-    cuda-memory-usage: "true"
-
-    # CUDA graphs will be traced as a whole
-    cuda-graph-trace: "graph"
-
-    # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
-    capture-range: "cudaProfilerApi"
-
-    # Specify the desired behavior when a capture range ends.
-    # In verl we need the orch.cuda.profiler.start/stop pair to repeats n times.
-    # valid values are "repeat-shutdown:n" or null.
-    # For normal whole step profiling, n = len(profile_steps);
-    # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
-    # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
-    capture-range-end: null
-
-    # Send signal to the target application's process group. We let the program to exit by itself.
-    kill: none
-
  # Project name for experiment tracking (e.g., wandb)
  project_name: verl_examples

@ -331,6 +265,79 @@ trainer:
  #  mode: "auto", "enable", or "disable"
  use_legacy_worker_impl: auto

+
+# profiler configs
+global_profiler:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.utils.profiler.ProfilerConfig
+
+  # Profiling tool: choose between nsys, npu, torch
+  tool: null
+
+  # profile steps
+  steps: null
+
+  # Whether to combine continuous steps into one database.
+  ## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
+  ## If False, [1] in one, [2] in another, [5] in another.
+  profile_continuous_steps: False
+
+  # Path to save profiling contents
+  save_path: "outputs/profile"
+
+  # Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
+  global_tool_config:
+  
+    # nsys config
+    nsys:
+
+      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+      _target_: verl.utils.profiler.config.NsightToolConfig
+
+      # True for each task has its own database, False for all tasks in one training step share one database.
+      discrete: False
+
+      # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
+      ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
+      ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
+      controller_nsight_options:
+
+        # Select the API(s) to be traced.
+        trace: "cuda,nvtx,cublas,ucx"
+
+        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
+        cuda-memory-usage: "true"
+
+        # CUDA graphs will be traced as a whole
+        cuda-graph-trace: "graph"
+
+      # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
+      worker_nsight_options:
+
+        # Select the API(s) to be traced.
+        trace: "cuda,nvtx,cublas,ucx"
+
+        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
+        cuda-memory-usage: "true"
+
+        # CUDA graphs will be traced as a whole
+        cuda-graph-trace: "graph"
+
+        # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
+        capture-range: "cudaProfilerApi"
+
+        # Specify the desired behavior when a capture range ends.
+        # In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
+        # valid values are "repeat-shutdown:n" or null.
+        # For normal whole step profiling, n = len(profile_steps);
+        # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
+        # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
+        capture-range-end: null
+
+        # Send signal to the target application's process group. We let the program to exit by itself.
+        kill: none
+
 # configs related to ray initialization
 ray_init:

--- a/verl/trainer/config/ref/megatron_ref.yaml
+++ b/verl/trainer/config/ref/megatron_ref.yaml
@ -23,11 +23,4 @@ megatron:
  override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
  use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}

-profile:
-  use_profile: False
-  profile_ranks: null
-  step_start: -1
-  step_end: -1
-  save_path: null
-
 load_weight: True
--- a/verl/trainer/config/ref/ref.yaml
+++ b/verl/trainer/config/ref/ref.yaml
@ -19,3 +19,28 @@ log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,fa
 # the max token length per GPU
 # same as actor_rollout_ref.actor.ppo_max_token_len_per_gpu if it exists, otherwise 16384
 log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
+
+# profile the ref model in `compute_log_prob` 
+profiler:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.utils.profiler.ProfilerConfig
+
+  # profiler tool, default same as profiler.tool in global config
+  # choices: nsys, npu, torch
+  tool: ${oc.select:global_profiler.tool,null}
+
+  # whether enable profile on ref
+  enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+  
+  # Whether to profile all ranks.
+  all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+
+  # The ranks that will be profiled. [] or [0,1,...]
+  ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+
+  # profile results saving path
+  save_path: ${oc.select:global_profiler.save_path,null}
+
+  # specific tool config
+  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/config/reward_model/reward_model.yaml
+++ b/verl/trainer/config/reward_model/reward_model.yaml
@ -65,17 +65,27 @@ sandbox_fusion:
  # Max memory limit for each sandbox process in MB
  memory_limit_mb: 1024

-# profiler configs
+# profile the reward model in `compute_reward` 
 profiler:

-  # hint for the target config dataclass
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig

-  # True for each task has its own database, False for all tasks in one training step share one database.
-  discrete: False
+  # profiler tool, default same as profiler.tool in global config
+  # choices: nsys, npu, torch
+  tool: ${oc.select:global_profiler.tool,null}

+  # whether enable profile on ref
+  enable: False
+  
  # Whether to profile all ranks.
  all_ranks: False

  # The ranks that will be profiled. [] or [0,1,...]
-  ranks: []
+  ranks: []
+
+  # profile results saving path
+  save_path: ${oc.select:global_profiler.save_path,null}
+
+  # specific tool config
+  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/config/rollout/rollout.yaml
+++ b/verl/trainer/config/rollout/rollout.yaml
@ -225,3 +225,28 @@ skip_rollout: False
 # Specifies the filesystem path where rollout data should be cached when skip_rollout is enabled.
 # Note: Giving path under /tmp/ray/session* is not recommended as these are temporary Ray cluster directories.
 skip_dump_dir: /tmp/rollout_dump
+
+# profile the rollout model in `generate_sequence` 
+profiler:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.utils.profiler.ProfilerConfig
+
+  # profiler tool, default same as profiler.tool in global config
+  # choices: nsys, npu, torch
+  tool: ${oc.select:global_profiler.tool,null}
+
+  # whether enable profile on ref
+  enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+  
+  # Whether to profile all ranks.
+  all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+
+  # The ranks that will be profiled. [] or [0,1,...]
+  ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+
+  # profile results saving path
+  save_path: ${oc.select:global_profiler.save_path,null}
+
+  # specific tool config
+  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/main_ppo.py
+++ b/verl/trainer/main_ppo.py
@ -64,13 +64,16 @@ def run_ppo(config) -> None:
    # Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
    if (
        is_cuda_available
-        and config.trainer.get("profile_steps") is not None
-        and len(config.trainer.get("profile_steps", [])) > 0
+        and config.global_profiler.tool == "nsys"
+        and config.global_profiler.get("steps") is not None
+        and len(config.global_profiler.get("steps", [])) > 0
    ):
        from verl.utils.import_utils import is_nvtx_available

        assert is_nvtx_available(), "nvtx is not available in CUDA platform. Please 'pip3 install nvtx'"
-        nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
+        nsight_options = OmegaConf.to_container(
+            config.global_profiler.global_tool_config.nsys.controller_nsight_options
+        )
        runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
    else:
        runner = TaskRunner.remote()
--- a/verl/trainer/ppo/ray_trainer.py
+++ b/verl/trainer/ppo/ray_trainer.py
@ -795,7 +795,6 @@ class RayPPOTrainer:
                cls=self.role_worker_mapping[Role.ActorRollout],
                config=self.config.actor_rollout_ref,
                role="actor_rollout",
-                profile_option=self.config.trainer.npu_profile.options,
            )
            self.resource_pool_to_cls[resource_pool]["actor_rollout"] = actor_rollout_cls
        else:
@ -815,7 +814,6 @@ class RayPPOTrainer:
                self.role_worker_mapping[Role.RefPolicy],
                config=self.config.actor_rollout_ref,
                role="ref",
-                profile_option=self.config.trainer.npu_profile.options,
            )
            self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls

@ -835,13 +833,13 @@ class RayPPOTrainer:
        wg_kwargs = {}  # Setting up kwargs for RayWorkerGroup
        if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
            wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
-        if OmegaConf.select(self.config.trainer, "profile_steps") is not None:
-            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps")
-            assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, (
+        if OmegaConf.select(self.config.global_profiler, "steps") is not None:
+            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.global_profiler, "steps")
+            assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
                "worker_nsight_options must be set when profile_steps is set"
            )
            wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
-                OmegaConf.select(self.config.trainer, "worker_nsight_options")
+                OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
            )
        wg_kwargs["device_name"] = self.device_name

@ -1083,8 +1081,8 @@ class RayPPOTrainer:

        prev_step_profile = False
        curr_step_profile = (
-            self.global_steps in self.config.trainer.profile_steps
-            if self.config.trainer.profile_steps is not None
+            self.global_steps in self.config.global_profiler.steps
+            if self.config.global_profiler.steps is not None
            else False
        )
        next_step_profile = False
@ -1097,7 +1095,7 @@ class RayPPOTrainer:
                with marked_timer("start_profile", timing_raw):
                    self._start_profiling(
                        not prev_step_profile and curr_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )

@ -1341,13 +1339,13 @@ class RayPPOTrainer:

                with marked_timer("stop_profile", timing_raw):
                    next_step_profile = (
-                        self.global_steps + 1 in self.config.trainer.profile_steps
-                        if self.config.trainer.profile_steps is not None
+                        self.global_steps + 1 in self.config.global_profiler.steps
+                        if self.config.global_profiler.steps is not None
                        else False
                    )
                    self._stop_profiling(
                        curr_step_profile and not next_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )
                    prev_step_profile = curr_step_profile
--- a/verl/utils/profiler/config.py
+++ b/verl/utils/profiler/config.py
@ -12,14 +12,74 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import warnings
 from dataclasses import dataclass, field
+from typing import Any, Optional
+
+from omegaconf import MISSING

 from verl.base_config import BaseConfig


+@dataclass
+class NsightToolConfig(BaseConfig):
+    """Nsight tool config."""
+
+    "True for each task has its own database, False for all tasks in one training step share one database."
+    discrete: bool = False
+
+    def __post_init__(self) -> None:
+        pass
+
+
+@dataclass
+class TorchProfilerToolConfig(BaseConfig):
+    """Torch profiler tool config.
+
+    Args:
+        step_start (int): Start step in update_policy.
+        step_end (int): End step.
+    """
+
+    step_start: int = -1
+    step_end: int = -1
+
+    def __post_init__(self) -> None:
+        """config validation logics go here"""
+        warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
+        assert isinstance(self.step_start, int), f"Profiler step_start must be of type int, got {type(self.step_start)}"
+
+
+@dataclass
+class NPUToolConfig(NsightToolConfig):
+    """NPU profiler too; config."""
+
+    # options: npu, cpu, memory, shapes, module, stack
+    contents: list[str] = field(default_factory=list)
+
+    # Collection level, optional values: level_none, level0, level1, level2.
+    level: str = "level1"
+
+    # Whether to automatically parse the data.
+    analysis: bool = False
+
+    def __post_init__(self) -> None:
+        """config validation logics go here"""
+        assert isinstance(self.contents, list), f"Profiler contents must be of type list, got {type(self.contents)}"
+        assert isinstance(self.level, str), f"Profiler level must be of type str, got {type(self.level)}"
+        assert isinstance(self.analysis, bool), f"Profiler analysis must be of type bool, got {type(self.analysis)}"
+        for content in self.contents:
+            assert content in ["npu", "cpu", "memory", "shapes", "module", "stack"], (
+                f"Profiler contents only supports npu, cpu, memory, shapes, module, stack, but gets {content}"
+            )
+        assert self.level in ["level_none", "level0", "level1", "level2"], (
+            f"Profiler level only supports level0, 1, 2, and level_none, but gets {self.level}"
+        )
+
+
@dataclass
 class ProfilerConfig(BaseConfig):
-    """Worker profiler config. Currently only support Nsight system profiler.
+    """Worker profiler config.

    The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.

@ -30,22 +90,33 @@ class ProfilerConfig(BaseConfig):
        ranks (list[int]): The ranks that will be profiled. Defaults to [].
    """

-    discrete: bool = False
+    tool: Optional[str] = MISSING
+    enable: bool = False
    all_ranks: bool = False
    ranks: list[int] = field(default_factory=list)
+    save_path: Optional[str] = MISSING
+    tool_config: Any = MISSING  # Just a placeholder, will use configs above directly

    def union(self, other: "ProfilerConfig") -> "ProfilerConfig":
+        assert self.tool == other.tool, f"Cannot union ProfilerConfig with different tools: {self.tool} vs {other.tool}"
        return ProfilerConfig(
+            tool=self.tool,
+            enable=self.enable or other.enable,
            all_ranks=self.all_ranks or other.all_ranks,
            ranks=list(set(self.ranks or []) | set(other.ranks or [])),
-            discrete=self.discrete or other.discrete,
+            tool_config=self.tool_config,
        )

    def intersect(self, other: "ProfilerConfig") -> "ProfilerConfig":
+        assert self.tool == other.tool, (
+            f"Cannot intersect ProfilerConfig with different tools: {self.tool} vs {other.tool}"
+        )
        return ProfilerConfig(
+            tool=self.tool,
+            enable=self.enable and other.enable,
            all_ranks=self.all_ranks and other.all_ranks,
            ranks=list(set(self.ranks or []) & set(other.ranks or [])),
-            discrete=self.discrete and other.discrete,
+            tool_config=self.tool_config,
        )

    def __post_init__(self) -> None:
--- a/verl/utils/profiler/mstx_profile.py
+++ b/verl/utils/profiler/mstx_profile.py
@ -20,9 +20,9 @@ from contextlib import contextmanager
 from typing import Any, Callable, Optional

 import torch_npu
-from omegaconf import DictConfig
 from torch_npu.npu import mstx

+from .config import NPUToolConfig
 from .profile import DistProfiler, ProfilerConfig


@ -86,7 +86,14 @@ def marked_timer(name: str, timing_raw: dict[str, float], *args: Any, **kwargs:
    mark_end_range(mark_range)


-def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_step: Optional[str] = None):
+def get_npu_profiler(
+    contents: list[str],
+    profile_level: str,
+    profile_save_path: str,
+    analysis: bool,
+    role: Optional[str] = None,
+    profile_step: Optional[str] = None,
+):
    """Generate and return an NPU profiler object.

    Args:
@ -97,18 +104,7 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
        profile_step(str, optional):
            The current training step. Defaults to None.
    """
-    if option.level == "level_none":
-        profile_level = torch_npu.profiler.ProfilerLevel.Level_none
-    elif option.level == "level0":
-        profile_level = torch_npu.profiler.ProfilerLevel.Level0
-    elif option.level == "level1":
-        profile_level = torch_npu.profiler.ProfilerLevel.Level1
-    elif option.level == "level2":
-        profile_level = torch_npu.profiler.ProfilerLevel.Level2
-    else:
-        raise ValueError(f"level only supports level0, 1, 2, and level_none, but gets {option.level}")

-    profile_save_path = option.save_path
    if profile_step:
        profile_save_path = os.path.join(profile_save_path, profile_step)
    if role:
@ -123,18 +119,18 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
    )

    activites = []
-    if option.with_npu:
+    if contents is None or "npu" in contents:
        activites.append(torch_npu.profiler.ProfilerActivity.NPU)
-    if option.with_cpu:
+    if contents is None or "cpu" in contents:
        activites.append(torch_npu.profiler.ProfilerActivity.CPU)

    prof = torch_npu.profiler.profile(
-        with_modules=option.with_module,
-        with_stack=option.with_stack,
-        record_shapes=option.record_shapes,
-        profile_memory=option.with_memory,
+        with_modules=contents is None or "module" in contents,
+        with_stack=contents is None or "stack" in contents,
+        record_shapes=contents is None or "shapes" in contents,
+        profile_memory=contents is None or "memory" in contents,
        activities=activites,
-        on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=option.analysis),
+        on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=analysis),
        experimental_config=experimental_config,
    )
    return prof
@ -147,7 +143,7 @@ class NPUProfiler(DistProfiler):

    _define_count = 0

-    def __init__(self, rank: int, config: ProfilerConfig, **kwargs):
+    def __init__(self, rank: int, config: ProfilerConfig, tool_config: NPUToolConfig, **kwargs):
        """Initialize the NsightSystemsProfiler.

        Args:
@ -155,12 +151,20 @@ class NPUProfiler(DistProfiler):
            config (Optional[ProfilerConfig]): Configuration for the profiler. If None, a default configuration is used.
        """
        if not config:
-            config = ProfilerConfig(ranks=[])
+            config = ProfilerConfig(ranks=[], enable=False)
+        if not tool_config:
+            assert not config.enable, "tool_config must be set when profiler is enabled"
+        self.enable: bool = config.enable
+        if not config.enable:
+            return
        self.this_step: bool = False
-        self.discrete: bool = config.discrete
+        self.discrete: bool = tool_config.discrete
        self.this_rank: bool = False
        self.profile_npu = None
-        self.profile_option = kwargs.get("option", None)
+        self.profile_contents = tool_config.contents
+        self.profile_level = tool_config.level
+        self.profile_save_path = config.save_path
+        self.analysis = tool_config.analysis
        if config.all_ranks:
            self.this_rank = True
        elif config.ranks:
@ -169,15 +173,22 @@ class NPUProfiler(DistProfiler):
    def start(self, **kwargs):
        role, profile_step = kwargs.get("role", None), kwargs.get("profile_step", None)
        profile_step = str(profile_step) if profile_step is not None else None
-        if self.this_rank and self.profile_option is not None:
+        if self.this_rank and self.enable:
            self.this_step = True
            if not self.discrete and NPUProfiler._define_count == 0:
-                self.profile_npu = get_npu_profiler(option=self.profile_option, role=role, profile_step=profile_step)
+                self.profile_npu = get_npu_profiler(
+                    contents=self.profile_contents,
+                    profile_level=self.profile_level,
+                    profile_save_path=self.profile_save_path,
+                    analysis=self.analysis,
+                    role=role,
+                    profile_step=profile_step,
+                )
                self.profile_npu.start()
                NPUProfiler._define_count += 1

    def stop(self):
-        if self.this_rank and self.profile_option is not None:
+        if self.this_rank and self.enable:
            self.this_step = False
            if not self.discrete and NPUProfiler._define_count == 1:
                self.profile_npu.step()
@ -201,26 +212,23 @@ class NPUProfiler(DistProfiler):
        def decorator(func):
            @functools.wraps(func)
            def wrapper(self, *args, **kwargs):
+                if not self.profiler.enable:
+                    return func(self, *args, **kwargs)
+
                profile_name = message or func.__name__
-                profile_this_role = True
                discrete_mode = self.profiler.discrete
-                profile_enable = self.profiler.this_step and self.profile_option is not None
+                profile_enable = self.profiler.this_step and self.profiler.enable

                if not profile_enable:
                    return func(self, *args, **kwargs)

-                if profile_enable and role is not None:
-                    target_roles = self.profile_option.get("roles", [])
-                    profile_this_role = "all" in target_roles or role in target_roles
-
                if profile_enable:
                    if not discrete_mode:
                        mark_range = mark_start_range(message=profile_name)
                    else:
-                        if profile_this_role:
-                            profile_npu = get_npu_profiler(option=self.profile_option, role=role)
-                            profile_npu.start()
-                            mark_range = mark_start_range(message=profile_name)
+                        profile_npu = get_npu_profiler(option=self.profile_option, role=role)
+                        profile_npu.start()
+                        mark_range = mark_start_range(message=profile_name)

                result = func(self, *args, **kwargs)

@ -228,10 +236,9 @@ class NPUProfiler(DistProfiler):
                    if not discrete_mode:
                        mark_end_range(mark_range)
                    else:
-                        if profile_this_role:
-                            mark_end_range(mark_range)
-                            profile_npu.step()
-                            profile_npu.stop()
+                        mark_end_range(mark_range)
+                        profile_npu.step()
+                        profile_npu.stop()

                return result

--- a/verl/utils/profiler/nvtx_profile.py
+++ b/verl/utils/profiler/nvtx_profile.py
@ -20,6 +20,7 @@ from typing import Callable, Optional
 import nvtx
 import torch

+from .config import NsightToolConfig
 from .profile import DistProfiler, ProfilerConfig


@ -113,7 +114,7 @@ def marked_timer(
 class NsightSystemsProfiler(DistProfiler):
    """Nsight system profiler. Installed in a worker to control the Nsight system profiler."""

-    def __init__(self, rank: int, config: Optional[ProfilerConfig], **kwargs):
+    def __init__(self, rank: int, config: Optional[ProfilerConfig], tool_config: Optional[NsightToolConfig], **kwargs):
        """Initialize the NsightSystemsProfiler.

        Args:
@ -123,8 +124,13 @@ class NsightSystemsProfiler(DistProfiler):
        # If no configuration is provided, create a default ProfilerConfig with an empty list of ranks
        if not config:
            config = ProfilerConfig(ranks=[])
+        if not tool_config:
+            assert not config.enable, "tool_config must be provided when profiler is enabled"
+        self.enable = config.enable
+        if not config.enable:
+            return
        self.this_step: bool = False
-        self.discrete: bool = config.discrete
+        self.discrete: bool = tool_config.discrete
        self.this_rank: bool = False
        if config.all_ranks:
            self.this_rank = True
@ -170,6 +176,9 @@ class NsightSystemsProfiler(DistProfiler):
        def decorator(func):
            @functools.wraps(func)
            def wrapper(self, *args, **kwargs):
+                if not self.profiler.enable:
+                    return func(self, *args, **kwargs)
+
                profile_name = message or func.__name__

                if self.profiler.this_step:
--- a/verl/utils/profiler/profile.py
+++ b/verl/utils/profiler/profile.py
@ -17,9 +17,8 @@ from typing import Callable, Optional

 import torch
 import torch.distributed
-from omegaconf import DictConfig, OmegaConf

-from .config import ProfilerConfig
+from .config import ProfilerConfig, TorchProfilerToolConfig


 class Profiler:
@ -39,18 +38,23 @@ class Profiler:
        config: Configuration object containing profiling parameters
    """

-    def __init__(self, config):
+    def __init__(self, config: ProfilerConfig, tool_config: Optional[TorchProfilerToolConfig] = None):
        # note : if we do not set use_profile, it will be set as None, so that all function will be skip
-        if not isinstance(config, DictConfig):
-            config = OmegaConf.create(config)
+        if not config:
+            config = ProfilerConfig(ranks=[], enable=False)
+        if not tool_config:
+            assert not config.enable, "tool_config must be provided when profiler is enabled"
+        self.enable = config.enable
+        if not config.enable:
+            return
        self.config = config
-        self.skip_prof = False
+        self.tool_config = tool_config
        self.saved = False
        self.prof = None
        self.rank = torch.distributed.get_rank()
        # we need to validate the config before using the profiler
        self._validate()
-        if config.use_profile and self.rank in self.config.profile_ranks:
+        if self.rank in self.config.profile_ranks:
            print(f"[Profiler] Profiler init for rank {self.rank}")

            self.prof = torch.profiler.profile(
@ -59,9 +63,9 @@ class Profiler:
                    torch.profiler.ProfilerActivity.CUDA,
                ],
                schedule=torch.profiler.schedule(
-                    wait=max(self.config.step_start - 1, 0),
-                    warmup=1 if self.config.step_start > 0 else 0,
-                    active=self.config.step_end - self.config.step_start,
+                    wait=max(self.tool_config.step_start - 1, 0),
+                    warmup=1 if self.tool_config.step_start > 0 else 0,
+                    active=self.tool_config.step_end - self.tool_config.step_start,
                    repeat=1,
                ),
                record_shapes=True,
@ -73,9 +77,9 @@ class Profiler:
            if self.config.profile_ranks is None:
                print("[WARNING] Profile ranks is not set, default to rank 0")
                self.config.profile_ranks = [0]
-            assert self.config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
-            assert self.config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
-            assert self.config.step_start < self.config.step_end, (
+            assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
+            assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
+            assert self.tool_config.step_start < self.tool_config.step_end, (
                "[ERROR] Profile step start must be less than step end"
            )

--- a/verl/workers/actor/megatron_actor.py
+++ b/verl/workers/actor/megatron_actor.py
@ -122,7 +122,7 @@ class MegatronPPOActor(BasePPOActor):
        self.tf_config = tf_config
        self.actor_module = actor_module
        self.actor_optimizer: DistributedOptimizer = actor_optimizer
-        self.prof = Profiler(self.config.profile)
+        self.prof = Profiler(self.config.profiler)
        self.use_fused_kernels = self.config.get("use_fused_kernels", False)
        if self.use_fused_kernels:
            from verl.models.mcore.model_forward_fused import patch_fused_forward
@ -600,7 +600,8 @@ class MegatronPPOActor(BasePPOActor):

        """
        metrics = {}
-        self.prof.start()
+        if self.prof.enable:
+            self.prof.start()
        for data in dataloader:
            data.to(get_device_id())
            self.actor_optimizer.zero_grad()
@ -639,9 +640,11 @@ class MegatronPPOActor(BasePPOActor):
                pass
            else:
                raise NotImplementedError
-            self.prof.step()
+            if self.prof.enable:
+                self.prof.step()
        # add empty cache after each compute
-        self.prof.stop_and_save()
-        self.prof.stop_trace()
+        if self.prof.enable:
+            self.prof.stop_and_save()
+            self.prof.stop_trace()
        get_torch_device().empty_cache()
        return metrics
--- a/verl/workers/config/actor.py
+++ b/verl/workers/config/actor.py
@ -19,6 +19,7 @@ from omegaconf import MISSING

 from verl.base_config import BaseConfig
 from verl.trainer.config import CheckpointConfig
+from verl.utils.profiler.config import ProfilerConfig

 from .engine import FSDPEngineConfig, McoreEngineConfig
 from .optimizer import OptimizerConfig
@ -109,6 +110,7 @@ class ActorConfig(BaseConfig):
    checkpoint: CheckpointConfig = field(default_factory=CheckpointConfig)
    optim: OptimizerConfig = field(default_factory=OptimizerConfig)
    use_fused_kernels: bool = False
+    profiler: ProfilerConfig = field(default_factory=ProfilerConfig)

    def __post_init__(self):
        """Validate actor configuration parameters."""
@ -218,6 +220,7 @@ class FSDPActorConfig(ActorConfig):
    entropy_checkpointing: bool = False
    fsdp_config: FSDPEngineConfig = field(default_factory=FSDPEngineConfig)
    use_remove_padding: bool = False
+    profiler: ProfilerConfig = field(default_factory=ProfilerConfig)

    def __post_init__(self):
        """Validate FSDP actor configuration parameters."""
--- a/verl/workers/fsdp_workers.py
+++ b/verl/workers/fsdp_workers.py
@ -72,7 +72,7 @@ from verl.utils.fsdp_utils import (
 )
 from verl.utils.import_utils import import_external_libs
 from verl.utils.model import compute_position_id_with_mask
-from verl.utils.profiler import DistProfiler, DistProfilerExtension, log_gpu_memory_usage, simple_timer
+from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig, log_gpu_memory_usage, simple_timer
 from verl.utils.profiler.performance import reduce_timing
 from verl.utils.py_functional import convert_to_regular_types
 from verl.workers.config import FSDPCriticConfig, FSDPEngineConfig
@ -116,7 +116,6 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
        Worker.__init__(self)

        self.config = config
-        self.profile_option = kwargs.get("profile_option", None)
        import torch.distributed

        if not torch.distributed.is_initialized():
@ -170,9 +169,30 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
        # We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
        # as they provides DictConfig-like interface
        # The benefit of creating the dataclass config is to perform validation during __post_init__
-        profiler_config = omega_conf_to_dataclass(config.get("profiler"))
+        if self._is_actor:
+            omega_profiler_config = config.actor.get("profiler", {})
+        elif self._is_rollout:
+            # NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
+            # This is for extendability in AsyncRL cases
+            omega_profiler_config = config.rollout.get("profiler", {})
+        elif self._is_ref:
+            omega_profiler_config = config.ref.get("profiler", {})
+        else:
+            raise ValueError(
+                f"Invalid role {self.role}, should be one of "
+                "['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
+            )
+        # omega_profiler_config is DictConfig
+        # profiler_config is a ProfilerConfig dataclass
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=profiler_config, option=self.profile_option)
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )

        self._is_offload_param = False
@ -938,7 +958,17 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
 class CriticWorker(Worker, DistProfilerExtension):
    def __init__(self, config: FSDPCriticConfig):
        Worker.__init__(self)
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
+        omega_profiler_config = config.get("profiler", {})
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
+        DistProfilerExtension.__init__(
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
+        )
        import torch.distributed

        self.config = config
@ -1336,8 +1366,18 @@ class RewardModelWorker(Worker, DistProfilerExtension):

    def __init__(self, config):
        Worker.__init__(self)
+
+        omega_profiler_config = config.get("profiler", {})
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
+            self,
+            DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
        )

        import torch.distributed
--- a/verl/workers/megatron_workers.py
+++ b/verl/workers/megatron_workers.py
@ -55,6 +55,7 @@ from verl.utils.profiler import (
    DistProfiler,
    DistProfilerExtension,
    GPUMemoryLogger,
+    ProfilerConfig,
    log_gpu_memory_usage,
    simple_timer,
 )
@ -213,8 +214,31 @@ class ActorRolloutRefWorker(MegatronWorker, DistProfilerExtension):
        self._is_rollout = self.role in ["rollout", "actor_rollout", "actor_rollout_ref"]
        self._is_ref = self.role in ["ref", "actor_rollout_ref"]

-        profiler_config = omega_conf_to_dataclass(config.get("profiler"))
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config))
+        if self._is_actor:
+            omega_profiler_config = config.actor.get("profiler", {})
+        elif self._is_rollout:
+            # NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
+            # This is for extendability in AsyncRL cases
+            omega_profiler_config = config.rollout.get("profiler", {})
+        elif self._is_ref:
+            omega_profiler_config = config.ref.get("profiler", {})
+        else:
+            raise ValueError(
+                f"Invalid role {self.role}, should be one of "
+                "['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
+            )
+        # omega_profiler_config is DictConfig
+        # profiler_config is a ProfilerConfig dataclass
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
+        DistProfilerExtension.__init__(
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
+        )

        # TODO(sgm): Currently, we only support reference model param offload
        # will support other offload later
@ -804,7 +828,18 @@ class AsyncActorRolloutRefWorker(ActorRolloutRefWorker):
 class CriticWorker(MegatronWorker, DistProfilerExtension):
    def __init__(self, config: McoreCriticConfig):
        Worker.__init__(self)
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
+
+        omega_profiler_config = config.get("profiler", {})
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
+        DistProfilerExtension.__init__(
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
+        )
        self.config: McoreCriticConfig = config

        # NOTE(sgm): We utilize colocate WorkerGroup by default.
@ -1072,8 +1107,19 @@ class RewardModelWorker(MegatronWorker, DistProfilerExtension):

    def __init__(self, config):
        Worker.__init__(self)
+
+        profiler_config = omega_conf_to_dataclass(config.get("profiler", {}), dataclass_type=ProfilerConfig)
+        omega_profiler_config = config.get("profiler", {})
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
+            self,
+            DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
        )
        self.config = config

--- a/verl/workers/roles/critic.py
+++ b/verl/workers/roles/critic.py
@ -30,7 +30,7 @@ from verl.utils.device import (
    get_device_id,
    get_nccl_backend,
 )
-from verl.utils.profiler import DistProfiler, DistProfilerExtension
+from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig
 from verl.utils.py_functional import append_to_dict
 from verl.utils.torch_functional import masked_mean
 from verl.workers.engine import EngineRegistry
@ -42,8 +42,16 @@ logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
 class CriticWorker(Worker, DistProfilerExtension):
    def __init__(self, config):
        Worker.__init__(self)
+        omega_profiler_config = config.get("profiler", {})
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
+        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
+            tool_config = omega_conf_to_dataclass(
+                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
+            )
+        else:
+            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        import torch.distributed