[BREAKING] [perf] refactor: Profiler api refactor (#2894)

### What does this PR do? Refactor profiler CI to a unified way. TODO: - nsys use `save_path` - nsys descrete tests are disabled - torch profiler cc: @davidmlw ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example Global profiler config: ```yaml global_profiler: _target_: verl.utils.profiler.ProfilerConfig tool: null steps: null profile_continuous_steps: false save_path: outputs/profile tool_config: nsys: _target_: verl.utils.profiler.config.NsightToolConfig discrete: false npu: _target_: verl.utils.profiler.config.NPUToolConfig discrete: false contents: [] level: level1 analysis: true torch: _target_: verl.utils.profiler.config.TorchProfilerToolConfig step_start: 0 step_end: null ``` Local profiler config: ```yaml profiler: # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs _target_: verl.utils.profiler.ProfilerConfig # profiler tool, default same as profiler.tool in global config # choices: nsys, npu, torch tool: ${oc.select:global_profiler.tool,null} # whether enable profile on critic enable: False # Whether to profile all ranks. all_ranks: False # The ranks that will be profiled. [] or [0,1,...] ranks: [] # profile results saving path save_path: ${oc.select:global_profiler.save_path,null} # specific tool config tool_config: ${oc.select:global_profiler.tool_config,null} ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-10-20 13:43:50 +08:00 · 2025-08-11 09:52:41 +08:00
parent 287ef7e262
commit 545f899844
41 changed files with 1005 additions and 694 deletions
--- a/.gitignore
+++ b/.gitignore
@ -59,6 +59,7 @@ coverage.xml
 *,cover
 .hypothesis/
 pytest.ini
 output.txt
 # Translations
 *.mo
--- a/docs/ascend_tutorial/ascend_profiling.rst
+++ b/docs/ascend_tutorial/ascend_profiling.rst
@ -8,107 +8,87 @@ Last updated: 07/24/2025.
 配置
 ----
-复用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数，
+使用两级profile设置来控制数据采集
-通过verl/trainer/config/npu_profile/npu_profile.yaml中的配置项控制例如采集等级等参数。
+
 - 全局采集控制：使用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数，
 - 角色profile控制：通过每个角色中的配置项控制等参数。
 全局采集控制
 ~~~~~~~~~~~~
 通过 ppo_trainer.yaml 中的参数控制采集步数和模式：
-  trainer.profile_steps：
+-  profiler: 控制采集的rank和模式
   该参数可以设置为一个包含采集步数的列表，例如[2，
   4]， 意味着将会采集第二步和第四步。如果该参数为null，则代表不进行采集
 -  actor_rollout_ref.profiler：
   控制采集的ranks和模式
-   -  all_ranks：设为True代表对所有rank进行采集
+   -  tool: 使用的采集工具，选项有 nsys、npu、torch、torch_memory。
-   -  ranks：当all_ranks不为True时，
+   -  steps: 此参数可以设置为包含采集步数的列表，例如 [2, 4]，表示将采集第2步和第4步。如果设置为 null，则不进行采集。
-      通过ranks参数控制需要采集的rank，该参数设置为一个包含采集rank的列表， 例如[0，
+   -  save_path: 保存采集数据的路径。默认值为 "outputs/profile"。
      1]
   -  discrete：
      控制采集的模式。当该参数设置为False，代表采集端到端的数据；当该参数设置为True，代表采用离散模式分训练阶段采集数据
-通过 npu_profile.yaml 中的参数控制具体采集行为：
+通过 ``profiler.tool_config.npu`` 中的参数控制具体采集行为：
-  save_path：采集数据的存放路径
+-  level: 采集级别—选项有 level_none、level0、level1 和 level2
 -  roles: 采集的角色，下列为可选项
-   -  rollout_generate：采集rollout的generate_sequences阶段
+   -  level_none: 禁用所有基于级别的数据采集（关闭 profiler_level）。
-   -  actor_compute_log_prob：采集actor的compute_log_prob阶段
+   -  level0: 采集高级应用数据、底层NPU数据和NPU上的算子执行详情。
-   -  actor_update：采集actor的update_actor阶段
+   -  level1: 在level0基础上增加CANN层AscendCL数据和NPU上的AI Core性能指标。
-   -  ref_compute_log_prob：采集ref的compute_ref_log_prob阶段
+   -  level2: 在level1基础上增加CANN层Runtime数据和AI CPU指标。
   -  all： 采集以上所有阶段
-  level：采集等级，可选项为level_none、level0、level1和level2
+-  contents: 控制采集内容的选项列表，例如
   npu、cpu、memory、shapes、module、stack。
   -  npu: 是否采集设备端性能数据。
   -  cpu: 是否采集主机端性能数据。
   -  memory: 是否启用内存分析。
   -  shapes: 是否记录张量形状。
   -  module: 是否记录框架层Python调用栈信息。
   -  stack: 是否记录算子调用栈信息。
-   -  level_none：不采集所有Level层级控制的数据，即关闭profiler_level
+-  analysis: 启用自动数据解析。
   -  level0：采集上层应用数据、底层NPU数据以及NPU上执行的算子信息
   -  level1：在level0的基础上多采集CANN层AscendCL数据和NPU上执行的AI
      Core性能指标信息
   -  level2：在level1的基础上多采集CANN层Runtime数据以及AI CPU
-  record_shapes：是否记录张量形状
+角色profile控制
-  with_memory：是否启用内存分析
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-  with_npu：是否采集device侧性能数据
+
-  with_cpu：是否采集host侧性能数据
+在每个角色的 ``profile`` 字段中，您可以控制该角色的采集模式。
-  with_module：是否记录框架层python调用栈信息
+
-  with_stack：是否记录算子调用栈信息
+-  enable: 是否为此角色启用性能分析。
-  analysis：是否自动解析数据
+-  all_ranks: 是否从所有rank收集数据。
 -  ranks: 要收集数据的rank列表。如果为空，则不收集数据。
 -  tool_config: 此角色使用的性能分析工具的配置。
 示例
 ----
 禁用采集
-~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
 .. code:: yaml
-       trainer:
+      profiler:
-           profile_steps: null # disable profile
+         steps: null # disable profile
 端到端采集
-~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~
 .. code:: yaml
-       trainer:
+      profiler:
-           profile_steps: [1, 2, 5]
+         steps: [1, 2, 5]
-       actor_rollout_ref:
+         discrete: False
-            profiler:
+      actor_rollout_ref:
-                discrete: False
+         actor:
-                all_ranks: True
+            profile:
               enable: True
               all_ranks: True
        # rollout & ref follow actor settings
 离散模式采集
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~
 .. code:: yaml
-       trainer:
+      profiler:
-           profile_steps: [1, 2, 5]
+         discrete: True
       actor_rollout_ref:
            profiler:
                discrete: True
                all_ranks: False
                ranks: [0, 1]
 离散模式采集actor
 ~~~~~~~~~~~~~~~~~~
 .. code:: yaml
       trainer:
           profile_steps: [1, 2, 5]
           npu_profile:
                options:
                    roles: ["actor_compute_log_prob", "actor_update"]
       actor_rollout_ref:
            profiler:
                discrete: True
                all_ranks: False
                ranks: [0, 1]
 可视化
--- a/docs/ascend_tutorial/ascend_profiling_en.rst
+++ b/docs/ascend_tutorial/ascend_profiling_en.rst
@ -9,10 +9,10 @@ based on FSDP on Ascend devices.
 Configuration
 -------------
-Reuse the configuration items in
+Leverage two levels of configuration to control data collection:
-verl/trainer/config/ppo_trainer.yaml to control the collection mode
+
-and steps, you can also manage the collection behaviors such as
+1. **Global profiler control**: Use parameters in ``ppo_trainer.yaml`` to control the collection mode and steps.
-collection level via verl/trainer/config/npu_profile/npu_profile.yaml.
+2. **Role profile control**: Use parameters in each role's ``profile`` field to control the collection mode for each role.
 Global collection control
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@ -20,31 +20,17 @@ Global collection control
 Use parameters in ppo_trainer.yaml to control the collection mode
 and steps.
-  trainer.profile_steps: This parameter can be set as a list that has
+-  profiler: Control the ranks and mode of profiling
   collection steps, such as [2, 4], which means it will collect steps 2
   and 4. If set to null, no collection occurs.
 -  actor_rollout_ref.profiler: Control the ranks and mode of profiling
-   -  all_ranks: Collects data from all ranks when set to true.
+   -  tool: The profiling tool to use, options are nsys, npu, torch,
-   -  ranks: This parameter specifies which ranks to collect (e.g., [0,
+      torch_memory.
-      1]) when all_ranks is False.
+   -  steps: This parameter can be set as a list that has
-   -  discrete: Controls the collection mode. If False, end-to-end data
+      collection steps, such as [2, 4], which means it will collect steps 2
-      is collected; if True, data is collected in discrete phases during
+      and 4. If set to null, no collection occurs.
-      training.
+   -  save_path: The path to save the collected data. Default is
      "outputs/profile".
-Use parameters in npu_profile.yaml to control collection behavior:
+Use parameters in ``profiler.tool_config.npu`` to control npu profiler behavior:
 -  save_path: Storage path for collected data.
 -  roles: Roles to collect. The following options are available
   -  rollout_generate: Collect the `generate_sequences` phase 
      of rollout worker.
   -  actor_compute_log_prob: Collect the `compute_log_prob` phase 
      of the actor worker.
   -  actor_update:  Collect the `update_actor` phase of the actor worker.
   -  ref_compute_log_prob: Collect the `compute_ref_log_prob` phase 
      of the ref worker.
   -  all: Collect all of the above phases.
 -  level: Collection level—options are level_none, level0, level1, and
   level2
@ -58,15 +44,31 @@ Use parameters in npu_profile.yaml to control collection behavior:
   -  level2: Extends level1 by adding CANN-layer Runtime data and AI
      CPU metrics.
-  record_shapes: Whether to record tensor shapes.
+-  contents: A list of options to control the collection content, such as
-  with_memory: Whether to enable memory analysis.
+   npu, cpu, memory, shapes, module, stack.
-  with_npu: Whether to collect device-side performance data.
+   
-  with_cpu: Whether to collect host-side performance data.
+   -  npu: Whether to collect device-side performance data.
-  with_module: Whether to record framework-layer Python call stack
+   -  cpu: Whether to collect host-side performance data.
-   information.
+   -  memory: Whether to enable memory analysis.
-  with_stack: Whether to record operator call stack information.
+   -  shapes: Whether to record tensor shapes.
   -  module: Whether to record framework-layer Python call stack
      information.
   -  stack: Whether to record operator call stack information.
 -  analysis: Enables automatic data parsing.
 Role collection control
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In each role's ``profile`` field, you can control the collection mode for that role.
 -  enable: Whether to enable profiling for this role.
 -  all_ranks: Whether to collect data from all ranks.
 -  ranks: A list of ranks to collect data from. If empty, no data is collected.
 -  tool_config: Configuration for the profiling tool used by this role.
 Examples
 --------
@ -75,20 +77,22 @@ Disabling collection
 .. code:: yaml
-       trainer:
+      profiler:
-           profile_steps: null # disable profile
+         steps: null # disable profile
 End-to-End collection
 ~~~~~~~~~~~~~~~~~~~~~
 .. code:: yaml
-       trainer:
+      profiler:
-           profile_steps: [1, 2, 5]
+         steps: [1, 2, 5]
-       actor_rollout_ref:
+         discrete: False
      actor_rollout_ref:
         actor:
            profiler:
-                discrete: False
+               enable: True
-                all_ranks: True
+               all_ranks: True
 Discrete Mode Collection
@ -96,30 +100,8 @@ Discrete Mode Collection
 .. code:: yaml
-       trainer:
+      profiler:
-           profile_steps: [1, 2, 5]
+         discrete: True
       actor_rollout_ref:
            profiler:
                discrete: True
                all_ranks: False
                ranks: [0, 1]
 Enable actor collection in discrete mode
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. code:: yaml
       trainer:
           profile_steps: [1, 2, 5]
           npu_profile:
                options:
                    roles: ["actor_compute_log_prob", "actor_update"]
       actor_rollout_ref:
            profiler:
                discrete: True
                all_ranks: False
                ranks: [0, 1]
 Visualization
--- a/docs/perf/nsight_profiling.md
+++ b/docs/perf/nsight_profiling.md
@ -16,31 +16,29 @@ Nsight Systems version is important, please reference `docker/Dockerfile.vllm.sg
 verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.
-In `trainer`, three new config entries control the profiler behaviors:
+In `profiler`, three new config entries control the profiler behaviors:
-* **`trainer.profile_steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
+* **`profiler.steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
-* **`trainer.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profile_steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
+* **`profiler.profile_continuous_steps`**. If true, and the following `profiler.discrete==False`, then the continuous steps in `profiler.steps` will be combined into one database. For example the above step 1 and 2 are in one database, and 5 in another. If false, every step occupies at least one database. The reason for this config is to observe the program behaviors between steps.
-* **`controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
+Nsys options in controller nodes and worker nodes are configured in `trainer`:
-* **`worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
+* **`trainer.controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
 * **`trainer.worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
 ### Worker process profiling
 Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields:
 * **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID.
 * **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`.
 * **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it.
 * **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway.
 ### where to find the profiling data
-By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. ["however, Ray preserves the `--output` option of the default config"](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
+By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. [&#34;however, Ray preserves the `--output` option of the default config&#34;](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
 Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place.
@ -49,51 +47,40 @@ Some users may think it is not convenient, but it is understandable that Ray may
 To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this:
 ### Disable profiler
 ```yaml
-    trainer:
+    profiler:
-        profile_steps: null # disable profile
+        steps: null # disable profile
 ```
 ### Enable profiler and one database for one training step
 ```yaml
-    trainer:
+    profiler:
-        profile_steps: [1, 2, 5]
+        steps: [1, 2, 5]
        discrete: False
    actor_rollout_ref:
-        profiler:
+        actor:
-            discrete: False
+            profile:
-            all_ranks: False
+                enable: True
-            ranks: [0, 1]
+                all_ranks: True
        # rollout & ref follow actor settings
    critic:
-        profiler:
+            profile:
-            discrete: False
+                enable: True
-            all_ranks: False
+                all_ranks: True
            ranks: [0, 1]
    reward_model:
-        profiler:
+            profile:
-            discrete: False
+                enable: True
-            all_ranks: False
+                all_ranks: True
            ranks: [0, 1]
 ```
 ### Enable profiler and multiple databases for one training step
 ```yaml
-    trainer:
+    profiler:
-        profile_steps: [1, 2, 5]
+        steps: [1, 2, 5]
-    actor_rollout_ref:
+        discrete: True
        profiler:
            discrete: True
            all_ranks: False
            ranks: [0, 1]
    critic:
        profiler:
            discrete: True
            all_ranks: False
            ranks: [0, 1]
    reward_model:
        profiler:
            discrete: True
            all_ranks: False
            ranks: [0, 1]
 ```
 ## Profiling Output
--- a/docs/workers/megatron_workers.rst
+++ b/docs/workers/megatron_workers.rst
@ -275,27 +275,6 @@ For the critic, you can include these parameters.
   critic.megatron.grad_offload=True \
   critic.megatron.optimizer_offload=True \
 Profiler
 ^^^^^^^^
 The profiler is a tool that helps you understand the performance of your 
 model. It can be used to profile the time spent on different operations 
 and identify the bottlenecks. You can get more information from 
 `torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_.
 In verl, now the profiler is only support for the actor role In Megatron. You can set 
 the begin step and end step to profile. Notice, one step means one gradient update. And 
 the profile result will be saved in the save_path. If you just want to profile in the 
 specific rank, you can set the profile_ranks, by default, it will be [0].
 .. code:: python
   actor_rollout_ref.actor.profile.use_profile=True \
   actor_rollout_ref.actor.profile.profile_ranks=[0] \
   actor_rollout_ref.actor.profile.step_start=0 \
   actor_rollout_ref.actor.profile.step_end=1 \
   actor_rollout_ref.actor.profile.save_path="./profile"
 Related MCore Document
 ----------------------
--- a/examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh
+++ b/examples/grpo_trainer/run_qwen2_5_7b_grpo_discrete_prof_npu.sh
@ -9,14 +9,8 @@ PROFILE_RANKS="[1,2]"
 # profiling NPU options
 SAVE_PATH="$HOME/profile_data"
 LEVEL="level1"
-WITH_MEMORY=False
+CONTENTS=['npu','cpu']
 RECORD_SHAPES=False
 WITH_NPU=True
 WITH_CPU=True
 WITH_MODULE=False
 WITH_STACK=False
 ANALYSIS=True
 ROLES=["all"]
 python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
@ -28,20 +22,20 @@ python3 -m verl.trainer.main_ppo \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
-    actor_rollout_ref.actor.optim.lr=5e-8 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.model.use_remove_padding=False \
-    actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
+    actor_rollout_ref.actor.optim.lr=5e-8 \
    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.profiler.discrete=$DISCRETE \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.profiler.enable=True \
    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
@ -51,16 +45,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.npu_profile.options.save_path=$SAVE_PATH \
    trainer.npu_profile.options.level=$LEVEL \
    trainer.npu_profile.options.with_memory=$WITH_MEMORY \
    trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
    trainer.npu_profile.options.with_npu=$WITH_NPU \
    trainer.npu_profile.options.with_cpu=$WITH_CPU \
    trainer.npu_profile.options.with_module=$WITH_MODULE \
    trainer.npu_profile.options.with_stack=$WITH_STACK \
    trainer.npu_profile.options.analysis=$ANALYSIS \
    trainer.npu_profile.options.roles=$ROLES \
    trainer.critic_warmup=0 \
    trainer.logger=console \
    trainer.project_name='verl_grpo_example_gsm8k' \
@ -70,5 +54,12 @@ python3 -m verl.trainer.main_ppo \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=5 \
-    trainer.profile_steps=$PROFILE_STEPS \
+    trainer.device=npu \
-    trainer.device=npu $@
+    profiler.tool=npu \
    profiler.steps=$PROFILE_STEPS \
    profiler.save_path=$SAVE_PATH \
    profiler.tool_config.npu.discrete=$DISCRETE \
    profiler.tool_config.npu.contents=$CONTENTS \
    profiler.tool_config.npu.level=$LEVEL \
    profiler.tool_config.npu.analysis=$ANALYSIS
    $@
--- a/examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh
+++ b/examples/grpo_trainer/run_qwen2_5_7b_grpo_e2e_prof_npu.sh
@ -8,12 +8,7 @@ DISCRETE=False
 # profiling NPU options
 SAVE_PATH="$HOME/profile_data"
 LEVEL="level1"
-WITH_MEMORY=False
+CONTENTS=['npu','cpu']
 RECORD_SHAPES=False
 WITH_NPU=True
 WITH_CPU=True
 WITH_MODULE=False
 WITH_STACK=False
 ANALYSIS=True
 python3 -m verl.trainer.main_ppo \
@ -28,15 +23,16 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=5e-8 \
    actor_rollout_ref.model.use_remove_padding=False \
-    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.profiler.discrete=$DISCRETE \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.profiler.enable=True \
    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
@ -48,15 +44,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.npu_profile.options.save_path=$SAVE_PATH \
    trainer.npu_profile.options.level=$LEVEL \
    trainer.npu_profile.options.with_memory=$WITH_MEMORY \
    trainer.npu_profile.options.record_shapes=$RECORD_SHAPES \
    trainer.npu_profile.options.with_npu=$WITH_NPU \
    trainer.npu_profile.options.with_cpu=$WITH_CPU \
    trainer.npu_profile.options.with_module=$WITH_MODULE \
    trainer.npu_profile.options.with_stack=$WITH_STACK \
    trainer.npu_profile.options.analysis=$ANALYSIS \
    trainer.critic_warmup=0 \
    trainer.logger=console \
    trainer.project_name='verl_grpo_example_gsm8k' \
@ -66,5 +53,12 @@ python3 -m verl.trainer.main_ppo \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=5 \
-    trainer.profile_steps=$PROFILE_STEPS \
+    trainer.device=npu \
-    trainer.device=npu $@
+    profiler.tool=npu \
    profiler.steps=$PROFILE_STEPS \
    profiler.save_path=$SAVE_PATH \
    profiler.tool_config.npu.discrete=$DISCRETE \
    profiler.tool_config.npu.contents=$CONTENTS \
    profiler.tool_config.npu.level=$LEVEL \
    profiler.tool_config.npu.analysis=$ANALYSIS \
    $@
--- a/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
+++ b/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
@ -13,9 +13,9 @@ train_files=${train_files:-"$gsm8k_train_path"}
 test_files=${test_files:-"$gsm8k_test_path"}
 # Nsight profiling configuration
-PROFILE_STEPS="[1,2,5]" # or [] or null
+PROFILE_STEPS="[1]" # or [] or null
 PROFILE_RANKS_ALL=False # or True
-PROFILE_RANKS=[0,4,8,12]
+PROFILE_RANKS=[0,4]
 DISCRETE=True  # or True
 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
@ -34,30 +34,32 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.profiler.enable=True \
    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
    actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.profiler.discrete=$DISCRETE \
    critic.optim.lr=1e-5 \
    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
    critic.ppo_micro_batch_size_per_gpu=4 \
    critic.profiler.enable=True \
    critic.profiler.ranks=$PROFILE_RANKS \
    critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
    critic.profiler.discrete=$DISCRETE \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_ppo_gsm8k_math_examples' \
    trainer.experiment_name='deepseek_llm_7b_megatron' \
    trainer.n_gpus_per_node=8 \
-    trainer.nnodes=2 \
+    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=-1 \
    trainer.total_epochs=100 \
-    trainer.total_training_steps=6 \
+    trainer.total_training_steps=1 \
-    trainer.profile_steps=$PROFILE_STEPS $@
+    profiler.tool=nsys \
    profiler.steps=$PROFILE_STEPS \
    profiler.tool_config.nsys.discrete=$DISCRETE $@
--- a/examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_nsys.sh
+++ b/examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_nsys.sh
@ -10,8 +10,8 @@ test_files=${test_files:-"$gsm8k_test_path"}
 PROFILE_STEPS="[1,2,5]" # or [] or null
 PROFILE_RANKS_ALL=False # or True
-PROFILE_RANKS=[0,4,8,12]
+PROFILE_RANKS=[0,4]
-DISCRETE=False  # or True
+DISCRETE=True  # or True
 python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
@ -30,17 +30,17 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.actor.ppo_mini_batch_size=512 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
-    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=12000 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.profiler.enable=True \
    actor_rollout_ref.actor.profiler.ranks=$PROFILE_RANKS \
    actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=24000 \
    actor_rollout_ref.profiler.ranks=$PROFILE_RANKS \
    actor_rollout_ref.profiler.all_ranks=$PROFILE_RANKS_ALL \
    actor_rollout_ref.profiler.discrete=$DISCRETE \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2-7B-Instruct \
@ -50,9 +50,9 @@ python3 -m verl.trainer.main_ppo \
    critic.ppo_max_token_len_per_gpu=98304 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    critic.profiler.enable=True \
    critic.profiler.ranks=$PROFILE_RANKS \
    critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
    critic.profiler.discrete=$DISCRETE \
    reward_model.enable=True \
    reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
    reward_model.model.use_remove_padding=True \
@ -60,9 +60,9 @@ python3 -m verl.trainer.main_ppo \
    reward_model.micro_batch_size_per_gpu=32 \
    reward_model.use_dynamic_bsz=True \
    reward_model.forward_max_token_len_per_gpu=98304 \
    reward_model.profiler.enable=True \
    reward_model.profiler.ranks=$PROFILE_RANKS \
    reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \
    reward_model.profiler.discrete=$DISCRETE \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
@ -70,10 +70,12 @@ python3 -m verl.trainer.main_ppo \
    trainer.experiment_name='qwen2-7b_hybrid_rm_bsz8k_p4k_r4k_seq_packing' \
    trainer.n_gpus_per_node=8 \
    trainer.val_before_train=False \
-    trainer.nnodes=2 \
+    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=-1 \
    trainer.total_epochs=15 \
    trainer.total_training_steps=6 \
-    trainer.profile_continuous_steps=True \
+    profiler.profile_continuous_steps=True \
-    trainer.profile_steps=$PROFILE_STEPS $@
+    profiler.tool=nsys \
    profiler.steps=$PROFILE_STEPS \
    profiler.tool_config.nsys.discrete=$DISCRETE $@
--- a/recipe/dapo/dapo_ray_trainer.py
+++ b/recipe/dapo/dapo_ray_trainer.py
@ -97,8 +97,8 @@ class RayDAPOTrainer(RayPPOTrainer):
        prev_step_profile = False
        curr_step_profile = (
-            self.global_steps in self.config.trainer.profile_steps
+            self.global_steps in self.config.global_profiler.steps
-            if self.config.trainer.profile_steps is not None
+            if self.config.global_profiler.steps is not None
            else False
        )
        next_step_profile = False
@ -114,7 +114,7 @@ class RayDAPOTrainer(RayPPOTrainer):
                with marked_timer("start_profile", timing_raw):
                    self._start_profiling(
                        not prev_step_profile and curr_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )
@ -350,13 +350,13 @@ class RayDAPOTrainer(RayPPOTrainer):
                with marked_timer("stop_profile", timing_raw):
                    next_step_profile = (
-                        self.global_steps + 1 in self.config.trainer.profile_steps
+                        self.global_steps + 1 in self.config.global_profiler.steps
-                        if self.config.trainer.profile_steps is not None
+                        if self.config.global_profiler.steps is not None
                        else False
                    )
                    self._stop_profiling(
                        curr_step_profile and not next_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )
                    prev_step_profile = curr_step_profile
--- a/recipe/dapo/main_dapo.py
+++ b/recipe/dapo/main_dapo.py
@ -45,10 +45,13 @@ def run_ppo(config) -> None:
    if (
        is_cuda_available
-        and OmegaConf.select(config.trainer, "profile_steps") is not None
+        and config.global_profiler.tool == "nsys"
-        and len(OmegaConf.select(config.trainer, "profile_steps")) > 0
+        and OmegaConf.select(config.global_profiler, "steps") is not None
        and len(OmegaConf.select(config.global_profiler, "steps")) > 0
    ):
-        nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
+        nsight_options = OmegaConf.to_container(
            config.global_profiler.global_tool_config.nsys.controller_nsight_options
        )
        runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
    else:
        runner = TaskRunner.remote()
--- a/recipe/one_step_off_policy/fsdp_workers.py
+++ b/recipe/one_step_off_policy/fsdp_workers.py
@ -38,6 +38,7 @@ from verl.utils.fsdp_utils import (
 )
 from verl.utils.import_utils import import_external_libs
 from verl.utils.model import get_generation_config, update_model_config
 from verl.utils.profiler import ProfilerConfig
 from verl.workers.fsdp_workers import ActorRolloutRefWorker as ARRWorker
 from verl.workers.fsdp_workers import CriticWorker
@ -131,8 +132,17 @@ class RolloutWorker(ActorRolloutRefWorker):
        # We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
        # as they provides DictConfig-like interface
        # The benefit of creating the dataclass config is to perform validation during __post_init__
-        profiler_config = omega_conf_to_dataclass(config.rollout.get("profiler", {}))
+        omega_profiler_config = config.get("profiler", {})
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config))
+        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        self._is_rollout = True
        self._is_actor = False
--- a/recipe/one_step_off_policy/main_ppo.py
+++ b/recipe/one_step_off_policy/main_ppo.py
@ -51,10 +51,11 @@ def run_ppo(config) -> None:
    # Create a remote instance of the TaskRunner class, and
    # Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
    if (
-        OmegaConf.select(config.trainer, "profile_steps") is not None
+        config.global_profiler.tool == "nsys"
-        and len(OmegaConf.select(config.trainer, "profile_steps")) > 0
+        and OmegaConf.select(config.global_profiler, "steps") is not None
        and len(OmegaConf.select(config.global_profiler, "steps")) > 0
    ):
-        nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
+        nsight_options = OmegaConf.to_container(config.global_profiler.tool_config.nsys.controller_nsight_options)
        runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
    else:
        runner = TaskRunner.remote()
--- a/recipe/one_step_off_policy/ray_trainer.py
+++ b/recipe/one_step_off_policy/ray_trainer.py
@ -213,7 +213,6 @@ class OneStepOffRayTrainer(RayPPOTrainer):
                self.role_worker_mapping[Role.RefPolicy],
                config=self.config.actor_rollout_ref,
                role="ref",
                profile_option=self.config.trainer.npu_profile.options,
            )
            self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls
@ -233,13 +232,13 @@ class OneStepOffRayTrainer(RayPPOTrainer):
        wg_kwargs = {}  # Setting up kwargs for RayWorkerGroup
        if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
            wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
-        if OmegaConf.select(self.config.trainer, "profile_steps") is not None:
+        if OmegaConf.select(self.config.global_profiler, "steps") is not None:
-            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps")
+            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "steps")
-            assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, (
+            assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
                "worker_nsight_options must be set when profile_steps is set"
            )
            wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
-                OmegaConf.select(self.config.trainer, "worker_nsight_options")
+                OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
            )
        for resource_pool, class_dict in self.resource_pool_to_cls.items():
@ -391,8 +390,8 @@ class OneStepOffRayTrainer(RayPPOTrainer):
        while batch_data_future is not None:
            do_profile = (
-                self.global_steps in self.config.trainer.profile_steps
+                self.global_steps in self.config.global_profiler.steps
-                if self.config.trainer.profile_steps is not None
+                if self.config.global_profiler.steps is not None
                else False
            )
            if do_profile:
--- a/tests/trainer/config/test_legacy_config_on_cpu.py
+++ b/tests/trainer/config/test_legacy_config_on_cpu.py
@ -37,6 +37,14 @@ class TestConfigComparison(unittest.TestCase):
        "activations_checkpoint_method",
        "activations_checkpoint_granularity",
        "activations_checkpoint_num_layers",
        "discrete",
        "profiler",
        "profile",
        "use_profile",
        "npu_profile",
        "profile_steps",
        "worker_nsight_options",
        "controller_nsight_options",
    ]
    def _compare_configs_recursively(
--- a/tests/utils/test_config_on_cpu.py
+++ b/tests/utils/test_config_on_cpu.py
@ -79,7 +79,7 @@ class TestPrintCfgCommand(unittest.TestCase):
        # Run the command
        result = subprocess.run(
-            ["python3", "scripts/print_cfg.py", "critic.profiler.discrete=True", "+critic.profiler.extra.any_key=val"],
+            ["python3", "scripts/print_cfg.py", "+critic.profiler.extra.any_key=val"],
            capture_output=True,
            text=True,
        )
@ -90,7 +90,6 @@ class TestPrintCfgCommand(unittest.TestCase):
        # Verify the output contains expected config information
        self.assertIn("critic", result.stdout)
        self.assertIn("profiler", result.stdout)
        self.assertIn("discrete=True", result.stdout)
        self.assertIn("extra={'any_key': 'val'}", result.stdout)
--- a/tests/utils/test_nvtx_profile.py
+++ b/tests/utils/test_nvtx_profile.py
@ -17,7 +17,7 @@ import unittest
 from unittest.mock import MagicMock, patch
 from verl.utils import omega_conf_to_dataclass
-from verl.utils.profiler import ProfilerConfig
+from verl.utils.profiler.config import NsightToolConfig, ProfilerConfig
 from verl.utils.profiler.nvtx_profile import NsightSystemsProfiler
@ -29,26 +29,25 @@ class TestProfilerConfig(unittest.TestCase):
        with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
            cfg = compose(config_name="ppo_trainer")
        arr = cfg.actor_rollout_ref
        for config in [
            cfg.actor_rollout_ref.actor.profiler,
            cfg.actor_rollout_ref.rollout.profiler,
            cfg.actor_rollout_ref.ref.profiler,
            cfg.critic.profiler,
            arr.profiler,
            cfg.reward_model.profiler,
        ]:
            profiler_config = omega_conf_to_dataclass(config)
-            self.assertEqual(profiler_config.discrete, config.discrete)
+            self.assertEqual(profiler_config.tool, config.tool)
            self.assertEqual(profiler_config.enable, config.enable)
            self.assertEqual(profiler_config.all_ranks, config.all_ranks)
            self.assertEqual(profiler_config.ranks, config.ranks)
            self.assertEqual(profiler_config.save_path, config.save_path)
            self.assertEqual(profiler_config.ranks, config.ranks)
            assert isinstance(profiler_config, ProfilerConfig)
            with self.assertRaises(AttributeError):
                _ = profiler_config.non_existing_key
            assert config.get("non_existing_key") == profiler_config.get("non_existing_key")
            assert config.get("non_existing_key", 1) == profiler_config.get("non_existing_key", 1)
            assert config["discrete"] == profiler_config["discrete"]
            from dataclasses import FrozenInstanceError
            with self.assertRaises(FrozenInstanceError):
                profiler_config.discrete = False
    def test_frozen_config(self):
        """Test that modifying frozen keys in ProfilerConfig raises exceptions."""
@ -57,11 +56,7 @@ class TestProfilerConfig(unittest.TestCase):
        from verl.utils.profiler.config import ProfilerConfig
        # Create a new ProfilerConfig instance
-        config = ProfilerConfig(discrete=True, all_ranks=False, ranks=[0], extra={"key": "value"})
+        config = ProfilerConfig(all_ranks=False, ranks=[0], extra={"key": "value"})
        # Test direct attribute assignment
        with self.assertRaises(FrozenInstanceError):
            config.discrete = False
        with self.assertRaises(FrozenInstanceError):
            config.all_ranks = True
@ -69,10 +64,6 @@ class TestProfilerConfig(unittest.TestCase):
        with self.assertRaises(FrozenInstanceError):
            config.ranks = [1, 2, 3]
        # Test dictionary-style assignment
        with self.assertRaises(TypeError):
            config["discrete"] = False
        with self.assertRaises(TypeError):
            config["all_ranks"] = True
@ -90,20 +81,19 @@ class TestNsightSystemsProfiler(unittest.TestCase):
    Test Plan:
    1. Initialization: Verify profiler state after creation
    2. Basic Profiling: Test start/stop functionality
-    3. Discrete Mode: Test discrete profiling behavior
+    3. Discrete Mode: TODO: Test discrete profiling behavior
    4. Annotation: Test the annotate decorator in both normal and discrete modes
    5. Config Validation: Verify proper config initialization from OmegaConf
    """
    def setUp(self):
-        self.config = ProfilerConfig(all_ranks=True)
+        self.config = ProfilerConfig(enable=True, all_ranks=True)
        self.rank = 0
-        self.profiler = NsightSystemsProfiler(self.rank, self.config)
+        self.profiler = NsightSystemsProfiler(self.rank, self.config, tool_config=NsightToolConfig(discrete=False))
    def test_initialization(self):
        self.assertEqual(self.profiler.this_rank, True)
        self.assertEqual(self.profiler.this_step, False)
        self.assertEqual(self.profiler.discrete, False)
    def test_start_stop_profiling(self):
        with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
@ -117,18 +107,18 @@ class TestNsightSystemsProfiler(unittest.TestCase):
            self.assertFalse(self.profiler.this_step)
            mock_stop.assert_called_once()
-    def test_discrete_profiling(self):
+    # def test_discrete_profiling(self):
-        discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
+    #     discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
-        profiler = NsightSystemsProfiler(self.rank, discrete_config)
+    #     profiler = NsightSystemsProfiler(self.rank, discrete_config)
-        with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
+    #     with patch("torch.cuda.profiler.start") as mock_start, patch("torch.cuda.profiler.stop") as mock_stop:
-            profiler.start()
+    #         profiler.start()
-            self.assertTrue(profiler.this_step)
+    #         self.assertTrue(profiler.this_step)
-            mock_start.assert_not_called()  # Shouldn't start immediately in discrete mode
+    #         mock_start.assert_not_called()  # Shouldn't start immediately in discrete mode
-            profiler.stop()
+    #         profiler.stop()
-            self.assertFalse(profiler.this_step)
+    #         self.assertFalse(profiler.this_step)
-            mock_stop.assert_not_called()  # Shouldn't stop immediately in discrete mode
+    #         mock_stop.assert_not_called()  # Shouldn't stop immediately in discrete mode
    def test_annotate_decorator(self):
        mock_self = MagicMock()
@ -152,29 +142,29 @@ class TestNsightSystemsProfiler(unittest.TestCase):
            mock_start.assert_not_called()  # Not discrete mode
            mock_stop.assert_not_called()  # Not discrete mode
-    def test_annotate_discrete_mode(self):
+    # def test_annotate_discrete_mode(self):
-        discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
+    #     discrete_config = ProfilerConfig(discrete=True, all_ranks=True)
-        profiler = NsightSystemsProfiler(self.rank, discrete_config)
+    #     profiler = NsightSystemsProfiler(self.rank, discrete_config)
-        mock_self = MagicMock()
+    #     mock_self = MagicMock()
-        mock_self.profiler = profiler
+    #     mock_self.profiler = profiler
-        mock_self.profiler.this_step = True
+    #     mock_self.profiler.this_step = True
-        @NsightSystemsProfiler.annotate(message="test")
+    #     @NsightSystemsProfiler.annotate(message="test")
-        def test_func(self, *args, **kwargs):
+    #     def test_func(self, *args, **kwargs):
-            return "result"
+    #         return "result"
-        with (
+    #     with (
-            patch("torch.cuda.profiler.start") as mock_start,
+    #         patch("torch.cuda.profiler.start") as mock_start,
-            patch("torch.cuda.profiler.stop") as mock_stop,
+    #         patch("torch.cuda.profiler.stop") as mock_stop,
-            patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
+    #         patch("verl.utils.profiler.nvtx_profile.mark_start_range") as mock_start_range,
-            patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
+    #         patch("verl.utils.profiler.nvtx_profile.mark_end_range") as mock_end_range,
-        ):
+    #     ):
-            result = test_func(mock_self)
+    #         result = test_func(mock_self)
-            self.assertEqual(result, "result")
+    #         self.assertEqual(result, "result")
-            mock_start_range.assert_called_once()
+    #         mock_start_range.assert_called_once()
-            mock_end_range.assert_called_once()
+    #         mock_end_range.assert_called_once()
-            mock_start.assert_called_once()  # Should start in discrete mode
+    #         mock_start.assert_called_once()  # Should start in discrete mode
-            mock_stop.assert_called_once()  # Should stop in discrete mode
+    #         mock_stop.assert_called_once()  # Should stop in discrete mode
 if __name__ == "__main__":
--- a/tests/workers/config/test_critic_config_on_cpu.py
+++ b/tests/workers/config/test_critic_config_on_cpu.py
@ -184,29 +184,26 @@ class TestCriticConfig:
        optim = OptimizerConfig(lr=0.1)
        critic_config = CriticConfig(ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim)
        assert isinstance(critic_config.profiler, ProfilerConfig)
        assert critic_config.profiler.discrete is False
        assert critic_config.profiler.all_ranks is False
        assert critic_config.profiler.ranks == []
-        custom_profiler = ProfilerConfig(discrete=True, all_ranks=True, ranks=[0, 1])
+        custom_profiler = ProfilerConfig(all_ranks=True, ranks=[0, 1])
        critic_config_custom = CriticConfig(
            profiler=custom_profiler, ppo_micro_batch_size_per_gpu=1, strategy="fsdp2", optim=optim
        )
        assert isinstance(critic_config_custom.profiler, ProfilerConfig)
        assert critic_config_custom.profiler.discrete is True
        assert critic_config_custom.profiler.all_ranks is True
        assert critic_config_custom.profiler.ranks == [0, 1]
-        profiler1 = ProfilerConfig(discrete=True, ranks=[0, 1])
+        profiler1 = ProfilerConfig(enable=True, ranks=[0, 1])
        profiler2 = ProfilerConfig(all_ranks=True, ranks=[1, 2])
        union_result = profiler1.union(profiler2)
-        assert union_result.discrete is True
+        assert union_result.enable is True
        assert union_result.all_ranks is True
        assert set(union_result.ranks) == {0, 1, 2}
        intersect_result = profiler1.intersect(profiler2)
        assert intersect_result.discrete is False
        assert intersect_result.all_ranks is False
        assert intersect_result.ranks == [1]
--- a/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
@ -59,6 +59,25 @@ actor_rollout_ref:
      use_checkpoint_opt_param_scheduler: false
      override_optimizer_config: {}
    use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
    profiler:
      _target_: verl.utils.profiler.ProfilerConfig
      tool: ${oc.select:global_profiler.tool,null}
      enable: false
      all_ranks: false
      ranks: []
      save_path: ${oc.select:global_profiler.save_path,null}
      tool_config:
        nsys:
          discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
        npu:
          _target_: verl.utils.profiler.config.NPUToolConfig
          contents: []
          level: level1
          analysis: true
        torch:
          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
          step_start: 0
          step_end: null
    data_loader_seed: null
    load_weight: true
    megatron:
@ -85,12 +104,6 @@ actor_rollout_ref:
        recompute_method: null
        recompute_num_layers: null
      use_mbridge: false
    profile:
      use_profile: false
      profile_ranks: null
      step_start: -1
      step_end: -1
      save_path: null
  ref:
    strategy: megatron
    use_torch_compile: ${oc.select:actor_rollout_ref.actor.use_torch_compile,true}
@ -98,6 +111,14 @@ actor_rollout_ref:
    log_prob_micro_batch_size_per_gpu: null
    log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
    log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
    profiler:
      _target_: verl.utils.profiler.ProfilerConfig
      tool: ${oc.select:global_profiler.tool,null}
      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
      save_path: ${oc.select:global_profiler.save_path,null}
      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    megatron:
      _target_: verl.workers.config.MegatronEngineConfig
      param_offload: false
@ -114,12 +135,6 @@ actor_rollout_ref:
      seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
      override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
      use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
    profile:
      use_profile: false
      profile_ranks: null
      step_start: -1
      step_end: -1
      save_path: null
    load_weight: true
  rollout:
    name: ???
@ -184,6 +199,14 @@ actor_rollout_ref:
      token2text: false
    skip_rollout: false
    skip_dump_dir: /tmp/rollout_dump
    profiler:
      _target_: verl.utils.profiler.ProfilerConfig
      tool: ${oc.select:global_profiler.tool,null}
      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
      save_path: ${oc.select:global_profiler.save_path,null}
      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    enable_chunked_prefill: false
    load_format: dummy_megatron
    layer_name_map:
@ -201,63 +224,6 @@ actor_rollout_ref:
        freeze_moe_router: false
    use_fused_kernels: false
    trust_remote_code: false
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
    discrete: false
    all_ranks: false
    ranks: []
 trainer:
  npu_profile:
    options:
      save_path: ./profiler_data
      roles:
      - all
      level: level1
      with_memory: false
      record_shapes: false
      with_npu: true
      with_cpu: true
      with_module: false
      with_stack: false
      analysis: true
  balance_batch: true
  total_epochs: 30
  total_training_steps: null
  profile_steps: null
  profile_continuous_steps: false
  project_name: verl_examples
  experiment_name: gsm8k
  logger:
  - console
  - wandb
  log_val_generations: 0
  nnodes: 1
  n_gpus_per_node: 8
  save_freq: -1
  esi_redundant_time: 0
  resume_mode: auto
  resume_from_path: null
  del_local_ckpt_after_load: false
  val_before_train: true
  test_freq: -1
  critic_warmup: 0
  default_hdfs_dir: null
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  max_actor_ckpt_to_keep: null
  max_critic_ckpt_to_keep: null
  ray_wait_register_center_timeout: 300
  device: cuda
  controller_nsight_options:
    trace: cuda,nvtx,cublas,ucx
    cuda-memory-usage: 'true'
    cuda-graph-trace: graph
  worker_nsight_options:
    trace: cuda,nvtx,cublas,ucx
    cuda-memory-usage: 'true'
    cuda-graph-trace: graph
    capture-range: cudaProfilerApi
    capture-range-end: null
    kill: none
 data:
  tokenizer: null
  use_shm: false
@ -344,9 +310,12 @@ critic:
    async_save: false
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
    enable: false
    all_ranks: false
    ranks: []
    save_path: ${oc.select:global_profiler.save_path,null}
    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  nccl_timeout: 600
  megatron:
    _target_: verl.workers.config.McoreEngineConfig
@ -390,9 +359,12 @@ reward_model:
    memory_limit_mb: 1024
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
    enable: false
    all_ranks: false
    ranks: []
    save_path: ${oc.select:global_profiler.save_path,null}
    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  nccl_timeout: 600
  megatron:
    _target_: verl.workers.config.MegatronEngineConfig
@ -432,6 +404,52 @@ algorithm:
  pf_ppo:
    reweight_method: pow
    weight_pow: 2.0
 trainer:
  balance_batch: true
  total_epochs: 30
  total_training_steps: null
  project_name: verl_examples
  experiment_name: gsm8k
  logger:
  - console
  - wandb
  log_val_generations: 0
  nnodes: 1
  n_gpus_per_node: 8
  save_freq: -1
  esi_redundant_time: 0
  resume_mode: auto
  resume_from_path: null
  del_local_ckpt_after_load: false
  val_before_train: true
  test_freq: -1
  critic_warmup: 0
  default_hdfs_dir: null
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  max_actor_ckpt_to_keep: null
  max_critic_ckpt_to_keep: null
  ray_wait_register_center_timeout: 300
  device: cuda
 global_profiler:
  _target_: verl.utils.profiler.ProfilerConfig
  tool: null
  steps: null
  profile_continuous_steps: false
  save_path: outputs/profile
  global_tool_config:
    nsys:
      discrete: false
      controller_nsight_options:
        trace: cuda,nvtx,cublas,ucx
        cuda-memory-usage: 'true'
        cuda-graph-trace: graph
      worker_nsight_options:
        trace: cuda,nvtx,cublas,ucx
        cuda-memory-usage: 'true'
        cuda-graph-trace: graph
        capture-range: cudaProfilerApi
        capture-range-end: null
        kill: none
 ray_init:
  num_cpus: null
  timeline_json_file: null
--- a/verl/trainer/config/_generated_ppo_trainer.yaml
+++ b/verl/trainer/config/_generated_ppo_trainer.yaml
@ -51,6 +51,25 @@ actor_rollout_ref:
      num_cycles: 0.5
      warmup_style: constant
    use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
    profiler:
      _target_: verl.utils.profiler.ProfilerConfig
      tool: ${oc.select:global_profiler.tool,null}
      enable: false
      all_ranks: false
      ranks: []
      save_path: ${oc.select:global_profiler.save_path,null}
      tool_config:
        nsys:
          discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
        npu:
          _target_: verl.utils.profiler.config.NPUToolConfig
          contents: []
          level: level1
          analysis: true
        torch:
          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
          step_start: 0
          step_end: null
    grad_clip: 1.0
    ulysses_sequence_parallel_size: 1
    entropy_from_logits_with_chunking: false
@ -73,6 +92,14 @@ actor_rollout_ref:
    log_prob_micro_batch_size_per_gpu: null
    log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
    log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
    profiler:
      _target_: verl.utils.profiler.ProfilerConfig
      tool: ${oc.select:global_profiler.tool,null}
      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
      save_path: ${oc.select:global_profiler.save_path,null}
      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    model: null
    fsdp_config:
      _target_: verl.workers.config.FSDPEngineConfig
@ -147,6 +174,14 @@ actor_rollout_ref:
      token2text: false
    skip_rollout: false
    skip_dump_dir: /tmp/rollout_dump
    profiler:
      _target_: verl.utils.profiler.ProfilerConfig
      tool: ${oc.select:global_profiler.tool,null}
      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
      save_path: ${oc.select:global_profiler.save_path,null}
      tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
    enable_chunked_prefill: true
    load_format: dummy_dtensor
    layered_summon: false
@ -170,67 +205,6 @@ actor_rollout_ref:
    fused_kernel_options:
      impl_backend: torch
    trust_remote_code: false
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
    discrete: false
    all_ranks: false
    ranks: []
 trainer:
  npu_profile:
    options:
      save_path: ./profiler_data
      roles:
      - all
      level: level1
      with_memory: false
      record_shapes: false
      with_npu: true
      with_cpu: true
      with_module: false
      with_stack: false
      analysis: true
  balance_batch: true
  total_epochs: 30
  total_training_steps: null
  profile_steps: null
  profile_continuous_steps: false
  controller_nsight_options:
    trace: cuda,nvtx,cublas,ucx
    cuda-memory-usage: 'true'
    cuda-graph-trace: graph
  worker_nsight_options:
    trace: cuda,nvtx,cublas,ucx
    cuda-memory-usage: 'true'
    cuda-graph-trace: graph
    capture-range: cudaProfilerApi
    capture-range-end: null
    kill: none
  project_name: verl_examples
  experiment_name: gsm8k
  logger:
  - console
  - wandb
  log_val_generations: 0
  rollout_data_dir: null
  validation_data_dir: null
  nnodes: 1
  n_gpus_per_node: 8
  save_freq: -1
  esi_redundant_time: 0
  resume_mode: auto
  resume_from_path: null
  val_before_train: true
  val_only: false
  test_freq: -1
  critic_warmup: 0
  default_hdfs_dir: null
  del_local_ckpt_after_load: false
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  max_actor_ckpt_to_keep: null
  max_critic_ckpt_to_keep: null
  ray_wait_register_center_timeout: 300
  device: cuda
  use_legacy_worker_impl: auto
 data:
  tokenizer: null
  use_shm: false
@ -322,9 +296,12 @@ critic:
    async_save: false
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
    enable: false
    all_ranks: false
    ranks: []
    save_path: ${oc.select:global_profiler.save_path,null}
    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  forward_micro_batch_size: ${oc.select:.ppo_micro_batch_size,null}
  forward_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size_per_gpu,null}
  ulysses_sequence_parallel_size: 1
@ -361,9 +338,12 @@ reward_model:
    memory_limit_mb: 1024
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
-    discrete: false
+    tool: ${oc.select:global_profiler.tool,null}
    enable: false
    all_ranks: false
    ranks: []
    save_path: ${oc.select:global_profiler.save_path,null}
    tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
  ulysses_sequence_parallel_size: 1
 custom_reward_function:
  path: null
@ -386,6 +366,57 @@ algorithm:
  pf_ppo:
    reweight_method: pow
    weight_pow: 2.0
 trainer:
  balance_batch: true
  total_epochs: 30
  total_training_steps: null
  project_name: verl_examples
  experiment_name: gsm8k
  logger:
  - console
  - wandb
  log_val_generations: 0
  rollout_data_dir: null
  validation_data_dir: null
  nnodes: 1
  n_gpus_per_node: 8
  save_freq: -1
  esi_redundant_time: 0
  resume_mode: auto
  resume_from_path: null
  val_before_train: true
  val_only: false
  test_freq: -1
  critic_warmup: 0
  default_hdfs_dir: null
  del_local_ckpt_after_load: false
  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
  max_actor_ckpt_to_keep: null
  max_critic_ckpt_to_keep: null
  ray_wait_register_center_timeout: 300
  device: cuda
  use_legacy_worker_impl: auto
 global_profiler:
  _target_: verl.utils.profiler.ProfilerConfig
  tool: null
  steps: null
  profile_continuous_steps: false
  save_path: outputs/profile
  global_tool_config:
    nsys:
      _target_: verl.utils.profiler.config.NsightToolConfig
      discrete: false
      controller_nsight_options:
        trace: cuda,nvtx,cublas,ucx
        cuda-memory-usage: 'true'
        cuda-graph-trace: graph
      worker_nsight_options:
        trace: cuda,nvtx,cublas,ucx
        cuda-memory-usage: 'true'
        cuda-graph-trace: graph
        capture-range: cudaProfilerApi
        capture-range-end: null
        kill: none
 ray_init:
  num_cpus: null
  timeline_json_file: null
--- a/verl/trainer/config/actor/actor.yaml
+++ b/verl/trainer/config/actor/actor.yaml
@ -128,3 +128,65 @@ optim:
 # Whether to use custom fused kernels (e.g., FlashAttention, fused MLP)
 use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
 # profile the actor model in `update_policy` 
 profiler:
  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig
  # profiler tool, default same as profiler.tool in global config
  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}
  # whether enable profile on Actor
  enable: False
  # Whether to profile all ranks.
  all_ranks: False
  # The ranks that will be profiled. [] or [0,1,...]
  ranks: []
  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}
  # specific tool config which only related to the role
  tool_config:
    # nsys tool config
    nsys:
      # True for each task has its own database, False for all tasks in one training step share one database.
      discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
    # npu config
    npu:
      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
      _target_: verl.utils.profiler.config.NPUToolConfig
      # Contents to profile, can be empty
      # options: npu, cpu, memory, shapes, module, stack
      contents: []
      # Collection level, optional values: level_none, level0, level1, level2.
      level: "level1"
      # Whether to automatically parse the data.
      analysis: True
    # torch profiler config
    torch:
      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
      _target_: verl.utils.profiler.config.TorchProfilerToolConfig
      # start profile mini-batch in training
      # NOTICE: different with global steps config which refers to iteration
      # This field only related with mini-batch
      step_start: 0
      # stop profile mini-batch in training
      step_end: null
--- a/verl/trainer/config/actor/megatron_actor.yaml
+++ b/verl/trainer/config/actor/megatron_actor.yaml
@ -103,22 +103,4 @@ megatron:
    recompute_num_layers: null
  # oc.select: default val for ref.megatron.use_mbridge
-  use_mbridge: False
+  use_mbridge: False
 # profile the actor model in `update_policy` 
 profile:
  # turn it on when you want to profile the actor model
  use_profile: False
  # list, you can specify the ranks to profile
  profile_ranks: null
  # start step in update_policy
  step_start: -1
  # end step
  step_end: -1
  # the path to save the profile result
  save_path: null
--- a/verl/trainer/config/config.py
+++ b/verl/trainer/config/config.py
@ -45,14 +45,12 @@ class ProfileConfig(BaseConfig):
    The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.
    Args:
        use_profile (bool): Whether to enable profiling.
        profile_ranks (Optional[list[int]]): List of ranks to profile. None means all ranks.
        step_start (int): Starting step for profiling.
        step_end (int): Ending step for profiling.
        save_path (Optional[str]): Path to save profiling results.
    """
    use_profile: bool = False
    profile_ranks: Optional[list[int]] = None
    step_start: int = -1
    step_end: int = -1
--- a/verl/trainer/config/critic/critic.yaml
+++ b/verl/trainer/config/critic/critic.yaml
@ -95,18 +95,27 @@ checkpoint:
  # Whether to save checkpoints asynchronously. Only effective for Megatron as of now.
  async_save: False
-# profiler configs
+# profile the critic model in `update_policy` 
 # the corresponding dataclass is verl.utils.profiler.ProfilerConfig.
 profiler:
  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig
-  # True for each task has its own database, False for all tasks in one training step share one database.
+  # profiler tool, default same as profiler.tool in global config
-  discrete: False
+  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}
  # whether enable profile on critic
  enable: False
  # Whether to profile all ranks.
  all_ranks: False
  # The ranks that will be profiled. [] or [0,1,...]
  ranks: []
  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}
  # specific tool config
  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/config/ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/ppo_megatron_trainer.yaml
@ -4,8 +4,6 @@ defaults:
  # <folder_name>@<field_name>.<field_name>: <yaml_file_name>
  # actor_rollout_ref.actor: trainer/config/actor/megatron_actor.yaml
  - actor@actor_rollout_ref.actor: megatron_actor
  # trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
  - npu_profile@trainer.npu_profile: npu_profile
  # data: trainer/config/data/legacy_data.yaml
  - data@data: legacy_data
  # load the reference default config, then apply the fields in the current yaml
@ -57,12 +55,6 @@ actor_rollout_ref:
      qkv_layer_name: qkv
      gate_proj_layer_name: gate_up
  profiler:
    _target_: verl.utils.profiler.ProfilerConfig
    discrete: False
    all_ranks: False
    ranks: []
 custom_reward_function:
  path: null
  name: compute_score
@ -92,8 +84,6 @@ trainer:
  balance_batch: True
  total_epochs: 30
  total_training_steps: null
  profile_steps: null # [1,2,5] or [] or null
  profile_continuous_steps: False
  project_name: verl_examples
  experiment_name: gsm8k
  logger: ['console', 'wandb']
@ -117,18 +107,62 @@ trainer:
  # The timeout for ray worker group to wait for the register center to be ready
  ray_wait_register_center_timeout: 300
  device: cuda
-  # see ppo_trainer.yaml for more details
+
-  controller_nsight_options:
+global_profiler:
-    trace: "cuda,nvtx,cublas,ucx"
+  _target_: verl.utils.profiler.ProfilerConfig
-    cuda-memory-usage: "true"
+  tool: null  # choose between nsys, npu, torch
-    cuda-graph-trace: "graph"
+  steps: null   # profile steps
-  worker_nsight_options:
+  profile_continuous_steps: False
-    trace: "cuda,nvtx,cublas,ucx"
+  save_path: "outputs/profile"   # profiler saving path
-    cuda-memory-usage: "true"
+  # Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
-    cuda-graph-trace: "graph"
+  global_tool_config:
-    capture-range: "cudaProfilerApi"
+  
-    capture-range-end: null
+    # nsys config
-    kill: none
+    nsys:
      # True for each task has its own database, False for all tasks in one training step share one database.
      discrete: False
      # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
      ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
      ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
      controller_nsight_options:
        # Select the API(s) to be traced.
        trace: "cuda,nvtx,cublas,ucx"
        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
        cuda-memory-usage: "true"
        # CUDA graphs will be traced as a whole
        cuda-graph-trace: "graph"
      # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
      worker_nsight_options:
        # Select the API(s) to be traced.
        trace: "cuda,nvtx,cublas,ucx"
        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
        cuda-memory-usage: "true"
        # CUDA graphs will be traced as a whole
        cuda-graph-trace: "graph"
        # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
        capture-range: "cudaProfilerApi"
        # Specify the desired behavior when a capture range ends.
        # In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
        # valid values are "repeat-shutdown:n" or null.
        # For normal whole step profiling, n = len(profile_steps);
        # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
        # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
        capture-range-end: null
        # Send signal to the target application's process group. We let the program to exit by itself.
        kill: none
 ray_init:
  num_cpus: null # `None` means using all CPUs, which might cause hang if limited in systems like SLURM. Please set to a number allowed then.
  timeline_json_file: null
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@ -11,9 +11,6 @@ defaults:
  # actor_rollout_ref.actor: trainer/config/actor/dp_actor.yaml
  - actor@actor_rollout_ref.actor: dp_actor
  # trainer.npu_profile: trainer/config/npu_profile/npu_profile.yaml
  - npu_profile@trainer.npu_profile: npu_profile
  # data: trainer/config/data/legacy_data.yaml
  - data@data: legacy_data
@ -112,21 +109,6 @@ actor_rollout_ref:
    # for huge model, layered summon can save memory (prevent OOM) but make it slower
    layered_summon: False
  # profiler configs
  profiler:
    # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
    _target_: verl.utils.profiler.ProfilerConfig
    # True for each task has its own database, False for all tasks in one training step share one database.
    discrete: False
    # Whether to profile all ranks.
    all_ranks: False
    # The ranks that will be profiled. [] or [0,1,...]
    ranks: []
 # custom reward function definition
 custom_reward_function:
@ -203,54 +185,6 @@ trainer:
  # Total training steps (can be set explicitly or derived from epochs)
  total_training_steps: null
  # The steps that will be profiled. null means no profiling. null or [1,2,5,...]
  profile_steps: null
  # Whether to combine continuous steps into one database.
  ## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
  ## If False, [1] in one, [2] in another, [5] in another.
  profile_continuous_steps: False
  # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
  ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
  ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
  controller_nsight_options:
    # Select the API(s) to be traced.
    trace: "cuda,nvtx,cublas,ucx"
    # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
    cuda-memory-usage: "true"
    # CUDA graphs will be traced as a whole
    cuda-graph-trace: "graph"
  # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
  worker_nsight_options:
    # Select the API(s) to be traced.
    trace: "cuda,nvtx,cublas,ucx"
    # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
    cuda-memory-usage: "true"
    # CUDA graphs will be traced as a whole
    cuda-graph-trace: "graph"
    # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
    capture-range: "cudaProfilerApi"
    # Specify the desired behavior when a capture range ends.
    # In verl we need the orch.cuda.profiler.start/stop pair to repeats n times.
    # valid values are "repeat-shutdown:n" or null.
    # For normal whole step profiling, n = len(profile_steps);
    # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
    # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
    capture-range-end: null
    # Send signal to the target application's process group. We let the program to exit by itself.
    kill: none
  # Project name for experiment tracking (e.g., wandb)
  project_name: verl_examples
@ -331,6 +265,79 @@ trainer:
  #  mode: "auto", "enable", or "disable"
  use_legacy_worker_impl: auto
 # profiler configs
 global_profiler:
  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig
  # Profiling tool: choose between nsys, npu, torch
  tool: null
  # profile steps
  steps: null
  # Whether to combine continuous steps into one database.
  ## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
  ## If False, [1] in one, [2] in another, [5] in another.
  profile_continuous_steps: False
  # Path to save profiling contents
  save_path: "outputs/profile"
  # Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
  global_tool_config:
    # nsys config
    nsys:
      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
      _target_: verl.utils.profiler.config.NsightToolConfig
      # True for each task has its own database, False for all tasks in one training step share one database.
      discrete: False
      # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
      ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
      ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
      controller_nsight_options:
        # Select the API(s) to be traced.
        trace: "cuda,nvtx,cublas,ucx"
        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
        cuda-memory-usage: "true"
        # CUDA graphs will be traced as a whole
        cuda-graph-trace: "graph"
      # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
      worker_nsight_options:
        # Select the API(s) to be traced.
        trace: "cuda,nvtx,cublas,ucx"
        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
        cuda-memory-usage: "true"
        # CUDA graphs will be traced as a whole
        cuda-graph-trace: "graph"
        # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
        capture-range: "cudaProfilerApi"
        # Specify the desired behavior when a capture range ends.
        # In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
        # valid values are "repeat-shutdown:n" or null.
        # For normal whole step profiling, n = len(profile_steps);
        # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
        # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
        capture-range-end: null
        # Send signal to the target application's process group. We let the program to exit by itself.
        kill: none
 # configs related to ray initialization
 ray_init:
--- a/verl/trainer/config/ref/megatron_ref.yaml
+++ b/verl/trainer/config/ref/megatron_ref.yaml
@ -23,11 +23,4 @@ megatron:
  override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
  use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
 profile:
  use_profile: False
  profile_ranks: null
  step_start: -1
  step_end: -1
  save_path: null
 load_weight: True
--- a/verl/trainer/config/ref/ref.yaml
+++ b/verl/trainer/config/ref/ref.yaml
@ -19,3 +19,28 @@ log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,fa
 # the max token length per GPU
 # same as actor_rollout_ref.actor.ppo_max_token_len_per_gpu if it exists, otherwise 16384
 log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
 # profile the ref model in `compute_log_prob` 
 profiler:
  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig
  # profiler tool, default same as profiler.tool in global config
  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}
  # whether enable profile on ref
  enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
  # Whether to profile all ranks.
  all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
  # The ranks that will be profiled. [] or [0,1,...]
  ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}
  # specific tool config
  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/config/reward_model/reward_model.yaml
+++ b/verl/trainer/config/reward_model/reward_model.yaml
@ -65,17 +65,27 @@ sandbox_fusion:
  # Max memory limit for each sandbox process in MB
  memory_limit_mb: 1024
-# profiler configs
+# profile the reward model in `compute_reward` 
 profiler:
-  # hint for the target config dataclass
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig
-  # True for each task has its own database, False for all tasks in one training step share one database.
+  # profiler tool, default same as profiler.tool in global config
-  discrete: False
+  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}
  # whether enable profile on ref
  enable: False
  # Whether to profile all ranks.
  all_ranks: False
  # The ranks that will be profiled. [] or [0,1,...]
-  ranks: []
+  ranks: []
  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}
  # specific tool config
  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/config/rollout/rollout.yaml
+++ b/verl/trainer/config/rollout/rollout.yaml
@ -225,3 +225,28 @@ skip_rollout: False
 # Specifies the filesystem path where rollout data should be cached when skip_rollout is enabled.
 # Note: Giving path under /tmp/ray/session* is not recommended as these are temporary Ray cluster directories.
 skip_dump_dir: /tmp/rollout_dump
 # profile the rollout model in `generate_sequence` 
 profiler:
  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig
  # profiler tool, default same as profiler.tool in global config
  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}
  # whether enable profile on ref
  enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
  # Whether to profile all ranks.
  all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
  # The ranks that will be profiled. [] or [0,1,...]
  ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}
  # specific tool config
  tool_config: ${oc.select:actor_rollout_ref.actor.tool_config,null}
--- a/verl/trainer/main_ppo.py
+++ b/verl/trainer/main_ppo.py
@ -64,13 +64,16 @@ def run_ppo(config) -> None:
    # Execute the `run` method of the TaskRunner instance remotely and wait for it to complete
    if (
        is_cuda_available
-        and config.trainer.get("profile_steps") is not None
+        and config.global_profiler.tool == "nsys"
-        and len(config.trainer.get("profile_steps", [])) > 0
+        and config.global_profiler.get("steps") is not None
        and len(config.global_profiler.get("steps", [])) > 0
    ):
        from verl.utils.import_utils import is_nvtx_available
        assert is_nvtx_available(), "nvtx is not available in CUDA platform. Please 'pip3 install nvtx'"
-        nsight_options = OmegaConf.to_container(config.trainer.controller_nsight_options)
+        nsight_options = OmegaConf.to_container(
            config.global_profiler.global_tool_config.nsys.controller_nsight_options
        )
        runner = TaskRunner.options(runtime_env={"nsight": nsight_options}).remote()
    else:
        runner = TaskRunner.remote()
--- a/verl/trainer/ppo/ray_trainer.py
+++ b/verl/trainer/ppo/ray_trainer.py
@ -795,7 +795,6 @@ class RayPPOTrainer:
                cls=self.role_worker_mapping[Role.ActorRollout],
                config=self.config.actor_rollout_ref,
                role="actor_rollout",
                profile_option=self.config.trainer.npu_profile.options,
            )
            self.resource_pool_to_cls[resource_pool]["actor_rollout"] = actor_rollout_cls
        else:
@ -815,7 +814,6 @@ class RayPPOTrainer:
                self.role_worker_mapping[Role.RefPolicy],
                config=self.config.actor_rollout_ref,
                role="ref",
                profile_option=self.config.trainer.npu_profile.options,
            )
            self.resource_pool_to_cls[resource_pool]["ref"] = ref_policy_cls
@ -835,13 +833,13 @@ class RayPPOTrainer:
        wg_kwargs = {}  # Setting up kwargs for RayWorkerGroup
        if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
            wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
-        if OmegaConf.select(self.config.trainer, "profile_steps") is not None:
+        if OmegaConf.select(self.config.global_profiler, "steps") is not None:
-            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.trainer, "profile_steps")
+            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.global_profiler, "steps")
-            assert OmegaConf.select(self.config.trainer, "worker_nsight_options") is not None, (
+            assert OmegaConf.select(self.config.global_profiler, "worker_nsight_options") is not None, (
                "worker_nsight_options must be set when profile_steps is set"
            )
            wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
-                OmegaConf.select(self.config.trainer, "worker_nsight_options")
+                OmegaConf.select(self.config.global_profiler, "worker_nsight_options")
            )
        wg_kwargs["device_name"] = self.device_name
@ -1083,8 +1081,8 @@ class RayPPOTrainer:
        prev_step_profile = False
        curr_step_profile = (
-            self.global_steps in self.config.trainer.profile_steps
+            self.global_steps in self.config.global_profiler.steps
-            if self.config.trainer.profile_steps is not None
+            if self.config.global_profiler.steps is not None
            else False
        )
        next_step_profile = False
@ -1097,7 +1095,7 @@ class RayPPOTrainer:
                with marked_timer("start_profile", timing_raw):
                    self._start_profiling(
                        not prev_step_profile and curr_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )
@ -1341,13 +1339,13 @@ class RayPPOTrainer:
                with marked_timer("stop_profile", timing_raw):
                    next_step_profile = (
-                        self.global_steps + 1 in self.config.trainer.profile_steps
+                        self.global_steps + 1 in self.config.global_profiler.steps
-                        if self.config.trainer.profile_steps is not None
+                        if self.config.global_profiler.steps is not None
                        else False
                    )
                    self._stop_profiling(
                        curr_step_profile and not next_step_profile
-                        if self.config.trainer.profile_continuous_steps
+                        if self.config.global_profiler.profile_continuous_steps
                        else curr_step_profile
                    )
                    prev_step_profile = curr_step_profile
--- a/verl/utils/profiler/config.py
+++ b/verl/utils/profiler/config.py
@ -12,14 +12,74 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import warnings
 from dataclasses import dataclass, field
 from typing import Any, Optional
 from omegaconf import MISSING
 from verl.base_config import BaseConfig
@dataclass
 class NsightToolConfig(BaseConfig):
    """Nsight tool config."""
    "True for each task has its own database, False for all tasks in one training step share one database."
    discrete: bool = False
    def __post_init__(self) -> None:
        pass
@dataclass
 class TorchProfilerToolConfig(BaseConfig):
    """Torch profiler tool config.
    Args:
        step_start (int): Start step in update_policy.
        step_end (int): End step.
    """
    step_start: int = -1
    step_end: int = -1
    def __post_init__(self) -> None:
        """config validation logics go here"""
        warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1)
        assert isinstance(self.step_start, int), f"Profiler step_start must be of type int, got {type(self.step_start)}"
@dataclass
 class NPUToolConfig(NsightToolConfig):
    """NPU profiler too; config."""
    # options: npu, cpu, memory, shapes, module, stack
    contents: list[str] = field(default_factory=list)
    # Collection level, optional values: level_none, level0, level1, level2.
    level: str = "level1"
    # Whether to automatically parse the data.
    analysis: bool = False
    def __post_init__(self) -> None:
        """config validation logics go here"""
        assert isinstance(self.contents, list), f"Profiler contents must be of type list, got {type(self.contents)}"
        assert isinstance(self.level, str), f"Profiler level must be of type str, got {type(self.level)}"
        assert isinstance(self.analysis, bool), f"Profiler analysis must be of type bool, got {type(self.analysis)}"
        for content in self.contents:
            assert content in ["npu", "cpu", "memory", "shapes", "module", "stack"], (
                f"Profiler contents only supports npu, cpu, memory, shapes, module, stack, but gets {content}"
            )
        assert self.level in ["level_none", "level0", "level1", "level2"], (
            f"Profiler level only supports level0, 1, 2, and level_none, but gets {self.level}"
        )
@dataclass
 class ProfilerConfig(BaseConfig):
-    """Worker profiler config. Currently only support Nsight system profiler.
+    """Worker profiler config.
    The inheritance from BaseConfig provides omegaconf.DictConfig-like interface for a dataclass config.
@ -30,22 +90,33 @@ class ProfilerConfig(BaseConfig):
        ranks (list[int]): The ranks that will be profiled. Defaults to [].
    """
-    discrete: bool = False
+    tool: Optional[str] = MISSING
    enable: bool = False
    all_ranks: bool = False
    ranks: list[int] = field(default_factory=list)
    save_path: Optional[str] = MISSING
    tool_config: Any = MISSING  # Just a placeholder, will use configs above directly
    def union(self, other: "ProfilerConfig") -> "ProfilerConfig":
        assert self.tool == other.tool, f"Cannot union ProfilerConfig with different tools: {self.tool} vs {other.tool}"
        return ProfilerConfig(
            tool=self.tool,
            enable=self.enable or other.enable,
            all_ranks=self.all_ranks or other.all_ranks,
            ranks=list(set(self.ranks or []) | set(other.ranks or [])),
-            discrete=self.discrete or other.discrete,
+            tool_config=self.tool_config,
        )
    def intersect(self, other: "ProfilerConfig") -> "ProfilerConfig":
        assert self.tool == other.tool, (
            f"Cannot intersect ProfilerConfig with different tools: {self.tool} vs {other.tool}"
        )
        return ProfilerConfig(
            tool=self.tool,
            enable=self.enable and other.enable,
            all_ranks=self.all_ranks and other.all_ranks,
            ranks=list(set(self.ranks or []) & set(other.ranks or [])),
-            discrete=self.discrete and other.discrete,
+            tool_config=self.tool_config,
        )
    def __post_init__(self) -> None:
--- a/verl/utils/profiler/mstx_profile.py
+++ b/verl/utils/profiler/mstx_profile.py
@ -20,9 +20,9 @@ from contextlib import contextmanager
 from typing import Any, Callable, Optional
 import torch_npu
 from omegaconf import DictConfig
 from torch_npu.npu import mstx
 from .config import NPUToolConfig
 from .profile import DistProfiler, ProfilerConfig
@ -86,7 +86,14 @@ def marked_timer(name: str, timing_raw: dict[str, float], *args: Any, **kwargs:
    mark_end_range(mark_range)
-def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_step: Optional[str] = None):
+def get_npu_profiler(
    contents: list[str],
    profile_level: str,
    profile_save_path: str,
    analysis: bool,
    role: Optional[str] = None,
    profile_step: Optional[str] = None,
 ):
    """Generate and return an NPU profiler object.
    Args:
@ -97,18 +104,7 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
        profile_step(str, optional):
            The current training step. Defaults to None.
    """
    if option.level == "level_none":
        profile_level = torch_npu.profiler.ProfilerLevel.Level_none
    elif option.level == "level0":
        profile_level = torch_npu.profiler.ProfilerLevel.Level0
    elif option.level == "level1":
        profile_level = torch_npu.profiler.ProfilerLevel.Level1
    elif option.level == "level2":
        profile_level = torch_npu.profiler.ProfilerLevel.Level2
    else:
        raise ValueError(f"level only supports level0, 1, 2, and level_none, but gets {option.level}")
    profile_save_path = option.save_path
    if profile_step:
        profile_save_path = os.path.join(profile_save_path, profile_step)
    if role:
@ -123,18 +119,18 @@ def get_npu_profiler(option: DictConfig, role: Optional[str] = None, profile_ste
    )
    activites = []
-    if option.with_npu:
+    if contents is None or "npu" in contents:
        activites.append(torch_npu.profiler.ProfilerActivity.NPU)
-    if option.with_cpu:
+    if contents is None or "cpu" in contents:
        activites.append(torch_npu.profiler.ProfilerActivity.CPU)
    prof = torch_npu.profiler.profile(
-        with_modules=option.with_module,
+        with_modules=contents is None or "module" in contents,
-        with_stack=option.with_stack,
+        with_stack=contents is None or "stack" in contents,
-        record_shapes=option.record_shapes,
+        record_shapes=contents is None or "shapes" in contents,
-        profile_memory=option.with_memory,
+        profile_memory=contents is None or "memory" in contents,
        activities=activites,
-        on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=option.analysis),
+        on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(profile_save_path, analyse_flag=analysis),
        experimental_config=experimental_config,
    )
    return prof
@ -147,7 +143,7 @@ class NPUProfiler(DistProfiler):
    _define_count = 0
-    def __init__(self, rank: int, config: ProfilerConfig, **kwargs):
+    def __init__(self, rank: int, config: ProfilerConfig, tool_config: NPUToolConfig, **kwargs):
        """Initialize the NsightSystemsProfiler.
        Args:
@ -155,12 +151,20 @@ class NPUProfiler(DistProfiler):
            config (Optional[ProfilerConfig]): Configuration for the profiler. If None, a default configuration is used.
        """
        if not config:
-            config = ProfilerConfig(ranks=[])
+            config = ProfilerConfig(ranks=[], enable=False)
        if not tool_config:
            assert not config.enable, "tool_config must be set when profiler is enabled"
        self.enable: bool = config.enable
        if not config.enable:
            return
        self.this_step: bool = False
-        self.discrete: bool = config.discrete
+        self.discrete: bool = tool_config.discrete
        self.this_rank: bool = False
        self.profile_npu = None
-        self.profile_option = kwargs.get("option", None)
+        self.profile_contents = tool_config.contents
        self.profile_level = tool_config.level
        self.profile_save_path = config.save_path
        self.analysis = tool_config.analysis
        if config.all_ranks:
            self.this_rank = True
        elif config.ranks:
@ -169,15 +173,22 @@ class NPUProfiler(DistProfiler):
    def start(self, **kwargs):
        role, profile_step = kwargs.get("role", None), kwargs.get("profile_step", None)
        profile_step = str(profile_step) if profile_step is not None else None
-        if self.this_rank and self.profile_option is not None:
+        if self.this_rank and self.enable:
            self.this_step = True
            if not self.discrete and NPUProfiler._define_count == 0:
-                self.profile_npu = get_npu_profiler(option=self.profile_option, role=role, profile_step=profile_step)
+                self.profile_npu = get_npu_profiler(
                    contents=self.profile_contents,
                    profile_level=self.profile_level,
                    profile_save_path=self.profile_save_path,
                    analysis=self.analysis,
                    role=role,
                    profile_step=profile_step,
                )
                self.profile_npu.start()
                NPUProfiler._define_count += 1
    def stop(self):
-        if self.this_rank and self.profile_option is not None:
+        if self.this_rank and self.enable:
            self.this_step = False
            if not self.discrete and NPUProfiler._define_count == 1:
                self.profile_npu.step()
@ -201,26 +212,23 @@ class NPUProfiler(DistProfiler):
        def decorator(func):
            @functools.wraps(func)
            def wrapper(self, *args, **kwargs):
                if not self.profiler.enable:
                    return func(self, *args, **kwargs)
                profile_name = message or func.__name__
                profile_this_role = True
                discrete_mode = self.profiler.discrete
-                profile_enable = self.profiler.this_step and self.profile_option is not None
+                profile_enable = self.profiler.this_step and self.profiler.enable
                if not profile_enable:
                    return func(self, *args, **kwargs)
                if profile_enable and role is not None:
                    target_roles = self.profile_option.get("roles", [])
                    profile_this_role = "all" in target_roles or role in target_roles
                if profile_enable:
                    if not discrete_mode:
                        mark_range = mark_start_range(message=profile_name)
                    else:
-                        if profile_this_role:
+                        profile_npu = get_npu_profiler(option=self.profile_option, role=role)
-                            profile_npu = get_npu_profiler(option=self.profile_option, role=role)
+                        profile_npu.start()
-                            profile_npu.start()
+                        mark_range = mark_start_range(message=profile_name)
                            mark_range = mark_start_range(message=profile_name)
                result = func(self, *args, **kwargs)
@ -228,10 +236,9 @@ class NPUProfiler(DistProfiler):
                    if not discrete_mode:
                        mark_end_range(mark_range)
                    else:
-                        if profile_this_role:
+                        mark_end_range(mark_range)
-                            mark_end_range(mark_range)
+                        profile_npu.step()
-                            profile_npu.step()
+                        profile_npu.stop()
                            profile_npu.stop()
                return result
--- a/verl/utils/profiler/nvtx_profile.py
+++ b/verl/utils/profiler/nvtx_profile.py
@ -20,6 +20,7 @@ from typing import Callable, Optional
 import nvtx
 import torch
 from .config import NsightToolConfig
 from .profile import DistProfiler, ProfilerConfig
@ -113,7 +114,7 @@ def marked_timer(
 class NsightSystemsProfiler(DistProfiler):
    """Nsight system profiler. Installed in a worker to control the Nsight system profiler."""
-    def __init__(self, rank: int, config: Optional[ProfilerConfig], **kwargs):
+    def __init__(self, rank: int, config: Optional[ProfilerConfig], tool_config: Optional[NsightToolConfig], **kwargs):
        """Initialize the NsightSystemsProfiler.
        Args:
@ -123,8 +124,13 @@ class NsightSystemsProfiler(DistProfiler):
        # If no configuration is provided, create a default ProfilerConfig with an empty list of ranks
        if not config:
            config = ProfilerConfig(ranks=[])
        if not tool_config:
            assert not config.enable, "tool_config must be provided when profiler is enabled"
        self.enable = config.enable
        if not config.enable:
            return
        self.this_step: bool = False
-        self.discrete: bool = config.discrete
+        self.discrete: bool = tool_config.discrete
        self.this_rank: bool = False
        if config.all_ranks:
            self.this_rank = True
@ -170,6 +176,9 @@ class NsightSystemsProfiler(DistProfiler):
        def decorator(func):
            @functools.wraps(func)
            def wrapper(self, *args, **kwargs):
                if not self.profiler.enable:
                    return func(self, *args, **kwargs)
                profile_name = message or func.__name__
                if self.profiler.this_step:
--- a/verl/utils/profiler/profile.py
+++ b/verl/utils/profiler/profile.py
@ -17,9 +17,8 @@ from typing import Callable, Optional
 import torch
 import torch.distributed
 from omegaconf import DictConfig, OmegaConf
-from .config import ProfilerConfig
+from .config import ProfilerConfig, TorchProfilerToolConfig
 class Profiler:
@ -39,18 +38,23 @@ class Profiler:
        config: Configuration object containing profiling parameters
    """
-    def __init__(self, config):
+    def __init__(self, config: ProfilerConfig, tool_config: Optional[TorchProfilerToolConfig] = None):
        # note : if we do not set use_profile, it will be set as None, so that all function will be skip
-        if not isinstance(config, DictConfig):
+        if not config:
-            config = OmegaConf.create(config)
+            config = ProfilerConfig(ranks=[], enable=False)
        if not tool_config:
            assert not config.enable, "tool_config must be provided when profiler is enabled"
        self.enable = config.enable
        if not config.enable:
            return
        self.config = config
-        self.skip_prof = False
+        self.tool_config = tool_config
        self.saved = False
        self.prof = None
        self.rank = torch.distributed.get_rank()
        # we need to validate the config before using the profiler
        self._validate()
-        if config.use_profile and self.rank in self.config.profile_ranks:
+        if self.rank in self.config.profile_ranks:
            print(f"[Profiler] Profiler init for rank {self.rank}")
            self.prof = torch.profiler.profile(
@ -59,9 +63,9 @@ class Profiler:
                    torch.profiler.ProfilerActivity.CUDA,
                ],
                schedule=torch.profiler.schedule(
-                    wait=max(self.config.step_start - 1, 0),
+                    wait=max(self.tool_config.step_start - 1, 0),
-                    warmup=1 if self.config.step_start > 0 else 0,
+                    warmup=1 if self.tool_config.step_start > 0 else 0,
-                    active=self.config.step_end - self.config.step_start,
+                    active=self.tool_config.step_end - self.tool_config.step_start,
                    repeat=1,
                ),
                record_shapes=True,
@ -73,9 +77,9 @@ class Profiler:
            if self.config.profile_ranks is None:
                print("[WARNING] Profile ranks is not set, default to rank 0")
                self.config.profile_ranks = [0]
-            assert self.config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
+            assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
-            assert self.config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
+            assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
-            assert self.config.step_start < self.config.step_end, (
+            assert self.tool_config.step_start < self.tool_config.step_end, (
                "[ERROR] Profile step start must be less than step end"
            )
--- a/verl/workers/actor/megatron_actor.py
+++ b/verl/workers/actor/megatron_actor.py
@ -122,7 +122,7 @@ class MegatronPPOActor(BasePPOActor):
        self.tf_config = tf_config
        self.actor_module = actor_module
        self.actor_optimizer: DistributedOptimizer = actor_optimizer
-        self.prof = Profiler(self.config.profile)
+        self.prof = Profiler(self.config.profiler)
        self.use_fused_kernels = self.config.get("use_fused_kernels", False)
        if self.use_fused_kernels:
            from verl.models.mcore.model_forward_fused import patch_fused_forward
@ -600,7 +600,8 @@ class MegatronPPOActor(BasePPOActor):
        """
        metrics = {}
-        self.prof.start()
+        if self.prof.enable:
            self.prof.start()
        for data in dataloader:
            data.to(get_device_id())
            self.actor_optimizer.zero_grad()
@ -639,9 +640,11 @@ class MegatronPPOActor(BasePPOActor):
                pass
            else:
                raise NotImplementedError
-            self.prof.step()
+            if self.prof.enable:
                self.prof.step()
        # add empty cache after each compute
-        self.prof.stop_and_save()
+        if self.prof.enable:
-        self.prof.stop_trace()
+            self.prof.stop_and_save()
            self.prof.stop_trace()
        get_torch_device().empty_cache()
        return metrics
--- a/verl/workers/config/actor.py
+++ b/verl/workers/config/actor.py
@ -19,6 +19,7 @@ from omegaconf import MISSING
 from verl.base_config import BaseConfig
 from verl.trainer.config import CheckpointConfig
 from verl.utils.profiler.config import ProfilerConfig
 from .engine import FSDPEngineConfig, McoreEngineConfig
 from .optimizer import OptimizerConfig
@ -109,6 +110,7 @@ class ActorConfig(BaseConfig):
    checkpoint: CheckpointConfig = field(default_factory=CheckpointConfig)
    optim: OptimizerConfig = field(default_factory=OptimizerConfig)
    use_fused_kernels: bool = False
    profiler: ProfilerConfig = field(default_factory=ProfilerConfig)
    def __post_init__(self):
        """Validate actor configuration parameters."""
@ -218,6 +220,7 @@ class FSDPActorConfig(ActorConfig):
    entropy_checkpointing: bool = False
    fsdp_config: FSDPEngineConfig = field(default_factory=FSDPEngineConfig)
    use_remove_padding: bool = False
    profiler: ProfilerConfig = field(default_factory=ProfilerConfig)
    def __post_init__(self):
        """Validate FSDP actor configuration parameters."""
--- a/verl/workers/fsdp_workers.py
+++ b/verl/workers/fsdp_workers.py
@ -72,7 +72,7 @@ from verl.utils.fsdp_utils import (
 )
 from verl.utils.import_utils import import_external_libs
 from verl.utils.model import compute_position_id_with_mask
-from verl.utils.profiler import DistProfiler, DistProfilerExtension, log_gpu_memory_usage, simple_timer
+from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig, log_gpu_memory_usage, simple_timer
 from verl.utils.profiler.performance import reduce_timing
 from verl.utils.py_functional import convert_to_regular_types
 from verl.workers.config import FSDPCriticConfig, FSDPEngineConfig
@ -116,7 +116,6 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
        Worker.__init__(self)
        self.config = config
        self.profile_option = kwargs.get("profile_option", None)
        import torch.distributed
        if not torch.distributed.is_initialized():
@ -170,9 +169,30 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
        # We can still use ProfilerConfig for testing purpose (tests/utils/test_nvtx_profile.py)
        # as they provides DictConfig-like interface
        # The benefit of creating the dataclass config is to perform validation during __post_init__
-        profiler_config = omega_conf_to_dataclass(config.get("profiler"))
+        if self._is_actor:
            omega_profiler_config = config.actor.get("profiler", {})
        elif self._is_rollout:
            # NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
            # This is for extendability in AsyncRL cases
            omega_profiler_config = config.rollout.get("profiler", {})
        elif self._is_ref:
            omega_profiler_config = config.ref.get("profiler", {})
        else:
            raise ValueError(
                f"Invalid role {self.role}, should be one of "
                "['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
            )
        # omega_profiler_config is DictConfig
        # profiler_config is a ProfilerConfig dataclass
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=profiler_config, option=self.profile_option)
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        self._is_offload_param = False
@ -938,7 +958,17 @@ class ActorRolloutRefWorker(Worker, DistProfilerExtension):
 class CriticWorker(Worker, DistProfilerExtension):
    def __init__(self, config: FSDPCriticConfig):
        Worker.__init__(self)
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
+        omega_profiler_config = config.get("profiler", {})
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        import torch.distributed
        self.config = config
@ -1336,8 +1366,18 @@ class RewardModelWorker(Worker, DistProfilerExtension):
    def __init__(self, config):
        Worker.__init__(self)
        omega_profiler_config = config.get("profiler", {})
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
+            self,
            DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
        )
        import torch.distributed
--- a/verl/workers/megatron_workers.py
+++ b/verl/workers/megatron_workers.py
@ -55,6 +55,7 @@ from verl.utils.profiler import (
    DistProfiler,
    DistProfilerExtension,
    GPUMemoryLogger,
    ProfilerConfig,
    log_gpu_memory_usage,
    simple_timer,
 )
@ -213,8 +214,31 @@ class ActorRolloutRefWorker(MegatronWorker, DistProfilerExtension):
        self._is_rollout = self.role in ["rollout", "actor_rollout", "actor_rollout_ref"]
        self._is_ref = self.role in ["ref", "actor_rollout_ref"]
-        profiler_config = omega_conf_to_dataclass(config.get("profiler"))
+        if self._is_actor:
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=profiler_config))
+            omega_profiler_config = config.actor.get("profiler", {})
        elif self._is_rollout:
            # NOTE: In colocation mode, rollout config may not take effect (follow the actor config)
            # This is for extendability in AsyncRL cases
            omega_profiler_config = config.rollout.get("profiler", {})
        elif self._is_ref:
            omega_profiler_config = config.ref.get("profiler", {})
        else:
            raise ValueError(
                f"Invalid role {self.role}, should be one of "
                "['actor', 'rollout', 'ref', 'actor_rollout', 'actor_rollout_ref']"
            )
        # omega_profiler_config is DictConfig
        # profiler_config is a ProfilerConfig dataclass
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        # TODO(sgm): Currently, we only support reference model param offload
        # will support other offload later
@ -804,7 +828,18 @@ class AsyncActorRolloutRefWorker(ActorRolloutRefWorker):
 class CriticWorker(MegatronWorker, DistProfilerExtension):
    def __init__(self, config: McoreCriticConfig):
        Worker.__init__(self)
-        DistProfilerExtension.__init__(self, DistProfiler(rank=self.rank, config=config.get("profiler")))
+
        omega_profiler_config = config.get("profiler", {})
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        self.config: McoreCriticConfig = config
        # NOTE(sgm): We utilize colocate WorkerGroup by default.
@ -1072,8 +1107,19 @@ class RewardModelWorker(MegatronWorker, DistProfilerExtension):
    def __init__(self, config):
        Worker.__init__(self)
        profiler_config = omega_conf_to_dataclass(config.get("profiler", {}), dataclass_type=ProfilerConfig)
        omega_profiler_config = config.get("profiler", {})
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
+            self,
            DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config),
        )
        self.config = config
--- a/verl/workers/roles/critic.py
+++ b/verl/workers/roles/critic.py
@ -30,7 +30,7 @@ from verl.utils.device import (
    get_device_id,
    get_nccl_backend,
 )
-from verl.utils.profiler import DistProfiler, DistProfilerExtension
+from verl.utils.profiler import DistProfiler, DistProfilerExtension, ProfilerConfig
 from verl.utils.py_functional import append_to_dict
 from verl.utils.torch_functional import masked_mean
 from verl.workers.engine import EngineRegistry
@ -42,8 +42,16 @@ logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
 class CriticWorker(Worker, DistProfilerExtension):
    def __init__(self, config):
        Worker.__init__(self)
        omega_profiler_config = config.get("profiler", {})
        profiler_config = omega_conf_to_dataclass(omega_profiler_config, dataclass_type=ProfilerConfig)
        if omega_profiler_config.get("tool", None) in ["npu", "nsys", "torch"]:
            tool_config = omega_conf_to_dataclass(
                omega_profiler_config.get("tool_config", {}).get(omega_profiler_config.get("tool"))
            )
        else:
            tool_config = None
        DistProfilerExtension.__init__(
-            self, DistProfiler(rank=self.rank, config=omega_conf_to_dataclass(config.get("profiler")))
+            self, DistProfiler(rank=self.rank, config=profiler_config, tool_config=tool_config)
        )
        import torch.distributed