[deployment, doc] feat: Add SkyPilot integration examples (#3333)

### What does this PR do? Adds SkyPilot integration examples for running verl training jobs on Kubernetes/cloud platforms with GPUs. Includes configurations for PPO, GRPO, and multi-turn tool usage training. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+skypilot - [x] Format the PR title as `[{modules}] {type}: {description}` ### Test Validated SkyPilot YAML configurations for Ray cluster initialization, dataset downloading, and distributed training setup with H100 GPUs. ### API and Usage Example ```bash # Launch PPO training on 2 nodes sky launch -c verl-ppo examples/skypilot/verl-ppo.yaml --secret WANDB_API_KEY -y # Launch GRPO training sky launch -c verl-grpo examples/skypilot/verl-grpo.yaml --secret WANDB_API_KEY -y # Launch multi-turn tool usage training sky launch -c verl-multiturn examples/skypilot/verl-multiturn-tools.yaml --secret WANDB_API_KEY --secret HF_TOKEN -y ``` Design & Code Changes - Added 3 SkyPilot YAML configurations for PPO, GRPO, and multi-turn training - Added `examples/skypilot/README.md` with setup guide - Added `docs/examples/skypilot_examples.rst` documentation - Updated `docs/index.rst` and `docs/start/multinode.rst` with references ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-10-20 13:43:50 +08:00 · 2025-09-04 04:56:00 -04:00
parent 4d45c12408
commit f356fc1e56
7 changed files with 560 additions and 0 deletions
--- a/docs/examples/skypilot_examples.rst
+++ b/docs/examples/skypilot_examples.rst
@ -0,0 +1,144 @@
+SkyPilot Examples
+=================
+
+This guide provides examples of running VERL reinforcement learning training on Kubernetes clusters or cloud platforms with GPU nodes using `SkyPilot <https://github.com/skypilot-org/skypilot>`_.
+
+Installation and Configuration
+-------------------------------
+
+Step 1: Install SkyPilot
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Choose the installation based on your target platform:
+
+.. code-block:: bash
+
+   # For Kubernetes only
+   pip install "skypilot[kubernetes]"
+   
+   # For AWS
+   pip install "skypilot[aws]"
+   
+   # For Google Cloud Platform
+   pip install "skypilot[gcp]"
+   
+   # For Azure
+   pip install "skypilot[azure]"
+   
+   # For multiple platforms
+   pip install "skypilot[kubernetes,aws,gcp,azure]"
+
+Step 2: Configure Your Platform
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+See https://docs.skypilot.co/en/latest/getting-started/installation.html
+
+Step 3: Set Up Environment Variables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Export necessary API keys for experiment tracking:
+
+.. code-block:: bash
+
+   # For Weights & Biases tracking
+   export WANDB_API_KEY="your-wandb-api-key"
+   
+   # For HuggingFace gated models (if needed)
+   export HF_TOKEN="your-huggingface-token"
+
+Examples
+--------
+
+All example configurations are available in the `examples/skypilot/ <https://github.com/volcengine/verl/tree/main/examples/skypilot>`_ directory on GitHub. See the `README <https://github.com/volcengine/verl/blob/main/examples/skypilot/README.md>`_ for additional details.
+
+PPO Training
+~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky launch -c verl-ppo verl-ppo.yaml --secret WANDB_API_KEY -y
+
+Runs PPO training on GSM8K dataset using Qwen2.5-0.5B-Instruct model across 2 nodes with H100 GPUs. Based on examples in ``examples/ppo_trainer/``.
+
+`View verl-ppo.yaml on GitHub <https://github.com/volcengine/verl/blob/main/examples/skypilot/verl-ppo.yaml>`_
+
+GRPO Training
+~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky launch -c verl-grpo verl-grpo.yaml --secret WANDB_API_KEY -y
+
+Runs GRPO (Group Relative Policy Optimization) training on MATH dataset using Qwen2.5-7B-Instruct model. Memory-optimized configuration for 2 nodes. Based on examples in ``examples/grpo_trainer/``.
+
+`View verl-grpo.yaml on GitHub <https://github.com/volcengine/verl/blob/main/examples/skypilot/verl-grpo.yaml>`_
+
+Multi-turn Tool Usage Training
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky launch -c verl-multiturn verl-multiturn-tools.yaml \
+     --secret WANDB_API_KEY --secret HF_TOKEN -y
+
+Single-node training with 8xH100 GPUs for multi-turn tool usage with Qwen2.5-3B-Instruct. Includes tool and interaction configurations for GSM8K. Based on examples in ``examples/sglang_multiturn/`` but uses vLLM instead of sglang.
+
+`View verl-multiturn-tools.yaml on GitHub <https://github.com/volcengine/verl/blob/main/examples/skypilot/verl-multiturn-tools.yaml>`_
+
+Configuration
+-------------
+
+The example YAML files are pre-configured with:
+
+- **Infrastructure**: Kubernetes clusters (``infra: k8s``) - can be changed to ``infra: aws`` or ``infra: gcp``, etc.
+- **Docker Image**: VERL's official Docker image with CUDA 12.6 support
+- **Setup**: Automatically clones and installs VERL from source
+- **Datasets**: Downloads required datasets during setup phase
+- **Ray Cluster**: Configures distributed training across nodes
+- **Logging**: Supports Weights & Biases via ``--secret WANDB_API_KEY``
+- **Models**: Supports gated HuggingFace models via ``--secret HF_TOKEN``
+
+Launch Command Options
+----------------------
+
+- ``-c <name>``: Cluster name for managing the job
+- ``--secret KEY``: Pass secrets for API keys (can be used multiple times)
+- ``-y``: Skip confirmation prompt
+
+Monitoring Your Jobs
+--------------------
+
+Check Cluster Status
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky status
+
+View Logs
+~~~~~~~~~
+
+.. code-block:: bash
+
+   sky logs verl-ppo  # View logs for the PPO job
+
+SSH into Head Node
+~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky ssh verl-ppo
+
+Access Ray Dashboard
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky status --endpoint 8265 verl-ppo  # Get dashboard URL
+
+Stop a Cluster
+~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   sky down verl-ppo
--- a/docs/index.rst
+++ b/docs/index.rst
@ -62,6 +62,7 @@ verl is fast with:
   examples/ppo_code_architecture
   examples/gsm8k_example
   examples/multi_modal_example
+   examples/skypilot_examples

 .. toctree::
   :maxdepth: 1
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
@ -69,6 +69,15 @@ Submit job to ray cluster
 Option 2: Launch via SkyPilot on Kubernetes or clouds
 ------------------------------------------------------

+.. note::
+   Ready-to-use SkyPilot example configurations are available in the `examples/skypilot/ <https://github.com/volcengine/verl/tree/main/examples/skypilot>`_ directory:
+   
+   - ``verl-ppo.yaml`` - PPO training with GSM8K dataset
+   - ``verl-grpo.yaml`` - GRPO training with MATH dataset  
+   - ``verl-multiturn-tools.yaml`` - Multi-turn tool usage training
+   
+   See the `SkyPilot examples README <https://github.com/volcengine/verl/tree/main/examples/skypilot>`_ for detailed usage instructions.
+
 Step 1: Setup SkyPilot
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 SkyPilot can support different clouds, here we use GCP as example. `install skypilot <https://docs.skypilot.co/en/latest/getting-started/installation.html>`_
--- a/examples/skypilot/README.md
+++ b/examples/skypilot/README.md
@ -0,0 +1,107 @@
+# verl with SkyPilot
+
+Run verl reinforcement learning training jobs on Kubernetes clusters or cloud platforms with GPU nodes using [SkyPilot](https://github.com/skypilot-org/skypilot).
+
+## Installation and Configuration
+
+### Step 1: Install SkyPilot
+
+Choose the installation based on your target platform:
+
+```bash
+# For Kubernetes only
+pip install "skypilot[kubernetes]"
+
+# For AWS
+pip install "skypilot[aws]"
+
+# For Google Cloud Platform
+pip install "skypilot[gcp]"
+
+# For Azure
+pip install "skypilot[azure]"
+
+# For multiple platforms
+pip install "skypilot[kubernetes,aws,gcp,azure]"
+```
+
+### Step 2: Configure Your Platform
+
+See https://docs.skypilot.co/en/latest/getting-started/installation.html
+
+### Step 3: Set Up Environment Variables
+
+Export necessary API keys for experiment tracking:
+
+```bash
+# For Weights & Biases tracking
+export WANDB_API_KEY="your-wandb-api-key"
+
+# For HuggingFace gated models (if needed)
+export HF_TOKEN="your-huggingface-token"
+```
+
+## Examples
+
+### PPO Training
+```bash
+sky launch -c verl-ppo verl-ppo.yaml --secret WANDB_API_KEY -y
+```
+Runs PPO training on GSM8K dataset using Qwen2.5-0.5B-Instruct model across 2 nodes with H100 GPUs. Based on examples in [`../ppo_trainer/`](../ppo_trainer/).
+
+### GRPO Training  
+```bash
+sky launch -c verl-grpo verl-grpo.yaml --secret WANDB_API_KEY -y
+```
+Runs GRPO (Group Relative Policy Optimization) training on MATH dataset using Qwen2.5-7B-Instruct model. Memory-optimized configuration for 2 nodes. Based on examples in [`../grpo_trainer/`](../grpo_trainer/).
+
+### Multi-turn Tool Usage Training
+```bash
+sky launch -c verl-multiturn verl-multiturn-tools.yaml --secret WANDB_API_KEY --secret HF_TOKEN -y
+```
+Single-node training with 8xH100 GPUs for multi-turn tool usage with Qwen2.5-3B-Instruct. Includes tool and interaction configurations for GSM8K. Based on examples in [`../sglang_multiturn/`](../sglang_multiturn/) but uses vLLM instead of sglang.
+
+## Configuration
+
+The example YAML files are pre-configured with:
+
+- **Infrastructure**: Kubernetes clusters (`infra: k8s`) - can be changed to `infra: aws` or `infra: gcp`, etc.
+- **Docker Image**: verl's official Docker image with CUDA 12.6 support
+- **Setup**: Automatically clones and installs verl from source
+- **Datasets**: Downloads required datasets during setup phase
+- **Ray Cluster**: Configures distributed training across nodes
+- **Logging**: Supports Weights & Biases via `--secret WANDB_API_KEY`
+- **Models**: Supports gated HuggingFace models via `--secret HF_TOKEN`
+
+## Launch Command Options
+
+- `-c <name>`: Cluster name for managing the job
+- `--secret KEY`: Pass secrets for API keys (can be used multiple times)
+- `-y`: Skip confirmation prompt
+
+## Monitoring Your Jobs
+
+### Check cluster status
+```bash
+sky status
+```
+
+### View logs
+```bash
+sky logs verl-ppo  # View logs for the PPO job
+```
+
+### SSH into head node
+```bash
+ssh verl-ppo
+```
+
+### Access Ray dashboard
+```bash
+sky status --endpoint 8265 verl-ppo  # Get dashboard URL
+```
+
+### Stop a cluster
+```bash
+sky down verl-ppo
+```
--- a/examples/skypilot/verl-grpo.yaml
+++ b/examples/skypilot/verl-grpo.yaml
@ -0,0 +1,99 @@
+resources:
+  infra: k8s
+  accelerators: H100:1 
+  memory: 128+
+  image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
+  ports: 8265
+
+num_nodes: 2
+
+secrets:
+  WANDB_API_KEY: 
+
+setup: |  
+  rm -rf verl
+  git clone https://github.com/volcengine/verl.git
+  cd verl
+  pip3 install -v -e .[vllm]
+  pip3 install flashinfer-python
+  echo "Downloading Math dataset..."
+  mkdir -p ~/data/math
+  python3 "$(pwd)/examples/data_preprocess/math_dataset.py" --local_dir ~/data/math
+  echo "Math dataset download completed"
+
+run: |
+  HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NUM_NODES=$SKYPILOT_NUM_NODES
+  NUM_GPUS_PER_NODE=$SKYPILOT_NUM_GPUS_PER_NODE
+  
+  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
+    echo "Starting Ray head node..."
+    ps aux | grep ray | grep 6379 &> /dev/null ||  ray start --head --disable-usage-stats \
+          --port=6379 \
+          --dashboard-host=0.0.0.0 \
+          --dashboard-port=8265
+
+    # Wait for all worker nodes to join
+    retry_count=0
+    max_retries=30
+    while [ $retry_count -lt $max_retries ]; do
+      connected_nodes=$(ray status 2>/dev/null | grep -c "node_" || echo "0")
+      echo "Connected nodes: $connected_nodes/$NUM_NODES (attempt $((retry_count+1))/$max_retries)"
+      
+      if [ "$connected_nodes" -ge "$NUM_NODES" ]; then
+        echo "All nodes connected to Ray cluster"
+        break
+      fi
+      
+      retry_count=$((retry_count+1))
+      sleep 10
+    done
+
+    python3 -m verl.trainer.main_ppo \
+     algorithm.adv_estimator=grpo \
+     data.train_files=$HOME/data/math/train.parquet \
+     data.val_files=$HOME/data/math/test.parquet \
+     data.train_batch_size=32 \
+     data.max_prompt_length=256 \
+     data.max_response_length=256 \
+     data.filter_overlong_prompts=True \
+     data.truncation='error' \
+     actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
+     actor_rollout_ref.actor.optim.lr=1e-6 \
+     actor_rollout_ref.model.use_remove_padding=True \
+     actor_rollout_ref.actor.ppo_mini_batch_size=16 \
+     actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+     actor_rollout_ref.actor.ppo_epochs=1 \
+     actor_rollout_ref.actor.use_kl_loss=False \
+     actor_rollout_ref.actor.entropy_coeff=0 \
+     actor_rollout_ref.model.enable_gradient_checkpointing=True \
+     actor_rollout_ref.actor.fsdp_config.param_offload=True \
+     actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+     actor_rollout_ref.rollout.name=vllm \
+     actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+     actor_rollout_ref.rollout.n=1 \
+     actor_rollout_ref.rollout.enable_chunked_prefill=True \
+     actor_rollout_ref.rollout.max_num_batched_tokens=2048 \
+     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+     actor_rollout_ref.ref.fsdp_config.param_offload=True \
+     algorithm.use_kl_in_reward=False \
+     trainer.critic_warmup=0 \
+     trainer.logger=[console,wandb] \
+     trainer.project_name=verl_math_grpo_demo \
+     trainer.experiment_name=qwen25_7b_grpo \
+     trainer.n_gpus_per_node=$NUM_GPUS_PER_NODE \
+     trainer.nnodes=$NUM_NODES \
+     trainer.save_freq=-1 \
+     trainer.test_freq=-1 \
+     trainer.total_epochs=1
+
+  else
+    sleep 15
+    echo "Starting Ray worker node..."
+    ps aux | grep ray | grep $HEAD_IP:6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
+    sleep 10
+  fi
+
+  echo "Node setup and Ray start script finished for rank $SKYPILOT_NODE_RANK."
--- a/examples/skypilot/verl-multiturn-tools.yaml
+++ b/examples/skypilot/verl-multiturn-tools.yaml
@ -0,0 +1,91 @@
+resources:
+  infra: k8s
+  accelerators: H100:8
+  memory: 128+
+  image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
+  ports: 8265
+
+num_nodes: 1
+
+secrets:
+  WANDB_API_KEY: 
+  HF_TOKEN: # in case you're using gated models from the HF hub
+
+setup: |
+  rm -rf verl
+  git clone https://github.com/volcengine/verl.git
+  cd verl
+  pip3 install -v -e .[vllm]
+  pip3 install flashinfer-python
+  pip install "transformers<4.54.0" # https://github.com/vllm-project/vllm-ascend/issues/2046
+  # Download GSM8K dataset for multiturn tool training
+  echo "Downloading GSM8K dataset..."
+  mkdir -p ~/data/gsm8k
+  python3 "$(pwd)/examples/data_preprocess/gsm8k.py" --local_dir ~/data/gsm8k
+  echo "GSM8K dataset download completed"
+
+run: |
+  NUM_GPUS_PER_NODE=$SKYPILOT_NUM_GPUS_PER_NODE
+  PROJECT_DIR="$(pwd)/verl"
+  CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
+  
+  # Single node setup - no worker coordination needed
+  echo "Starting Ray head node..."
+  ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats \
+        --port=6379 \
+        --dashboard-host=0.0.0.0 \
+        --dashboard-port=8265
+
+  cd verl
+
+  python3 -m verl.trainer.main_ppo \
+     --config-path="$CONFIG_PATH" \
+     --config-name='gsm8k_multiturn_grpo' \
+     algorithm.adv_estimator=grpo \
+     data.train_batch_size=512 \
+     data.max_prompt_length=1024 \
+     data.max_response_length=1024 \
+     data.filter_overlong_prompts=True \
+     data.truncation='error' \
+     data.return_raw_chat=True \
+     data.train_files=$HOME/data/gsm8k/train.parquet \
+     data.val_files=$HOME/data/gsm8k/test.parquet \
+     actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
+     actor_rollout_ref.actor.optim.lr=1e-6 \
+     actor_rollout_ref.model.use_remove_padding=True \
+     actor_rollout_ref.actor.ppo_mini_batch_size=512 \
+     actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32 \
+     actor_rollout_ref.actor.use_kl_loss=True \
+     actor_rollout_ref.actor.kl_loss_coef=0.001 \
+     actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+     actor_rollout_ref.actor.entropy_coeff=0 \
+     actor_rollout_ref.model.enable_gradient_checkpointing=True \
+     actor_rollout_ref.actor.fsdp_config.param_offload=False \
+     actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=64 \
+     actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+     actor_rollout_ref.rollout.name=vllm \
+     actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+     actor_rollout_ref.rollout.n=16 \
+     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=64 \
+     actor_rollout_ref.ref.fsdp_config.param_offload=True \
+     algorithm.use_kl_in_reward=False \
+     trainer.critic_warmup=0 \
+     trainer.logger=[console,wandb] \
+     trainer.project_name=verl_multiturn_tools \
+     trainer.experiment_name=qwen25_7b_gsm8k_multiturn_tools \
+     trainer.n_gpus_per_node=$NUM_GPUS_PER_NODE \
+     trainer.nnodes=1 \
+     trainer.save_freq=10 \
+     trainer.test_freq=5 \
+     trainer.total_epochs=10 \
+     actor_rollout_ref.actor.ppo_max_token_len_per_gpu=8192 \
+     actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=8192 \
+     actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=8192 \
+     critic.ppo_max_token_len_per_gpu=8192 \
+     critic.forward_max_token_len_per_gpu=8192 \
+     actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
+     actor_rollout_ref.rollout.multi_turn.interaction_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/interaction_config/gsm8k_interaction_config.yaml" \
+     actor_rollout_ref.rollout.multi_turn.max_user_turns=1
+
+  echo "Node setup and Ray start script finished for rank $SKYPILOT_NODE_RANK."
--- a/examples/skypilot/verl-ppo.yaml
+++ b/examples/skypilot/verl-ppo.yaml
@ -0,0 +1,109 @@
+resources:
+  infra: k8s
+  accelerators: H100:1 
+  memory: 128+
+  image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
+  ports: 8265
+
+num_nodes: 2
+
+secrets:
+  WANDB_API_KEY: 
+
+setup: |  
+  rm -rf verl
+  git clone https://github.com/volcengine/verl.git
+  cd verl
+  pip3 install -v -e .[vllm]
+  pip3 install flashinfer-python
+  # Download GSM8K dataset - alternative approach
+  echo "Downloading GSM8K dataset..."
+  mkdir -p ~/data/gsm8k
+  # Check if the script exists and use absolute path
+  if [ -f "$(pwd)/examples/data_preprocess/gsm8k.py" ]; then
+    python3 "$(pwd)/examples/data_preprocess/gsm8k.py" --local_dir ~/data/gsm8k
+  else
+    echo "Warning: gsm8k.py script not found, skipping dataset download"
+    # You might want to download the dataset manually or use a different approach
+  fi
+  echo "GSM8K dataset download completed"
+
+run: |
+  # Get the Head node's IP and total number of nodes
+  HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NUM_NODES=$SKYPILOT_NUM_NODES
+  
+  # login wandb
+  # python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"
+
+  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
+    # Head node starts Ray Head
+    echo "Starting Ray head node..."
+    ps aux | grep ray | grep 6379 &> /dev/null ||  ray start --head --disable-usage-stats \
+          --port=6379 \
+          --dashboard-host=0.0.0.0 \
+          --dashboard-port=8265
+
+    # Wait for all worker nodes to join the cluster with better checking
+    echo "Waiting for all nodes to join Ray cluster..."
+    retry_count=0
+    max_retries=30
+    while [ $retry_count -lt $max_retries ]; do
+      connected_nodes=$(ray status 2>/dev/null | grep -c "node_" || echo "0")
+      echo "Connected nodes: $connected_nodes/$NUM_NODES (attempt $((retry_count+1))/$max_retries)"
+      
+      if [ "$connected_nodes" -ge "$NUM_NODES" ]; then
+        echo "All nodes connected to Ray cluster"
+        break
+      fi
+      
+      retry_count=$((retry_count+1))
+      sleep 10
+    done
+
+    if [ $retry_count -eq $max_retries ]; then
+      echo "WARNING: Not all nodes connected to Ray cluster after $max_retries attempts"
+      echo "Current Ray status:"
+      ray status
+    fi
+
+    python3 -m verl.trainer.main_ppo \
+     data.train_files=$HOME/data/gsm8k/train.parquet \
+     data.val_files=$HOME/data/gsm8k/test.parquet \
+     data.train_batch_size=256 \
+     data.max_prompt_length=512 \
+     data.max_response_length=256 \
+     actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+     actor_rollout_ref.actor.optim.lr=1e-6 \
+     actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+     actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+     actor_rollout_ref.rollout.name=vllm \
+     actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+     critic.optim.lr=1e-5 \
+     critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+     critic.ppo_micro_batch_size_per_gpu=4 \
+     algorithm.kl_ctrl.kl_coef=0.001 \
+     trainer.logger=[console,wandb] \
+     trainer.val_before_train=False \
+     trainer.default_hdfs_dir=null \
+     trainer.n_gpus_per_node=1 \
+     trainer.nnodes=2 \
+     trainer.save_freq=20 \
+     trainer.test_freq=20 \
+     trainer.total_epochs=2 \
+     trainer.project_name=verl_examples \
+     trainer.experiment_name=experiment_name_gsm8k
+
+  else
+    # Wait for Ray Head to start
+    sleep 15
+    # Worker node starts Ray Worker
+    echo "Starting Ray worker node..."
+    ps aux | grep ray | grep $HEAD_IP:6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
+    sleep 10
+  fi
+
+  echo "Node setup and Ray start script finished for rank $SKYPILOT_NODE_RANK."