mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
### Summary #### Minimize Test Workloads This PR minimizes the test workloads while keeping them meaningful, reducing the time cost of a test from >10 min to 1~2 min. Specifically, we 1. set batch sizes and steps as small but still meaningful numbers: ```bash train_traj_micro_bsz_per_gpu=2 # b n_resp_per_prompt=4 # g train_traj_micro_bsz=$((train_traj_micro_bsz_per_gpu * NUM_GPUS)) # b * n train_traj_mini_bsz=$((train_traj_micro_bsz * 2)) # 2 * b * n train_prompt_mini_bsz=$((train_traj_mini_bsz * n_resp_per_prompt)) # 2 * b * n / g train_prompt_bsz=$((train_prompt_mini_bsz * 2)) # 4 * b * n / g # ... TOT_TRAIN_STEPS=${TOT_TRAIN_STEPS:-1} ``` 2. disable validation (this costs a lot!) / saving / resuming for training tests by default and leave them to specialized tests ```bash # Validation VAL_BEFORE_TRAIN=${VAL_BEFORE_TRAIN:-False} TEST_FREQ=${TEST_FREQ:--1} # Save & Resume RESUME_MODE=${RESUME_MODE:-disable} SAVE_FREQ=${SAVE_FREQ:--1} ``` #### Improve Triggering Mode This PRs introduces a more comprehensive triggering logic mode. Specifically, we 1. consider all Python code by default 2. include related entrypoints (the workflow config, scripts used by it and hydra config, etc.) 3. exclude unrelated Python code from other components (e.g., recipes, examples, Megatron, SFT, generation, evaluation, etc. for FSDP training) An example from `e2e_ppo_trainer`: ```yaml on: paths: - "**/*.py" # Entrypoints - ".github/workflows/e2e_ppo_trainer.yml" - "examples/data_preprocess/gsm8k.py" - "examples/data_preprocess/geo3k.py" - "tests/e2e/ppo_trainer" - "verl/trainer/main_ppo.py" - "verl/trainer/config/ppo_trainer.yaml" - "!examples" - "!verl/trainer/main_*.py" - "!verl/trainer/fsdp_sft_trainer.py" # Recipes - "!recipe" # Megatron - "!verl/workers/**/megatron_*.py" ``` #### Avoid missing out errors Some test scripts didn't end with the main python command and might miss out the error. To address this issue, this PR introduces the following options: ```bash set -xeuo pipefail ``` , which means - `x`: Print each command before executing it (useful for debugging) - `e`: Exit immediately if any command fails (returns non-zero exit status) - `u`: Treat unset variables as an error - `o pipefail`: Return the exit status of the last command in a pipeline that failed, or zero if all succeeded Together, these options make the script fail fast and provide verbose output, which helps with debugging and ensuring the script doesn't continue after encountering errors. #### Others Besides, we also 1. unify runner labels into `"L20x8"` to enable preemptive scheduling of jobs 2. reduce test scripts of minimal differences, grouping by entrypoint (e.g. `ppo_trainer`, `ppo_megatron_trainer`, recipes, etc.), into a base script with options
18 lines
706 B
Bash
18 lines
706 B
Bash
#!/usr/bin/env bash
|
|
set -uxo pipefail
|
|
|
|
export VERL_HOME=${VERL_HOME:-"${HOME}/verl"}
|
|
export TRAIN_FILE=${TRAIN_FILE:-"${VERL_HOME}/data/dapo-math-17k.parquet"}
|
|
export TEST_FILE=${TEST_FILE:-"${VERL_HOME}/data/aime-2024.parquet"}
|
|
export OVERWRITE=${OVERWRITE:-0}
|
|
|
|
mkdir -p "${VERL_HOME}/data"
|
|
|
|
if [ ! -f "${TRAIN_FILE}" ] || [ "${OVERWRITE}" -eq 1 ]; then
|
|
wget -O "${TRAIN_FILE}" "https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/resolve/main/data/dapo-math-17k.parquet?download=true"
|
|
fi
|
|
|
|
if [ ! -f "${TEST_FILE}" ] || [ "${OVERWRITE}" -eq 1 ]; then
|
|
wget -O "${TEST_FILE}" "https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024/resolve/main/data/aime-2024.parquet?download=true"
|
|
fi
|