Files
verl/.gitignore
Blue Space ccab83654c Megatron checkpoint default not save hf_models, and provide model merge tool. (#780)
Because CI is too slow, combine the features and functions of checkpoint
here in 1 PR.

# Add Layer idx to decode layers

But it seems to be hard to attach a "correct" layer number to each
layer, now verl implemented megatron each pp and vpp rank's layers start
from index 0, leading to some inconvenience for merging tool.

The difficulty mainly comes from `torch.nn.ModuleList` implementation,
[it suggests and forces to directly use index rather than custom layer
number](8a40fca9a1/torch/nn/modules/container.py (L302C5-L324C66)).

Current solution is that we modify the layer number to actual number
starts from pp and vpp offset when saving megatron checkpoint, and
recover when loading. When use merging tool, there is no need for extra
scans.

# Huggingface Model loader logic simplified

Since every rank can have access to state_dict, there is actually no
need to broadcast the weights among mp and dp groups at all, and all
from rank 0. The implementation before is too costly and may cause OOM
issue because each rank can take up whole model space in GPU.

And the loader logic is not straight-forward, since everyone only need
to load its vpp_size number of layers, why iterate over whole
num_layers.

So current solution is every rank load itself's sharded weights from
`state_dict`.

But this requires users having storage nodes available to connect with
every calculation nodes. For those who can only use rank 0 to store
huggingface model, we move original implementation to deperacated
besides new version of file.

# Modify test scripts to reuse downloaded huggingface model

Avoid errors when connecting with huggingface to access metadata.

# Modify CI workflows to enable load-balance of CI machines

Currently L20-0 takes up 6 more jobs than L20-1, try reduce the pipeline
bubble of each task.
2025-03-30 10:39:40 +08:00

129 lines
1.4 KiB
Plaintext

**/*.pt
**/checkpoints
**/wget-log
**/_build/
**/*.ckpt
**/outputs
**/*.tar.gz
**/playground
**/wandb
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
dataset/*
tensorflow/my_graph/*
.idea/
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
tmp/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
.venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
# vscode
.vscode
# Mac
.DS_Store
# output logs
tests/e2e/toy_examples/deepspeed/synchronous/output.txt
# vim
*.swp
# ckpt
*.lock
# data
*.parquet
# local logs
logs
log
outputs