!225 去除pretrainer相关内容

Merge pull request !225 from humphrey007/master
This commit is contained in:
humphrey007
2025-06-05 07:47:45 +00:00
committed by i-robot
parent 46bffba06d
commit 0adb659e5c
16 changed files with 78 additions and 2133 deletions

View File

@ -1,124 +0,0 @@
# PreTrainer Module APIs
## openmind.PreTrainer Class
The `PreTrainer` class provides common functions for pre-training process management.
**Parameters**
| Parameter | Type | Description | Default Value |
| ---------------- | ------------------------------------------- |---------------|------|
| pretrain_args | PreTrainingArguments | Pre-training parameter | - |
| accelerator | Accelerator | Accelerate instance| None |
| model | torch.nn.Module | Torch model | None |
| optimizer | accelerate.utils.MegatronLMOptimizerWrapper | Optimizer | None |
| lr_scheduler | accelerate.utils.MegatronLMSchedulerWrapper | Scheduler | None |
| train_dataloader | torch.utils.data.DataLoader | Training data loader | None |
| eval_dataloader | torch.utils.data.DataLoader | Evaluation data loader | None |
### train
Starts pre-training.
**Prototype**
```python
def train()
```
## openmind.PreTrainingArguments Class
The `PreTrainingArguments` class configures parameters of a training job, including hyperparameters required during training, model save path, and learning rate.
**Parameters**
| Parameter | Type| Description | Default Value for PyTorch |
| --------------------------- | ---- |-------------------|-----------------------|
| num_training_steps | int | Number of training steps | - |
| micro_batch_size | int | Size of a micro batch | - |
| dp | int | Degree of parallelism | - |
| gradient_accumulation_steps | int | Number of gradient accumulation steps | 1 |
| seq_length | int | Maximum length of a sequence | None |
| megatron_dataset_flag | bool | Whether the dataset is Magatron-formatted| None |
| data_path | str | Dataset path | None |
| save_dir | str | Model saving path | None |
| save_interval | int | Model saving interval | None |
| eval_interval | int | Model evaluation interval | None |
| openmind_model_path | str | Model path | None |
| dtype | str | Runtime data type | bf16 |
| plugin_args | dict | [Accelerate plugin parameter](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMPlugin) | None |
| dataloader_config | dict | [Loader configuration parameter](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMDummyDataLoader) | None |
| report_to | str | Accelerate log output object| None |
| project_name | str | Project name | "accelerate-megatron" |
### from_yaml
Loads configurations from the YAML configuration file.
**Prototype**
```python
def from_yaml(config_path: str)
```
**Parameters**
| Parameter | Description | Supported Type|
| ----------- |-------------| -------- |
| config_path | Path of the YAML configuration file| str |
### get_mixed_precision
Obtains the mixed precision type.
**Prototype**
```python
def get_mixed_precision()
```
### get_torch_dtype
Obtains the runtime data type.
**Prototype**
```python
def get_torch_dtype()
```
### get_distributed_train_args
Obtains distributed pre-training parameters.
**Prototype**
```python
def get_distributed_train_args()
```
### update_distributed_train_args
Updates distributed pre-training parameters.
**Prototype**
```python
def update_distributed_train_args(extra_args: dict)
```
**Parameters**
| Parameter | Description | Supported Type|
| ---------- |-------------| -------- |
| extra_args | Additional parameter for distributed pre-training| dict |
### get_dataloader_config
Obtains the configuration parameters of the data loader.
**Prototype**
```python
def get_dataloader_config()
```

View File

@ -1,450 +0,0 @@
# Model Pre-training
## Basic Concepts
**Pre-training** is a training policy for deep learning models, which is usually performed on a large-scale dataset. The goal of pre-training is to train the model on a related but large task so that the model learns general features and representations. However, with the rapid growth of large model parameters and the amount of training data required, the resource upper limit of a single machine can no longer meet the training requirements, so the concept of distributed training is introduced.
**Distributed training** means that a deep learning model task is divided into a plurality of subtasks, and training is performed in parallel on multiple computing devices. Distributed training greatly improves the training speed of large models and greatly reduces the overall model training time.
In this document, PreTrainer implements distributed capabilities of multiple frameworks (Megatron, DeepSpeed, and FSDP) based on Accelerate and provides common functions for pre-training process management.
## Environment Setup
```shell
torch: 2.1.0
transformers: 4.45.2
accelerate: 0.28.0
deepspeed: 0.15.2
megatron_core: 0.4.0rc0
```
### Installing the Megatron-LM Distributed Framework
To use the Megatron-LM distributed framework, perform the following steps:
1. Install Megatron. For details, see the [Megatron installation method of MindSpeed](https://gitee.com/ascend/MindSpeed#3-obtain-megatron-lm-and-specify-commit-id.)
```shell
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout bcce6f54e075e3c3374ea67adefe54f3f2da2b07
pip install --no-use-pep517 -e . # "--no-use-pep517 -e" can install all Megatron files.
```
2. Install MindSpeed.
```shell
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout origin/1.0.RC1
pip install -r requirements.txt
pip install -e .
```
3. Use pip to install the openmind_accelerate plugin of the Modelers community.
```shell
#AArch64 platform
pip install openmind-accelerate
#x86 platform
pip install openmind-accelerate --extra-index-url https://download.pytorch.org/whl/cpu
```
4. Install Accelerate and DeepSpeed.
```shell
pip install deepspeed==0.15.2
pip install accelerate==0.28.0
```
### openMind Library Environment Setup
```shell
#Installation in the AArch64 environment
pip install openmind[pt]
#Installation in the x86 environment
pip install openmind[pt] --extra-index-url https://download.pytorch.org/whl/cpu
```
For details about how to install the openMind Library dependency environment, see [openMind Library Installation Guide](../install.md).
After the installation is complete, use `pip list` to check the version dependency. If the Accelerate or Transformers version is updated during the installation, update them to the specified version.
## Quick Start
[Sample configuration files and startup scripts](https://modelers.cn/models/AI-Research/accelerate_examples/tree/main/examples) are provided for easy access.
### PreTrainer Use Procedure
#### Preparing Dataset
Prepare your own pre-training dataset, for example, [alpaca_en](https://modelers.cn/datasets/HaM/alpaca_en/tree/main) dataset.
If you need to use the Megatron-LM distributed framework, see [Megatron Data Processing](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#data-preprocessing).
#### Preparing a Model
Prepare a model file, for example, [Llama 2](https://modelers.cn/models/AI_Connect/llama2_7b/tree/main).
If you want to use the Megatron-LM distributed framework, you only need to prepare the **config.json** and **tokenizer** files.
#### Preparing Pre-training Parameters
The pre-training parameters can be automatically generated by loading the [llama2_config/llama2-megatron-json-dataset.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-megatron-json-dataset.yaml) file. You can fine-tune the sample configuration file of the dataset in JSON format by referring to [here] (#llama2_megatron).
#### Startup
- For details about the Accelerate configuration file, see [accelerate_config/accelerate_megatron_config.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/accelerate_config/accelerate_megatron_config.yaml).
```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: [ ]
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
- For details about the model configuration file, see [llama2_config/llama2-megatron-json-dataset.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-megatron-json-dataset.yaml).
<a id="llama2_megatron"></a>
```yaml
num_training_steps: 1000
micro_batch_size: &micro_batch_size 4
dp: 1
gradient_accumulation_steps: &gradient_accumulation_steps 8
### The value of **seq_length** must be less than or equal to the value of **max_position_embeddings** in the model weight configuration file **config.json**.
seq_length: &seq_length 4096
megatron_dataset_flag: False
### data_path: Enter the path of the local fine-tuning dataset.
data_path: &data_path '/path/to/alpaca_en/alpaca_data_en_52k.json'
### Path for saving the fine-tuning model weight
save_dir: './saves'
save_interval: 10000
eval_interval: 10000
### openmind_model_path: Enter the path of the local model weight folder.
openmind_model_path: '/path/to/llama2-7b-hf'
dtype: 'bf16'
plugin_args:
tp_degree: 8
pp_degree: 1
num_micro_batches: *gradient_accumulation_steps
gradient_clipping: 1.0
use_distributed_optimizer: False
sequence_parallelism: False
other_megatron_args:
### tokenizer_model: path of the tokenizer.model file in the local model weight file.
tokenizer_model: &tokenizer_model '/path/to/llama2-7b-hf/tokenizer.model'
tokenizer_type: &tokenizer_type 'Llama2Tokenizer'
finetune: False
recompute_granularity: "full"
recompute_method: "block"
recompute_num_layers: 32
optimizer: "adam"
lr: 1e-5
min_lr: 1e-6
adam_beta2: 0.95
add_bias_linear: False
async_tensor_model_parallel_allreduce: False
attention_dropout: 0.0
attention_softmax_in_fp32: False
bias_gelu_fusion: False
ffn_hidden_size: 11008
hidden_dropout: 0.0
init_method_std: 0.01
initial_loss_scale: 65536.0
lr_decay_style: "cosine"
lr_warmup_fraction: 0.01
masked_softmax_fusion: False
normalization: "RMSNorm"
split: &split "100,0,0"
swiglu: True
untie_embeddings_and_output_weights: True
use_flash_attn: False
weight_decay: 0.1
no_load_optim: True
no_load_rng: True
eval_iters: &eval_iters 10
position_embedding_type: "rope"
dataloader_config:
return_tensors: 'pt'
padding: 'max_length'
pad_to_multiple_of: *seq_length
max_length: *seq_length
```
- For details about the pre-training program file, see [train_with_megatron_json_dataset.py](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/train_with_megatron_json_dataset.py). This Python script cannot be directly run. To run it, download the following repository to obtain the utils code and copy **accelerate_examples/examples/utils** to the same directory as the script.
```shell
git clone https://modelers.cn/AI-Research/accelerate_examples.git
cp -r accelerate_examples/examples/utils ./ #: Replace the destination path with the path of the train_with_megatron_json_dataset.py file.
```
```python
import os
import openmind_accelerate
from openmind import PreTrainingArguments, PreTrainer
from utils.config import get_pretrain_config_file
from utils.accelerator import make_accelerator
from utils.data import make_train_and_eval_dataloader
from utils.tokenizer import get_tokenizer
pretrain_args = PreTrainingArguments.from_yaml(get_pretrain_config_file())
os.makedirs(pretrain_args.save_dir, exist_ok=True)
accelerator = make_accelerator(pretrain_args=pretrain_args)
tokenizer = get_tokenizer(tokenizer_path=pretrain_args.openmind_model_path, use_fast=False)
transformer_dataloader_config = pretrain_args.get_dataloader_config()
train_dataloader, eval_dataloader = make_train_and_eval_dataloader(
dataloader_config=transformer_dataloader_config,
micro_batch_size=pretrain_args.micro_batch_size,
data_files=pretrain_args.data_path,
max_length=pretrain_args.seq_length,
tokenizer=tokenizer,
accelerator=accelerator
)
pretrainer = PreTrainer(pretrain_args=pretrain_args,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
)
pretrainer.train()
```
After configuring the environment configuration and preparing the configuration file, run the following command to start fine-tuning. Ensure that the training script and configuration file are in the actual local path.
```shell
accelerate launch --config_file accelerate_config/accelerate_megatron_config.yaml train_with_megatron_json_dataset.py --pretrain_config_file llama2_config/llama2-megatron-json-dataset.yaml
```
## Advanced Use
### Defining Pre-training Parameters
Before defining PreTrainer, you need to define a PreTrainingArguments class that contains all hyperparameters used by PreTrainer for training and evaluation. You can initialize the pre-training parameters by using the configuration file or directly transferring parameters.
#### Using the Configuration File
The pre-training parameters can be automatically generated by loading the YAML file. For more YAML examples, see [Samples Link](https://modelers.cn/models/AI-Research/accelerate_examples/tree/main/examples/llama2_config).
```python
from openmind import PreTrainingArguments
# Replace the path with a local path.
pretrain_args = PreTrainingArguments.from_yaml(
"openmind-accelerate/examples/llama2_config/llama2-megatron.yaml"
)
```
#### Directly Passing Parameters
Pre-training parameters can also be instantiated through parameter pass. The initialization process of the pre-trainer for training the Megatron dataset using the Megatron model is as follows.
For details, see [PreTrainingArguments Description] (#pretrainingarguments Description).
```python
from openmind import PreTrainingArguments
# Replace the path with a local path.
pretrain_args = PreTrainingArguments(
megatron_dataset_flag=True,
data_path="HaM/alpaca_en",
num_training_steps=1000,
micro_batch_size=4,
dp=1,
gradient_accumulation_steps=8,
seq_length=2048,
)
```
### Pre-training a Model Using the Megatron Framework
After configuring the pre-training parameters, you can start the Megatron model pre-training.
- For details about the configuration file for Accelerate and Megatron interconnection, see [accelerate_config/accelerate_megatron_config.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/accelerate_config/accelerate_megatron_config.yaml).
- For details about how to use the Megatron framework to train the JSON dataset, see [train_with_megatron_json_dataset.py](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/train_with_megatron_json_dataset.py).
- For details about the configuration file of JSON pre-training dataset, see [llama2_config/llama2-megatron-json-dataset.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-megatron-json-dataset.yaml).
You only need to pass the prepared `train_dataloader` (`eval_dataloader` not necessarily required) to PreTrainer. Then, you can use the custom dataloader to pre-train the model.
```shell
accelerate launch --config_file accelerate_config/accelerate_megatron_config.yaml train_with_megatron_json_dataset.py --pretrain_config_file llama2_config/llama2-megatron-json-dataset.yaml
```
#### (Optional) Customizing the Processing Flow of the Megatron Framework
##### Customizing Functions
When using Megatron for pre-training, you can customize any function in datasets_provider, model_provider, get_batch, and loss_function and assign the function pointer to the following attributes. For details about how to implement user-defined functions, see the official sample [pretrain_gpt.py](https://github.com/NVIDIA/Megatron-LM/blob/main/pretrain_gpt.py).
- `custom_megatron_datasets_provider_function`: provides the training and validation datasets of Megatron.
- `custom_get_batch_function`: generates batch data.
- `custom_model_provider_function`: builds models.
- `custom_loss_function`: returns the loss function.
```python
import openmind_accelerate
from openmind import PreTrainingArguments
from pretrain_gpt import (
train_valid_test_datasets_provider,
get_batch as megatron_gpt_get_batch,
model_provider as megatron_gpt_model_provider,
loss_func as megatron_gpt_loss_func,
)
# Replace the path with a local path.
pretrain_args = PreTrainingArguments.from_yaml(
"openmind-accelerate/examples/llama2_config/llama2-megatron-json-dataset.yaml"
)
train_valid_test_datasets_provider.is_distributed = True
pretrain_args.update_distributed_train_args(
extra_args={
"custom_megatron_datasets_provider_function": train_valid_test_datasets_provider,
"custom_get_batch_function": megatron_gpt_get_batch,
"custom_model_provider_function": megatron_gpt_model_provider,
"custom_loss_function": megatron_gpt_loss_func,
}
)
```
##### Customizing Analytical Model Configuration File
You can customize the analytical function of the model configuration file based on the format configured for the Accelerate analytical model. The following is the built-in analytical function of the Llama model configuration file in PreTrainer. You can refer to the function as needed.
```python
import openmind_accelerate
from accelerate.utils import add_model_config_to_megatron_parser
@add_model_config_to_megatron_parser("llama")
def parse_llama_config(megatron_lm_plugin, model, batch_data):
model_type_name = "gpt"
num_layers = model.config.num_hidden_layers
pretraining_flag = True
hidden_size = model.config.hidden_size
num_attention_heads = model.config.num_attention_heads
orig_vocab_size = model.config.vocab_size
max_position_embeddings = getattr(model.config, "max_position_embeddings")
seq_length = getattr(model.config, "max_sequence_length", None)
if megatron_lm_plugin.seq_length is None:
if seq_length is not None:
megatron_lm_plugin.seq_length = seq_length
elif megatron_lm_plugin.decoder_seq_length is not None:
megatron_lm_plugin.seq_length = megatron_lm_plugin.decoder_seq_length
elif batch_data is not None:
megatron_lm_plugin.seq_length = batch_data["input_ids"].shape[1]
else:
megatron_lm_plugin.seq_length = max_position_embeddings
megatron_lm_plugin.megatron_lm_default_args["return_logits"] = megatron_lm_plugin.return_logits
megatron_lm_plugin.megatron_lm_default_args["tokenizer_type"] = "Llama2Tokenizer"
megatron_lm_plugin.megatron_lm_default_args["model_type_name"] = model_type_name
megatron_lm_plugin.megatron_lm_default_args["num_layers"] = num_layers
megatron_lm_plugin.megatron_lm_default_args["pretraining_flag"] = pretraining_flag
megatron_lm_plugin.megatron_lm_default_args["hidden_size"] = hidden_size
megatron_lm_plugin.megatron_lm_default_args["num_attention_heads"] = num_attention_heads
megatron_lm_plugin.megatron_lm_default_args["orig_vocab_size"] = orig_vocab_size
megatron_lm_plugin.megatron_lm_default_args["max_position_embeddings"] = max_position_embeddings
megatron_lm_plugin.megatron_lm_default_args["seq_length"] = megatron_lm_plugin.seq_length
megatron_lm_plugin.megatron_lm_default_args["model_return_dict"] = model.config.return_dict
```
### Using Other Frameworks to Pre-train Models
PreTrainer can implement a multi-framework distributed capability based on Accelerate. In addition to Megatron, PreTrainer also supports the DeepSpeed and FSDP distributed frameworks. The following uses DeepSpeed as an example.
After configuring the JSON pre-training parameters, you can start the DeepSpeed model pre-training.
- For details about the configuration file for Accelerate and DeepSpeed interconnection, see [accelerate_config/accelerate_megatron_config.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/accelerate_config/accelerate_deepspeed_config.yaml).
- For details about how to use the DeepSpeed framework to train the JSON dataset, see [train_with_deepspeed.py](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/train_with_deepspeed.py).
- For details about the configuration file of JSON pre-training dataset, see [llama2_config/llama2-deepspeed.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-deepspeed.yaml).
```yaml
num_training_steps: 1000
micro_batch_size: 1
dp: 8
gradient_accumulation_steps: 8
seq_length: 4096
megatron_dataset_flag: False
data_path: '/path/to/alpaca_en/alpaca_data_en_52k.json'
save_dir: './saves'
save_interval: 10000
eval_interval: 10000
openmind_model_path: '/path/to/llama2-7b-hf'
dtype: 'bf16'
dataloader_config:
return_tensors: 'pt'
padding: 'max_length'
pad_to_multiple_of: 4096
max_length: 4096
### The value of **seq_length**, **max_length**, and **padding** must be less than or equal to the value of **max_position_embeddings** in the model weight configuration file **config.json**.
```
```shell
accelerate launch --config_file accelerate_config/accelerate_deepspeed_config.yaml train_with_deepspeed.py --pretrain_config_file llama2_config/llama2-deepspeed.yaml
```
## PreTrainingArguments Description
| **Name** | **Description** | **Type**| **Default Value**| Mandatory/Optional |
|-----------------------------|-----------------------|--------|---------|---------|
| num_training_steps | Total number of steps for training a model. | int | - | Mandatory |
| micro_batch_size | Batch size of each model instance. | int | - | Mandatory |
| dp | Data parallelism | int | - | Mandatory |
| gradient_accumulation_steps | Number of gradient steps to be accumulated before model parameters are updated. | int | 1 | Optional |
| seq_length | Maximum length of the sequence to be processed. | int | None | Optional |
| megatron_dataset_flag | Whether to use a flag of the Megatron dataset. | bool | None | Optional |
| data_path | Training dataset path. | str | None | Optional |
| save_dir | Output directory to which the checkpoint is to be saved. | str | None | Optional |
| save_interval | Iteration interval for saving checkpoints. | int | None | Optional |
| eval_interval | Iteration interval for evaluation. | int | None | Optional |
| openmind_model_path | Path of the openMind model to be trained. | str | None | Optional |
| dtype | Dtype mode of the running model. | str | bf16 | Optional |
| plugin_args | [Accelerate plugin parameters](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMPlugin) | dict | None | Optional |
| dataloader_config | [Dataloader configuration parameters](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMDummyDataLoader) | dict | None | Optional |
| report_to | Location to which Accelerate logs are reported. | str | None | Optional |
| project_name | Project name. | str | None | Optional |
## PreTrainer Description
The PreTrainer API creates a Megatron pre-trainer or other pre-trainers based on whether Accelerate uses the Megatron-LM distributed acceleration library (specifically `ACCELERATE_USE_MEGATRON_LM=="true"`).
### Megatron Pre-trainer
| No.| Constraint Description |
| ---- |-----------------------------------------------------------------------|
| 1 | The Megatron dependencies need to be installed. |
| 2 | The openmind_accelerate dependencies need to be installed. |
| 3 | Megatron manages accumulated gradients. Therefore, the `gradient_accumulation_steps` parameter of Accelerate must be set to **1**.|
| 4 | `train_dataloader` needs to be provided during initialization or `data_path` needs to be provided in **PreTrainingArguments**. |
| 5 | `model` needs to be provided during initialization or `openmind_model_path` needs to be provided in **PreTrainingArguments**. |
### Other Pre-trainers
| No. | Constraint |
| ---- |----------------------------------------------------------------|
| 1 | `train_dataloader` needs to be provided during initialization. |
| 2 | `optimizer` needs to be provided during initialization. |
| 3 | `lr_scheduler` needs to be provided during initialization. |
| 4 | `model` needs to be provided during initialization or `openmind_model_path` needs to be provided in **PreTrainingArguments**.|
*Thank community contributors for contributing the llama 2 model and alpaca_en dataset.*

View File

@ -4,8 +4,6 @@ openMind Library is an open-source deep learning development kit. It supports mo
## openMind Library Features ## openMind Library Features
+ To cope with the challenges of distributed training of foundation models, openMind Library provides pre-training APIs and acceleration libraries such as MindSpeed and Accelerate to help you quickly and smoothly train foundation models. For details, see [model pre-training](basic_tutorial/pretrainer.md).
+ openMind Library encapsulates APIs such as Transformers, MindFormers AutoClass, Pipeline, and Trainer, enhances functions, and provides the capability of automatic download and load of models from the Modelers community. In addition, the Ascend NPU affinity feature is added, effectively improves the performance of model training and inference on Ascend NPUs. For details, see [Model Fine-Tuning](basic_tutorial/finetune/overview.md) and [Model Inference](basic_tutorial/pipeline.md). + openMind Library encapsulates APIs such as Transformers, MindFormers AutoClass, Pipeline, and Trainer, enhances functions, and provides the capability of automatic download and load of models from the Modelers community. In addition, the Ascend NPU affinity feature is added, effectively improves the performance of model training and inference on Ascend NPUs. For details, see [Model Fine-Tuning](basic_tutorial/finetune/overview.md) and [Model Inference](basic_tutorial/pipeline.md).
+ openMind Library provides simple and easy-to-use command-line interfaces (CLIs) for quickly uploading, downloading, inferring, dialog, and deploying models with low code. For details, see the [command line interface](basic_tutorial/cli.md). + openMind Library provides simple and easy-to-use command-line interfaces (CLIs) for quickly uploading, downloading, inferring, dialog, and deploying models with low code. For details, see the [command line interface](basic_tutorial/cli.md).

View File

@ -50,13 +50,6 @@
"en": "Data Load" "en": "Data Load"
} }
}, },
{
"id": "pretrainer",
"label": {
"zh": "模型预训练",
"en": "Model Pre-training"
}
},
{ {
"id": "train", "id": "train",
"label": { "label": {
@ -343,13 +336,6 @@
"en": "Pipelines" "en": "Pipelines"
} }
}, },
{
"id": "pretrainer_api",
"label": {
"zh": "PreTrainer",
"en": "PreTrainer"
}
},
{ {
"id": "trainer_api", "id": "trainer_api",
"label": { "label": {

View File

@ -1,124 +0,0 @@
# PreTrainer 模块接口
## openmind.PreTrainer类
`PreTrainer`类提供了通用的预训练流程管理功能。
**参数列表**
| 参数名 | 类型 | 描述 | 默认值 |
| ---------------- | ------------------------------------------- |---------------|------|
| pretrain_args | PreTrainingArguments | 预训练参数。 | - |
| accelerator | Accelerator | accelerate实例。 | None |
| model | torch.nn.Module | torch模型。 | None |
| optimizer | accelerate.utils.MegatronLMOptimizerWrapper | 优化器。 | None |
| lr_scheduler | accelerate.utils.MegatronLMSchedulerWrapper | 调度器。 | None |
| train_dataloader | torch.utils.data.DataLoader | 训练数据加载器。 | None |
| eval_dataloader | torch.utils.data.DataLoader | 评估数据加载器。 | None |
### train
预训练启动。
**接口原型**
```python
def train()
```
## openmind.PreTrainingArguments类
`PreTrainingArguments`类用于配置训练任务的参数,包括训练过程中所需的超参数、模型保存路径和学习率等。
**参数列表**
| 参数名 | 类型 | 描述 | PyTorch默认值 |
| --------------------------- | ---- |-------------------|-----------------------|
| num_training_steps | int | 训练步数。 | - |
| micro_batch_size | int | 微批大小。 | - |
| dp | int | 并行度。 | - |
| gradient_accumulation_steps | int | 梯度累计步数。 | 1 |
| seq_length | int | 最大处理序列长度。 | None |
| megatron_dataset_flag | bool | 是否未megatron格式数据集。 | None |
| data_path | str | 数据集路径。 | None |
| save_dir | str | 模型保存路径。 | None |
| save_interval | int | 模型保存间隔。 | None |
| eval_interval | int | 模型评估间隔。 | None |
| openmind_model_path | str | 模型路径。 | None |
| dtype | str | 运行时数据类型。 | bf16 |
| plugin_args | dict | [Accelerate插件参数。](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMPlugin) | None |
| dataloader_config | dict | [加载器配置参数。](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMDummyDataLoader) | None |
| report_to | str | accelerate日志输出对象。 | None |
| project_name | str | 项目名称。 | "accelerate-megatron" |
### from_yaml
从yaml配置文件加载配置。
**接口原型**
```python
def from_yaml(config_path: str)
```
**参数列表**
| 参数名 | 描述 | 支持类型 |
| ----------- |-------------| -------- |
| config_path | yaml配置文件路径。 | str |
### get_mixed_precision
获取混合精度类型。
**接口原型**
```python
def get_mixed_precision()
```
### get_torch_dtype
获取运行时数据类型。
**接口原型**
```python
def get_torch_dtype()
```
### get_distributed_train_args
获取分布式预训练参数。
**接口原型**
```python
def get_distributed_train_args()
```
### update_distributed_train_args
更新分布式预训练参数。
**接口原型**
```python
def update_distributed_train_args(extra_args: dict)
```
**参数列表**
| 参数名 | 描述 | 支持类型 |
| ---------- |-------------| -------- |
| extra_args | 分布式预训练额外参数。 | dict |
### get_dataloader_config
获取数据加载器配置参数。
**接口原型**
```python
def get_dataloader_config()
```

View File

@ -1,450 +0,0 @@
# 模型预训练
## 基础概念
**预训练**是一种深度学习模型训练的策略,通常在大规模的数据集上进行。预训练的目标是通过在一个相关但较大的任务上训练模型,使得模型学习到通用的特征表示。但是随着大模型参数和所需训练数据量的急剧增长,单个机器的资源上限已无法满足训练要求,于是就引出了分布式训练的概念。
**分布式训练**指的是将深度学习模型任务分解为多个子任务,并在多个计算设备上并行的进行训练。分布式训练极大地提升了大模型的训练速度,可以大幅降低模型训练的总体时间。
本文档中的PreTrainer是基于Accelerate实现了多框架Megatron、DeepSpeed以及FSDP的分布式能力并提供了通用的预训练流程管理功能。
## 环境准备
```shell
torch: 2.1.0
transformers: 4.45.2
accelerate: 0.28.0
deepspeed: 0.15.2
megatron_core: 0.4.0rc0
```
### 安装Megatron-LM分布式框架
若用户需要使用Megatron-LM分布式框架则还需执行以下步骤。
1. 安装Megatron[参考MindSpeed的Megatron安装方式](https://gitee.com/ascend/MindSpeed#3-获取-megatron-lm-并指定-commit-id)
```shell
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout bcce6f54e075e3c3374ea67adefe54f3f2da2b07
pip install --no-use-pep517 -e . # 使用"--no-use-pep517 -e"安装megatron全部文件
```
2. 安装MindSpeed
```shell
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed
git checkout origin/1.0.RC1
pip install -r requirements.txt
pip install -e .
```
3. 使用pip安装魔乐社区openmind_accelerate插件
```shell
#aarch64平台
pip install openmind-accelerate
#x86平台
pip install openmind-accelerate --extra-index-url https://download.pytorch.org/whl/cpu
```
4. 安装accelerate与deepspeed
```shell
pip install deepspeed==0.15.2
pip install accelerate==0.28.0
```
### openMind Library环境准备
```shell
#aarch64环境下安装
pip install openmind[pt]
#x86环境下安装
pip install openmind[pt] --extra-index-url https://download.pytorch.org/whl/cpu
```
openMind Library依赖环境安装请参考[openMind Library安装指南](../install.md)。
安装完成后请使用`pip list`检查版本依赖如果在安装上述依赖的时候accelerate或transformers版本被刷新请重新刷回指定版本。
## 快速使用
我们提供了[样例配置文件和启动脚本](https://modelers.cn/models/AI-Research/accelerate_examples/tree/main/examples),方便用户一键使用。
### PreTrainer的使用步骤如下所示
#### 准备数据
用户需要准备好自己的预训练数据,例如[alpaca_en](https://modelers.cn/datasets/HaM/alpaca_en/tree/main)数据。
如果用户需要使用Megatron-LM分布式框架可参考[Megatron的数据处理方法](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#data-preprocessing) 进行处理。
#### 准备模型
用户需要准备好模型文件,例如[llama2模型](https://modelers.cn/models/AI_Connect/llama2_7b/tree/main)。
如果用户需要使用Megatron-LM分布式框架则只需要准备config.json和tokenizer相关文件即可。
#### 准备预训练参数
预训练参数可以通过加载 [llama2_config/llama2-megatron-json-dataset.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-megatron-json-dataset.yaml) 文件自动生成,用户可参考[此处](#llama2_megatron)基于json格式微调数据集的样例配置文件
#### 启动
- Accelerate配置文件可参考[accelerate_config/accelerate_megatron_config.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/accelerate_config/accelerate_megatron_config.yaml)
```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: [ ]
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
- 模型配置文件可参考:[llama2_config/llama2-megatron-json-dataset.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-megatron-json-dataset.yaml)
<a id="llama2_megatron"></a>
```yaml
num_training_steps: 1000
micro_batch_size: &micro_batch_size 4
dp: 1
gradient_accumulation_steps: &gradient_accumulation_steps 8
### seq_length需要小于或等于模型权重配置文件config.json中"max_position_embeddings"字段的值
seq_length: &seq_length 4096
megatron_dataset_flag: False
### data_path请传入本地微调数据集所在路径
data_path: &data_path '/path/to/alpaca_en/alpaca_data_en_52k.json'
### 微调模型权重保存路径
save_dir: './saves'
save_interval: 10000
eval_interval: 10000
### openmind_model_path请传入本地模型权重文件夹所在路径
openmind_model_path: '/path/to/llama2-7b-hf'
dtype: 'bf16'
plugin_args:
tp_degree: 8
pp_degree: 1
num_micro_batches: *gradient_accumulation_steps
gradient_clipping: 1.0
use_distributed_optimizer: False
sequence_parallelism: False
other_megatron_args:
### tokenizer_model请传入本地模型权重文件中tokenizer.model文件所在路径
tokenizer_model: &tokenizer_model '/path/to/llama2-7b-hf/tokenizer.model'
tokenizer_type: &tokenizer_type 'Llama2Tokenizer'
finetune: False
recompute_granularity: "full"
recompute_method: "block"
recompute_num_layers: 32
optimizer: "adam"
lr: 1e-5
min_lr: 1e-6
adam_beta2: 0.95
add_bias_linear: False
async_tensor_model_parallel_allreduce: False
attention_dropout: 0.0
attention_softmax_in_fp32: False
bias_gelu_fusion: False
ffn_hidden_size: 11008
hidden_dropout: 0.0
init_method_std: 0.01
initial_loss_scale: 65536.0
lr_decay_style: "cosine"
lr_warmup_fraction: 0.01
masked_softmax_fusion: False
normalization: "RMSNorm"
split: &split "100,0,0"
swiglu: True
untie_embeddings_and_output_weights: True
use_flash_attn: False
weight_decay: 0.1
no_load_optim: True
no_load_rng: True
eval_iters: &eval_iters 10
position_embedding_type: "rope"
dataloader_config:
return_tensors: 'pt'
padding: 'max_length'
pad_to_multiple_of: *seq_length
max_length: *seq_length
```
- 预训练程序文件可参考[train_with_megatron_json_dataset.py](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/train_with_megatron_json_dataset.py)此python脚本不能直接运行如需运行请自行下载如下仓库获取utils相关代码然后将accelerate_examples/examples/utils复制到此脚本同目录下。
```shell
git clone https://modelers.cn/AI-Research/accelerate_examples.git
cp -r accelerate_examples/examples/utils ./ # 自行替换目的路径为train_with_megatron_json_dataset.py所在路径
```
```python
import os
import openmind_accelerate
from openmind import PreTrainingArguments, PreTrainer
from utils.config import get_pretrain_config_file
from utils.accelerator import make_accelerator
from utils.data import make_train_and_eval_dataloader
from utils.tokenizer import get_tokenizer
pretrain_args = PreTrainingArguments.from_yaml(get_pretrain_config_file())
os.makedirs(pretrain_args.save_dir, exist_ok=True)
accelerator = make_accelerator(pretrain_args=pretrain_args)
tokenizer = get_tokenizer(tokenizer_path=pretrain_args.openmind_model_path, use_fast=False)
transformer_dataloader_config = pretrain_args.get_dataloader_config()
train_dataloader, eval_dataloader = make_train_and_eval_dataloader(
dataloader_config=transformer_dataloader_config,
micro_batch_size=pretrain_args.micro_batch_size,
data_files=pretrain_args.data_path,
max_length=pretrain_args.seq_length,
tokenizer=tokenizer,
accelerator=accelerator
)
pretrainer = PreTrainer(pretrain_args=pretrain_args,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
)
pretrainer.train()
```
在完成上述环境配置以及配置文件准备后,即可通过如下命令启动微调,请确保其中的训练脚本和配置文件为本地实际路径。
```shell
accelerate launch --config_file accelerate_config/accelerate_megatron_config.yaml train_with_megatron_json_dataset.py --pretrain_config_file llama2_config/llama2-megatron-json-dataset.yaml
```
## 进阶使用
### 定义预训练参数
在我们定义PreTrainer之前首先需要定义一个PreTrainingArguments类它将包含PreTrainer用于训练和评估的所有超参数。用户可以通过配置文件或者直接传参初始化预训练参数。
#### 使用配置文件
预训练参数可以通过加载yaml文件自动生成更多yaml样例可参考[样例链接](https://modelers.cn/models/AI-Research/accelerate_examples/tree/main/examples/llama2_config)。
```python
from openmind import PreTrainingArguments
# 路径需要替换为本地路径
pretrain_args = PreTrainingArguments.from_yaml(
"openmind-accelerate/examples/llama2_config/llama2-megatron.yaml"
)
```
#### 直接传参
预训练参数也可以通过传参的方式实例化。使用Megatron模型训练Megatron数据集的预训练器初始化流程如下。
参数链接请点击:[PreTrainingArguments说明](#pretrainingarguments说明)。
```python
from openmind import PreTrainingArguments
# 路径需要替换为本地路径
pretrain_args = PreTrainingArguments(
megatron_dataset_flag=True,
data_path="HaM/alpaca_en",
num_training_steps=1000,
micro_batch_size=4,
dp=1,
gradient_accumulation_steps=8,
seq_length=2048,
)
```
### 使用Megatron框架预训练模型
用户完成预训练参数配置后即可启动Megatron模型预训练。
- Accelerate对接Megatron的配置文件可参考[accelerate_config/accelerate_megatron_config.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/accelerate_config/accelerate_megatron_config.yaml)
- 使用Megatron框架训练Json数据运行示例可参考[train_with_megatron_json_dataset.py](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/train_with_megatron_json_dataset.py)。
- Json格式数据预训练配置文件示例可参考[llama2_config/llama2-megatron-json-dataset.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-megatron-json-dataset.yaml)。
用户只需要将准备好的`train_dataloader``eval_dataloader`非必选传给PreTrainer即可使用用户自定义的dataloader预训练模型。
```shell
accelerate launch --config_file accelerate_config/accelerate_megatron_config.yaml train_with_megatron_json_dataset.py --pretrain_config_file llama2_config/llama2-megatron-json-dataset.yaml
```
#### 自定义Megatron框架处理流程可选
##### 自定义处理函数
如下代码所示PreTrainer接口在使用Megatron预训练时支持用户根据实际场景按需自定义`datasets_provider`、`model_provider`、`get_batch`和`loss_function`中的任意函数,并将函数指针赋值到如下属性中。自定义函数的实现可参考官方样例[pretrain_gpt.py](https://github.com/NVIDIA/Megatron-LM/blob/main/pretrain_gpt.py)。
- `custom_megatron_datasets_provider_function`用于提供Megatron的训练和验证数据集。
- `custom_get_batch_function`:用于生成批次数据。
- `custom_model_provider_function`:用于构建模型。
- `custom_loss_function`:返回损失函数。
```python
import openmind_accelerate
from openmind import PreTrainingArguments
from pretrain_gpt import (
train_valid_test_datasets_provider,
get_batch as megatron_gpt_get_batch,
model_provider as megatron_gpt_model_provider,
loss_func as megatron_gpt_loss_func,
)
# 路径需要替换为本地路径
pretrain_args = PreTrainingArguments.from_yaml(
"openmind-accelerate/examples/llama2_config/llama2-megatron-json-dataset.yaml"
)
train_valid_test_datasets_provider.is_distributed = True
pretrain_args.update_distributed_train_args(
extra_args={
"custom_megatron_datasets_provider_function": train_valid_test_datasets_provider,
"custom_get_batch_function": megatron_gpt_get_batch,
"custom_model_provider_function": megatron_gpt_model_provider,
"custom_loss_function": megatron_gpt_loss_func,
}
)
```
##### 自定义解析模型配置文件
用户可依据Accelerate解析模型配置的格式自定义模型配置文件解析函数。以下为PreTrainer内置的llama模型配置文件解析函数用户可以根据实际情况参考。
```python
import openmind_accelerate
from accelerate.utils import add_model_config_to_megatron_parser
@add_model_config_to_megatron_parser("llama")
def parse_llama_config(megatron_lm_plugin, model, batch_data):
model_type_name = "gpt"
num_layers = model.config.num_hidden_layers
pretraining_flag = True
hidden_size = model.config.hidden_size
num_attention_heads = model.config.num_attention_heads
orig_vocab_size = model.config.vocab_size
max_position_embeddings = getattr(model.config, "max_position_embeddings")
seq_length = getattr(model.config, "max_sequence_length", None)
if megatron_lm_plugin.seq_length is None:
if seq_length is not None:
megatron_lm_plugin.seq_length = seq_length
elif megatron_lm_plugin.decoder_seq_length is not None:
megatron_lm_plugin.seq_length = megatron_lm_plugin.decoder_seq_length
elif batch_data is not None:
megatron_lm_plugin.seq_length = batch_data["input_ids"].shape[1]
else:
megatron_lm_plugin.seq_length = max_position_embeddings
megatron_lm_plugin.megatron_lm_default_args["return_logits"] = megatron_lm_plugin.return_logits
megatron_lm_plugin.megatron_lm_default_args["tokenizer_type"] = "Llama2Tokenizer"
megatron_lm_plugin.megatron_lm_default_args["model_type_name"] = model_type_name
megatron_lm_plugin.megatron_lm_default_args["num_layers"] = num_layers
megatron_lm_plugin.megatron_lm_default_args["pretraining_flag"] = pretraining_flag
megatron_lm_plugin.megatron_lm_default_args["hidden_size"] = hidden_size
megatron_lm_plugin.megatron_lm_default_args["num_attention_heads"] = num_attention_heads
megatron_lm_plugin.megatron_lm_default_args["orig_vocab_size"] = orig_vocab_size
megatron_lm_plugin.megatron_lm_default_args["max_position_embeddings"] = max_position_embeddings
megatron_lm_plugin.megatron_lm_default_args["seq_length"] = megatron_lm_plugin.seq_length
megatron_lm_plugin.megatron_lm_default_args["model_return_dict"] = model.config.return_dict
```
### 使用其他框架预训练模型
PreTrainer是基于Accelerate实现的多框架分布式能力所以PreTrainer除了支持Megatron框架还支持DeepSpeed和FSDP分布式框架。如下以DeepSpeed分布式框架为例
用户完成Json格式预训练参数配置后即可启动DeepSpeed模型预训练。
- Accelerate对接DeepSpeed的配置文件示例可参考[accelerate_config/accelerate_deepspeed_config.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/accelerate_config/accelerate_deepspeed_config.yaml)。
- 使用DeepSpeed框架训练Json数据运行示例可参考[train_with_deepspeed.py](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/train_with_deepspeed.py)。
- Json格式数据预训练配置文件示例可参考[llama2_config/llama2-deepspeed.yaml](https://modelers.cn/models/AI-Research/accelerate_examples/blob/main/examples/llama2_config/llama2-deepspeed.yaml)。
```yaml
num_training_steps: 1000
micro_batch_size: 1
dp: 8
gradient_accumulation_steps: 8
seq_length: 4096
megatron_dataset_flag: False
data_path: '/path/to/alpaca_en/alpaca_data_en_52k.json'
save_dir: './saves'
save_interval: 10000
eval_interval: 10000
openmind_model_path: '/path/to/llama2-7b-hf'
dtype: 'bf16'
dataloader_config:
return_tensors: 'pt'
padding: 'max_length'
pad_to_multiple_of: 4096
max_length: 4096
### seq_length、max_length以及padding的值均需要小于或等于模型权重配置文件config.json中"max_position_embeddings"字段的值
```
```shell
accelerate launch --config_file accelerate_config/accelerate_deepspeed_config.yaml train_with_deepspeed.py --pretrain_config_file llama2_config/llama2-deepspeed.yaml
```
## PreTrainingArguments说明
| **参数名** | **描述** | **类型** | **默认值** | 是否可选 |
|-----------------------------|-----------------------|--------|---------|---------|
| num_training_steps | 训练模型的总步数。 | int | - | 必选 |
| micro_batch_size | 每个模型实例的批处理大小。 | int | - | 必选 |
| dp | 数据并行度。 | int | - | 必选 |
| gradient_accumulation_steps | 在更新模型参数之前要累积的梯度步数。 | int | 1 | 可选 |
| seq_length | 要处理的最大序列长度。 | int | None | 可选 |
| megatron_dataset_flag | 是否使用Megatron类型数据集的标志。 | bool | None | 可选 |
| data_path | 训练数据集的路径。 | str | None | 可选 |
| save_dir | 要将检查点保存到的输出目录。 | str | None | 可选 |
| save_interval | 检查点保存的迭代间隔。 | int | None | 可选 |
| eval_interval | 验证集评估的迭代间隔。 | int | None | 可选 |
| openmind_model_path | 待训练的openMind模型的路径。 | str | None | 可选 |
| dtype | 运行模型的dtype模式。 | str | bf16 | 可选 |
| plugin_args | [Accelerate插件参数。](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMPlugin) | dict | None | 可选 |
| dataloader_config | [加载器配置参数。](https://huggingface.co/docs/accelerate/v0.28.0/en/package_reference/megatron_lm#accelerate.utils.MegatronLMDummyDataLoader) | dict | None | 可选 |
| report_to | Accelerate日志上报到何处。 | str | None | 可选 |
| project_name | 项目的名称。 | str | None | 可选 |
## PreTrainer说明
PreTrainer接口会根据Accelerate是否使用Megatron-LM分布式加速库以环境变量`ACCELERATE_USE_MEGATRON_LM=="true"`为依据来选择创建Megatron预训练器或其他预训练器。
### Megatron预训练器
| 序号 | 约束描述 |
| ---- |-----------------------------------------------------------------------|
| 1 | 需要预先安装Megatron依赖。 |
| 2 | 需要预先安装openmind_accelerate插件依赖。 |
| 3 | Megatron会自管理累积梯度所以Accelerate的`gradient_accumulation_steps`参数需要指定为 1。 |
| 4 | 初始化时需要提供`train_dataloader`或在PreTrainingArguments里提供`data_path`。 |
| 5 | 初始化时需要提供`model`或在PreTrainingArguments里提供`openmind_model_path`。 |
### 其他预训练器
| 序号 | 约束描述 |
| ---- |----------------------------------------------------------------|
| 1 | 初始化时需要提供`train_dataloader`。 |
| 2 | 初始化时需要提供`optimizer`。 |
| 3 | 初始化时需要提供`lr_scheduler`。 |
| 4 | 初始化时需要提供`model`或在PreTrainingArguments里提供`openmind_model_path`。 |
*感谢社区贡献的 llama2 模型以及 alpaca_en 数据集*

View File

@ -79,6 +79,75 @@ You are a helpful assistant.<|im_end|>
</tr> </tr>
</thead> </thead>
<tbody> <tbody>
<!-- Qwen3 -->
<tr>
<td rowspan="11">Qwen3</td>
<td>Qwen3-32B-Chat</td>
<td>Models_Ecosystem/Qwen3-32B</td>
<td>Qwen/Qwen3-32B</td>
<td rowspan="11">qwen</td>
<td></td>
</tr>
<tr>
<td>Qwen3-14B-Chat</td>
<td>Models_Ecosystem/Qwen3-14B</td>
<td>Qwen/Qwen3-14B</td>
<td></td>
</tr>
<tr>
<td>Qwen3-14B</td>
<td>Models_Ecosystem/Qwen3-14B-Base</td>
<td>Qwen/Qwen3-14B-Base</td>
<td></td>
</tr>
<tr>
<td>Qwen3-8B-Chat</td>
<td>Models_Ecosystem/Qwen3-8B</td>
<td>Qwen/Qwen3-8B</td>
<td></td>
</tr>
<tr>
<td>Qwen3-8B</td>
<td>Models_Ecosystem/Qwen3-8B-Base</td>
<td>Qwen/Qwen3-8B-Base</td>
<td></td>
</tr>
<tr>
<td>Qwen3-4B-Chat</td>
<td>Models_Ecosystem/Qwen3-4B</td>
<td>Qwen/Qwen3-4B</td>
<td></td>
</tr>
<tr>
<td>Qwen3-4B</td>
<td>Models_Ecosystem/Qwen3-4B-Base</td>
<td>Qwen/Qwen3-4B-Base</td>
<td></td>
</tr>
<tr>
<td>Qwen3-1.7B-Chat</td>
<td>Models_Ecosystem/Qwen3-1.7B</td>
<td>Qwen/Qwen3-1.7B</td>
<td></td>
</tr>
<tr>
<td>Qwen3-1.7B</td>
<td>Models_Ecosystem/Qwen3-1.7B-Base</td>
<td>Qwen/Qwen3-1.7B-Base</td>
<td></td>
</tr>
<tr>
<td>Qwen3-0.6B-Chat</td>
<td>Models_Ecosystem/Qwen3-0.6B</td>
<td>Qwen/Qwen3-0.6B</td>
<td></td>
</tr>
<tr>
<td>Qwen3-0.6B</td>
<td>Models_Ecosystem/Qwen3-0.6B-Base</td>
<td>Qwen/Qwen3-0.6B-Base</td>
<td></td>
</tr>
<!-- Qwen2.5 --> <!-- Qwen2.5 -->
<tr> <tr>
<td rowspan="3">Qwen2.5</td> <td rowspan="3">Qwen2.5</td>
@ -100,6 +169,15 @@ You are a helpful assistant.<|im_end|>
<td>Qwen/Qwen2.5-32B</td> <td>Qwen/Qwen2.5-32B</td>
<td></td> <td></td>
</tr> </tr>
<!-- Qwen2.5-VL -->
<tr>
<td>Qwen2.5-VL</td>
<td>Qwen2.5-VL-7B-Instruct</td>
<td>PyTorch-NPU/Qwen2.5-VL-7B-Instruct</td>
<td>Qwen/Qwen2.5-VL-7B-Instruct</td>
<td>qwen2_vl</td>
<td></td>
</tr>
<!-- Qwen2 --> <!-- Qwen2 -->
<tr> <tr>
<td rowspan="3">Qwen2</td> <td rowspan="3">Qwen2</td>
@ -256,15 +334,6 @@ You are a helpful assistant.<|im_end|>
<td>deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B</td> <td>deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B</td>
<td></td> <td></td>
</tr> </tr>
<!-- Qwen2.5-VL -->
<tr>
<td>Qwen2.5-VL</td>
<td>Qwen2.5-VL-7B-Instruct</td>
<td>PyTorch-NPU/Qwen2.5-VL-7B-Instruct</td>
<td>Qwen/Qwen2.5-VL-7B-Instruct</td>
<td>qwen2_vl</td>
<td></td>
</tr>
</tbody> </tbody>
</table> </table>

View File

@ -4,8 +4,6 @@ openMind Library是一个深度学习开发套件通过简单易用的API支
## openMind Library特性 ## openMind Library特性
+ 为了应对大模型分布式训练的挑战openMind Library提供了预训练接口支持MindSpeed、Accelerate等加速库帮助开发者顺畅快速地训练大模型具体可参考[模型预训练](basic_tutorial/pretrainer.md)章节。
+ openMind Library基于[transformers库](https://github.com/huggingface/transformers)集成了PyTorch框架下主流第三方工具的功能提供了一键式的封装的微调命令行接口解决方案涵盖了从数据处理、权重加载到低参数训练、量化适配训练和跟踪的全流程功能更多细节可查看[模型训练](basic_tutorial/train/overview.md)。 + openMind Library基于[transformers库](https://github.com/huggingface/transformers)集成了PyTorch框架下主流第三方工具的功能提供了一键式的封装的微调命令行接口解决方案涵盖了从数据处理、权重加载到低参数训练、量化适配训练和跟踪的全流程功能更多细节可查看[模型训练](basic_tutorial/train/overview.md)。
+ openMind Library对Transformers和MindFormers的AutoClass、Pipeline、Trainer等接口进行封装并增强了其功能提供了对应的SDK。还提供了从魔乐社区自动下载和加载模型的能力同时扩展新增了昇腾NPU亲和的特性有效提升在昇腾NPU上进行模型训练推理的性能具体可参考[模型训练](basic_tutorial/train/overview.md)和[模型推理](basic_tutorial/pipeline.md)章节。 + openMind Library对Transformers和MindFormers的AutoClass、Pipeline、Trainer等接口进行封装并增强了其功能提供了对应的SDK。还提供了从魔乐社区自动下载和加载模型的能力同时扩展新增了昇腾NPU亲和的特性有效提升在昇腾NPU上进行模型训练推理的性能具体可参考[模型训练](basic_tutorial/train/overview.md)和[模型推理](basic_tutorial/pipeline.md)章节。

View File

@ -51,8 +51,6 @@ if TYPE_CHECKING:
from .archived.trainers import ( from .archived.trainers import (
Trainer, Trainer,
TrainingArguments, TrainingArguments,
PreTrainer,
PreTrainingArguments,
) )
from .archived.pipelines import pipeline from .archived.pipelines import pipeline
from .omdatasets import OmDataset from .omdatasets import OmDataset

View File

@ -18,16 +18,12 @@ from openmind.utils import _LazyModule
if TYPE_CHECKING: if TYPE_CHECKING:
from .trainer import Trainer from .trainer import Trainer
from .training_args import TrainingArguments from .training_args import TrainingArguments
from .pretrainer import PreTrainer
from .pretraining_args import PreTrainingArguments
else: else:
import sys import sys
_import_structure = { _import_structure = {
"trainer": ["Trainer"], "trainer": ["Trainer"],
"training_args": ["TrainingArguments"], "training_args": ["TrainingArguments"],
"pretrainer": ["PreTrainer"],
"pretraining_args": ["PreTrainingArguments"],
} }
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@ -1,452 +0,0 @@
# Copyright (c) 2024 Huawei Technologies Co., Ltd.
#
# openMind is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
#
# http://license.coscl.org.cn/MulanPSL2
#
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
import dataclasses
import importlib
import importlib.util
import os
import time
import warnings
from accelerate import Accelerator, init_empty_weights
try:
import torch
except ImportError as e:
raise ImportError("Please install torch package before using this PreTrainer.") from e
import torch.utils.data
from transformers import AutoConfig, AutoModelForCausalLM
from .pretrainer_utils import print_in_last_rank, print_in_main_process
from .pretraining_args import PreTrainingArguments
warnings.warn(
"The class 'PreTrainer' is deprecated and will be removed in version 1.1.0. ",
FutureWarning,
)
class _PreTrainerCommon:
def __init__(
self,
pretrain_args: PreTrainingArguments,
accelerator: Accelerator = None,
model: torch.nn.Module = None,
optimizer=None,
lr_scheduler=None,
train_dataloader: torch.utils.data.DataLoader = None,
eval_dataloader: torch.utils.data.DataLoader = None,
*args,
**kwargs,
):
self.model = model
self.pretrain_args = pretrain_args
self.train_dataloader = train_dataloader
self.optimizer = optimizer
self.lr_scheduler = lr_scheduler
self.accelerator = accelerator
self.eval_dataloader = eval_dataloader
self.completed_steps = 0
self._post_init()
def train(self):
self._pre_training()
batch_loss_sum = 0
start_time = time.time()
while self.completed_steps < self.pretrain_args.num_training_steps:
for batch in self.train_dataloader:
outputs = self._train_step(batch)
loss_ = outputs.loss.detach().float()
batch_loss_sum += loss_.item()
if self.accelerator.sync_gradients:
self.completed_steps += 1
else:
continue # for accelerator's gradient_accumulation
lr = self._get_lr()
batch_loss_avg = self._get_batch_loss_avg(batch_loss_sum=batch_loss_sum)
elapsed_time = (time.time() - start_time) * 1000 # ms
self._train_step_log(step=self.completed_steps, loss=batch_loss_avg, lr=lr, elapsed_time=elapsed_time)
batch_loss_sum = 0
if (
self.pretrain_args.save_interval
and self.completed_steps % self.pretrain_args.save_interval == 0
and self.pretrain_args.save_dir
):
self._save_state(save_dir=self.pretrain_args.save_dir)
if (
self.pretrain_args.eval_interval
and self.completed_steps % self.pretrain_args.eval_interval == 0
and self.eval_dataloader is not None
):
self._eval(eval_dataloader=self.eval_dataloader, completed_steps=self.completed_steps)
start_time = time.time()
if self.completed_steps >= self.pretrain_args.num_training_steps:
break
self.accelerator.end_training()
self.accelerator.wait_for_everyone()
self._post_training()
def _post_init(self):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _init_trackers(self):
experiment_config = {}
experiment_config.update(dataclasses.asdict(self.pretrain_args))
self.accelerator.init_trackers(self.pretrain_args.project_name, experiment_config)
def _get_gradient_accumulation_steps(self):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _get_batch_loss_avg(self, batch_loss_sum):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _get_lr(self):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _train_step(self, batch):
self.model.train()
with self.accelerator.accumulate(self.model):
outputs = self.model(**batch)
loss = outputs.loss
self.accelerator.backward(loss)
self.optimizer.step()
self.lr_scheduler.step()
self.optimizer.zero_grad()
return outputs
def _train_step_log(self, loss, lr, elapsed_time, step):
log_str = (
f"step: {step} | elapsed time per iteration (ms): {elapsed_time:.1f} | learning rate: {lr:.3E} | "
f"lm loss: {loss:.6E}"
)
print_in_last_rank(log_str)
# tracker
self.accelerator.log(
{
"train_loss": loss,
"learning_rate": lr,
},
step=step,
)
def _print_training_info(self):
print_in_main_process("***** Running training *****")
print_in_main_process(
f" Num examples = {self.pretrain_args.num_training_steps * self.pretrain_args.batch_size}"
)
print_in_main_process(f" Instantaneous batch size per device = {self.pretrain_args.micro_batch_size}")
print_in_main_process(
f" Total train batch size (w. parallel, distributed & accumulation) = {self.pretrain_args.batch_size}"
)
print_in_main_process(f" Gradient Accumulation steps = {self._get_gradient_accumulation_steps()}")
print_in_main_process(f" Total steps = {self.pretrain_args.num_training_steps}")
def _pre_training(self):
self._print_training_info()
print_in_main_process(f"[before the start of training step] datetime: {time.strftime('%Y-%m-%d %H:%M:%S')}")
self.completed_steps = 0
def _post_training(self):
print_in_main_process(f"[after training is done] datetime: {time.strftime('%Y-%m-%d %H:%M:%S')}")
self._save(save_dir=self.pretrain_args.save_dir)
def _get_eval_loss(self, loss):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _eval(self, eval_dataloader, completed_steps=None):
if completed_steps is not None:
self.completed_steps = completed_steps
losses = []
for _, batch in enumerate(eval_dataloader):
outputs = self._eval_step(batch)
loss = outputs.loss
losses.append(self._get_eval_loss(loss))
self._eval_log(losses=losses)
def _eval_step(self, batch):
self.model.eval()
with torch.no_grad():
outputs = self.model(**batch)
return outputs
def _handle_eval_losses(self, losses):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _eval_log(self, losses):
losses = self._handle_eval_losses(losses)
eval_loss = torch.mean(losses)
print_in_last_rank(f"validation at step: {self.completed_steps} | eval_loss: {eval_loss}")
self.accelerator.log(
{
"eval_loss": eval_loss,
},
step=self.completed_steps,
)
def _save_state(self, save_dir):
self.accelerator.save_state(save_dir)
def _save(self, save_dir):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _read_model(self):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _prepare(self):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
def _make_accelerator(self):
raise NotImplementedError("_PreTrainerCommon : Not implemented!")
class _PreTrainerMegatron(_PreTrainerCommon):
def _make_megatron_dataloader(self):
from accelerate.utils import MegatronLMDummyDataLoader
data_path = self.pretrain_args.data_path
megatron_dataloader_config = {
"data_path": data_path if isinstance(data_path, list) else [data_path],
"seq_length": self.pretrain_args.seq_length,
"micro_batch_size": self.pretrain_args.micro_batch_size,
"eval_interval": self.pretrain_args.eval_interval,
}
if self.pretrain_args.dataloader_config:
for key, value in self.pretrain_args.dataloader_config.items():
if key in megatron_dataloader_config.keys():
print_in_main_process(
f"PreTrainerMegatron dataloader overriding arguments for "
f"{key}:{megatron_dataloader_config[key]} with {key}:{value}"
)
megatron_dataloader_config[key] = value
megatron_dataloader = MegatronLMDummyDataLoader(**megatron_dataloader_config)
self.train_dataloader = megatron_dataloader
self.accelerator.state.megatron_lm_plugin.megatron_dataset_flag = True
def _get_megatron_lm_plugin(self):
from accelerate.utils import MegatronLMPlugin
plugin_args = {
"train_iters": self.pretrain_args.num_training_steps,
"seq_length": self.pretrain_args.seq_length,
"num_micro_batches": self.pretrain_args.gradient_accumulation_steps,
"megatron_dataset_flag": self.pretrain_args.megatron_dataset_flag,
"eval_interval": self.pretrain_args.eval_interval,
}
if self.pretrain_args.plugin_args:
for key, value in self.pretrain_args.plugin_args.items():
if key in plugin_args.keys():
msg = (
f"WARNING: PreTrainerMegatron plugin overriding arguments for "
f"{key}:{plugin_args[key]} with {key}:{value}"
)
print_in_main_process(msg)
plugin_args[key] = value
return MegatronLMPlugin(**plugin_args)
def _make_accelerator(self):
accelerate_kwargs = {
"log_with": self.pretrain_args.report_to,
"project_dir": self.pretrain_args.save_dir,
"mixed_precision": self.pretrain_args.get_mixed_precision(),
}
megatron_lm_plugin = self._get_megatron_lm_plugin()
accelerate_kwargs["megatron_lm_plugin"] = megatron_lm_plugin
self.accelerator = Accelerator(**accelerate_kwargs)
def _post_init(self):
if importlib.util.find_spec("megatron") is None or importlib.util.find_spec("megatron.data") is None:
raise EnvironmentError("You must use '--no-use-pep517' to pip install nvidia's megatron from source.")
if importlib.util.find_spec("openmind_accelerate") is None:
raise EnvironmentError("You must pip install openmind_accelerate.")
import openmind_accelerate # noqa:F401
if self.accelerator is None:
self._make_accelerator()
if self.accelerator.gradient_accumulation_steps != 1:
raise ValueError(
"When using Megatron, gradient accumulation is done in Megatron, "
"so gradient_accumulation_steps in Accelerator needs to be set to 1."
)
if self.train_dataloader is None:
if not self.pretrain_args.data_path:
raise ValueError("`PreTrainer` requires either a `train_dataloader` or `args.data_path` argument")
self._make_megatron_dataloader()
self.accelerator.state.megatron_lm_plugin.megatron_lm_default_args["train_iters"] = (
self.pretrain_args.num_training_steps
)
if self.model is None:
if not self.pretrain_args.openmind_model_path:
raise ValueError("`PreTrainer` requires either a `model` or `args.openmind_model_path` argument")
self._read_model()
self._prepare()
self._init_trackers()
def _pre_training(self):
from megatron import get_args
super()._pre_training()
args = get_args()
self.model.iteration = args.iteration
self.completed_steps = args.iteration
def _eval(self, eval_dataloader, completed_steps=None):
from megatron import get_args
if completed_steps is not None:
self.completed_steps = completed_steps
args = get_args()
losses = []
iteration = 0
for _, batch in enumerate(eval_dataloader):
outputs = self._eval_step(batch)
loss = outputs.loss
losses.append(self._get_eval_loss(loss))
iteration += 1
if iteration >= args.eval_iters:
break
self._eval_log(losses=losses)
def _get_gradient_accumulation_steps(self):
return self.accelerator.state.megatron_lm_plugin.num_micro_batches
def _get_batch_loss_avg(self, batch_loss_sum):
return batch_loss_sum
def _get_lr(self):
return self.lr_scheduler.get_lr()
def _get_eval_loss(self, loss):
return loss
def _handle_eval_losses(self, losses):
return torch.tensor(losses)
def _save(self, save_dir):
self.accelerator.save_state(save_dir)
def _read_model(self):
model_config = AutoConfig.from_pretrained(self.pretrain_args.openmind_model_path)
with init_empty_weights():
self.model = AutoModelForCausalLM.from_config(model_config)
self.model.config.use_cache = False
def _prepare(self):
from accelerate.utils import MegatronLMOptimizerWrapper, MegatronLMSchedulerWrapper
self.model, self.train_dataloader, self.eval_dataloader = self.accelerator.prepare(
self.model, self.train_dataloader, self.train_dataloader
)
self.optimizer = MegatronLMOptimizerWrapper(self.model.optimizer)
self.lr_scheduler = MegatronLMSchedulerWrapper(self.model.scheduler, self.model.optimizer)
class _PreTrainerOther(_PreTrainerCommon):
def _make_accelerator(self):
accelerate_kwargs = {
"log_with": self.pretrain_args.report_to,
"project_dir": self.pretrain_args.save_dir,
"mixed_precision": self.pretrain_args.get_mixed_precision(),
}
self.accelerator = Accelerator(**accelerate_kwargs)
def _post_init(self):
if self.accelerator is None:
self._make_accelerator()
if self.train_dataloader is None:
raise ValueError("When not using Megatron, `PreTrainer` requires `train_dataloader`")
if self.optimizer is None:
raise ValueError("When not using Megatron, `PreTrainer` requires `optimizer`")
if self.lr_scheduler is None:
raise ValueError("When not using Megatron, `PreTrainer` requires `lr_scheduler`")
if self.model is None:
if not self.pretrain_args.openmind_model_path:
raise ValueError("`PreTrainer` requires either a `model` or `args.openmind_model_path` argument")
self._read_model()
self._prepare()
self._init_trackers()
def _get_gradient_accumulation_steps(self):
return self.accelerator.gradient_accumulation_steps
def _get_batch_loss_avg(self, batch_loss_sum):
return batch_loss_sum / self._get_gradient_accumulation_steps()
def _get_lr(self):
return self.lr_scheduler.get_last_lr()[0]
def _get_eval_loss(self, loss):
return self.accelerator.gather_for_metrics(loss.repeat(self.pretrain_args.batch_size))
def _handle_eval_losses(self, losses):
return torch.cat(losses)
def _save(self, save_dir):
unwrapped_model = self.accelerator.unwrap_model(self.model)
unwrapped_model.save_pretrained(
save_dir, is_main_process=self.accelerator.is_main_process, save_function=self.accelerator.save
)
def _read_model(self):
self.model = AutoModelForCausalLM.from_pretrained(
self.pretrain_args.openmind_model_path,
torch_dtype=self.pretrain_args.get_torch_dtype(),
)
self.model.gradient_checkpointing_enable()
self.model.config.use_cache = False
def _prepare(self):
if self.eval_dataloader:
(
self.model,
self.train_dataloader,
self.eval_dataloader,
self.optimizer,
self.lr_scheduler,
) = self.accelerator.prepare(
self.model, self.train_dataloader, self.eval_dataloader, self.optimizer, self.lr_scheduler
)
else:
self.model, self.train_dataloader, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
self.model, self.train_dataloader, self.optimizer, self.lr_scheduler
)
class PreTrainer(_PreTrainerCommon):
def __new__(cls, *args, **kwargs):
if os.environ.get("ACCELERATE_USE_MEGATRON_LM", "false") == "true":
return _PreTrainerMegatron(*args, **kwargs)
return _PreTrainerOther(*args, **kwargs)

View File

@ -1,39 +0,0 @@
# Copyright (c) 2024 Huawei Technologies Co., Ltd.
#
# openMind is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
#
# http://license.coscl.org.cn/MulanPSL2
#
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
import logging
import os
import torch
from openmind.utils.logging import get_logger
openmind_logger = get_logger(__name__)
openmind_logger.setLevel(logging.INFO)
def print_in_main_process(msg):
local_rank = int(os.environ.get("LOCAL_RANK", -1))
if local_rank in [0, -1]:
openmind_logger.info(msg)
def is_last_rank():
return torch.distributed.get_rank() == (torch.distributed.get_world_size() - 1)
def print_in_last_rank(msg):
if torch.distributed.is_initialized():
if is_last_rank():
openmind_logger.info(msg)
else:
openmind_logger.info(msg)

View File

@ -1,115 +0,0 @@
# Copyright (c) 2024 Huawei Technologies Co., Ltd.
#
# openMind is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
#
# http://license.coscl.org.cn/MulanPSL2
#
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
import dataclasses
from dataclasses import dataclass, field
import re
import warnings
import torch
import yaml
from .pretrainer_utils import print_in_main_process
warnings.warn(
"The class 'PreTrainingArguments' is deprecated and will be removed in version 1.1.0. ",
FutureWarning,
)
_dtype_map = {"bf16": torch.bfloat16, "fp16": torch.float16, "fp32": torch.float32}
@dataclass
class PreTrainingArguments:
num_training_steps: int = field(metadata={"help": "Total number fo steps to train the model."})
micro_batch_size: int = field(metadata={"help": "Batch size per model instance."})
dp: int = field(metadata={"help": "Degree of Parallelism."})
gradient_accumulation_steps: int = field(
default=1, metadata={"help": "The number of gradient steps to accumulate before updating the model parameters."}
)
seq_length: int = field(default=None, metadata={"help": "Maximum sequence length to process."})
megatron_dataset_flag: bool = field(
default=None, metadata={"help": "Flags for whether or not to use a Megatron type dataset."}
)
data_path: str = field(default=None, metadata={"help": "Path to the training dataset."})
save_dir: str = field(default=None, metadata={"help": "Output directory to save checkpoints to."})
save_interval: int = field(default=None, metadata={"help": "Number of iterations between checkpoint saves."})
eval_interval: int = field(
default=None, metadata={"help": "Interval between running evaluation on validation set."}
)
openmind_model_path: str = field(default=None, metadata={"help": "The path of the Openmind model to be trained."})
dtype: str = field(default="bf16", metadata={"help": "The dtype mode that the model is running on."})
plugin_args: dict = field(default=None, metadata={"help": "Parameters related to accelerate plugins."})
dataloader_config: dict = field(default=None, metadata={"help": "The parameters of dataloader."})
report_to: str = field(default=None, metadata={"help": "Whom will accelerate report the log to."})
project_name: str = field(default="accelerate-megatron", metadata={"help": "The name of the project"})
@staticmethod
def from_yaml(config_path: str):
with open(config_path, "r") as file:
config_data = yaml.safe_load(file)
return PreTrainingArguments(**config_data)
def __post_init__(self):
self.batch_size = self.micro_batch_size * self.gradient_accumulation_steps * self.dp
if self.data_path is not None and self.megatron_dataset_flag is None:
raise ValueError(
"Since you filled in data_path in PreTrainArguments, you have to specify the "
"megatron_dataset_flag parameter at the same time."
)
self.dtype = self.dtype.lower()
if self.dtype not in _dtype_map:
raise ValueError(f"Unknown dtype:{self.dtype}. Supported dtypes:{','.join(_dtype_map.keys())}")
for f in dataclasses.fields(self):
value = getattr(self, f.name)
if value:
if f.type is str:
if re.match(r"^[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)$", value):
setattr(self, f.name, float(value))
print_in_main_process(
f"WARNING: PreTrainingArguments transferring the type of {f.name} from str to float!"
)
if f.type is dict:
self._scientific_str_to_float(value)
def get_mixed_precision(self):
if self.dtype == "fp32":
return "no"
return self.dtype
def get_torch_dtype(self):
return _dtype_map.get(self.dtype)
def get_distributed_train_args(self):
return self.plugin_args.copy()
def update_distributed_train_args(self, extra_args: dict):
self.plugin_args.update(extra_args)
def get_dataloader_config(self):
return self.dataloader_config.copy()
def _scientific_str_to_float(self, config_dict: dict):
for key, value in config_dict.items():
if isinstance(value, str):
if re.match(r"^[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)$", value):
config_dict[key] = float(value)
print_in_main_process(
f"WARNING: PreTrainingArguments transferring the type of {key} from str to float!"
)
if isinstance(value, dict):
self._scientific_str_to_float(value)

View File

@ -1,238 +0,0 @@
# Copyright (c) 2024 Huawei Technologies Co., Ltd.
#
# openMind is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
#
# http://license.coscl.org.cn/MulanPSL2
#
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
import dataclasses
from unittest import TestCase, skip
from unittest.mock import MagicMock
import pytest
from tests.utils_for_test import require_torch
class TestPreTrainerCommon(TestCase):
@pytest.fixture(scope="function", autouse=True)
def global_setup(self):
from openmind import PreTrainer, PreTrainingArguments
self.pretrain_args = PreTrainingArguments(
num_training_steps=10,
micro_batch_size=4,
dp=1,
gradient_accumulation_steps=8,
seq_length=4096,
megatron_dataset_flag=True,
data_path="llama2-mt_text_document",
save_dir="llama-2-7b-hf_save",
save_interval=10000,
eval_interval=10000,
openmind_model_path="hf",
dtype="bf16",
plugin_args={
"tp_degree": 4,
"pp_degree": 1,
"num_micro_batches": 8,
"gradient_clipping": 1.0,
"use_distributed_optimizer": False,
"sequence_parallelism": False,
"other_megatron_args": {
"tokenizer_model": "tokenizer.model",
"tokenizer_type": "Llama2Tokenizer",
"finetune": False,
"recompute_granularity": "full",
"recompute_method": "block",
"recompute_num_layers": 32,
"optimizer": "adam",
"lr": 1e-5,
"min_lr": 1e-6,
"adam_beta2": 0.95,
"add_bias_linear": False,
"async_tensor_model_parallel_allreduce": True,
"attention_dropout": 0.0,
"attention_softmax_in_fp32": True,
"bias_gelu_fusion": False,
"ffn_hidden_size": 11008,
"hidden_dropout": 0.0,
"init_method_std": 0.01,
"initial_loss_scale": 65536.0,
"lr_decay_style": "cosine",
"lr_warmup_fraction": 0.01,
"masked_softmax_fusion": False,
"normalization": "RMSNorm",
"sequence_parallel": True,
"split": "100,0,0",
"swiglu": True,
"untie_embeddings_and_output_weights": True,
"use_flash_attn": True,
"weight_decay": 0.1,
"no_load_optim": True,
"no_load_rng": True,
"eval_iters": 10000,
"position_embedding_type": "rope",
},
},
dataloader_config={
"data_path": ["llama2-mt_text_document"],
"seq_length": 4096,
"micro_batch_size": 4,
"split": "100,0,0",
"eval_iters": 10000,
"tokenizer_model": "tokenizer.model",
"tokenizer_type": "Llama2Tokenizer",
},
)
self.pretrainer = PreTrainer
self.obj = MagicMock()
self.obj.pretrain_args = self.pretrain_args
self.obj.accelerate = MagicMock()
def test_init_trackers(self):
self.obj.accelerator.init_trackers = MagicMock()
self.pretrainer._init_trackers(self.obj)
self.obj.accelerator.init_trackers.assert_called_once_with(
self.obj.pretrain_args.project_name, dataclasses.asdict(self.obj.pretrain_args)
)
def test_get_gradient_accumulation_steps(self):
with self.assertRaises(NotImplementedError):
self.pretrainer._get_gradient_accumulation_steps(self.obj)
def test_get_batch_loss_avg(self):
batch_loss_sum = 100.0
with self.assertRaises(NotImplementedError):
self.pretrainer._get_batch_loss_avg(self.obj, batch_loss_sum)
def test_get_lr(self):
with self.assertRaises(NotImplementedError):
self.pretrainer._get_lr(self.obj)
def test_train(self):
self.obj.completed_steps = 0
self.obj.train_dataloader = [MagicMock()]
self.obj.eval_dataloader = MagicMock()
self.obj._pre_training = MagicMock()
self.obj._train_step = MagicMock()
self.obj._get_lr = MagicMock(return_value=0.001)
self.obj._get_batch_loss_avg = MagicMock(return_value=0.5)
self.obj._train_step_log = MagicMock()
self.obj._save_state = MagicMock()
self.obj._eval = MagicMock()
self.obj._post_training = MagicMock()
self.obj.accelerate.sync_gradients = True
self.pretrainer.train(self.obj)
self.assertTrue(self.obj._pre_training.called)
self.assertTrue(self.obj.accelerator.end_training.called)
self.assertTrue(self.obj.accelerator.wait_for_everyone.called)
self.assertTrue(self.obj._post_training.called)
def test_train_step(self):
self.obj.model = MagicMock()
self.obj.optimizer = MagicMock()
self.obj.lr_scheduler = MagicMock()
batch = {"input": "data"}
outputs = self.pretrainer._train_step(self.obj, batch)
self.obj.model.train.assert_called_once()
self.obj.accelerator.accumulate.assert_called_once_with(self.obj.model)
self.obj.model.assert_called_once_with(**batch)
self.obj.accelerator.backward.assert_called_once_with(outputs.loss)
self.obj.optimizer.step.assert_called_once()
self.obj.lr_scheduler.step.assert_called_once()
self.obj.optimizer.zero_grad.assert_called_once()
self.assertEqual(outputs, self.obj.model.return_value)
def test_train_step_log(self):
loss = 0.123
lr = 0.001
elapsed_time = 10.5
step = 100
self.pretrainer._train_step_log(self.obj, loss, lr, elapsed_time, step)
self.obj.accelerator.log.assert_called_with({"train_loss": loss, "learning_rate": lr}, step=step)
def test_pre_training(self):
self.obj._print_training_info = MagicMock()
self.pretrainer._pre_training(self.obj)
self.obj._print_training_info.assert_called_once()
self.assertEqual(self.obj.completed_steps, 0)
def test_post_training(self):
self.obj._save = MagicMock()
self.pretrainer._post_training(self.obj)
self.obj._save.assert_called_once_with(save_dir=self.obj.pretrain_args.save_dir)
def test_get_eval_loss(self):
loss = 1.0
with self.assertRaises(NotImplementedError):
self.pretrainer._get_eval_loss(self.obj, loss)
@skip
def test_eval(self):
eval_dataloader = [(1, "batch1"), (2, "batch2")]
completed_steps = 100
self.obj._eval_log = MagicMock()
self.pretrainer._eval(self.obj, eval_dataloader, completed_steps)
self.obj._eval_log.assert_called_once()
@require_torch
def test_eval_step(self):
import torch
batch = {"input_ids": torch.tensor([[1, 2, 3]]), "attention_mask": torch.tensor([[1, 1, 1]])}
self.obj.model = MagicMock()
outputs = self.pretrainer._eval_step(self.obj, batch)
self.assertTrue(self.obj.model.eval.called)
self.assertIsNotNone(outputs)
def test_handle_eval_losses(self):
losses = [0.1, 0.2]
with self.assertRaises(NotImplementedError):
self.pretrainer._handle_eval_losses(self.obj, losses)
@require_torch
def test_eval_log(self):
import torch
losses = torch.tensor([0.5, 0.3])
self.obj._handle_eval_losses = MagicMock(return_value=losses)
self.obj.accelerator.log = MagicMock()
self.pretrainer._eval_log(self.obj, losses)
self.assertTrue(self.obj._handle_eval_losses.called)
self.assertEqual(self.obj.accelerator.log.call_count, 1)
def test_save_state(self):
save_dir = "/path/to/save"
self.obj.accelerator.save_state = MagicMock()
self.pretrainer._save_state(self.obj, save_dir)
self.obj.accelerator.save_state.assert_called_once_with(save_dir)
def test_save(self):
save_dir = "/path/to/save"
with self.assertRaises(NotImplementedError):
self.pretrainer._save(self.obj, save_dir)
def test_read_model(self):
with self.assertRaises(NotImplementedError):
self.pretrainer._read_model(self.obj)
def test_prepare(self):
with self.assertRaises(NotImplementedError):
self.pretrainer._prepare(self.obj)
def test_make_accelerator(self):
with self.assertRaises(NotImplementedError):
self.pretrainer._make_accelerator(self.obj)

View File

@ -1,40 +0,0 @@
# Copyright (c) 2024 Huawei Technologies Co., Ltd.
#
# openMind is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
#
# http://license.coscl.org.cn/MulanPSL2
#
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
import os
import logging
from unittest.mock import patch
from openmind.utils.logging import get_logger
from tests.utils_for_test import require_torch
openmind_logger = get_logger(__name__)
openmind_logger.setLevel(logging.INFO)
def test_print_in_main_process_with_local_rank_0(caplog):
caplog.set_level(logging.INFO)
with patch.dict(os.environ, {"LOCAL_RANK": "0"}):
openmind_logger.info("Test message.")
log_msg = [record.message for record in caplog.records]
assert "Test message." in log_msg
@require_torch
def test_print_in_main_process_with_local_rank_1(capsys):
from openmind.archived.trainers.pretrainer_utils import print_in_main_process
with patch.dict(os.environ, {"LOCAL_RANK": "1"}):
print_in_main_process("Test message.")
captured = capsys.readouterr()
assert captured.out == ""

View File

@ -1,68 +0,0 @@
# Copyright (c) 2024 Huawei Technologies Co., Ltd.
#
# openMind is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
#
# http://license.coscl.org.cn/MulanPSL2
#
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
from unittest import TestCase
import pytest
from tests.utils_for_test import require_torch
@require_torch
class TestPreTrainingArguments(TestCase):
@pytest.fixture(scope="function", autouse=True)
def global_setup(self):
from openmind import PreTrainingArguments
self.pretrain_args = PreTrainingArguments(
num_training_steps=1000,
micro_batch_size=4,
dp=1,
gradient_accumulation_steps=8,
seq_length=2048,
megatron_dataset_flag=True,
data_path="DATA_PATH",
save_dir="SAVE_PATH",
save_interval=10000,
eval_interval=0,
openmind_model_path="BASE_MODEL",
plugin_args={"lr": 1.23e-4},
dataloader_config={"batch": 20},
)
def test_from_yaml(self):
config_path = "CONFIG_PATH"
try:
self.pretrain_args.from_yaml(config_path)
except Exception as exception:
self.assertIsInstance(exception, FileNotFoundError)
def test_get_torch_dtype(self):
import torch
self.assertEqual(self.pretrain_args.get_torch_dtype(), torch.bfloat16)
def test_get_distributed_train_args(self):
self.assertEqual(self.pretrain_args.get_distributed_train_args()["lr"], 1.23e-4)
def test_update_distributed_train_args(self):
self.pretrain_args.update_distributed_train_args({"tp_degree": 4})
self.assertEqual(self.pretrain_args.plugin_args["lr"], 1.23e-4)
self.assertEqual(self.pretrain_args.plugin_args["tp_degree"], 4)
def test_get_dataloader_config(self):
self.assertEqual(self.pretrain_args.get_dataloader_config()["batch"], 20)
def test_scientific_str_to_float(self):
self.pretrain_args._scientific_str_to_float(self.pretrain_args.plugin_args)
self.assertEqual(self.pretrain_args.plugin_args["lr"], 0.000123)