Files
openmind/docs/zh/best_practice/deepseek_r1.md
A1waysBeenHere 75622d3079 !103 fix typo
Merge pull request !103 from A1waysBeenHere/master
2025-02-28 08:31:29 +00:00

14 KiB
Raw Permalink Blame History

在NPU上进行模型蒸馏和微调DeepSeek-R1-Distill系列模型

本教程介绍如何使用openMind套件在npu上进行模型蒸馏和微调DeepSeek-R1-Distill系列LLM模型。


模型蒸馏

环境准备

基础环境配置请参考 环境准备文档

git clone https://gitee.com/ascend/openmind.git
cd openmind
pip install .[pt]

模型准备

Qwen2.5-7B-Instruct为例openMind已内置Qwen2.5-7B-Instruct,训练流程中会自动下载模型权重文件

也可通过带lfs的git从魔乐社区进行模型下载openMind提供使用本地权重文件路径的方式传入模型

git clone https://modelers.cn/AI-Research/Qwen2.5-7B-Instruct.git

数据集配置

微调支持alpacasharegpt两种主流格式,使用json文件存储

方法1使用内置数据集

openMind内置一个由大参数量模型生成的COT数据集Sky-T1_data_17k供用户直接使用

方法2使用自定义数据集

此外openMind也支持用户自行通过DeepSeek-R1生成的数据集来进行模型蒸馏。以生成的sharegpt格式的O1思维链的多轮样例为例

    {
        "system": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:",
        "conversations": [
            {
                "from": "user",
                "value": "What is the critical temperature of a superconducting thin film made of lead with a thickness of 100 nm? And, how does the critical temperature vary with the thickness of the lead film, assuming that the film is perfectly clean and free from any defects?"
            },
            {
                "from": "assistant",
                "value": "<|begin_of_thought|>\n\nAlright, I'm trying to figure out the critical temperature of a superconducting thin film made of lead with a thickness of 100 nm. I know that lead is a type II superconductor, and its bulk critical temperature is around 7.2 K. But since this is a thin film, I wonder if the thickness affects the critical temperature.\n\nI recall that in thin films, the critical temperature can be lower than the bulk value due to the reduction in dimensionality and possible quantum effects. There might be some formulas or models that describe how Tc changes with thickness.\n\nI should look into the Ginzburg-Landau theory or maybe the dirty limit theory for superconductors. Also, I need to consider if the film is perfectly clean and free from defects, which might simplify things.\n\nPerhaps there's a relation that considers the thickness and the coherence length of lead. I need to find the coherence length for lead at its bulk Tc.\n\nWait, the coherence length (xi) is an important parameter in superconductivity. For lead, I think it's on the order of nanometers. If the film thickness is comparable to the coherence length, there might be significant effects on Tc.\n\nI should also consider the surface effects in thin films. Since the film is only 100 nm thick, the surface-to-volume ratio is high, which could influence the superconducting properties.\n\nMaybe there's a specific equation that relates Tc to the thickness (d) for thin films. I think it might involve the ratio of d to the coherence length.\n\nI should check if the film is in the regime where the reduced dimensionality causes a depression of Tc. There might be a power-law dependence or something similar.\n\nI also need to remember that the critical temperature can be affected by the geometry of the film. A thin film might have different behavior compared to a bulk sample.\n\nLet me try to find a formula or a known result for the critical temperature of thin lead films. I believe it's been studied before.\n\nAfter some research, I found that for thin films, the critical temperature can be lower than the bulk value, and the dependence can be approximated by Tc = Tc0 * (1 - (d0/d)^n), where d0 is a characteristic thickness and n is an exponent.\n\nBut I'm not sure about the exact values for lead. I need to find specific values for d0 and n for Pb films.\n\nAlternatively, there might be a more precise model that takes into account the coherence length and the thickness more directly.\n\nI should also consider if the film is clean, as mentioned, which might mean that impurities aren't a factor, so the main effect comes from the reduced dimensionality.\n\nMaybe I can look into the work by Giaever or other researchers who studied superconducting thin films.\n\nUpon further investigation, I found that for very thin films, the critical temperature can be suppressed due to the finite size effects and the energy associated with the superconducting state.\n\nThere might be a dependence where Tc decreases as the thickness decreases below a certain value, which could be related to the coherence length.\n\nIn some cases, for extremely thin films, Tc can approach zero, but for 100 nm, it's probably still significantly above zero.\n\nI need to find a quantitative way to express how Tc varies with thickness for lead films.\n\nPerhaps there is an empirical formula or some theoretical prediction that I can use.\n\nAlternatively, I can look into the concept of the superconducting gap and how it might be affected by the film thickness.\n\nWait, I should consider the energy gap in the superconducting state and how thickness might influence that.\n\nBut I'm getting a bit confused. Let me try to summarize:\n\n\\- Bulk Tc of lead is 7.2 K.\n\n\\- Thin films can have lower Tc due to reduced dimensionality.\n\n\\- The dependence might be approximated by Tc = Tc0 * (1 - (d0/d)^n).\n\n\\- I need to find values specific to lead films.\n\n\\- Alternatively, consider the coherence length and thickness ratio.\n\nGiven that the coherence length for lead is around 40 nm at T=0 K and increases with temperature, for a 100 nm film, the ratio d/xi is around 2.5, which might not be extremely small, so the effect on Tc might not be too drastic.\n\nHowever, I'm not entirely sure about the exact relationship.\n\nI should also consider if the film is in the dirty limit or clean limit. Since it's perfectly clean and free from defects, it might be in the clean limit, which could have different implications for Tc.\n\nMaybe I can look up some references or textbooks on superconductivity in thin films to get a more precise answer.\n\nAfter checking some resources, I found that for thin films, the critical temperature can be described by the relation Tc = Tc0 * (1 - (xi/d)^2), where xi is the coherence length.\n\nAssuming xi is around 40 nm for lead, and d is 100 nm, then Tc = 7.2 * (1 - (40/100)^2) = 7.2 * (1 - 0.16) = 7.2 * 0.84 = 6.048 K.\n\nSo, the critical temperature for a 100 nm thick lead film would be approximately 6.05 K.\n\nAs for how Tc varies with thickness, according to this relation, Tc increases as the thickness increases, approaching the bulk Tc value as d becomes much larger than xi.\n\nTherefore, Tc decreases as the thickness decreases below a certain value, and the relation is roughly Tc = Tc0 * (1 - (xi/d)^2).\n\nI think this makes sense, and it's a reasonable estimate for the critical temperature of a 100 nm thick lead film.<|end_of_thought|>\n\n<|begin_of_solution|>\n\nThe critical temperature of a superconducting thin film made of lead with a thickness of 100 nm can be estimated using the relation:\n\n$$T_c = T_{c0} \\left( 1 - \\left( \\frac{\\xi}{d} \\right)^2 \\right)$$\n\nwhere:\n\n- $T_{c0}$ is the bulk critical temperature of lead (7.2 K),\n\n- $\\xi$ is the coherence length (approximately 40 nm for lead),\n\n- $d$ is the thickness of the film (100 nm).\n\nSubstituting the values:\n\n$$T_c = 7.2 \\left( 1 - \\left( \\frac{40}{100} \\right)^2 \\right) = 7.2 \\left( 1 - 0.16 \\right) = 7.2 \\times 0.84 = 6.048 \\, \\text{K}$$\n\nThus, the critical temperature for a 100 nm thick lead film is approximately:\n\n\\boxed{6.05 , \\text{K}}\n\nAs for how the critical temperature varies with thickness, $T_c$ increases as the thickness increases, approaching the bulk $T_{c0}$ value as $d$ becomes much larger than $\\xi$. The relationship is given by:\n\n$$T_c = T_{c0} \\left( 1 - \\left( \\frac{\\xi}{d} \\right)^2 \\right)$$\n\nThis indicates that $T_c$ decreases as the thickness decreases below a certain value.<|end_of_solution|>"
            }
        ]
    }

训练数据按照格式准备好后,需要编写一个custom_dataset_info.json文件,用于说明数据具体的情况,比如如下配置说明了数据集的位置是/data/custom_data.json, 并将数据的字段名称与标准的名称进行了映射, 并将数据集命名为custom_data_name

{
  "custom_data_name": {
    "file_name": "custom_data.json",
    "local_path": "/data/",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",
      "system": "system"
    },
    "tags": {
      "role_tag": "from",
      "content_tag": "value",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  }
}

更多数据相关说明请参考文档

训练配置与启动

openMind提供了低代码配置化的方式启动训练流程只需要编写一个train_sample.yaml配置文件,定义训练过程中需要的不同参数即可。

以Qwen2.5-7B-Instruct的蒸馏任务为例openMind已经提供了样例脚本。在openMind代码目录下执行以下命令即可,使用ASCEND_RT_VISIBLE_DEVICES控制NPU设备的数量和编号

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 openmind-cli train examples/qwen2.5/train_distill_full_qwen2_5_7b_instruct.yaml

openMind也支持用户使用自定义yaml文件进行任务配置,这里以全参微调为例子进行说明

# model
model_id: Qwen2.5-7B-Chat

# method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json

# dataset
dataset: Sky-T1_data_17k
cutoff_len: 1024
template: qwen

# output
output_dir: saves/qwen2.5_7b_chat_full
logging_steps: 1
save_steps: 20000
overwrite_output_dir: true

# train
per_device_train_batch_size: 8
learning_rate: 1.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
max_steps: 5000
seed: 1234

关键参数说明:配置设置了原始模型的路径, 微调方法为fulltemplate设置为qwen, 数据方面引用了前面第3步定义的配置并使用了deepspeedzero2算法请参考openMind源码中examples/deepspeed目录下对应配置文件),并设定了模型保存的路径为 saves/qwen2.5_7b_chat_full

完整的参数说明可参考文档

开启LoRA低参微调只需要把finetuning_type修改为lora即可, 与之对应的是训练保存的参数只会保存LoRA部分。低参微调样例文件如下

# model
model_id: Qwen2.5-7B-Chat

# method
stage: sft
do_train: true
finetuning_type: lora
deepspeed: examples/deepspeed/ds_z2_config.json

# dataset
dataset: Sky-T1_data_17k
cutoff_len: 1024
template: qwen

# output
output_dir: saves/qwen2.5_7b_chat_lora
logging_steps: 1
save_steps: 20000
overwrite_output_dir: true

# train
per_device_train_batch_size: 8
learning_rate: 1.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
max_steps: 5000
seed: 1234

训练过程中会输出日志包含loss等等待训练完成后即可在配置中的output_dir获取到微调后的模型权重

蒸馏效果

模型蒸馏后,在部分评估任务上得分有显著提升。以Qwen2.5-7B-Instruct为例,蒸馏前后部分评估任务得分如下

评估任务 模型蒸馏前 模型蒸馏后
mgsm_cot_native qwen2.5_7b_instruct_mgsm_cot_native.png qwen2.5_7b_distill_mgsm_cot_native.png
mgsm_direct qwen2.5_7b_instruct_mgsm_direct.png qwen2.5_7b_distill_mgsm_direct.png

DeepSeek-R1-Distill模型微调

环境准备,模型下载,数据集配置请参考上文

LoRA低参微调

openMind提供样例脚本只需要执行以下命令即可

openmind-cli train examples/deepseek_r1/train_sft_lora_deepseek_r1.yaml

LoRA权重合并

LoRA训练完成后我们还需要将模型进行合并才能得到完整的权重。可以使用openMind自带的export命令, 编写一个export_sample.yaml

model_name_or_path: /model/DeepSeek-R1-Distill-Qwen-7B
adapter_models: saves/deepseek_r1_distill_qwen_7b_lora
template: deepseek_r1
finetuning_type: lora
output_dir: target_path
trust_remote_code: True

这里定义了原始的权重路径微调的LoRA权重路径合并后的目标保存路径 更多参数可参考文档

配置好后,使用如下的命令即可完成模型权重的合并操作

openmind-cli export export_sample.yaml