mirror of
https://github.com/huggingface/transformers.git
synced 2025-11-03 03:14:36 +08:00
Compare commits
7 Commits
update_lla
...
run_tiny_w
| Author | SHA1 | Date | |
|---|---|---|---|
| e68eefcc2f | |||
| f84c122c04 | |||
| 085ea7e56c | |||
| 7ecd229ba4 | |||
| ced9fd86f5 | |||
| 0e402e1478 | |||
| a5bee89c9d |
2
.github/workflows/check_tiny_models.yml
vendored
2
.github/workflows/check_tiny_models.yml
vendored
@ -3,7 +3,7 @@ name: Check Tiny Models
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- check_tiny_models*
|
||||
- run_tiny_with_fix_tiny_model_creation
|
||||
repository_dispatch:
|
||||
schedule:
|
||||
- cron: "0 2 * * *"
|
||||
|
||||
@ -47,6 +47,8 @@
|
||||
title: 综述
|
||||
- local: big_models
|
||||
title: 实例化大模型
|
||||
- local: debugging
|
||||
title: 问题定位及解决
|
||||
title: 性能和可扩展性
|
||||
- sections:
|
||||
- local: task_summary
|
||||
|
||||
308
docs/source/zh/debugging.md
Normal file
308
docs/source/zh/debugging.md
Normal file
@ -0,0 +1,308 @@
|
||||
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# 调试
|
||||
|
||||
## 多GPU网络问题调试
|
||||
|
||||
当使用`DistributedDataParallel`和多个GPU进行训练或推理时,如果遇到进程和(或)节点之间的互联问题,您可以使用以下脚本来诊断网络问题。
|
||||
|
||||
```bash
|
||||
wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
|
||||
```
|
||||
|
||||
例如,要测试两个GPU之间的互联,请执行以下操作:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
|
||||
```
|
||||
|
||||
如果两个进程能够相互通信并分配GPU内存,它们各自将打印出 "OK" 状态。
|
||||
|
||||
对于更多的GPU或节点,可以根据脚本中的参数进行调整。
|
||||
|
||||
在诊断脚本内部,您将找到更多详细信息,甚至有关如何在SLURM环境中运行它的说明。
|
||||
|
||||
另一种级别的调试是添加 `NCCL_DEBUG=INFO` 环境变量,如下所示:
|
||||
|
||||
|
||||
```bash
|
||||
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
|
||||
```
|
||||
|
||||
这将产生大量与NCCL相关的调试信息,如果发现有问题报告,您可以在线搜索以获取相关信息。或者,如果您不确定如何解释输出,可以在`issue`中分享日志文件。
|
||||
|
||||
|
||||
## 下溢和上溢检测
|
||||
|
||||
<Tip>
|
||||
|
||||
目前,此功能仅适用于PyTorch。
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
对于多GPU训练,它需要使用DDP(`torch.distributed.launch`)。
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
此功能可以与任何基于`nn.Module`的模型一起使用。
|
||||
|
||||
</Tip>
|
||||
|
||||
如果您开始发现`loss=NaN`或模型因激活值或权重中的`inf`或`nan`而出现一些异常行为,就需要发现第一个下溢或上溢发生的地方以及导致它的原因。幸运的是,您可以通过激活一个特殊模块来自动进行检测。
|
||||
|
||||
如果您正在使用[`Trainer`],只需把以下内容:
|
||||
|
||||
|
||||
```bash
|
||||
--debug underflow_overflow
|
||||
```
|
||||
|
||||
添加到常规命令行参数中,或在创建[`TrainingArguments`]对象时传递 `debug="underflow_overflow"`。
|
||||
|
||||
如果您正在使用自己的训练循环或其他Trainer,您可以通过以下方式实现相同的功能:
|
||||
|
||||
```python
|
||||
from transformers.debug_utils import DebugUnderflowOverflow
|
||||
|
||||
debug_overflow = DebugUnderflowOverflow(model)
|
||||
```
|
||||
|
||||
[`debug_utils.DebugUnderflowOverflow`] 将`hooks`插入模型,紧跟在每次前向调用之后,进而测试输入和输出变量,以及相应模块的权重。一旦在激活值或权重的至少一个元素中检测到`inf`或`nan`,程序将执行`assert`并打印报告,就像这样(这是在`google/mt5-small`下使用fp16混合精度捕获的):
|
||||
|
||||
```
|
||||
Detected inf/nan during batch_number=0
|
||||
Last 21 forward frames:
|
||||
abs min abs max metadata
|
||||
encoder.block.1.layer.1.DenseReluDense.dropout Dropout
|
||||
0.00e+00 2.57e+02 input[0]
|
||||
0.00e+00 2.85e+02 output
|
||||
[...]
|
||||
encoder.block.2.layer.0 T5LayerSelfAttention
|
||||
6.78e-04 3.15e+03 input[0]
|
||||
2.65e-04 3.42e+03 output[0]
|
||||
None output[1]
|
||||
2.25e-01 1.00e+04 output[2]
|
||||
encoder.block.2.layer.1.layer_norm T5LayerNorm
|
||||
8.69e-02 4.18e-01 weight
|
||||
2.65e-04 3.42e+03 input[0]
|
||||
1.79e-06 4.65e+00 output
|
||||
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
|
||||
2.17e-07 4.50e+00 weight
|
||||
1.79e-06 4.65e+00 input[0]
|
||||
2.68e-06 3.70e+01 output
|
||||
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
|
||||
8.08e-07 2.66e+01 weight
|
||||
1.79e-06 4.65e+00 input[0]
|
||||
1.27e-04 2.37e+02 output
|
||||
encoder.block.2.layer.1.DenseReluDense.dropout Dropout
|
||||
0.00e+00 8.76e+03 input[0]
|
||||
0.00e+00 9.74e+03 output
|
||||
encoder.block.2.layer.1.DenseReluDense.wo Linear
|
||||
1.01e-06 6.44e+00 weight
|
||||
0.00e+00 9.74e+03 input[0]
|
||||
3.18e-04 6.27e+04 output
|
||||
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
|
||||
1.79e-06 4.65e+00 input[0]
|
||||
3.18e-04 6.27e+04 output
|
||||
encoder.block.2.layer.1.dropout Dropout
|
||||
3.18e-04 6.27e+04 input[0]
|
||||
0.00e+00 inf output
|
||||
```
|
||||
|
||||
由于篇幅原因,示例输出中间的部分已经被缩减。
|
||||
|
||||
第二列显示了绝对最大元素的值,因此,如果您仔细查看最后`frame`,输入和输出都在`1e4`的范围内。因此,在使用fp16混合精度进行训练时,最后一步发生了溢出(因为在`fp16`下,在`inf`之前的最大数字是`64e3`)。为了避免在`fp16`下发生溢出,激活值必须保持低于`1e4`,因为`1e4 * 1e4 = 1e8`,因此任何具有大激活值的矩阵乘法都会导致数值溢出。
|
||||
|
||||
在跟踪的开始处,您可以发现问题发生在哪个批次(这里的`Detected inf/nan during batch_number=0`表示问题发生在第一个批次)。
|
||||
|
||||
每个报告的`frame`都以声明相应模块的层信息为开头,说明这一`frame`是为哪个模块报告的。如果只看这个`frame`:
|
||||
|
||||
```
|
||||
encoder.block.2.layer.1.layer_norm T5LayerNorm
|
||||
8.69e-02 4.18e-01 weight
|
||||
2.65e-04 3.42e+03 input[0]
|
||||
1.79e-06 4.65e+00 output
|
||||
```
|
||||
|
||||
在这里,`encoder.block.2.layer.1.layer_norm` 表示它是编码器的第二个块中第一层的`layer norm`。而 `forward` 的具体调用是 `T5LayerNorm`。
|
||||
|
||||
让我们看看该报告的最后几个`frame`:
|
||||
|
||||
```
|
||||
Detected inf/nan during batch_number=0
|
||||
Last 21 forward frames:
|
||||
abs min abs max metadata
|
||||
[...]
|
||||
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
|
||||
2.17e-07 4.50e+00 weight
|
||||
1.79e-06 4.65e+00 input[0]
|
||||
2.68e-06 3.70e+01 output
|
||||
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
|
||||
8.08e-07 2.66e+01 weight
|
||||
1.79e-06 4.65e+00 input[0]
|
||||
1.27e-04 2.37e+02 output
|
||||
encoder.block.2.layer.1.DenseReluDense.wo Linear
|
||||
1.01e-06 6.44e+00 weight
|
||||
0.00e+00 9.74e+03 input[0]
|
||||
3.18e-04 6.27e+04 output
|
||||
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
|
||||
1.79e-06 4.65e+00 input[0]
|
||||
3.18e-04 6.27e+04 output
|
||||
encoder.block.2.layer.1.dropout Dropout
|
||||
3.18e-04 6.27e+04 input[0]
|
||||
0.00e+00 inf output
|
||||
```
|
||||
|
||||
最后一个`frame`报告了`Dropout.forward`函数,第一个条目是唯一的输入,第二个条目是唯一的输出。您可以看到,它是从`DenseReluDense`类内的属性`dropout`中调用的。我们可以看到它发生在第2个块的第1层,也就是在第一个批次期间。最后,绝对最大的输入元素值为`6.27e+04`,输出也是`inf`。
|
||||
|
||||
您可以在这里看到,`T5DenseGatedGeluDense.forward`产生了输出激活值,其绝对最大值约为62.7K,非常接近fp16的上限64K。在下一个`frame`中,我们有`Dropout`对权重进行重新归一化,之后将某些元素归零,将绝对最大值推到了64K以上,导致溢出(`inf`)。
|
||||
|
||||
正如你所看到的,我们需要查看前面的`frame`, 从那里fp16数字开始变得非常大。
|
||||
|
||||
让我们将报告与`models/t5/modeling_t5.py`中的代码匹配:
|
||||
|
||||
```python
|
||||
class T5DenseGatedGeluDense(nn.Module):
|
||||
def __init__(self, config):
|
||||
super().__init__()
|
||||
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
|
||||
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
|
||||
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
|
||||
self.dropout = nn.Dropout(config.dropout_rate)
|
||||
self.gelu_act = ACT2FN["gelu_new"]
|
||||
|
||||
def forward(self, hidden_states):
|
||||
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
|
||||
hidden_linear = self.wi_1(hidden_states)
|
||||
hidden_states = hidden_gelu * hidden_linear
|
||||
hidden_states = self.dropout(hidden_states)
|
||||
hidden_states = self.wo(hidden_states)
|
||||
return hidden_states
|
||||
```
|
||||
|
||||
现在很容易看到`dropout`调用,以及所有之前的调用。
|
||||
|
||||
由于检测是在前向`hook`中进行的,这些报告将立即在每个`forward`返回后打印出来。
|
||||
|
||||
回到完整的报告,要采取措施并解决问题,我们需要往回看几个`frame`,在那里数字开始上升,并且最有可能切换到fp32模式以便在乘法或求和时数字不会溢出。当然,可能还有其他解决方案。例如,如果启用了`amp`,我们可以在将原始`forward`移到`helper wrapper`中后,暂时关闭它,如下所示:
|
||||
|
||||
```python
|
||||
def _forward(self, hidden_states):
|
||||
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
|
||||
hidden_linear = self.wi_1(hidden_states)
|
||||
hidden_states = hidden_gelu * hidden_linear
|
||||
hidden_states = self.dropout(hidden_states)
|
||||
hidden_states = self.wo(hidden_states)
|
||||
return hidden_states
|
||||
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
def forward(self, hidden_states):
|
||||
if torch.is_autocast_enabled():
|
||||
with torch.cuda.amp.autocast(enabled=False):
|
||||
return self._forward(hidden_states)
|
||||
else:
|
||||
return self._forward(hidden_states)
|
||||
```
|
||||
|
||||
由于自动检测器仅报告完整`frame`的输入和输出,一旦知道在哪里查找,您可能还希望分析特定`forward`函数的中间阶段。在这种情况下,您可以使用`detect_overflow`辅助函数将检测器放到希望的位置,例如:
|
||||
|
||||
```python
|
||||
from debug_utils import detect_overflow
|
||||
|
||||
|
||||
class T5LayerFF(nn.Module):
|
||||
[...]
|
||||
|
||||
def forward(self, hidden_states):
|
||||
forwarded_states = self.layer_norm(hidden_states)
|
||||
detect_overflow(forwarded_states, "after layer_norm")
|
||||
forwarded_states = self.DenseReluDense(forwarded_states)
|
||||
detect_overflow(forwarded_states, "after DenseReluDense")
|
||||
return hidden_states + self.dropout(forwarded_states)
|
||||
```
|
||||
|
||||
可以看到,我们添加了2个检测器,现在我们可以跟踪是否在`forwarded_states`中间的某个地方检测到了`inf`或`nan`。
|
||||
|
||||
实际上,检测器已经报告了这些,因为上面示例中的每个调用都是一个`nn.Module`,但假设如果您有一些本地的直接计算,这就是您将如何执行的方式。
|
||||
|
||||
此外,如果您在自己的代码中实例化调试器,您可以调整从其默认打印的`frame`数,例如:
|
||||
|
||||
```python
|
||||
from transformers.debug_utils import DebugUnderflowOverflow
|
||||
|
||||
debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
|
||||
```
|
||||
|
||||
### 特定批次的绝对最小值和最大值跟踪
|
||||
|
||||
当关闭下溢/上溢检测功能, 同样的调试类可以用于批处理跟踪。
|
||||
|
||||
假设您想要监视给定批次的每个`forward`调用的所有成分的绝对最小值和最大值,并且仅对批次1和3执行此操作,您可以这样实例化这个类:
|
||||
|
||||
```python
|
||||
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
|
||||
```
|
||||
|
||||
现在,完整的批次1和3将以与下溢/上溢检测器相同的格式进行跟踪。
|
||||
|
||||
批次从0开始计数。
|
||||
|
||||
如果您知道程序在某个批次编号之后开始出现问题,那么您可以直接快进到该区域。以下是一个截取的配置示例输出:
|
||||
|
||||
```
|
||||
*** Starting batch number=1 ***
|
||||
abs min abs max metadata
|
||||
shared Embedding
|
||||
1.01e-06 7.92e+02 weight
|
||||
0.00e+00 2.47e+04 input[0]
|
||||
5.36e-05 7.92e+02 output
|
||||
[...]
|
||||
decoder.dropout Dropout
|
||||
1.60e-07 2.27e+01 input[0]
|
||||
0.00e+00 2.52e+01 output
|
||||
decoder T5Stack
|
||||
not a tensor output
|
||||
lm_head Linear
|
||||
1.01e-06 7.92e+02 weight
|
||||
0.00e+00 1.11e+00 input[0]
|
||||
6.06e-02 8.39e+01 output
|
||||
T5ForConditionalGeneration
|
||||
not a tensor output
|
||||
|
||||
*** Starting batch number=3 ***
|
||||
abs min abs max metadata
|
||||
shared Embedding
|
||||
1.01e-06 7.92e+02 weight
|
||||
0.00e+00 2.78e+04 input[0]
|
||||
5.36e-05 7.92e+02 output
|
||||
[...]
|
||||
```
|
||||
|
||||
在这里,您将获得大量的`frame`被`dump` - 与您的模型中的前向调用一样多,它有可能符合也可能不符合您的要求,但有时对于调试目的来说,它可能比正常的调试器更容易使用。例如,如果问题开始发生在批次号150上,您可以`dump`批次149和150的跟踪,并比较数字开始发散的地方。
|
||||
|
||||
你还可以使用以下命令指定停止训练的批次号:
|
||||
|
||||
```python
|
||||
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
|
||||
```
|
||||
@ -26,12 +26,14 @@ from ...generation.logits_process import (
|
||||
BarkEosPrioritizerLogitsProcessor,
|
||||
SuppressTokensLogitsProcessor,
|
||||
)
|
||||
from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
|
||||
from ...modeling_outputs import CausalLMOutputWithPast, MaskedLMOutput
|
||||
from ...modeling_utils import PreTrainedModel, get_parameter_device
|
||||
from ...utils import (
|
||||
add_start_docstrings,
|
||||
add_start_docstrings_to_model_forward,
|
||||
is_accelerate_available,
|
||||
is_flash_attn_2_available,
|
||||
logging,
|
||||
)
|
||||
from ..auto import AutoModel
|
||||
@ -49,6 +51,11 @@ from .generation_configuration_bark import (
|
||||
)
|
||||
|
||||
|
||||
if is_flash_attn_2_available():
|
||||
from flash_attn import flash_attn_func, flash_attn_varlen_func
|
||||
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
@ -62,6 +69,19 @@ BARK_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
||||
]
|
||||
|
||||
|
||||
# Copied from transformers.models.llama.modeling_llama._get_unpad_data
|
||||
def _get_unpad_data(attention_mask):
|
||||
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
|
||||
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
|
||||
max_seqlen_in_batch = seqlens_in_batch.max().item()
|
||||
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
|
||||
return (
|
||||
indices,
|
||||
cu_seqlens,
|
||||
max_seqlen_in_batch,
|
||||
)
|
||||
|
||||
|
||||
class BarkSelfAttention(nn.Module):
|
||||
# adapted from GPTNeoSelfAttention and Bark code
|
||||
# BarkSelfAttention can have two attention type, i.e full attention or causal attention
|
||||
@ -187,6 +207,177 @@ class BarkSelfAttention(nn.Module):
|
||||
return outputs
|
||||
|
||||
|
||||
class BarkSelfFlashAttention2(BarkSelfAttention):
|
||||
"""
|
||||
Bark flash attention module. This module inherits from `BarkSelfAttention` as the weights of the module stays
|
||||
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
|
||||
flash attention and deal with padding tokens in case the input contains any of them.
|
||||
"""
|
||||
|
||||
def _split_heads(self, tensor, num_heads, attn_head_size):
|
||||
"""
|
||||
Splits hidden_size dim into attn_head_size and num_heads
|
||||
"""
|
||||
new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
|
||||
tensor = tensor.view(new_shape)
|
||||
# Flash attention requires the input to have the shape
|
||||
# batch_size x seq_length x head_dim x hidden_dim - (batch, seq_length, head, head_features)
|
||||
return tensor
|
||||
|
||||
def _merge_heads(self, tensor, num_heads, attn_head_size):
|
||||
"""
|
||||
Merges attn_head_size dim and num_attn_heads dim into hidden_size
|
||||
"""
|
||||
# re-assemble all head outputs side by side
|
||||
# (batch, seq_len, num_heads, attn_head_size) -> (batch, seq_len, num_heads*attn_head_size)
|
||||
tensor = tensor.view(tensor.size()[:-2] + (num_heads * attn_head_size,))
|
||||
return tensor
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states,
|
||||
attention_mask=None,
|
||||
past_key_values=None,
|
||||
head_mask=None,
|
||||
use_cache=False,
|
||||
output_attentions=False,
|
||||
):
|
||||
batch_size, query_len, _ = hidden_states.size()
|
||||
|
||||
# calculate query, key, values for all heads in batch and move head forward to be the batch dim
|
||||
query, key, value = self.att_proj(hidden_states).split(self.embed_dim, dim=2)
|
||||
|
||||
query = self._split_heads(query, self.num_heads, self.head_dim)
|
||||
key = self._split_heads(key, self.num_heads, self.head_dim)
|
||||
value = self._split_heads(value, self.num_heads, self.head_dim)
|
||||
|
||||
if past_key_values is not None:
|
||||
# (batch, head, seq_length, head_features) -> (batch, seq_length, head, head_features)
|
||||
past_key = past_key_values[0].transpose(1, 2)
|
||||
past_value = past_key_values[1].transpose(1, 2)
|
||||
# and merge on seq_length
|
||||
key = torch.cat((past_key, key), dim=1)
|
||||
value = torch.cat((past_value, value), dim=1)
|
||||
|
||||
if use_cache is True:
|
||||
# (batch, head, seq_length, head_features)
|
||||
present = (key.transpose(1, 2), value.transpose(1, 2))
|
||||
else:
|
||||
present = None
|
||||
|
||||
attn_output = self._flash_attention_forward(query, key, value, attention_mask, query_len, dropout=self.dropout)
|
||||
|
||||
attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
|
||||
attn_output = self.out_proj(attn_output)
|
||||
attn_output = self.resid_dropout(attn_output)
|
||||
|
||||
outputs = (attn_output, present)
|
||||
if output_attentions:
|
||||
attn_weights = None
|
||||
outputs += (attn_weights,)
|
||||
|
||||
return outputs
|
||||
|
||||
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward
|
||||
def _flash_attention_forward(
|
||||
self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
|
||||
):
|
||||
"""
|
||||
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
|
||||
first unpad the input, then computes the attention scores and pad the final attention scores.
|
||||
|
||||
Args:
|
||||
query_states (`torch.Tensor`):
|
||||
Input query states to be passed to Flash Attention API
|
||||
key_states (`torch.Tensor`):
|
||||
Input key states to be passed to Flash Attention API
|
||||
value_states (`torch.Tensor`):
|
||||
Input value states to be passed to Flash Attention API
|
||||
attention_mask (`torch.Tensor`):
|
||||
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
|
||||
position of padding tokens and 1 for the position of non-padding tokens.
|
||||
dropout (`int`, *optional*):
|
||||
Attention dropout
|
||||
softmax_scale (`float`, *optional*):
|
||||
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
|
||||
"""
|
||||
# Contains at least one padding token in the sequence
|
||||
if attention_mask is not None:
|
||||
batch_size = query_states.shape[0]
|
||||
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
|
||||
query_states, key_states, value_states, attention_mask, query_length
|
||||
)
|
||||
|
||||
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
|
||||
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
|
||||
|
||||
attn_output_unpad = flash_attn_varlen_func(
|
||||
query_states,
|
||||
key_states,
|
||||
value_states,
|
||||
cu_seqlens_q=cu_seqlens_q,
|
||||
cu_seqlens_k=cu_seqlens_k,
|
||||
max_seqlen_q=max_seqlen_in_batch_q,
|
||||
max_seqlen_k=max_seqlen_in_batch_k,
|
||||
dropout_p=dropout,
|
||||
softmax_scale=softmax_scale,
|
||||
causal=self.is_causal,
|
||||
)
|
||||
|
||||
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
|
||||
else:
|
||||
attn_output = flash_attn_func(
|
||||
query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=self.is_causal
|
||||
)
|
||||
|
||||
return attn_output
|
||||
|
||||
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input
|
||||
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
|
||||
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
|
||||
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
|
||||
|
||||
key_layer = index_first_axis(
|
||||
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
|
||||
)
|
||||
value_layer = index_first_axis(
|
||||
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
|
||||
)
|
||||
if query_length == kv_seq_len:
|
||||
query_layer = index_first_axis(
|
||||
query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
|
||||
)
|
||||
cu_seqlens_q = cu_seqlens_k
|
||||
max_seqlen_in_batch_q = max_seqlen_in_batch_k
|
||||
indices_q = indices_k
|
||||
elif query_length == 1:
|
||||
max_seqlen_in_batch_q = 1
|
||||
cu_seqlens_q = torch.arange(
|
||||
batch_size + 1, dtype=torch.int32, device=query_layer.device
|
||||
) # There is a memcpy here, that is very bad.
|
||||
indices_q = cu_seqlens_q[:-1]
|
||||
query_layer = query_layer.squeeze(1)
|
||||
else:
|
||||
# The -q_len: slice assumes left padding.
|
||||
attention_mask = attention_mask[:, -query_length:]
|
||||
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
|
||||
|
||||
return (
|
||||
query_layer,
|
||||
key_layer,
|
||||
value_layer,
|
||||
indices_q,
|
||||
(cu_seqlens_q, cu_seqlens_k),
|
||||
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
|
||||
)
|
||||
|
||||
|
||||
BARK_ATTENTION_CLASSES = {
|
||||
"default": BarkSelfAttention,
|
||||
"flash_attention_2": BarkSelfFlashAttention2,
|
||||
}
|
||||
|
||||
|
||||
class BarkLayerNorm(nn.Module):
|
||||
"""LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False."""
|
||||
|
||||
@ -229,7 +420,8 @@ class BarkBlock(nn.Module):
|
||||
self.layernorm_1 = nn.LayerNorm(config.hidden_size)
|
||||
self.layernorm_2 = nn.LayerNorm(config.hidden_size)
|
||||
|
||||
self.attn = BarkSelfAttention(config, is_causal=is_causal)
|
||||
attn_type = "flash_attention_2" if getattr(config, "_flash_attn_2_enabled", False) else "default"
|
||||
self.attn = BARK_ATTENTION_CLASSES[attn_type](config, is_causal=is_causal)
|
||||
|
||||
self.mlp = BarkMLP(config)
|
||||
|
||||
@ -277,6 +469,7 @@ class BarkPreTrainedModel(PreTrainedModel):
|
||||
|
||||
config_class = BarkConfig
|
||||
supports_gradient_checkpointing = False
|
||||
_supports_flash_attn_2 = True
|
||||
|
||||
def _init_weights(self, module):
|
||||
"""Initialize the weights."""
|
||||
@ -596,21 +789,13 @@ class BarkCausalModel(BarkPreTrainedModel):
|
||||
if attention_mask is not None:
|
||||
if batch_size <= 0:
|
||||
raise ValueError("batch_size has to be defined and > 0")
|
||||
attention_mask = attention_mask.view(batch_size, -1)
|
||||
# We create a 3D attention mask from a 2D tensor mask.
|
||||
# Sizes are [batch_size, 1, 1, to_seq_length]
|
||||
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
||||
# this attention mask is more simple than the triangular masking of causal attention
|
||||
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
||||
attention_mask = attention_mask[:, None, None, :]
|
||||
|
||||
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||
# masked positions, this operation will create a tensor which is 0.0 for
|
||||
# positions we want to attend and the dtype's smallest value for masked positions.
|
||||
# Since we are adding it to the raw scores before the softmax, this is
|
||||
# effectively the same as removing these entirely.
|
||||
attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
|
||||
attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min
|
||||
if getattr(self.config, "_flash_attn_2_enabled", False):
|
||||
attention_mask = attention_mask if 0 in attention_mask else None
|
||||
else:
|
||||
attention_mask = attention_mask.view(batch_size, -1)
|
||||
# [bsz, to_seq_length] -> [bsz, 1, 1, to_seq_length]
|
||||
# from_seq_length is 1 to easily broadcast
|
||||
attention_mask = _prepare_4d_attention_mask(attention_mask, input_embeds.dtype, tgt_len=1)
|
||||
|
||||
# Prepare head mask if needed
|
||||
# 1.0 in head_mask indicate we keep the head
|
||||
@ -1233,10 +1418,12 @@ class BarkFineModel(BarkPreTrainedModel):
|
||||
if attention_mask is not None:
|
||||
if batch_size <= 0:
|
||||
raise ValueError("batch_size has to be defined and > 0")
|
||||
attention_mask = attention_mask.view(batch_size, -1)
|
||||
attention_mask = attention_mask[:, None, None, :]
|
||||
attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
|
||||
attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min
|
||||
if getattr(self.config, "_flash_attn_2_enabled", False):
|
||||
attention_mask = attention_mask if 0 in attention_mask else None
|
||||
else:
|
||||
# [bsz, to_seq_length] -> [bsz, 1, 1, to_seq_length]
|
||||
# from_seq_length is 1 to easily broadcast
|
||||
attention_mask = _prepare_4d_attention_mask(attention_mask, input_embeds.dtype, tgt_len=1)
|
||||
|
||||
head_mask = self.get_head_mask(head_mask, self.config.num_layers)
|
||||
|
||||
@ -1669,3 +1856,32 @@ class BarkModel(BarkPreTrainedModel):
|
||||
return audio, output_lengths
|
||||
|
||||
return audio
|
||||
|
||||
@classmethod
|
||||
def _check_and_enable_flash_attn_2(
|
||||
cls, config, torch_dtype: Optional[torch.dtype] = None, device_map: Optional[Union[str, Dict[str, int]]] = None
|
||||
):
|
||||
"""
|
||||
`_check_and_enable_flash_attn_2` originally don't expand flash attention enabling to the model
|
||||
sub-configurations. We override the original method to make sure that Bark sub-models are using Flash Attention
|
||||
if necessary.
|
||||
|
||||
If you don't know about Flash Attention, check out the official repository of flash attention:
|
||||
https://github.com/Dao-AILab/flash-attention
|
||||
|
||||
For using Flash Attention 1.0 you can do it directly via the `BetterTransformer` API, have a look at this
|
||||
specific section of the documentation to learn more about it:
|
||||
https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#decoder-models
|
||||
|
||||
The method checks if the current setup is compatible with Flash Attention as it requires the model to be in
|
||||
half precision and not ran on CPU.
|
||||
|
||||
If all checks pass, the method will create an attribute in the config `_flash_attn_2_enabled` so that the model
|
||||
can initialize the correct attention module
|
||||
"""
|
||||
config = super()._check_and_enable_flash_attn_2(config, torch_dtype, device_map)
|
||||
|
||||
config.semantic_config._flash_attn_2_enabled = getattr(config, "_flash_attn_2_enabled", False)
|
||||
config.coarse_acoustics_config._flash_attn_2_enabled = getattr(config, "_flash_attn_2_enabled", False)
|
||||
config.fine_acoustics_config._flash_attn_2_enabled = getattr(config, "_flash_attn_2_enabled", False)
|
||||
return config
|
||||
|
||||
@ -149,9 +149,9 @@ class CodeLlamaTokenizer(PreTrainedTokenizer):
|
||||
):
|
||||
requires_backends(self, "protobuf")
|
||||
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
|
||||
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
|
||||
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
|
||||
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
|
||||
bos_token = AddedToken(bos_token, normalized=False, special=True) if isinstance(bos_token, str) else bos_token
|
||||
eos_token = AddedToken(eos_token, normalized=False, special=True) if isinstance(eos_token, str) else eos_token
|
||||
unk_token = AddedToken(unk_token, normalized=False, special=True) if isinstance(unk_token, str) else unk_token
|
||||
|
||||
self.use_default_system_prompt = use_default_system_prompt
|
||||
# mark tokens special to skip them
|
||||
|
||||
@ -447,9 +447,6 @@ class LlamaTokenizer(PreTrainedTokenizer):
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% if loop_messages|length == 0 and system_message %}" # Special handling when only sys message present
|
||||
"{{ bos_token + '[INST] <<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n [/INST]' }}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
|
||||
@ -243,9 +243,6 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
|
||||
"{% set loop_messages = messages %}"
|
||||
"{% set system_message = false %}"
|
||||
"{% endif %}"
|
||||
"{% if loop_messages|length == 0 and system_message %}" # Special handling when only sys message present
|
||||
"{{ bos_token + '[INST] <<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n [/INST]' }}"
|
||||
"{% endif %}"
|
||||
"{% for message in loop_messages %}" # Loop over all non-system messages
|
||||
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
|
||||
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
|
||||
|
||||
@ -48,7 +48,7 @@ from transformers.testing_utils import (
|
||||
slow,
|
||||
)
|
||||
from transformers.trainer_utils import get_last_checkpoint, set_seed
|
||||
from transformers.utils import WEIGHTS_NAME, is_torch_bf16_gpu_available
|
||||
from transformers.utils import SAFE_WEIGHTS_NAME, is_torch_bf16_gpu_available
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@ -565,8 +565,7 @@ class TrainerIntegrationDeepSpeed(TrainerIntegrationDeepSpeedWithCustomConfig, T
|
||||
|
||||
def check_saved_checkpoints_deepspeed(self, output_dir, freq, total, stage, dtype):
|
||||
# adapted from TrainerIntegrationCommon.check_saved_checkpoints
|
||||
|
||||
file_list = [WEIGHTS_NAME, "training_args.bin", "trainer_state.json", "config.json"]
|
||||
file_list = [SAFE_WEIGHTS_NAME, "training_args.bin", "trainer_state.json", "config.json"]
|
||||
|
||||
if stage == ZERO2:
|
||||
ds_file_list = ["mp_rank_00_model_states.pt"]
|
||||
@ -581,7 +580,6 @@ class TrainerIntegrationDeepSpeed(TrainerIntegrationDeepSpeedWithCustomConfig, T
|
||||
for step in range(freq, total, freq):
|
||||
checkpoint = os.path.join(output_dir, f"checkpoint-{step}")
|
||||
self.assertTrue(os.path.isdir(checkpoint), f"[{stage}] {checkpoint} dir is not found")
|
||||
|
||||
# common files
|
||||
for filename in file_list:
|
||||
path = os.path.join(checkpoint, filename)
|
||||
|
||||
@ -20,6 +20,8 @@ import inspect
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from pytest import mark
|
||||
|
||||
from transformers import (
|
||||
BarkCoarseConfig,
|
||||
BarkConfig,
|
||||
@ -33,6 +35,7 @@ from transformers.models.bark.generation_configuration_bark import (
|
||||
BarkSemanticGenerationConfig,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
require_flash_attn,
|
||||
require_torch,
|
||||
require_torch_fp16,
|
||||
require_torch_gpu,
|
||||
@ -872,6 +875,122 @@ class BarkFineModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
# Check that the model can still do a forward pass successfully (every parameter should be resized)
|
||||
model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
@require_flash_attn
|
||||
@require_torch_gpu
|
||||
@mark.flash_attn_test
|
||||
@slow
|
||||
def test_flash_attn_2_inference(self):
|
||||
for model_class in self.all_model_classes:
|
||||
if not model_class._supports_flash_attn_2:
|
||||
return
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
model = model_class(config)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
model_fa = model_class.from_pretrained(
|
||||
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=True
|
||||
)
|
||||
model_fa.to(torch_device)
|
||||
|
||||
model = model_class.from_pretrained(
|
||||
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=False
|
||||
)
|
||||
model.to(torch_device)
|
||||
|
||||
dummy_input = inputs_dict["input_ids"][:1]
|
||||
if dummy_input.dtype in [torch.float32, torch.float16]:
|
||||
dummy_input = dummy_input.to(torch.bfloat16)
|
||||
|
||||
dummy_attention_mask = inputs_dict.get("attention_mask", None)
|
||||
|
||||
if dummy_attention_mask is not None:
|
||||
dummy_attention_mask = dummy_attention_mask[:1]
|
||||
dummy_attention_mask[:, 1:] = 1
|
||||
dummy_attention_mask[:, :1] = 0
|
||||
|
||||
outputs = model(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
|
||||
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
|
||||
|
||||
logits = outputs.hidden_states[-1]
|
||||
logits_fa = outputs_fa.hidden_states[-1]
|
||||
|
||||
assert torch.allclose(logits_fa, logits, atol=4e-2, rtol=4e-2)
|
||||
|
||||
other_inputs = {"output_hidden_states": True}
|
||||
if dummy_attention_mask is not None:
|
||||
other_inputs["attention_mask"] = dummy_attention_mask
|
||||
|
||||
outputs = model(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
|
||||
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
|
||||
|
||||
logits = outputs.hidden_states[-1]
|
||||
logits_fa = outputs_fa.hidden_states[-1]
|
||||
|
||||
assert torch.allclose(logits_fa[1:], logits[1:], atol=4e-2, rtol=4e-2)
|
||||
|
||||
# check with inference + dropout
|
||||
model.train()
|
||||
_ = model_fa(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
|
||||
|
||||
@require_flash_attn
|
||||
@require_torch_gpu
|
||||
@mark.flash_attn_test
|
||||
@slow
|
||||
def test_flash_attn_2_inference_padding_right(self):
|
||||
for model_class in self.all_model_classes:
|
||||
if not model_class._supports_flash_attn_2:
|
||||
return
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
model = model_class(config)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
model_fa = model_class.from_pretrained(
|
||||
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=True
|
||||
)
|
||||
model_fa.to(torch_device)
|
||||
|
||||
model = model_class.from_pretrained(
|
||||
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=False
|
||||
)
|
||||
model.to(torch_device)
|
||||
|
||||
dummy_input = inputs_dict["input_ids"][:1]
|
||||
if dummy_input.dtype in [torch.float32, torch.float16]:
|
||||
dummy_input = dummy_input.to(torch.bfloat16)
|
||||
|
||||
dummy_attention_mask = inputs_dict.get("attention_mask", None)
|
||||
|
||||
if dummy_attention_mask is not None:
|
||||
dummy_attention_mask = dummy_attention_mask[:1]
|
||||
dummy_attention_mask[:, :-1] = 1
|
||||
dummy_attention_mask[:, -1:] = 0
|
||||
|
||||
outputs = model(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
|
||||
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
|
||||
|
||||
logits = outputs.hidden_states[-1]
|
||||
logits_fa = outputs_fa.hidden_states[-1]
|
||||
|
||||
assert torch.allclose(logits_fa, logits, atol=4e-2, rtol=4e-2)
|
||||
|
||||
other_inputs = {
|
||||
"output_hidden_states": True,
|
||||
}
|
||||
if dummy_attention_mask is not None:
|
||||
other_inputs["attention_mask"] = dummy_attention_mask
|
||||
|
||||
outputs = model(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
|
||||
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
|
||||
|
||||
logits = outputs.hidden_states[-1]
|
||||
logits_fa = outputs_fa.hidden_states[-1]
|
||||
|
||||
assert torch.allclose(logits_fa[:-1], logits[:-1], atol=4e-2, rtol=4e-2)
|
||||
|
||||
|
||||
@require_torch
|
||||
class BarkModelIntegrationTests(unittest.TestCase):
|
||||
|
||||
@ -150,6 +150,8 @@ class CodeLlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
self.tokenizers_list = [
|
||||
(self.rust_tokenizer_class, "hf-internal-testing/llama-code-tokenizer", {}),
|
||||
(self.tokenizer_class, "hf-internal-testing/llama-code-tokenizer", {}),
|
||||
(self.tokenizer_class, "codellama/CodeLlama-34b-Instruct-hf", {}),
|
||||
(self.rust_tokenizer_class, "codellama/CodeLlama-34b-Instruct-hf", {}),
|
||||
]
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
|
||||
|
||||
@ -519,7 +519,7 @@ class IBertModelIntegrationTest(unittest.TestCase):
|
||||
gelu_q = IntGELU(quant_mode=True)
|
||||
gelu_dq = nn.GELU()
|
||||
|
||||
x_int = torch.range(-10000, 10000, 1)
|
||||
x_int = torch.arange(-10000, 10001, 1)
|
||||
x_scaling_factor = torch.tensor(0.001)
|
||||
x = x_int * x_scaling_factor
|
||||
|
||||
@ -534,7 +534,7 @@ class IBertModelIntegrationTest(unittest.TestCase):
|
||||
self.assertTrue(torch.allclose(q_int, q_int.round(), atol=1e-4))
|
||||
|
||||
def test_force_dequant_gelu(self):
|
||||
x_int = torch.range(-10000, 10000, 1)
|
||||
x_int = torch.arange(-10000, 10001, 1)
|
||||
x_scaling_factor = torch.tensor(0.001)
|
||||
x = x_int * x_scaling_factor
|
||||
|
||||
@ -565,7 +565,6 @@ class IBertModelIntegrationTest(unittest.TestCase):
|
||||
softmax_q = IntSoftmax(output_bit, quant_mode=True)
|
||||
softmax_dq = nn.Softmax()
|
||||
|
||||
# x_int = torch.range(-10000, 10000, 1)
|
||||
def _test(array):
|
||||
x_int = torch.tensor(array)
|
||||
x_scaling_factor = torch.tensor(0.1)
|
||||
|
||||
@ -81,6 +81,8 @@ def get_test_classes(test_file):
|
||||
test_module = get_test_module(test_file)
|
||||
for attr in dir(test_module):
|
||||
attr_value = getattr(test_module, attr)
|
||||
if not (isinstance(attr_value, type) and "ModelTesterMixin" in [x.__name__ for x in attr_value.__bases__]):
|
||||
continue
|
||||
# (TF/Flax)ModelTesterMixin is also an attribute in specific model test module. Let's exclude them by checking
|
||||
# `all_model_classes` is not empty (which also excludes other special classes).
|
||||
model_classes = getattr(attr_value, "all_model_classes", [])
|
||||
|
||||
Reference in New Issue
Block a user