Compare commits

..

7 Commits

Author SHA1 Message Date
e68eefcc2f fix 2023-11-09 11:19:13 +01:00
f84c122c04 fix 2023-11-09 10:27:04 +01:00
085ea7e56c [CodeLlamaTokenizer] Nit, update __init__ to make sure the AddedTokens are not normalized because they are special (#27359)
* make sure tokens are properly initialized for codellama slow

* add m ore pretrained models

* style

* test more tokenizers checkpoints
2023-11-09 10:15:10 +01:00
7ecd229ba4 Smangrul/fix failing ds ci tests (#27358)
* fix failing DeepSpeed CI tests due to `safetensors` being default

* debug

* remove debug statements

* resolve comments

* Update test_deepspeed.py
2023-11-09 11:47:24 +05:30
ced9fd86f5 translate debugging.md to chinese (#27374)
* update

* update
2023-11-08 14:04:06 -08:00
0e402e1478 Update deprecated torch.range in test_modeling_ibert.py (#27355)
* Update deprecated torch.range

* Remove comment
2023-11-08 20:58:36 +01:00
a5bee89c9d Add Flash Attention 2 support to Bark (#27364)
* change handmade attention mask to _prepare_4d_attention_mask

* add flashattention2 support in Bark

* add flashattention2 tests on BarkSemanticModel

* make style

* fix flashattention and tests + make style

* fix memory leak and allow Bark to pass flash attention to sub-models

* make style

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* remove unecessary code from tests + justify overriding

* Update tests/models/bark/test_modeling_bark.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* make style

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-11-08 17:06:35 +00:00
12 changed files with 677 additions and 37 deletions

View File

@ -3,7 +3,7 @@ name: Check Tiny Models
on:
push:
branches:
- check_tiny_models*
- run_tiny_with_fix_tiny_model_creation
repository_dispatch:
schedule:
- cron: "0 2 * * *"

View File

@ -47,6 +47,8 @@
title: 综述
- local: big_models
title: 实例化大模型
- local: debugging
title: 问题定位及解决
title: 性能和可扩展性
- sections:
- local: task_summary

308
docs/source/zh/debugging.md Normal file
View File

@ -0,0 +1,308 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# 调试
## 多GPU网络问题调试
当使用`DistributedDataParallel`和多个GPU进行训练或推理时如果遇到进程和节点之间的互联问题您可以使用以下脚本来诊断网络问题。
```bash
wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
```
例如要测试两个GPU之间的互联请执行以下操作
```bash
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
```
如果两个进程能够相互通信并分配GPU内存它们各自将打印出 "OK" 状态。
对于更多的GPU或节点可以根据脚本中的参数进行调整。
在诊断脚本内部您将找到更多详细信息甚至有关如何在SLURM环境中运行它的说明。
另一种级别的调试是添加 `NCCL_DEBUG=INFO` 环境变量,如下所示:
```bash
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
```
这将产生大量与NCCL相关的调试信息如果发现有问题报告您可以在线搜索以获取相关信息。或者如果您不确定如何解释输出可以在`issue`中分享日志文件。
## 下溢和上溢检测
<Tip>
目前此功能仅适用于PyTorch。
</Tip>
<Tip>
对于多GPU训练它需要使用DDP`torch.distributed.launch`)。
</Tip>
<Tip>
此功能可以与任何基于`nn.Module`的模型一起使用。
</Tip>
如果您开始发现`loss=NaN`或模型因激活值或权重中的`inf``nan`而出现一些异常行为,就需要发现第一个下溢或上溢发生的地方以及导致它的原因。幸运的是,您可以通过激活一个特殊模块来自动进行检测。
如果您正在使用[`Trainer`],只需把以下内容:
```bash
--debug underflow_overflow
```
添加到常规命令行参数中,或在创建[`TrainingArguments`]对象时传递 `debug="underflow_overflow"`
如果您正在使用自己的训练循环或其他Trainer您可以通过以下方式实现相同的功能
```python
from transformers.debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model)
```
[`debug_utils.DebugUnderflowOverflow`] 将`hooks`插入模型,紧跟在每次前向调用之后,进而测试输入和输出变量,以及相应模块的权重。一旦在激活值或权重的至少一个元素中检测到`inf``nan`,程序将执行`assert`并打印报告,就像这样(这是在`google/mt5-small`下使用fp16混合精度捕获的
```
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
None output[1]
2.25e-01 1.00e+04 output[2]
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
```
由于篇幅原因,示例输出中间的部分已经被缩减。
第二列显示了绝对最大元素的值,因此,如果您仔细查看最后`frame`,输入和输出都在`1e4`的范围内。因此在使用fp16混合精度进行训练时最后一步发生了溢出因为在`fp16`下,在`inf`之前的最大数字是`64e3`)。为了避免在`fp16`下发生溢出,激活值必须保持低于`1e4`,因为`1e4 * 1e4 = 1e8`,因此任何具有大激活值的矩阵乘法都会导致数值溢出。
在跟踪的开始处,您可以发现问题发生在哪个批次(这里的`Detected inf/nan during batch_number=0`表示问题发生在第一个批次)。
每个报告的`frame`都以声明相应模块的层信息为开头,说明这一`frame`是为哪个模块报告的。如果只看这个`frame`
```
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
```
在这里,`encoder.block.2.layer.1.layer_norm` 表示它是编码器的第二个块中第一层的`layer norm`。而 `forward` 的具体调用是 `T5LayerNorm`
让我们看看该报告的最后几个`frame`
```
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
[...]
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
```
最后一个`frame`报告了`Dropout.forward`函数,第一个条目是唯一的输入,第二个条目是唯一的输出。您可以看到,它是从`DenseReluDense`类内的属性`dropout`中调用的。我们可以看到它发生在第2个块的第1层也就是在第一个批次期间。最后绝对最大的输入元素值为`6.27e+04`,输出也是`inf`
您可以在这里看到,`T5DenseGatedGeluDense.forward`产生了输出激活值其绝对最大值约为62.7K非常接近fp16的上限64K。在下一个`frame`中,我们有`Dropout`对权重进行重新归一化之后将某些元素归零将绝对最大值推到了64K以上导致溢出`inf`)。
正如你所看到的,我们需要查看前面的`frame`, 从那里fp16数字开始变得非常大。
让我们将报告与`models/t5/modeling_t5.py`中的代码匹配:
```python
class T5DenseGatedGeluDense(nn.Module):
def __init__(self, config):
super().__init__()
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
self.dropout = nn.Dropout(config.dropout_rate)
self.gelu_act = ACT2FN["gelu_new"]
def forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
```
现在很容易看到`dropout`调用,以及所有之前的调用。
由于检测是在前向`hook`中进行的,这些报告将立即在每个`forward`返回后打印出来。
回到完整的报告,要采取措施并解决问题,我们需要往回看几个`frame`在那里数字开始上升并且最有可能切换到fp32模式以便在乘法或求和时数字不会溢出。当然可能还有其他解决方案。例如如果启用了`amp`,我们可以在将原始`forward`移到`helper wrapper`中后,暂时关闭它,如下所示:
```python
def _forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
import torch
def forward(self, hidden_states):
if torch.is_autocast_enabled():
with torch.cuda.amp.autocast(enabled=False):
return self._forward(hidden_states)
else:
return self._forward(hidden_states)
```
由于自动检测器仅报告完整`frame`的输入和输出,一旦知道在哪里查找,您可能还希望分析特定`forward`函数的中间阶段。在这种情况下,您可以使用`detect_overflow`辅助函数将检测器放到希望的位置,例如:
```python
from debug_utils import detect_overflow
class T5LayerFF(nn.Module):
[...]
def forward(self, hidden_states):
forwarded_states = self.layer_norm(hidden_states)
detect_overflow(forwarded_states, "after layer_norm")
forwarded_states = self.DenseReluDense(forwarded_states)
detect_overflow(forwarded_states, "after DenseReluDense")
return hidden_states + self.dropout(forwarded_states)
```
可以看到我们添加了2个检测器现在我们可以跟踪是否在`forwarded_states`中间的某个地方检测到了`inf``nan`
实际上,检测器已经报告了这些,因为上面示例中的每个调用都是一个`nn.Module`,但假设如果您有一些本地的直接计算,这就是您将如何执行的方式。
此外,如果您在自己的代码中实例化调试器,您可以调整从其默认打印的`frame`数,例如:
```python
from transformers.debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
```
### 特定批次的绝对最小值和最大值跟踪
当关闭下溢/上溢检测功能, 同样的调试类可以用于批处理跟踪。
假设您想要监视给定批次的每个`forward`调用的所有成分的绝对最小值和最大值并且仅对批次1和3执行此操作您可以这样实例化这个类
```python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
```
现在完整的批次1和3将以与下溢/上溢检测器相同的格式进行跟踪。
批次从0开始计数。
如果您知道程序在某个批次编号之后开始出现问题,那么您可以直接快进到该区域。以下是一个截取的配置示例输出:
```
*** Starting batch number=1 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
decoder T5Stack
not a tensor output
lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
T5ForConditionalGeneration
not a tensor output
*** Starting batch number=3 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]
```
在这里,您将获得大量的`frame``dump` - 与您的模型中的前向调用一样多它有可能符合也可能不符合您的要求但有时对于调试目的来说它可能比正常的调试器更容易使用。例如如果问题开始发生在批次号150上您可以`dump`批次149和150的跟踪并比较数字开始发散的地方。
你还可以使用以下命令指定停止训练的批次号:
```python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
```

View File

@ -26,12 +26,14 @@ from ...generation.logits_process import (
BarkEosPrioritizerLogitsProcessor,
SuppressTokensLogitsProcessor,
)
from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
from ...modeling_outputs import CausalLMOutputWithPast, MaskedLMOutput
from ...modeling_utils import PreTrainedModel, get_parameter_device
from ...utils import (
add_start_docstrings,
add_start_docstrings_to_model_forward,
is_accelerate_available,
is_flash_attn_2_available,
logging,
)
from ..auto import AutoModel
@ -49,6 +51,11 @@ from .generation_configuration_bark import (
)
if is_flash_attn_2_available():
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
logger = logging.get_logger(__name__)
@ -62,6 +69,19 @@ BARK_PRETRAINED_MODEL_ARCHIVE_LIST = [
]
# Copied from transformers.models.llama.modeling_llama._get_unpad_data
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
class BarkSelfAttention(nn.Module):
# adapted from GPTNeoSelfAttention and Bark code
# BarkSelfAttention can have two attention type, i.e full attention or causal attention
@ -187,6 +207,177 @@ class BarkSelfAttention(nn.Module):
return outputs
class BarkSelfFlashAttention2(BarkSelfAttention):
"""
Bark flash attention module. This module inherits from `BarkSelfAttention` as the weights of the module stays
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
flash attention and deal with padding tokens in case the input contains any of them.
"""
def _split_heads(self, tensor, num_heads, attn_head_size):
"""
Splits hidden_size dim into attn_head_size and num_heads
"""
new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
tensor = tensor.view(new_shape)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim - (batch, seq_length, head, head_features)
return tensor
def _merge_heads(self, tensor, num_heads, attn_head_size):
"""
Merges attn_head_size dim and num_attn_heads dim into hidden_size
"""
# re-assemble all head outputs side by side
# (batch, seq_len, num_heads, attn_head_size) -> (batch, seq_len, num_heads*attn_head_size)
tensor = tensor.view(tensor.size()[:-2] + (num_heads * attn_head_size,))
return tensor
def forward(
self,
hidden_states,
attention_mask=None,
past_key_values=None,
head_mask=None,
use_cache=False,
output_attentions=False,
):
batch_size, query_len, _ = hidden_states.size()
# calculate query, key, values for all heads in batch and move head forward to be the batch dim
query, key, value = self.att_proj(hidden_states).split(self.embed_dim, dim=2)
query = self._split_heads(query, self.num_heads, self.head_dim)
key = self._split_heads(key, self.num_heads, self.head_dim)
value = self._split_heads(value, self.num_heads, self.head_dim)
if past_key_values is not None:
# (batch, head, seq_length, head_features) -> (batch, seq_length, head, head_features)
past_key = past_key_values[0].transpose(1, 2)
past_value = past_key_values[1].transpose(1, 2)
# and merge on seq_length
key = torch.cat((past_key, key), dim=1)
value = torch.cat((past_value, value), dim=1)
if use_cache is True:
# (batch, head, seq_length, head_features)
present = (key.transpose(1, 2), value.transpose(1, 2))
else:
present = None
attn_output = self._flash_attention_forward(query, key, value, attention_mask, query_len, dropout=self.dropout)
attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
attn_output = self.out_proj(attn_output)
attn_output = self.resid_dropout(attn_output)
outputs = (attn_output, present)
if output_attentions:
attn_weights = None
outputs += (attn_weights,)
return outputs
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward
def _flash_attention_forward(
self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
query_states, key_states, value_states, attention_mask, query_length
)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=self.is_causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
else:
attn_output = flash_attn_func(
query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=self.is_causal
)
return attn_output
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
)
value_layer = index_first_axis(
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
)
if query_length == kv_seq_len:
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
BARK_ATTENTION_CLASSES = {
"default": BarkSelfAttention,
"flash_attention_2": BarkSelfFlashAttention2,
}
class BarkLayerNorm(nn.Module):
"""LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False."""
@ -229,7 +420,8 @@ class BarkBlock(nn.Module):
self.layernorm_1 = nn.LayerNorm(config.hidden_size)
self.layernorm_2 = nn.LayerNorm(config.hidden_size)
self.attn = BarkSelfAttention(config, is_causal=is_causal)
attn_type = "flash_attention_2" if getattr(config, "_flash_attn_2_enabled", False) else "default"
self.attn = BARK_ATTENTION_CLASSES[attn_type](config, is_causal=is_causal)
self.mlp = BarkMLP(config)
@ -277,6 +469,7 @@ class BarkPreTrainedModel(PreTrainedModel):
config_class = BarkConfig
supports_gradient_checkpointing = False
_supports_flash_attn_2 = True
def _init_weights(self, module):
"""Initialize the weights."""
@ -596,21 +789,13 @@ class BarkCausalModel(BarkPreTrainedModel):
if attention_mask is not None:
if batch_size <= 0:
raise ValueError("batch_size has to be defined and > 0")
attention_mask = attention_mask.view(batch_size, -1)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# this attention mask is more simple than the triangular masking of causal attention
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
attention_mask = attention_mask[:, None, None, :]
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and the dtype's smallest value for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min
if getattr(self.config, "_flash_attn_2_enabled", False):
attention_mask = attention_mask if 0 in attention_mask else None
else:
attention_mask = attention_mask.view(batch_size, -1)
# [bsz, to_seq_length] -> [bsz, 1, 1, to_seq_length]
# from_seq_length is 1 to easily broadcast
attention_mask = _prepare_4d_attention_mask(attention_mask, input_embeds.dtype, tgt_len=1)
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
@ -1233,10 +1418,12 @@ class BarkFineModel(BarkPreTrainedModel):
if attention_mask is not None:
if batch_size <= 0:
raise ValueError("batch_size has to be defined and > 0")
attention_mask = attention_mask.view(batch_size, -1)
attention_mask = attention_mask[:, None, None, :]
attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min
if getattr(self.config, "_flash_attn_2_enabled", False):
attention_mask = attention_mask if 0 in attention_mask else None
else:
# [bsz, to_seq_length] -> [bsz, 1, 1, to_seq_length]
# from_seq_length is 1 to easily broadcast
attention_mask = _prepare_4d_attention_mask(attention_mask, input_embeds.dtype, tgt_len=1)
head_mask = self.get_head_mask(head_mask, self.config.num_layers)
@ -1669,3 +1856,32 @@ class BarkModel(BarkPreTrainedModel):
return audio, output_lengths
return audio
@classmethod
def _check_and_enable_flash_attn_2(
cls, config, torch_dtype: Optional[torch.dtype] = None, device_map: Optional[Union[str, Dict[str, int]]] = None
):
"""
`_check_and_enable_flash_attn_2` originally don't expand flash attention enabling to the model
sub-configurations. We override the original method to make sure that Bark sub-models are using Flash Attention
if necessary.
If you don't know about Flash Attention, check out the official repository of flash attention:
https://github.com/Dao-AILab/flash-attention
For using Flash Attention 1.0 you can do it directly via the `BetterTransformer` API, have a look at this
specific section of the documentation to learn more about it:
https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#decoder-models
The method checks if the current setup is compatible with Flash Attention as it requires the model to be in
half precision and not ran on CPU.
If all checks pass, the method will create an attribute in the config `_flash_attn_2_enabled` so that the model
can initialize the correct attention module
"""
config = super()._check_and_enable_flash_attn_2(config, torch_dtype, device_map)
config.semantic_config._flash_attn_2_enabled = getattr(config, "_flash_attn_2_enabled", False)
config.coarse_acoustics_config._flash_attn_2_enabled = getattr(config, "_flash_attn_2_enabled", False)
config.fine_acoustics_config._flash_attn_2_enabled = getattr(config, "_flash_attn_2_enabled", False)
return config

View File

@ -149,9 +149,9 @@ class CodeLlamaTokenizer(PreTrainedTokenizer):
):
requires_backends(self, "protobuf")
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
bos_token = AddedToken(bos_token, normalized=False, special=True) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, normalized=False, special=True) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, normalized=False, special=True) if isinstance(unk_token, str) else unk_token
self.use_default_system_prompt = use_default_system_prompt
# mark tokens special to skip them

View File

@ -447,9 +447,6 @@ class LlamaTokenizer(PreTrainedTokenizer):
"{% set loop_messages = messages %}"
"{% set system_message = false %}"
"{% endif %}"
"{% if loop_messages|length == 0 and system_message %}" # Special handling when only sys message present
"{{ bos_token + '[INST] <<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n [/INST]' }}"
"{% endif %}"
"{% for message in loop_messages %}" # Loop over all non-system messages
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"

View File

@ -243,9 +243,6 @@ class LlamaTokenizerFast(PreTrainedTokenizerFast):
"{% set loop_messages = messages %}"
"{% set system_message = false %}"
"{% endif %}"
"{% if loop_messages|length == 0 and system_message %}" # Special handling when only sys message present
"{{ bos_token + '[INST] <<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n [/INST]' }}"
"{% endif %}"
"{% for message in loop_messages %}" # Loop over all non-system messages
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
"{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"

View File

@ -48,7 +48,7 @@ from transformers.testing_utils import (
slow,
)
from transformers.trainer_utils import get_last_checkpoint, set_seed
from transformers.utils import WEIGHTS_NAME, is_torch_bf16_gpu_available
from transformers.utils import SAFE_WEIGHTS_NAME, is_torch_bf16_gpu_available
if is_torch_available():
@ -565,8 +565,7 @@ class TrainerIntegrationDeepSpeed(TrainerIntegrationDeepSpeedWithCustomConfig, T
def check_saved_checkpoints_deepspeed(self, output_dir, freq, total, stage, dtype):
# adapted from TrainerIntegrationCommon.check_saved_checkpoints
file_list = [WEIGHTS_NAME, "training_args.bin", "trainer_state.json", "config.json"]
file_list = [SAFE_WEIGHTS_NAME, "training_args.bin", "trainer_state.json", "config.json"]
if stage == ZERO2:
ds_file_list = ["mp_rank_00_model_states.pt"]
@ -581,7 +580,6 @@ class TrainerIntegrationDeepSpeed(TrainerIntegrationDeepSpeedWithCustomConfig, T
for step in range(freq, total, freq):
checkpoint = os.path.join(output_dir, f"checkpoint-{step}")
self.assertTrue(os.path.isdir(checkpoint), f"[{stage}] {checkpoint} dir is not found")
# common files
for filename in file_list:
path = os.path.join(checkpoint, filename)

View File

@ -20,6 +20,8 @@ import inspect
import tempfile
import unittest
from pytest import mark
from transformers import (
BarkCoarseConfig,
BarkConfig,
@ -33,6 +35,7 @@ from transformers.models.bark.generation_configuration_bark import (
BarkSemanticGenerationConfig,
)
from transformers.testing_utils import (
require_flash_attn,
require_torch,
require_torch_fp16,
require_torch_gpu,
@ -872,6 +875,122 @@ class BarkFineModelTest(ModelTesterMixin, unittest.TestCase):
# Check that the model can still do a forward pass successfully (every parameter should be resized)
model(**self._prepare_for_class(inputs_dict, model_class))
@require_flash_attn
@require_torch_gpu
@mark.flash_attn_test
@slow
def test_flash_attn_2_inference(self):
for model_class in self.all_model_classes:
if not model_class._supports_flash_attn_2:
return
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model_fa = model_class.from_pretrained(
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=True
)
model_fa.to(torch_device)
model = model_class.from_pretrained(
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=False
)
model.to(torch_device)
dummy_input = inputs_dict["input_ids"][:1]
if dummy_input.dtype in [torch.float32, torch.float16]:
dummy_input = dummy_input.to(torch.bfloat16)
dummy_attention_mask = inputs_dict.get("attention_mask", None)
if dummy_attention_mask is not None:
dummy_attention_mask = dummy_attention_mask[:1]
dummy_attention_mask[:, 1:] = 1
dummy_attention_mask[:, :1] = 0
outputs = model(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
logits = outputs.hidden_states[-1]
logits_fa = outputs_fa.hidden_states[-1]
assert torch.allclose(logits_fa, logits, atol=4e-2, rtol=4e-2)
other_inputs = {"output_hidden_states": True}
if dummy_attention_mask is not None:
other_inputs["attention_mask"] = dummy_attention_mask
outputs = model(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
logits = outputs.hidden_states[-1]
logits_fa = outputs_fa.hidden_states[-1]
assert torch.allclose(logits_fa[1:], logits[1:], atol=4e-2, rtol=4e-2)
# check with inference + dropout
model.train()
_ = model_fa(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
@require_flash_attn
@require_torch_gpu
@mark.flash_attn_test
@slow
def test_flash_attn_2_inference_padding_right(self):
for model_class in self.all_model_classes:
if not model_class._supports_flash_attn_2:
return
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model_fa = model_class.from_pretrained(
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=True
)
model_fa.to(torch_device)
model = model_class.from_pretrained(
tmpdirname, torch_dtype=torch.bfloat16, use_flash_attention_2=False
)
model.to(torch_device)
dummy_input = inputs_dict["input_ids"][:1]
if dummy_input.dtype in [torch.float32, torch.float16]:
dummy_input = dummy_input.to(torch.bfloat16)
dummy_attention_mask = inputs_dict.get("attention_mask", None)
if dummy_attention_mask is not None:
dummy_attention_mask = dummy_attention_mask[:1]
dummy_attention_mask[:, :-1] = 1
dummy_attention_mask[:, -1:] = 0
outputs = model(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, output_hidden_states=True)
logits = outputs.hidden_states[-1]
logits_fa = outputs_fa.hidden_states[-1]
assert torch.allclose(logits_fa, logits, atol=4e-2, rtol=4e-2)
other_inputs = {
"output_hidden_states": True,
}
if dummy_attention_mask is not None:
other_inputs["attention_mask"] = dummy_attention_mask
outputs = model(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
outputs_fa = model_fa(inputs_dict["codebook_idx"], dummy_input, **other_inputs)
logits = outputs.hidden_states[-1]
logits_fa = outputs_fa.hidden_states[-1]
assert torch.allclose(logits_fa[:-1], logits[:-1], atol=4e-2, rtol=4e-2)
@require_torch
class BarkModelIntegrationTests(unittest.TestCase):

View File

@ -150,6 +150,8 @@ class CodeLlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.tokenizers_list = [
(self.rust_tokenizer_class, "hf-internal-testing/llama-code-tokenizer", {}),
(self.tokenizer_class, "hf-internal-testing/llama-code-tokenizer", {}),
(self.tokenizer_class, "codellama/CodeLlama-34b-Instruct-hf", {}),
(self.rust_tokenizer_class, "codellama/CodeLlama-34b-Instruct-hf", {}),
]
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):

View File

@ -519,7 +519,7 @@ class IBertModelIntegrationTest(unittest.TestCase):
gelu_q = IntGELU(quant_mode=True)
gelu_dq = nn.GELU()
x_int = torch.range(-10000, 10000, 1)
x_int = torch.arange(-10000, 10001, 1)
x_scaling_factor = torch.tensor(0.001)
x = x_int * x_scaling_factor
@ -534,7 +534,7 @@ class IBertModelIntegrationTest(unittest.TestCase):
self.assertTrue(torch.allclose(q_int, q_int.round(), atol=1e-4))
def test_force_dequant_gelu(self):
x_int = torch.range(-10000, 10000, 1)
x_int = torch.arange(-10000, 10001, 1)
x_scaling_factor = torch.tensor(0.001)
x = x_int * x_scaling_factor
@ -565,7 +565,6 @@ class IBertModelIntegrationTest(unittest.TestCase):
softmax_q = IntSoftmax(output_bit, quant_mode=True)
softmax_dq = nn.Softmax()
# x_int = torch.range(-10000, 10000, 1)
def _test(array):
x_int = torch.tensor(array)
x_scaling_factor = torch.tensor(0.1)

View File

@ -81,6 +81,8 @@ def get_test_classes(test_file):
test_module = get_test_module(test_file)
for attr in dir(test_module):
attr_value = getattr(test_module, attr)
if not (isinstance(attr_value, type) and "ModelTesterMixin" in [x.__name__ for x in attr_value.__bases__]):
continue
# (TF/Flax)ModelTesterMixin is also an attribute in specific model test module. Let's exclude them by checking
# `all_model_classes` is not empty (which also excludes other special classes).
model_classes = getattr(attr_value, "all_model_classes", [])