Compare commits

...

11 Commits

Author SHA1 Message Date
53b5ba6dbd fix style 2023-01-23 12:21:52 +01:00
47c2b2e6c8 change checkpoint name in tests 2023-01-23 11:56:54 +01:00
b4157707ff remove upcasting tests 2023-01-23 11:56:32 +01:00
7511e64c80 remove pruning tests 2023-01-23 11:56:16 +01:00
fb53ff6ee6 fill info 2023-01-20 19:01:47 +01:00
03b18d9d7b add MQA Readme 2023-01-20 18:43:06 +01:00
072e71de9f undo unnecessary changes 2023-01-20 18:41:24 +01:00
4705801404 fix style 2023-01-20 18:40:06 +01:00
95975eacaa update config 2023-01-20 17:53:50 +01:00
905346778d update modeling file 2023-01-20 17:53:37 +01:00
67a460aefd run template 2023-01-20 17:25:56 +01:00
59 changed files with 3072 additions and 76 deletions

View File

@ -335,6 +335,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

View File

@ -328,6 +328,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

View File

@ -300,6 +300,7 @@ conda install -c huggingface transformers
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया।
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा।
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा।
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।

View File

@ -362,6 +362,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/)
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI から) Ben Wang and Aran Komatsuzaki から公開されたレポジトリー [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/)
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden から) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren から公開された研究論文: [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf)
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (Microsoft から) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu から公開された研究論文: [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234).
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)

View File

@ -277,6 +277,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 의 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 논문과 함께 발표했습니다.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden 에서) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 의 [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 논문과 함께 발표했습니다.
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu 의 [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) 논문과 함께 발표했습니다.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다.

View File

@ -301,6 +301,7 @@ conda install -c huggingface transformers
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。

View File

@ -313,6 +313,7 @@ conda install -c huggingface transformers
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[GPT2MQAMQA](https://huggingface.co/docs/transformers/main/model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

View File

@ -114,6 +114,7 @@ The documentation is organized into five sections:
1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[GPT2MQAMQA](model_doc/gpt2mqa)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
1. **[Graphormer](model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
@ -292,6 +293,7 @@ Flax), PyTorch, and/or TensorFlow.
| GPT NeoX Japanese | ✅ | ❌ | ✅ | ❌ | ❌ |
| GPT-J | ❌ | ❌ | ✅ | ✅ | ✅ |
| GPT-Sw3 | ✅ | ✅ | ✅ | ✅ | ✅ |
| GPT2MQAMQA | ❌ | ❌ | ✅ | ❌ | ❌ |
| Graphormer | ❌ | ❌ | ✅ | ❌ | ❌ |
| GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ |
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |

View File

@ -0,0 +1,69 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# GPT2MQA
## Overview
The GPT2MQA model was proposed in [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Ben Allal et al..
It adds Multi Query Attention (MQA) to the GPT2 architecture which reduces the memory footprint of the model, especially at large batches.
The MQA approach was proposed in [Fast Transformer Decoding: One Write-Head is All You Need](https://arxiv.org/abs/1911.02150) by Shazeer et al..
The abstract from the paper is the following:
The BigCode project is an open-scientific collaboration working on the responsi- ble development of large language models for code.1 This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experi- ments conducted to de-risk the model architecture, and the experiments investi- gating better preprocessing methods for the training data. We train 1.1B param- eter models on the Java, JavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and evaluate them on the MultiPL-E text-to-code benchmark (Cas- sano et al., 2022). We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model out- performs previous open-source multilingual code generation models (InCoder- 6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a sub- stantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.
Tips:
The model can be used like any decoder model but can manage extremly large batches.
This model was contributed by [lvwerra](https://huggingface.co/lvwerra).
## GPT2MQAConfig
[[autodoc]] GPT2MQAConfig
## GPT2MQA specific outputs
[[autodoc]] models.gpt2mqa.modeling_gpt2mqa.GPT2MQADoubleHeadsModelOutput
[[autodoc]] models.gpt2mqa.modeling_tf_gpt2mqa.TFGPT2MQADoubleHeadsModelOutput
## GPT2MQAModel
[[autodoc]] GPT2MQAModel
- forward
- parallelize
- deparallelize
## GPT2MQALMHeadModel
[[autodoc]] GPT2MQALMHeadModel
- forward
- parallelize
- deparallelize
## GPT2MQADoubleHeadsModel
[[autodoc]] GPT2MQADoubleHeadsModel
- forward
## GPT2MQAForSequenceClassification
[[autodoc]] GPT2MQAForSequenceClassification
- forward
## GPT2MQAForTokenClassification
[[autodoc]] GPT2MQAForTokenClassification
- forward

View File

@ -82,6 +82,7 @@ Ready-made configurations include the following architectures:
- GPT Neo
- GPT-J
- GPT-Sw3
- GPT2MQAMQA
- GroupViT
- I-BERT
- ImageGPT

View File

@ -62,7 +62,7 @@ Qualsiasi parametro addizionale per il tuo compito può essere incluso nella [`p
>>> generator(
... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
... num_return_sequences=2,
... ) # doctest: +SKIP
>>> ) # doctest: +SKIP
```
### Scegliere modello e tokenizer

View File

@ -267,6 +267,7 @@ _import_structure = {
"models.git": ["GIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GitConfig", "GitProcessor", "GitVisionConfig"],
"models.glpn": ["GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP", "GLPNConfig"],
"models.gpt2": ["GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2Config", "GPT2Tokenizer"],
"models.gpt2mqa": ["GPT2MQA_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2MQAConfig"],
"models.gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig"],
"models.gpt_neox": ["GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXConfig"],
"models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"],
@ -1513,6 +1514,18 @@ else:
"load_tf_weights_in_gpt2",
]
)
_import_structure["models.gpt2mqa"].extend(
[
"GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST",
"GPT2MQADoubleHeadsModel",
"GPT2MQAForSequenceClassification",
"GPT2MQAForTokenClassification",
"GPT2MQALMHeadModel",
"GPT2MQAModel",
"GPT2MQAPreTrainedModel",
"load_tf_weights_in_gpt2mqa",
]
)
_import_structure["models.gpt_neo"].extend(
[
"GPT_NEO_PRETRAINED_MODEL_ARCHIVE_LIST",
@ -3689,6 +3702,7 @@ if TYPE_CHECKING:
from .models.git import GIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GitConfig, GitProcessor, GitVisionConfig
from .models.glpn import GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP, GLPNConfig
from .models.gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config, GPT2Tokenizer
from .models.gpt2mqa import GPT2MQA_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2MQAConfig
from .models.gpt_neo import GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoConfig
from .models.gpt_neox import GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXConfig
from .models.gpt_neox_japanese import GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXJapaneseConfig
@ -4747,6 +4761,16 @@ if TYPE_CHECKING:
GPT2PreTrainedModel,
load_tf_weights_in_gpt2,
)
from .models.gpt2mqa import (
GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST,
GPT2MQADoubleHeadsModel,
GPT2MQAForSequenceClassification,
GPT2MQAForTokenClassification,
GPT2MQALMHeadModel,
GPT2MQAModel,
GPT2MQAPreTrainedModel,
load_tf_weights_in_gpt2mqa,
)
from .models.gpt_neo import (
GPT_NEO_PRETRAINED_MODEL_ARCHIVE_LIST,
GPTNeoForCausalLM,

View File

@ -78,6 +78,7 @@ from . import (
git,
glpn,
gpt2,
gpt2mqa,
gpt_neo,
gpt_neox,
gpt_neox_japanese,

View File

@ -83,6 +83,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("glpn", "GLPNConfig"),
("gpt-sw3", "GPT2Config"),
("gpt2", "GPT2Config"),
("gpt2mqa", "GPT2MQAConfig"),
("gpt_neo", "GPTNeoConfig"),
("gpt_neox", "GPTNeoXConfig"),
("gpt_neox_japanese", "GPTNeoXJapaneseConfig"),
@ -246,6 +247,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("git", "GIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("glpn", "GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gpt2", "GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gpt2mqa", "GPT2MQAMQA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gpt_neo", "GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gpt_neox", "GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gpt_neox_japanese", "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -409,6 +411,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("glpn", "GLPN"),
("gpt-sw3", "GPT-Sw3"),
("gpt2", "OpenAI GPT-2"),
("gpt2mqa", "GPT2MQAMQA"),
("gpt_neo", "GPT Neo"),
("gpt_neox", "GPT NeoX"),
("gpt_neox_japanese", "GPT NeoX Japanese"),

View File

@ -82,6 +82,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("glpn", "GLPNModel"),
("gpt-sw3", "GPT2Model"),
("gpt2", "GPT2Model"),
("gpt2mqa", "GPT2MQAModel"),
("gpt_neo", "GPTNeoModel"),
("gpt_neox", "GPTNeoXModel"),
("gpt_neox_japanese", "GPTNeoXJapaneseModel"),
@ -209,6 +210,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
("funnel", "FunnelForPreTraining"),
("gpt-sw3", "GPT2LMHeadModel"),
("gpt2", "GPT2LMHeadModel"),
("gpt2mqa", "GPT2MQALMHeadModel"),
("ibert", "IBertForMaskedLM"),
("layoutlm", "LayoutLMForMaskedLM"),
("longformer", "LongformerForMaskedLM"),
@ -273,6 +275,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
("git", "GitForCausalLM"),
("gpt-sw3", "GPT2LMHeadModel"),
("gpt2", "GPT2LMHeadModel"),
("gpt2mqa", "GPT2MQALMHeadModel"),
("gpt_neo", "GPTNeoForCausalLM"),
("gpt_neox", "GPTNeoXForCausalLM"),
("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"),
@ -339,6 +342,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
("git", "GitForCausalLM"),
("gpt-sw3", "GPT2LMHeadModel"),
("gpt2", "GPT2LMHeadModel"),
("gpt2mqa", "GPT2MQALMHeadModel"),
("gpt_neo", "GPTNeoForCausalLM"),
("gpt_neox", "GPTNeoXForCausalLM"),
("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"),
@ -617,6 +621,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
("funnel", "FunnelForSequenceClassification"),
("gpt-sw3", "GPT2ForSequenceClassification"),
("gpt2", "GPT2ForSequenceClassification"),
("gpt2mqa", "GPT2MQAForSequenceClassification"),
("gpt_neo", "GPTNeoForSequenceClassification"),
("gptj", "GPTJForSequenceClassification"),
("ibert", "IBertForSequenceClassification"),
@ -756,6 +761,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
("funnel", "FunnelForTokenClassification"),
("gpt-sw3", "GPT2ForTokenClassification"),
("gpt2", "GPT2ForTokenClassification"),
("gpt2mqa", "GPT2MQAForTokenClassification"),
("ibert", "IBertForTokenClassification"),
("layoutlm", "LayoutLMForTokenClassification"),
("layoutlmv2", "LayoutLMv2ForTokenClassification"),

View File

@ -140,6 +140,7 @@ else:
("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gpt2mqa", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("gpt_neox", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("gpt_neox_japanese", ("GPTNeoXJapaneseTokenizer", None)),

View File

@ -559,7 +559,7 @@ class EncoderDecoderModel(PreTrainedModel):
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained(
... "bert-base-uncased", "bert-base-uncased"
... ) # initialize Bert2Bert from pre-trained checkpoints
>>> ) # initialize Bert2Bert from pre-trained checkpoints
>>> # training
>>> model.config.decoder_start_token_id = tokenizer.cls_token_id

View File

@ -542,7 +542,7 @@ class TFEncoderDecoderModel(TFPreTrainedModel, TFCausalLanguageModelingLoss):
>>> # forward
>>> input_ids = tokenizer.encode(
... "Hello, my dog is cute", add_special_tokens=True, return_tensors="tf"
... ) # Batch size 1
>>> ) # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)
>>> # training

View File

@ -1158,7 +1158,7 @@ class FlaubertForQuestionAnswering(FlaubertPreTrainedModel):
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(
... 0
... ) # Batch size 1
>>> ) # Batch size 1
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

View File

@ -1016,7 +1016,7 @@ class TFGPT2DoubleHeadsModel(TFGPT2PreTrainedModel):
>>> embedding_layer = model.resize_token_embeddings(
... len(tokenizer)
... ) # Update the model embeddings with the new vocabulary size
>>> ) # Update the model embeddings with the new vocabulary size
>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
>>> encoded_choices = [tokenizer.encode(s) for s in choices]

View File

@ -0,0 +1,89 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_keras_nlp_available,
is_tensorflow_text_available,
is_torch_available,
)
_import_structure = {
"configuration_gpt2mqa": ["GPT2MQA_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2MQAConfig", "GPT2MQAOnnxConfig"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_gpt2mqa"] = [
"GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST",
"GPT2MQADoubleHeadsModel",
"GPT2MQAForSequenceClassification",
"GPT2MQAForTokenClassification",
"GPT2MQALMHeadModel",
"GPT2MQAModel",
"GPT2MQAPreTrainedModel",
"load_tf_weights_in_gpt2mqa",
]
try:
if not is_keras_nlp_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["tokenization_gpt2mqa_tf"] = ["TFGPT2Tokenizer"]
if TYPE_CHECKING:
from .configuration_gpt2mqa import GPT2MQA_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2MQAConfig, GPT2MQAOnnxConfig
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_gpt2mqa import (
GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST,
GPT2MQADoubleHeadsModel,
GPT2MQAForSequenceClassification,
GPT2MQAForTokenClassification,
GPT2MQALMHeadModel,
GPT2MQAModel,
GPT2MQAPreTrainedModel,
load_tf_weights_in_gpt2mqa,
)
try:
if not is_keras_nlp_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
pass
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@ -0,0 +1,277 @@
# coding=utf-8
# Copyright 2023 The OpenAI Team Authors and HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" GPT2MQA configuration"""
from collections import OrderedDict
from typing import Any, List, Mapping, Optional
from transformers import PreTrainedTokenizer, TensorType, is_torch_available
from ...configuration_utils import PretrainedConfig
from ...onnx import OnnxConfigWithPast, PatchingSpec
from ...utils import logging
logger = logging.get_logger(__name__)
GPT2MQA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"bigcode/santacoder": "https://huggingface.co/bigcode/santacoder/resolve/main/config.json",
}
MULTI_HEAD = "multihead"
MULTI_QUERY = "multiquery"
class GPT2MQAConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`GPT2MQAModel`] or a [`TFGPT2MQAModel`]. It is
used to instantiate a GPT-2 model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the GPT-2
[gpt2mqa](https://huggingface.co/gpt2mqa) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 50257):
Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`GPT2MQAModel`] or [`TFGPT2MQAModel`].
n_positions (`int`, *optional*, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
n_embd (`int`, *optional*, defaults to 768):
Dimensionality of the embeddings and hidden states.
n_layer (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
n_head (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
n_inner (`int`, *optional*, defaults to None):
Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
activation_function (`str`, *optional*, defaults to `"gelu"`):
Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
resid_pdrop (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
embd_pdrop (`int`, *optional*, defaults to 0.1):
The dropout ratio for the embeddings.
attn_pdrop (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention.
layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
The epsilon to use in the layer normalization layers.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
summary_type (`string`, *optional*, defaults to `"cls_index"`):
Argument used when doing sequence summary, used in the models [`GPT2MQADoubleHeadsModel`] and
[`TFGPT2MQADoubleHeadsModel`].
Has to be one of the following options:
- `"last"`: Take the last token hidden state (like XLNet).
- `"first"`: Take the first token hidden state (like BERT).
- `"mean"`: Take the mean of all tokens hidden states.
- `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
- `"attn"`: Not implemented now, use multi-head attention.
summary_use_proj (`bool`, *optional*, defaults to `True`):
Argument used when doing sequence summary, used in the models [`GPT2MQADoubleHeadsModel`] and
[`TFGPT2MQADoubleHeadsModel`].
Whether or not to add a projection after the vector extraction.
summary_activation (`str`, *optional*):
Argument used when doing sequence summary. Used in for the multiple choice head in
[`GPT2MQADoubleHeadsModel`].
Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
summary_proj_to_labels (`bool`, *optional*, defaults to `True`):
Argument used when doing sequence summary, used in the models [`GPT2MQADoubleHeadsModel`] and
[`TFGPT2MQADoubleHeadsModel`].
Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
summary_first_dropout (`float`, *optional*, defaults to 0.1):
Argument used when doing sequence summary, used in the models [`GPT2MQADoubleHeadsModel`] and
[`TFGPT2MQADoubleHeadsModel`].
The dropout ratio to be used after the projection and activation.
scale_attn_weights (`bool`, *optional*, defaults to `True`):
Scale attention weights by dividing by sqrt(hidden_size)..
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
scale_attn_by_inverse_layer_idx (`bool`, *optional*, defaults to `False`):
Whether to additionally scale attention weights by `1 / layer_idx + 1`.
reorder_and_upcast_attn (`bool`, *optional*, defaults to `False`):
Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention
dot-product/softmax to float() when training with mixed precision.
attention_head_type (`str`, *optional*, defaults to `"multiquery"`):
Whether to use multiquery or multihead attention. Alternatively one can set `"multihead"`,
Example:
```python
>>> from transformers import GPT2MQAConfig, GPT2MQAModel
>>> # Initializing a GPT2MQA configuration
>>> configuration = GPT2MQAConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = GPT2MQAModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "gpt2mqa"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {
"hidden_size": "n_embd",
"max_position_embeddings": "n_positions",
"num_attention_heads": "n_head",
"num_hidden_layers": "n_layer",
}
def __init__(
self,
vocab_size=50257,
n_positions=1024,
n_embd=768,
n_layer=12,
n_head=12,
n_inner=None,
activation_function="gelu_new",
resid_pdrop=0.1,
embd_pdrop=0.1,
attn_pdrop=0.1,
layer_norm_epsilon=1e-5,
initializer_range=0.02,
summary_type="cls_index",
summary_use_proj=True,
summary_activation=None,
summary_proj_to_labels=True,
summary_first_dropout=0.1,
scale_attn_weights=True,
use_cache=True,
bos_token_id=50256,
eos_token_id=50256,
scale_attn_by_inverse_layer_idx=False,
reorder_and_upcast_attn=False,
attention_head_type=MULTI_QUERY,
**kwargs,
):
self.vocab_size = vocab_size
self.n_positions = n_positions
self.n_embd = n_embd
self.n_layer = n_layer
self.n_head = n_head
self.n_inner = n_inner
self.activation_function = activation_function
self.resid_pdrop = resid_pdrop
self.embd_pdrop = embd_pdrop
self.attn_pdrop = attn_pdrop
self.layer_norm_epsilon = layer_norm_epsilon
self.initializer_range = initializer_range
self.summary_type = summary_type
self.summary_use_proj = summary_use_proj
self.summary_activation = summary_activation
self.summary_first_dropout = summary_first_dropout
self.summary_proj_to_labels = summary_proj_to_labels
self.scale_attn_weights = scale_attn_weights
self.use_cache = use_cache
self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
self.reorder_and_upcast_attn = reorder_and_upcast_attn
self.attention_head_type = attention_head_type
self.bos_token_id = bos_token_id
self.eos_token_id = eos_token_id
super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
class GPT2MQAOnnxConfig(OnnxConfigWithPast):
def __init__(
self,
config: PretrainedConfig,
task: str = "default",
patching_specs: List[PatchingSpec] = None,
use_past: bool = False,
):
super().__init__(config, task=task, patching_specs=patching_specs, use_past=use_past)
if not getattr(self._config, "pad_token_id", None):
# TODO: how to do that better?
self._config.pad_token_id = 0
@property
def inputs(self) -> Mapping[str, Mapping[int, str]]:
common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
if self.use_past:
self.fill_with_past_key_values_(common_inputs, direction="inputs")
common_inputs["attention_mask"] = {0: "batch", 1: "past_sequence + sequence"}
else:
common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
return common_inputs
@property
def num_layers(self) -> int:
return self._config.n_layer
@property
def num_attention_heads(self) -> int:
return self._config.n_head
def generate_dummy_inputs(
self,
tokenizer: PreTrainedTokenizer,
batch_size: int = -1,
seq_length: int = -1,
is_pair: bool = False,
framework: Optional[TensorType] = None,
) -> Mapping[str, Any]:
common_inputs = super(OnnxConfigWithPast, self).generate_dummy_inputs(
tokenizer, batch_size=batch_size, seq_length=seq_length, is_pair=is_pair, framework=framework
)
# We need to order the input in the way they appears in the forward()
ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
# Need to add the past_keys
if self.use_past:
if not is_torch_available():
raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
else:
import torch
batch, seqlen = common_inputs["input_ids"].shape
# Not using the same length for past_key_values
past_key_values_length = seqlen + 2
past_shape = (
batch,
self.num_attention_heads,
past_key_values_length,
self._config.hidden_size // self.num_attention_heads,
)
ordered_inputs["past_key_values"] = [
(torch.zeros(past_shape), torch.zeros(past_shape)) for _ in range(self.num_layers)
]
ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
if self.use_past:
mask_dtype = ordered_inputs["attention_mask"].dtype
ordered_inputs["attention_mask"] = torch.cat(
[ordered_inputs["attention_mask"], torch.ones(batch, past_key_values_length, dtype=mask_dtype)], dim=1
)
return ordered_inputs
@property
def default_onnx_opset(self) -> int:
return 13

View File

@ -0,0 +1,75 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert OpenAI GPT checkpoint."""
import argparse
import torch
from transformers import GPT2MQAConfig, GPT2MQAModel, load_tf_weights_in_gpt2mqa
from transformers.utils import CONFIG_NAME, WEIGHTS_NAME, logging
logging.set_verbosity_info()
def convert_gpt2mqa_checkpoint_to_pytorch(gpt2mqa_checkpoint_path, gpt2mqa_config_file, pytorch_dump_folder_path):
# Construct model
if gpt2mqa_config_file == "":
config = GPT2MQAConfig()
else:
config = GPT2MQAConfig.from_json_file(gpt2mqa_config_file)
model = GPT2MQAModel(config)
# Load weights from numpy
load_tf_weights_in_gpt2mqa(model, config, gpt2mqa_checkpoint_path)
# Save pytorch-model
pytorch_weights_dump_path = pytorch_dump_folder_path + "/" + WEIGHTS_NAME
pytorch_config_dump_path = pytorch_dump_folder_path + "/" + CONFIG_NAME
print(f"Save PyTorch model to {pytorch_weights_dump_path}")
torch.save(model.state_dict(), pytorch_weights_dump_path)
print(f"Save configuration file to {pytorch_config_dump_path}")
with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
f.write(config.to_json_string())
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--gpt2mqa_checkpoint_path",
default=None,
type=str,
required=True,
help="Path to the TensorFlow checkpoint path.",
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
)
parser.add_argument(
"--gpt2mqa_config_file",
default="",
type=str,
help=(
"An optional config json file corresponding to the pre-trained OpenAI model. \n"
"This specifies the model architecture."
),
)
args = parser.parse_args()
convert_gpt2mqa_checkpoint_to_pytorch(
args.gpt2mqa_checkpoint_path, args.gpt2mqa_config_file, args.pytorch_dump_folder_path
)

File diff suppressed because it is too large Load Diff

View File

@ -998,7 +998,7 @@ class ImageGPTForCausalImageModeling(ImageGPTPreTrainedModel):
>>> samples = output[:, 1:].cpu().detach().numpy()
>>> samples_img = [
... np.reshape(np.rint(127.5 * (clusters[s] + 1.0)), [height, width, 3]).astype(np.uint8) for s in samples
... ] # convert color cluster tokens back to pixels
>>> ] # convert color cluster tokens back to pixels
>>> f, axes = plt.subplots(1, batch_size, dpi=300)
>>> for img, ax in zip(samples_img, axes):

View File

@ -1682,10 +1682,10 @@ class LongformerModel(LongformerPreTrainedModel):
>>> attention_mask = torch.ones(
... input_ids.shape, dtype=torch.long, device=input_ids.device
... ) # initialize to local attention
>>> ) # initialize to local attention
>>> global_attention_mask = torch.zeros(
... input_ids.shape, dtype=torch.long, device=input_ids.device
... ) # initialize to global attention to be deactivated for all tokens
>>> ) # initialize to global attention to be deactivated for all tokens
>>> global_attention_mask[
... :,
... [
@ -1693,7 +1693,7 @@ class LongformerModel(LongformerPreTrainedModel):
... 4,
... 21,
... ],
... ] = 1 # Set global attention to random tokens for the sake of this example
>>> ] = 1 # Set global attention to random tokens for the sake of this example
>>> # Usually, set global attention based on the task. For example,
>>> # classification: the <s> token
>>> # QA: question tokens
@ -2083,7 +2083,7 @@ class LongformerForQuestionAnswering(LongformerPreTrainedModel):
>>> answer_tokens = all_tokens[torch.argmax(start_logits) : torch.argmax(end_logits) + 1]
>>> answer = tokenizer.decode(
... tokenizer.convert_tokens_to_ids(answer_tokens)
... ) # remove space prepending space token
>>> ) # remove space prepending space token
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

View File

@ -2128,7 +2128,7 @@ FLAX_LONGT5_MODEL_DOCSTRING = """
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="np"
... ).input_ids
>>> ).input_ids
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="np").input_ids
>>> # forward pass

View File

@ -1835,7 +1835,7 @@ class LongT5Model(LongT5PreTrainedModel):
>>> # Let's try a very long encoder input.
>>> input_ids = tokenizer(
... 100 * "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
@ -2202,7 +2202,7 @@ class LongT5EncoderModel(LongT5PreTrainedModel):
>>> model = LongT5EncoderModel.from_pretrained("google/long-t5-local-base")
>>> input_ids = tokenizer(
... 100 * "Studies have been shown that owning a dog is good for you ", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids)
>>> last_hidden_states = outputs.last_hidden_state
```"""

View File

@ -1098,11 +1098,11 @@ class LukeModel(LukePreTrainedModel):
>>> entities = [
... "Beyoncé",
... "Los Angeles",
... ] # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
>>> ] # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
>>> entity_spans = [
... (0, 7),
... (17, 28),
... ] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> ] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> encoding = tokenizer(
... text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt"
@ -1589,7 +1589,7 @@ class LukeForEntityPairClassification(LukePreTrainedModel):
>>> entity_spans = [
... (0, 7),
... (17, 28),
... ] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> ] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits

View File

@ -2414,13 +2414,13 @@ class Mask2FormerForUniversalSegmentation(Mask2FormerPreTrainedModel):
>>> # Perform post-processing to get semantic, instance or panoptic segmentation maps
>>> pred_semantic_map = image_processor.post_process_semantic_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]
>>> )[0]
>>> pred_instance_map = image_processor.post_process_instance_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]["segmentation"]
>>> )[0]["segmentation"]
>>> pred_panoptic_map = image_processor.post_process_panoptic_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]["segmentation"]
>>> )[0]["segmentation"]
```
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

View File

@ -1763,7 +1763,7 @@ class MaskFormerForInstanceSegmentation(MaskFormerPreTrainedModel):
>>> # you can pass them to image_processor for postprocessing
>>> predicted_semantic_map = image_processor.post_process_semantic_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]
>>> )[0]
>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
>>> list(predicted_semantic_map.shape)

View File

@ -1396,7 +1396,7 @@ class MT5Model(MT5PreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for MT5Model.
@ -1636,7 +1636,7 @@ class MT5ForConditionalGeneration(MT5PreTrainedModel):
>>> # inference
>>> input_ids = tokenizer(
... "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model.generate(input_ids)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
>>> # studies have shown that owning a dog is good for you.
@ -1915,7 +1915,7 @@ class MT5EncoderModel(MT5PreTrainedModel):
>>> model = MT5EncoderModel.from_pretrained("mt5-small")
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids)
>>> last_hidden_states = outputs.last_hidden_state
```"""

View File

@ -3115,7 +3115,7 @@ class OneFormerForUniversalSegmentation(OneFormerPreTrainedModel):
>>> # you can pass them to feature_extractor for semantic postprocessing
>>> predicted_semantic_map = feature_extractor.post_process_semantic_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]
>>> )[0]
>>> f"👉 Semantic Predictions Shape: {list(predicted_semantic_map.shape)}"
'👉 Semantic Predictions Shape: [512, 683]'
@ -3132,7 +3132,7 @@ class OneFormerForUniversalSegmentation(OneFormerPreTrainedModel):
>>> # you can pass them to feature_extractor for instance postprocessing
>>> predicted_instance_map = feature_extractor.post_process_instance_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]["segmentation"]
>>> )[0]["segmentation"]
>>> f"👉 Instance Predictions Shape: {list(predicted_instance_map.shape)}"
'👉 Instance Predictions Shape: [512, 683]'
@ -3149,7 +3149,7 @@ class OneFormerForUniversalSegmentation(OneFormerPreTrainedModel):
>>> # you can pass them to feature_extractor for panoptic postprocessing
>>> predicted_panoptic_map = feature_extractor.post_process_panoptic_segmentation(
... outputs, target_sizes=[image.size[::-1]]
... )[0]["segmentation"]
>>> )[0]["segmentation"]
>>> f"👉 Panoptic Predictions Shape: {list(predicted_panoptic_map.shape)}"
'👉 Panoptic Predictions Shape: [512, 683]'
```

View File

@ -683,7 +683,7 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
>>> model = OpenAIGPTDoubleHeadsModel.from_pretrained("openai-gpt")
>>> tokenizer.add_special_tokens(
... {"cls_token": "[CLS]"}
... ) # Add a [CLS] to the vocabulary (we should train it also!)
>>> ) # Add a [CLS] to the vocabulary (we should train it also!)
>>> model.resize_token_embeddings(len(tokenizer))
>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]

View File

@ -722,9 +722,9 @@ class TFOpenAIGPTDoubleHeadsModel(TFOpenAIGPTPreTrainedModel):
>>> inputs = {k: tf.expand_dims(v, 0) for k, v in encoding.items()}
>>> inputs["mc_token_ids"] = tf.constant(
... [inputs["input_ids"].shape[-1] - 1, inputs["input_ids"].shape[-1] - 1]
... )[
>>> )[
... None, :
... ] # Batch size 1
>>> ] # Batch size 1
>>> outputs = model(inputs)
>>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
```"""

View File

@ -1836,7 +1836,7 @@ class ProphetNetModel(ProphetNetPreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
@ -1964,7 +1964,7 @@ class ProphetNetForConditionalGeneration(ProphetNetPreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
@ -2234,7 +2234,7 @@ class ProphetNetForCausalLM(ProphetNetPreTrainedModel):
>>> input_ids = tokenizer_enc(ARTICLE, return_tensors="pt").input_ids
>>> labels = tokenizer_dec(
... "us rejects charges against its ambassador in bolivia", return_tensors="pt"
... ).input_ids
>>> ).input_ids
>>> outputs = model(input_ids=input_ids, decoder_input_ids=labels[:, :-1], labels=labels[:, 1:])
>>> loss = outputs.loss

View File

@ -830,7 +830,7 @@ class RagSequenceForGeneration(RagPreTrainedModel):
>>> docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
>>> doc_scores = torch.bmm(
... question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
... ).squeeze(1)
>>> ).squeeze(1)
>>> # 3. Forward to generator
>>> outputs = model(
... context_input_ids=docs_dict["context_input_ids"],
@ -1298,7 +1298,7 @@ class RagTokenForGeneration(RagPreTrainedModel):
>>> docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
>>> doc_scores = torch.bmm(
... question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
... ).squeeze(1)
>>> ).squeeze(1)
>>> # 3. Forward to generator
>>> outputs = model(
... context_input_ids=docs_dict["context_input_ids"],

View File

@ -353,7 +353,7 @@ class RagRetriever:
>>> dataset = (
... ...
... ) # dataset must be a datasets.Datasets object with columns "title", "text" and "embeddings", and it must have a faiss index
>>> ) # dataset must be a datasets.Datasets object with columns "title", "text" and "embeddings", and it must have a faiss index
>>> retriever = RagRetriever.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", indexed_dataset=dataset)
>>> # To load your own indexed dataset built with the datasets library that was saved on disk. More info in examples/rag/use_own_knowledge_dataset.py

View File

@ -1796,7 +1796,7 @@ class RealmForOpenQA(RealmPreTrainedModel):
... add_special_tokens=False,
... return_token_type_ids=False,
... return_attention_mask=False,
... ).input_ids
>>> ).input_ids
>>> reader_output, predicted_answer_ids = model(**question_ids, answer_ids=answer_ids, return_dict=False)
>>> predicted_answer = tokenizer.decode(predicted_answer_ids)

View File

@ -1406,7 +1406,7 @@ class TFSpeech2TextForConditionalGeneration(TFSpeech2TextPreTrainedModel, TFCaus
>>> input_features = processor(
... ds["speech"][0], sampling_rate=16000, return_tensors="tf"
... ).input_features # Batch size 1
>>> ).input_features # Batch size 1
>>> generated_ids = model.generate(input_features)
>>> transcription = processor.batch_decode(generated_ids)

View File

@ -1412,7 +1412,7 @@ class SwitchTransformersModel(SwitchTransformersPreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for SwitchTransformersModel.
@ -1604,7 +1604,7 @@ class SwitchTransformersForConditionalGeneration(SwitchTransformersPreTrainedMod
>>> # inference
>>> input_ids = tokenizer(
... "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model.generate(input_ids)
>>> # . To, lets say you have a dog. To summarize:
>>> # Since the model has been trained on MLM, this will output gibberish
@ -1877,7 +1877,7 @@ class SwitchTransformersEncoderModel(SwitchTransformersPreTrainedModel):
>>> model = SwitchTransformersEncoderModel.from_pretrained("google/switch-base-8")
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids)
>>> last_hidden_states = outputs.last_hidden_state
```"""

View File

@ -1387,7 +1387,7 @@ FLAX_T5_MODEL_DOCSTRING = """
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="np"
... ).input_ids
>>> ).input_ids
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="np").input_ids
>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model.

View File

@ -1393,7 +1393,7 @@ class T5Model(T5PreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model.
@ -1604,7 +1604,7 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
>>> # inference
>>> input_ids = tokenizer(
... "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model.generate(input_ids)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
>>> # studies have shown that owning a dog is good for you.
@ -1850,7 +1850,7 @@ class T5EncoderModel(T5PreTrainedModel):
>>> model = T5EncoderModel.from_pretrained("t5-small")
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids)
>>> last_hidden_states = outputs.last_hidden_state
```"""

View File

@ -1189,7 +1189,7 @@ class TFT5Model(TFT5PreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="tf"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="tf").input_ids # Batch size 1
>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model.
@ -1381,7 +1381,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel, TFCausalLanguageModeling
>>> # inference
>>> inputs = tokenizer(
... "summarize: studies have shown that owning a dog is good for you", return_tensors="tf"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model.generate(inputs)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
>>> # studies have shown that owning a dog is good for you
@ -1583,7 +1583,7 @@ class TFT5EncoderModel(TFT5PreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="tf"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> outputs = model(input_ids)
```"""

View File

@ -1056,7 +1056,7 @@ class TapasForMaskedLM(TapasPreTrainedModel):
... )
>>> labels = tokenizer(
... table=table, queries="How many movies has George Clooney played in?", return_tensors="pt"
... )["input_ids"]
>>> )["input_ids"]
>>> outputs = model(**inputs, labels=labels)
>>> logits = outputs.logits

View File

@ -1122,7 +1122,7 @@ class TFTapasForMaskedLM(TFTapasPreTrainedModel, TFMaskedLanguageModelingLoss):
... )
>>> labels = tokenizer(
... table=table, queries="How many movies has George Clooney played in?", return_tensors="tf"
... )["input_ids"]
>>> )["input_ids"]
>>> outputs = model(**inputs, labels=labels)
>>> logits = outputs.logits

View File

@ -313,7 +313,7 @@ class TFVisionEncoderDecoderModel(TFPreTrainedModel, TFCausalLanguageModelingLos
>>> output_ids = model.generate(
... pixel_values, max_length=16, num_beams=4, return_dict_in_generate=True
... ).sequences
>>> ).sequences
>>> preds = decoder_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
>>> preds = [pred.strip() for pred in preds]

View File

@ -262,7 +262,7 @@ class VisionEncoderDecoderModel(PreTrainedModel):
>>> output_ids = model.generate(
... pixel_values, max_length=16, num_beams=4, return_dict_in_generate=True
... ).sequences
>>> ).sequences
>>> preds = decoder_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
>>> preds = [pred.strip() for pred in preds]

View File

@ -1084,7 +1084,7 @@ FLAX_WAV2VEC2_MODEL_DOCSTRING = """
>>> input_values = processor(
... ds["speech"][0], sampling_rate=16_000, return_tensors="np"
... ).input_values # Batch size 1
>>> ).input_values # Batch size 1
>>> hidden_states = model(input_values).last_hidden_state
```
"""
@ -1203,7 +1203,7 @@ FLAX_WAV2VEC2_FOR_CTC_DOCSTRING = """
>>> input_values = processor(
... ds["speech"][0], sampling_rate=16_000, return_tensors="np"
... ).input_values # Batch size 1
>>> ).input_values # Batch size 1
>>> logits = model(input_values).logits
>>> predicted_ids = jnp.argmax(logits, axis=-1)

View File

@ -1467,7 +1467,7 @@ class Wav2Vec2ForPreTraining(Wav2Vec2PreTrainedModel):
>>> model = model.train()
>>> loss = model(
... input_values, mask_time_indices=mask_time_indices, sampled_negative_indices=sampled_negative_indices
... ).loss
>>> ).loss
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

View File

@ -1511,7 +1511,7 @@ class Wav2Vec2ConformerForPreTraining(Wav2Vec2ConformerPreTrainedModel):
>>> model = model.train()
>>> loss = model(
... input_values, mask_time_indices=mask_time_indices, sampled_negative_indices=sampled_negative_indices
... ).loss
>>> ).loss
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

View File

@ -1041,7 +1041,7 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(
... 0
... ) # Batch size 1
>>> ) # Batch size 1
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

View File

@ -1860,7 +1860,7 @@ class XLMProphetNetModel(XLMProphetNetPreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
@ -1991,7 +1991,7 @@ class XLMProphetNetForConditionalGeneration(XLMProphetNetPreTrainedModel):
>>> input_ids = tokenizer(
... "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> ).input_ids # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
@ -2264,7 +2264,7 @@ class XLMProphetNetForCausalLM(XLMProphetNetPreTrainedModel):
>>> input_ids = tokenizer_enc(ARTICLE, return_tensors="pt").input_ids
>>> labels = tokenizer_dec(
... "us rejects charges against its ambassador in bolivia", return_tensors="pt"
... ).input_ids
>>> ).input_ids
>>> outputs = model(input_ids=input_ids, decoder_input_ids=labels[:, :-1], labels=labels[:, 1:])
>>> loss = outputs.loss

View File

@ -1297,17 +1297,17 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
>>> # We show how to setup inputs to predict a next token using a bi-directional context.
>>> input_ids = tf.constant(tokenizer.encode("Hello, my dog is very <mask>", add_special_tokens=True))[
... None, :
... ] # We will predict the masked token
>>> ] # We will predict the masked token
>>> perm_mask = np.zeros((1, input_ids.shape[1], input_ids.shape[1]))
>>> perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
>>> target_mapping = np.zeros(
... (1, 1, input_ids.shape[1])
... ) # Shape [1, 1, seq_length] => let's predict one token
>>> ) # Shape [1, 1, seq_length] => let's predict one token
>>> target_mapping[
... 0, 0, -1
... ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
>>> ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
>>> outputs = model(
... input_ids,
@ -1317,7 +1317,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
>>> next_token_logits = outputs[
... 0
... ] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
>>> ] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
```"""
transformer_outputs = self.transformer(
input_ids=input_ids,

View File

@ -1403,47 +1403,47 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
>>> # We show how to setup inputs to predict a next token using a bi-directional context.
>>> input_ids = torch.tensor(
... tokenizer.encode("Hello, my dog is very <mask>", add_special_tokens=False)
... ).unsqueeze(
>>> ).unsqueeze(
... 0
... ) # We will predict the masked token
>>> ) # We will predict the masked token
>>> perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
>>> perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
>>> target_mapping = torch.zeros(
... (1, 1, input_ids.shape[1]), dtype=torch.float
... ) # Shape [1, 1, seq_length] => let's predict one token
>>> ) # Shape [1, 1, seq_length] => let's predict one token
>>> target_mapping[
... 0, 0, -1
... ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
>>> ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
>>> outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
>>> next_token_logits = outputs[
... 0
... ] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
>>> ] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
>>> # The same way can the XLNetLMHeadModel be used to be trained by standard auto-regressive language modeling.
>>> input_ids = torch.tensor(
... tokenizer.encode("Hello, my dog is very <mask>", add_special_tokens=False)
... ).unsqueeze(
>>> ).unsqueeze(
... 0
... ) # We will predict the masked token
>>> ) # We will predict the masked token
>>> labels = torch.tensor(tokenizer.encode("cute", add_special_tokens=False)).unsqueeze(0)
>>> assert labels.shape[0] == 1, "only one word will be predicted"
>>> perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
>>> perm_mask[
... :, :, -1
... ] = 1.0 # Previous tokens don't see last token as is done in standard auto-regressive lm training
>>> ] = 1.0 # Previous tokens don't see last token as is done in standard auto-regressive lm training
>>> target_mapping = torch.zeros(
... (1, 1, input_ids.shape[1]), dtype=torch.float
... ) # Shape [1, 1, seq_length] => let's predict one token
>>> ) # Shape [1, 1, seq_length] => let's predict one token
>>> target_mapping[
... 0, 0, -1
... ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
>>> ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
>>> outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels)
>>> loss = outputs.loss
>>> next_token_logits = (
... outputs.logits
... ) # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
>>> ) # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@ -1983,7 +1983,7 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(
... 0
... ) # Batch size 1
>>> ) # Batch size 1
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)

View File

@ -277,7 +277,7 @@ PT_SEQUENCE_CLASSIFICATION_SAMPLE = r"""
>>> labels = torch.sum(
... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
... ).to(torch.float)
>>> ).to(torch.float)
>>> loss = model(**inputs, labels=labels).loss
```
"""

View File

@ -2761,6 +2761,55 @@ def load_tf_weights_in_gpt2(*args, **kwargs):
requires_backends(load_tf_weights_in_gpt2, ["torch"])
GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST = None
class GPT2MQADoubleHeadsModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GPT2MQAForSequenceClassification(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GPT2MQAForTokenClassification(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GPT2MQALMHeadModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GPT2MQAModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class GPT2MQAPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
def load_tf_weights_in_gpt2mqa(*args, **kwargs):
requires_backends(load_tf_weights_in_gpt2mqa, ["torch"])
GPT_NEO_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

View File

@ -0,0 +1,817 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import math
import unittest
from transformers import GPT2MQAConfig, is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
if is_torch_available():
import torch
from transformers import (
GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST,
GPT2MQADoubleHeadsModel,
GPT2MQAForSequenceClassification,
GPT2MQAForTokenClassification,
GPT2MQALMHeadModel,
GPT2MQAModel,
GPT2Tokenizer,
)
class GPT2MQAModelTester:
def __init__(
self,
parent,
batch_size=14,
seq_length=7,
is_training=True,
use_token_type_ids=True,
use_input_mask=True,
use_labels=True,
use_mc_token_ids=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
num_labels=3,
num_choices=4,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_token_type_ids = use_token_type_ids
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.use_mc_token_ids = use_mc_token_ids
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.scope = None
self.bos_token_id = vocab_size - 1
self.eos_token_id = vocab_size - 1
self.pad_token_id = vocab_size - 1
def get_large_model_config(self):
return GPT2MQAConfig.from_pretrained("bigcode/santacoder")
def prepare_config_and_inputs(
self, gradient_checkpointing=False, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False
):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
token_type_ids = None
if self.use_token_type_ids:
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
mc_token_ids = None
if self.use_mc_token_ids:
mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
sequence_labels = None
token_labels = None
choice_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
choice_labels = ids_tensor([self.batch_size], self.num_choices)
config = self.get_config(
gradient_checkpointing=gradient_checkpointing,
scale_attn_by_inverse_layer_idx=scale_attn_by_inverse_layer_idx,
reorder_and_upcast_attn=reorder_and_upcast_attn,
)
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
return (
config,
input_ids,
input_mask,
head_mask,
token_type_ids,
mc_token_ids,
sequence_labels,
token_labels,
choice_labels,
)
def get_config(
self, gradient_checkpointing=False, scale_attn_by_inverse_layer_idx=False, reorder_and_upcast_attn=False
):
return GPT2MQAConfig(
vocab_size=self.vocab_size,
n_embd=self.hidden_size,
n_layer=self.num_hidden_layers,
n_head=self.num_attention_heads,
n_inner=self.intermediate_size,
activation_function=self.hidden_act,
resid_pdrop=self.hidden_dropout_prob,
attn_pdrop=self.attention_probs_dropout_prob,
n_positions=self.max_position_embeddings,
type_vocab_size=self.type_vocab_size,
initializer_range=self.initializer_range,
use_cache=True,
bos_token_id=self.bos_token_id,
eos_token_id=self.eos_token_id,
pad_token_id=self.pad_token_id,
gradient_checkpointing=gradient_checkpointing,
scale_attn_by_inverse_layer_idx=scale_attn_by_inverse_layer_idx,
reorder_and_upcast_attn=reorder_and_upcast_attn,
)
def get_pipeline_config(self):
config = self.get_config()
config.vocab_size = 300
return config
def prepare_config_and_inputs_for_decoder(self):
(
config,
input_ids,
input_mask,
head_mask,
token_type_ids,
mc_token_ids,
sequence_labels,
token_labels,
choice_labels,
) = self.prepare_config_and_inputs()
encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
encoder_attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
return (
config,
input_ids,
input_mask,
head_mask,
token_type_ids,
sequence_labels,
token_labels,
choice_labels,
encoder_hidden_states,
encoder_attention_mask,
)
def create_and_check_gpt2mqa_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
model = GPT2MQAModel(config=config)
model.to(torch_device)
model.eval()
result = model(input_ids, token_type_ids=token_type_ids, head_mask=head_mask)
result = model(input_ids, token_type_ids=token_type_ids)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(len(result.past_key_values), config.n_layer)
def create_and_check_gpt2mqa_model_past(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
model = GPT2MQAModel(config=config)
model.to(torch_device)
model.eval()
# first forward pass
outputs = model(input_ids, token_type_ids=token_type_ids, use_cache=True)
outputs_use_cache_conf = model(input_ids, token_type_ids=token_type_ids)
outputs_no_past = model(input_ids, token_type_ids=token_type_ids, use_cache=False)
self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf))
self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
output, past = outputs.to_tuple()
# create hypothetical next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size)
next_token_types = ids_tensor([self.batch_size, 1], self.type_vocab_size)
# append to next input_ids and token_type_ids
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_token_type_ids = torch.cat([token_type_ids, next_token_types], dim=-1)
output_from_no_past = model(next_input_ids, token_type_ids=next_token_type_ids)["last_hidden_state"]
output_from_past = model(next_tokens, token_type_ids=next_token_types, past_key_values=past)[
"last_hidden_state"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
def create_and_check_gpt2mqa_model_attention_mask_past(
self, config, input_ids, input_mask, head_mask, token_type_ids, *args
):
model = GPT2MQAModel(config=config)
model.to(torch_device)
model.eval()
# create attention mask
attn_mask = torch.ones(input_ids.shape, dtype=torch.long, device=torch_device)
half_seq_length = self.seq_length // 2
attn_mask[:, half_seq_length:] = 0
# first forward pass
output, past = model(input_ids, attention_mask=attn_mask).to_tuple()
# create hypothetical next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size)
# change a random masked slice from input_ids
random_seq_idx_to_change = ids_tensor((1,), half_seq_length).item() + 1
random_other_next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size).squeeze(-1)
input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
# append to next input_ids and attn_mask
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
attn_mask = torch.cat(
[attn_mask, torch.ones((attn_mask.shape[0], 1), dtype=torch.long, device=torch_device)],
dim=1,
)
# get two different outputs
output_from_no_past = model(next_input_ids, attention_mask=attn_mask)["last_hidden_state"]
output_from_past = model(next_tokens, past_key_values=past, attention_mask=attn_mask)["last_hidden_state"]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
def create_and_check_gpt2mqa_model_past_large_inputs(
self, config, input_ids, input_mask, head_mask, token_type_ids, *args
):
model = GPT2MQAModel(config=config)
model.to(torch_device)
model.eval()
# first forward pass
outputs = model(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask, use_cache=True)
output, past = outputs.to_tuple()
# create hypothetical next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_token_types = ids_tensor([self.batch_size, 3], self.type_vocab_size)
next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
# append to next input_ids and token_type_ids
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_token_type_ids = torch.cat([token_type_ids, next_token_types], dim=-1)
next_attention_mask = torch.cat([input_mask, next_mask], dim=-1)
output_from_no_past = model(
next_input_ids, token_type_ids=next_token_type_ids, attention_mask=next_attention_mask
)["last_hidden_state"]
output_from_past = model(
next_tokens, token_type_ids=next_token_types, attention_mask=next_attention_mask, past_key_values=past
)["last_hidden_state"]
self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
def create_and_check_lm_head_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
model = GPT2MQALMHeadModel(config)
model.to(torch_device)
model.eval()
result = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
def create_and_check_forward_and_backwards(
self, config, input_ids, input_mask, head_mask, token_type_ids, *args, gradient_checkpointing=False
):
model = GPT2MQALMHeadModel(config)
model.to(torch_device)
if gradient_checkpointing:
model.gradient_checkpointing_enable()
result = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
result.loss.backward()
def create_and_check_double_lm_head_model(
self, config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, *args
):
model = GPT2MQADoubleHeadsModel(config)
model.to(torch_device)
model.eval()
multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
multiple_choice_input_mask = input_mask.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
inputs = {
"input_ids": multiple_choice_inputs_ids,
"mc_token_ids": mc_token_ids,
"attention_mask": multiple_choice_input_mask,
"token_type_ids": multiple_choice_token_type_ids,
"labels": multiple_choice_inputs_ids,
}
result = model(**inputs)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(
result.logits.shape, (self.batch_size, self.num_choices, self.seq_length, self.vocab_size)
)
self.parent.assertEqual(result.mc_logits.shape, (self.batch_size, self.num_choices))
def create_and_check_gpt2mqa_for_sequence_classification(
self, config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, *args
):
config.num_labels = self.num_labels
model = GPT2MQAForSequenceClassification(config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_gpt2mqa_for_token_classification(
self, config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, *args
):
config.num_labels = self.num_labels
model = GPT2MQAForTokenClassification(config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.num_labels))
def create_and_check_gpt2mqa_weight_initialization(self, config, *args):
model = GPT2MQAModel(config)
model_std = model.config.initializer_range / math.sqrt(2 * model.config.n_layer)
for key in model.state_dict().keys():
if "c_proj" in key and "weight" in key:
self.parent.assertLessEqual(abs(torch.std(model.state_dict()[key]) - model_std), 0.001)
self.parent.assertLessEqual(abs(torch.mean(model.state_dict()[key]) - 0.0), 0.01)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
input_mask,
head_mask,
token_type_ids,
mc_token_ids,
sequence_labels,
token_labels,
choice_labels,
) = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"head_mask": head_mask,
}
return config, inputs_dict
@require_torch
class GPT2MQAModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (
(
GPT2MQAModel,
GPT2MQALMHeadModel,
GPT2MQADoubleHeadsModel,
GPT2MQAForSequenceClassification,
GPT2MQAForTokenClassification,
)
if is_torch_available()
else ()
)
all_generative_model_classes = (GPT2MQALMHeadModel, GPT2MQADoubleHeadsModel) if is_torch_available() else ()
all_parallelizable_model_classes = (GPT2MQALMHeadModel, GPT2MQADoubleHeadsModel) if is_torch_available() else ()
fx_compatible = False
test_missing_keys = False
test_model_parallel = True
# special case for DoubleHeads model
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
if return_labels:
if model_class.__name__ == "GPT2MQADoubleHeadsModel":
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.num_choices, self.model_tester.seq_length),
dtype=torch.long,
device=torch_device,
)
inputs_dict["input_ids"] = inputs_dict["labels"]
inputs_dict["token_type_ids"] = inputs_dict["labels"]
inputs_dict["mc_token_ids"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.num_choices),
dtype=torch.long,
device=torch_device,
)
inputs_dict["mc_labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
return inputs_dict
def setUp(self):
self.model_tester = GPT2MQAModelTester(self)
self.config_tester = ConfigTester(self, config_class=GPT2MQAConfig, n_embd=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_gpt2mqa_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_model(*config_and_inputs)
def test_gpt2mqa_model_past(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_model_past(*config_and_inputs)
def test_gpt2mqa_model_att_mask_past(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_model_attention_mask_past(*config_and_inputs)
def test_gpt2mqa_model_past_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_model_past_large_inputs(*config_and_inputs)
def test_gpt2mqa_lm_head_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
def test_gpt2mqa_double_lm_head_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_double_lm_head_model(*config_and_inputs)
def test_gpt2mqa_sequence_classification_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_for_sequence_classification(*config_and_inputs)
def test_gpt2mqa_token_classification_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_for_token_classification(*config_and_inputs)
def test_gpt2mqa_gradient_checkpointing(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_forward_and_backwards(*config_and_inputs, gradient_checkpointing=True)
def test_gpt2mqa_scale_attn_by_inverse_layer_idx(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(scale_attn_by_inverse_layer_idx=True)
self.model_tester.create_and_check_forward_and_backwards(*config_and_inputs)
@unittest.skip("GPT2MQA does not support head pruning.")
def test_head_pruning(self):
pass
@unittest.skip("GPT2MQA does not support head pruning.")
def test_head_pruning_integration(self):
pass
@unittest.skip("GPT2MQA does not support head pruning.")
def test_head_pruning_save_load_from_config_init(self):
pass
@unittest.skip("GPT2MQA does not support head pruning.")
def test_head_pruning_save_load_from_pretrained(self):
pass
def test_gpt2mqa_weight_initialization(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_gpt2mqa_weight_initialization(*config_and_inputs)
@slow
def test_batch_generation(self):
model = GPT2MQALMHeadModel.from_pretrained("bigcode/santacoder")
model.to(torch_device)
tokenizer = GPT2Tokenizer.from_pretrained("bigcode/santacoder")
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
# use different length sentences to test batching
sentences = [
"Hello, my dog is a little",
"Today, I",
]
inputs = tokenizer(sentences, return_tensors="pt", padding=True)
input_ids = inputs["input_ids"].to(torch_device)
token_type_ids = torch.cat(
[
input_ids.new_full((input_ids.shape[0], input_ids.shape[1] - 1), 0),
input_ids.new_full((input_ids.shape[0], 1), 500),
],
dim=-1,
)
outputs = model.generate(
input_ids=input_ids,
attention_mask=inputs["attention_mask"].to(torch_device),
)
outputs_tt = model.generate(
input_ids=input_ids,
attention_mask=inputs["attention_mask"].to(torch_device),
token_type_ids=token_type_ids,
)
inputs_non_padded = tokenizer(sentences[0], return_tensors="pt").input_ids.to(torch_device)
output_non_padded = model.generate(input_ids=inputs_non_padded)
num_paddings = inputs_non_padded.shape[-1] - inputs["attention_mask"][-1].long().sum().cpu().item()
inputs_padded = tokenizer(sentences[1], return_tensors="pt").input_ids.to(torch_device)
output_padded = model.generate(input_ids=inputs_padded, max_length=model.config.max_length - num_paddings)
batch_out_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch_out_sentence_tt = tokenizer.batch_decode(outputs_tt, skip_special_tokens=True)
non_padded_sentence = tokenizer.decode(output_non_padded[0], skip_special_tokens=True)
padded_sentence = tokenizer.decode(output_padded[0], skip_special_tokens=True)
expected_output_sentence = [
"Hello, my dog is a little bit of a mess. I'm not sure if he's going",
"Today, I'm going to be doing a lot of research on this. I",
]
self.assertListEqual(expected_output_sentence, batch_out_sentence)
self.assertTrue(batch_out_sentence_tt != batch_out_sentence) # token_type_ids should change output
self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
@slow
def test_batch_generation_2heads(self):
model = GPT2MQADoubleHeadsModel.from_pretrained("bigcode/santacoder")
model.to(torch_device)
tokenizer = GPT2Tokenizer.from_pretrained("bigcode/santacoder")
tokenizer.padding_side = "left"
# This tokenizer has no pad token, so we have to set it in some way
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
# use different length sentences to test batching
sentences = [
"Hello, my dog is a little",
"Today, I",
]
inputs = tokenizer(sentences, return_tensors="pt", padding=True)
input_ids = inputs["input_ids"].to(torch_device)
token_type_ids = torch.cat(
[
input_ids.new_full((input_ids.shape[0], input_ids.shape[1] - 1), 0),
input_ids.new_full((input_ids.shape[0], 1), 500),
],
dim=-1,
)
outputs = model.generate(
input_ids=input_ids,
attention_mask=inputs["attention_mask"].to(torch_device),
)
outputs_tt = model.generate(
input_ids=input_ids,
attention_mask=inputs["attention_mask"].to(torch_device),
token_type_ids=token_type_ids,
)
inputs_non_padded = tokenizer(sentences[0], return_tensors="pt").input_ids.to(torch_device)
output_non_padded = model.generate(input_ids=inputs_non_padded)
num_paddings = inputs_non_padded.shape[-1] - inputs["attention_mask"][-1].long().sum().cpu().item()
inputs_padded = tokenizer(sentences[1], return_tensors="pt").input_ids.to(torch_device)
output_padded = model.generate(input_ids=inputs_padded, max_length=model.config.max_length - num_paddings)
batch_out_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)
batch_out_sentence_tt = tokenizer.batch_decode(outputs_tt, skip_special_tokens=True)
non_padded_sentence = tokenizer.decode(output_non_padded[0], skip_special_tokens=True)
padded_sentence = tokenizer.decode(output_padded[0], skip_special_tokens=True)
expected_output_sentence = [
"Hello, my dog is a little bit of a mess. I'm not sure if he's going",
"Today, I'm going to be doing a lot of research on this. I",
]
self.assertListEqual(expected_output_sentence, batch_out_sentence)
self.assertTrue(batch_out_sentence_tt != batch_out_sentence) # token_type_ids should change output
self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
@slow
def test_model_from_pretrained(self):
for model_name in GPT2MQA_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = GPT2MQAModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@require_torch
class GPT2MQAModelLanguageGenerationTest(unittest.TestCase):
def _test_lm_generate_gpt2mqa_helper(
self,
gradient_checkpointing=False,
reorder_and_upcast_attn=False,
scale_attn_by_inverse_layer_idx=False,
verify_outputs=True,
):
model = GPT2MQALMHeadModel.from_pretrained(
"bigcode/santacoder",
reorder_and_upcast_attn=reorder_and_upcast_attn,
scale_attn_by_inverse_layer_idx=scale_attn_by_inverse_layer_idx,
)
if gradient_checkpointing:
model.gradient_checkpointing_enable()
else:
model.gradient_checkpointing_disable()
model.to(torch_device)
# The dog
input_ids = torch.tensor([[464, 3290]], dtype=torch.long, device=torch_device)
# The dog was found in a field near the intersection of West and West Streets.\n\nThe dog
# fmt: off
expected_output_ids = [
464, 3290, 373, 1043, 287, 257, 2214, 1474, 262, 16246, 286, 2688, 290, 2688, 27262, 13, 198, 198, 464, 3290,
]
# fmt: on
output_ids = model.generate(input_ids, do_sample=False)
if verify_outputs:
self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
@slow
def test_lm_generate_gpt2mqa(self):
self._test_lm_generate_gpt2mqa_helper()
@slow
def test_lm_generate_gpt2mqa_with_gradient_checkpointing(self):
self._test_lm_generate_gpt2mqa_helper(gradient_checkpointing=True)
@slow
def test_lm_generate_gpt2mqa_with_reorder_and_upcast_attn(self):
self._test_lm_generate_gpt2mqa_helper(reorder_and_upcast_attn=True)
@slow
def test_lm_generate_gpt2mqa_with_scale_attn_by_inverse_layer_idx(self):
self._test_lm_generate_gpt2mqa_helper(scale_attn_by_inverse_layer_idx=True, verify_outputs=False)
@slow
def test_gpt2mqa_sample(self):
tokenizer = GPT2Tokenizer.from_pretrained("bigcode/santacoder")
model = GPT2MQALMHeadModel.from_pretrained("bigcode/santacoder")
model.to(torch_device)
torch.manual_seed(0)
tokenized = tokenizer("Today is a nice day and", return_tensors="pt", return_token_type_ids=True)
input_ids = tokenized.input_ids.to(torch_device)
output_ids = model.generate(input_ids, do_sample=True)
output_str = tokenizer.decode(output_ids[0], skip_special_tokens=True)
token_type_ids = tokenized.token_type_ids.to(torch_device)
output_seq = model.generate(input_ids=input_ids, do_sample=True, num_return_sequences=5)
output_seq_tt = model.generate(
input_ids=input_ids, token_type_ids=token_type_ids, do_sample=True, num_return_sequences=5
)
output_seq_strs = tokenizer.batch_decode(output_seq, skip_special_tokens=True)
output_seq_tt_strs = tokenizer.batch_decode(output_seq_tt, skip_special_tokens=True)
EXPECTED_OUTPUT_STR = (
"Today is a nice day and if you don't know anything about the state of play during your holiday"
)
self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
self.assertTrue(
all([output_seq_strs[idx] != output_seq_tt_strs[idx] for idx in range(len(output_seq_tt_strs))])
) # token_type_ids should change output
@slow
def test_gpt2mqa_sample_max_time(self):
tokenizer = GPT2Tokenizer.from_pretrained("bigcode/santacoder")
model = GPT2MQALMHeadModel.from_pretrained("bigcode/santacoder")
model.to(torch_device)
torch.manual_seed(0)
tokenized = tokenizer("Today is a nice day and", return_tensors="pt", return_token_type_ids=True)
input_ids = tokenized.input_ids.to(torch_device)
MAX_TIME = 0.5
start = datetime.datetime.now()
model.generate(input_ids, do_sample=True, max_time=MAX_TIME, max_length=256)
duration = datetime.datetime.now() - start
self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
start = datetime.datetime.now()
model.generate(input_ids, do_sample=False, max_time=MAX_TIME, max_length=256)
duration = datetime.datetime.now() - start
self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
start = datetime.datetime.now()
model.generate(input_ids, do_sample=False, num_beams=2, max_time=MAX_TIME, max_length=256)
duration = datetime.datetime.now() - start
self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
start = datetime.datetime.now()
model.generate(input_ids, do_sample=True, num_beams=2, max_time=MAX_TIME, max_length=256)
duration = datetime.datetime.now() - start
self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
start = datetime.datetime.now()
model.generate(input_ids, do_sample=False, max_time=None, max_length=256)
duration = datetime.datetime.now() - start
self.assertGreater(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
@slow
def test_contrastive_search_gpt2mqa(self):
article = (
"DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research "
"laboratory founded in 2010. DeepMind was acquired by Google in 2014. The company is based"
)
gpt2mqa_tokenizer = GPT2Tokenizer.from_pretrained("bigcode/santacoder")
gpt2mqa_model = GPT2MQALMHeadModel.from_pretrained("bigcode/santacoder").to(torch_device)
input_ids = gpt2mqa_tokenizer(article, return_tensors="pt").input_ids.to(torch_device)
outputs = gpt2mqa_model.generate(input_ids, penalty_alpha=0.6, top_k=4, max_length=256)
generated_text = gpt2mqa_tokenizer.batch_decode(outputs, skip_special_tokens=True)
self.assertListEqual(
generated_text,
[
"DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research "
"laboratory founded in 2010. DeepMind was acquired by Google in 2014. The company is based in London, "
"United Kingdom\n\nGoogle has a lot of data on its users and uses it to improve its products, such as "
"Google Now, which helps users find the information they're looking for on the web. But the company "
"is not the only one to collect data on its users. Facebook, for example, has its own facial "
"recognition technology, as well as a database of millions of photos that it uses to personalize its "
"News Feed.\n\nFacebook's use of data is a hot topic in the tech industry, with privacy advocates "
"concerned about the company's ability to keep users' information private. In a blog post last "
'year, Facebook CEO Mark Zuckerberg said his company would "do our best to be transparent about our '
'data use and how we use it."\n\n"We have made it clear that we do not sell or share your data with '
'third parties," Zuckerberg wrote. "If you have questions or concerns, please reach out to us at '
'privacy@facebook.com."\n\nGoogle declined to comment on the privacy implications of its use of data, '
"but said in a statement to The Associated Press that"
],
)